ACM SIGMOD Conference 2016:San Francisco, CA, USA

Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016. ACM 【DBLP Link

Paper Num: 190 || Session Num: 26

Keynote - Jeff Dean 1

1. Building Machine Learning Systems that Understand.

Paper Link】 【Pages】:1

【Authors】: Jeff Dean

【Abstract】: Over the past five years, deep learning and large-scale neural networks have made significant advances in speech recognition, computer vision, language understanding and translation, robotics, and many other fields. Deep learning allows the use of very raw forms of data in order to build higher-level understanding of data automatically, and can also be used to learn to accomplish complex tasks. In the next decade, it is likely that a fruitful direction for research in data management will be in how to seamlessly integrate these kinds of machine learning models into systems that store and manage data. In this talk, I will highlight some of the advances that have been made in deep learning and suggest some interesting directions for future research.

【Keywords】: keynote talk

Session 1 - Scalable Analytics and Machine Learning 7

2. Learning Linear Regression Models over Factorized Joins.

Paper Link】 【Pages】:3-18

【Authors】: Maximilian Schleich ; Dan Olteanu ; Radu Ciucanu

【Abstract】: We investigate the problem of building least squares regression models over training datasets defined by arbitrary join queries on database tables. Our key observation is that joins entail a high degree of redundancy in both computation and data representation, which is not required for the end-to-end solution to learning over joins. We propose a new paradigm for computing batch gradient descent that exploits the factorized computation and representation of the training datasets, a rewriting of the regression objective function that decouples the computation of cofactors of model parameters from their convergence, and the commutativity of cofactor computation with relational union and projection. We introduce three flavors of this approach: F/FDB computes the cofactors in one pass over the materialized factorized join; Favoids this materialization and intermixes cofactor and join computation; F/SQL expresses this mixture as one SQL query. Our approach has the complexity of join factorization, which can be exponentially lower than of standard joins. Experiments with commercial, public, and synthetic datasets show that it outperforms MADlib, Python StatsModels, and R, by up to three orders of magnitude.

【Keywords】: factorized databases; join processing; linear regression

3. To Join or Not to Join?: Thinking Twice about Joins before Feature Selection.

Paper Link】 【Pages】:19-34

【Authors】: Arun Kumar ; Jeffrey F. Naughton ; Jignesh M. Patel ; Xiaojin Zhu

【Abstract】: Closer integration of machine learning (ML) with data processing is a booming area in both the data management industry and academia. Almost all ML toolkits assume that the input is a single table, but many datasets are not stored as single tables due to normalization. Thus, analysts often perform key-foreign key joins to obtain features from all base tables and apply a feature selection method, either explicitly or implicitly, with the aim of improving accuracy. In this work, we show that the features brought in by such joins can often be ignored without affecting ML accuracy significantly, i.e., we can "avoid joins safely." We identify the core technical issue that could cause accuracy to decrease in some cases and analyze this issue theoretically. Using simulations, we validate our analysis and measure the effects of various properties of normalized data on accuracy. We apply our analysis to design easy-to-understand decision rules to predict when it is safe to avoid joins in order to help analysts exploit this runtime-accuracy trade-off. Experiments with multiple real normalized datasets show that our rules are able to accurately predict when joins can be avoided safely, and in some cases, this led to significant reductions in the runtime of some popular feature selection methods.

【Keywords】: VC dimension; advanced analytics; feature engineering; feature selection; functional dependencies; key-foreign key joins; machine learning

4. Real-time Video Recommendation Exploration.

Paper Link】 【Pages】:35-46

【Authors】: Yanxiang Huang ; Bin Cui ; Jie Jiang ; Kunqian Hong ; Wenyu Zhang ; Yiran Xie

【Abstract】: Video recommendation has attracted growing attention in recent years. However, conventional techniques have limitations in real-time processing, accuracy or scalability for the large-scale video data. To address the deficiencies of current recommendation systems, we introduce some new techniques to provide real-time and accurate recommendations to users in the video recommendation system of Tencent Inc.. We develop a scalable online collaborative filtering algorithm based upon matrix factorization, with an adjustable updating strategy considering implicit feedback solution of different user actions. To select high-quality candidate videos for real-time top-N recommendation generation, we utilize additional factors like video type and time factor to compute similar videos. In addition, we propose the scalable implementation of our algorithm together with some optimizations to make the recommendations more efficient and accurate, including the demographic filtering and demographic training. To demonstrate the effectiveness and efficiency of our model, we conduct comprehensive experiments by collecting real data from Tencent Video. Furthermore, our video recommendation system is in production to provide recommendation services in Tencent Video, one of the largest video sites in China, and verifies its superiority in performance.

【Keywords】: big data; online matrix factorization; real-time; real-time collaborative filtering; video recommendation

5. Towards Globally Optimal Crowdsourcing Quality Management: The Uniform Worker Setting.

Paper Link】 【Pages】:47-62

【Authors】: Akash Das Sarma ; Aditya G. Parameswaran ; Jennifer Widom

【Abstract】: We study crowdsourcing quality management, that is, given worker responses to a set of tasks, our goal is to jointly estimate the true answers for the tasks, as well as the quality of the workers. Prior work on this problem relies primarily on applying Expectation-Maximization (EM) on the underlying maximum likelihood problem to estimate true answers as well as worker quality. Unfortunately, EM only provides a locally optimal solution rather than a globally optimal one. Other solutions to the problem (that do not leverage EM) fail to provide global optimality guarantees as well. In this paper, we focus on filtering, where tasks require the evaluation of a yes/no predicate, and rating, where tasks elicit integer scores from a finite domain. We design algorithms for finding the global optimal estimates of correct task answers and worker quality for the underlying maximum likelihood problem, and characterize the complexity of these algorithms. Our algorithms conceptually consider all mappings from tasks to true answers (typically a very large number), leveraging two key ideas to reduce, by several orders of magnitude, the number of mappings under consideration, while preserving optimality. We also demonstrate that these algorithms often find more accurate estimates than EM-based algorithms. This paper makes an important contribution towards understanding the inherent complexity of globally optimal crowdsourcing quality management.

【Keywords】: EM; crowdsourcing; expectation-maximization; filtering; human computation; maximum likelihood; quality management; rating

6. Building the Enterprise Fabric for Big Data with Vertica and Spark Integration.

Paper Link】 【Pages】:63-75

【Authors】: Jeff LeFevre ; Rui Liu ; Cornelio Inigo ; Lupita Paz ; Edward Ma ; Malú Castellanos ; Meichun Hsu

【Abstract】: Enterprise customers increasingly require greater flexibility in the way they access and process their Big Data while at the same time they continue to request advanced analytics and access to diverse data sources. Yet customers also still require the robustness of enterprise class analytics for their mission-critical data. In this paper, we present our initial efforts toward a solution that satisfies the above requirements by integrating the HPE Vertica enterprise database with Apache Spark's open source big data computation engine. In particular, it enables fast, reliable transferring of data between Vertica and Spark; and deploying Machine Learning models created by Spark into Vertica for predictive analytics on Vertica data. This integration provides a fabric on which our customers get the best of both worlds: it extends Vertica's extensive SQL analytics capabilities with Spark's machine learning library (MLlib), giving Vertica users access to a wide range of ML functions; it also enables customers to leverage Spark as an advanced ETL engine for all data that require the guarantees offered by Vertica.

【Keywords】: PMML; analytics; big data; connector; database; spark; vertica

7. Truss Decomposition of Probabilistic Graphs: Semantics and Algorithms.

Paper Link】 【Pages】:77-90

【Authors】: Xin Huang ; Wei Lu ; Laks V. S. Lakshmanan

【Abstract】: A key operation in network analysis is the discovery of cohesive subgraphs. The notion of $k$-truss has gained considerable popularity in this regard, based on its rich structure and efficient computability. However, many complex networks such as social, biological and communication networks feature uncertainty, best modeled using probabilities. Unfortunately the problem of discovering k-trusses in probabilistic graphs has received little attention to date. In this paper, given a probabilistic graph G, number k and parameter γ --(0,1], we define a (k,γ)-truss as a maximal connected subgraph H ⊆ G, in which for each edge, the probability that it is contained in at least (k-2) triangles is at least γ. We develop an efficient dynamic programming algorithm for decomposing a probabilistic graph into such maximal (k,γ)-trusses. The above definition of a (k,γ)-truss is local in that the "witness" graphs that has the (k-2) triangles containing an edge in H may be quite different for distinct edges. Hence, we also propose: a global (k,γ)-truss, which in addition to being a local (k,γ)-truss, has to satisfy the condition that the probability that H contains a k-truss is at least γ. We show that unlike local (k,γ)-trusses, the global (k,γ)-truss decomposition on a probabilistic graph is intractable. We propose a novel sampling technique which enables approximate discovery of global (k,γ)-trusses with high probability. Our extensive experiments on real datasets demonstrate the efficacy of our proposed approach and the usefulness of local and global (k,γ)-truss.

【Keywords】: cohesive subgraphs; k-truss decomposition; probabilistic graphs; uncertain networks

8. Efficient and Progressive Group Steiner Tree Search.

Paper Link】 【Pages】:91-106

【Authors】: Rong-Hua Li ; Lu Qin ; Jeffrey Xu Yu ; Rui Mao

【Abstract】: The Group Steiner Tree (GST) problem is a fundamental problem in database area that has been successfully applied to keyword search in relational databases and team search in social networks. The state-of-the-art algorithm for the GST problem is a parameterized dynamic programming (DP) algorithm, which finds the optimal tree in O(3kn+2k(n log n + m)) time, where k is the number of given groups, m and n are the number of the edges and nodes of the graph respectively. The major limitations of the parameterized DP algorithm are twofold: (i) it is intractable even for very small values of k (e.g., k=8) in large graphs due to its exponential complexity, and (ii) it cannot generate a solution until the algorithm has completed its entire execution. To overcome these limitations, we propose an efficient and progressive GST algorithm in this paper, called PrunedDP. It is based on newly-developed optimal-tree decomposition and conditional tree merging techniques. The proposed algorithm not only drastically reduces the search space of the parameterized DP algorithm, but it also produces progressively-refined feasible solutions during algorithm execution. To further speed up the PrunedDP algorithm, we propose a progressive A*-search algorithm, based on several carefully-designed lower-bounding techniques. We conduct extensive experiments to evaluate our algorithms on several large scale real-world graphs. The results show that our best algorithm is not only able to generate progressively-refined feasible solutions, but it also finds the optimal solution with at least two orders of magnitude acceleration over the state-of-the-art algorithm, using much less memory.

【Keywords】: DP; a * -search algorithm; group steiner tree

Session 2 - Privacy and Security 7

9. Publishing Attributed Social Graphs with Formal Privacy Guarantees.

Paper Link】 【Pages】:107-122

【Authors】: Zach Jorgensen ; Ting Yu ; Graham Cormode

【Abstract】: Many data analysis tasks rely on the abstraction of a graph to represent relations between entities, with attributes on the nodes and edges. Since the relationships encoded are often sensitive, we seek effective ways to release representative graphs which nevertheless protect the privacy of the data subjects. Prior work on this topic has focused primarily on the graph structure in isolation, and has not provided ways to handle richer graphs with correlated attributes. We introduce an approach to release such graphs under the strong guarantee of differential privacy. We adapt existing graph models, and introduce a new one, and show how to augment them with meaningful privacy. This provides a complete workflow, where the input is a sensitive graph, and the output is a realistic synthetic graph. Our experimental study demonstrates that our process produces useful, accurate attributed graphs.

【Keywords】: attributed graphs; differential privacy; social graphs; social network analysis

10. Publishing Graph Degree Distribution with Node Differential Privacy.

Paper Link】 【Pages】:123-138

【Authors】: Wei-Yen Day ; Ninghui Li ; Min Lyu

【Abstract】: Graph data publishing under node-differential privacy (node-DP) is challenging due to the huge sensitivity of queries. However, since a node in graph data oftentimes represents a person, node-DP is necessary to achieve personal data protection. In this paper, we investigate the problem of publishing the degree distribution of a graph under node-DP by exploring the projection approach to reduce the sensitivity. We propose two approaches based on aggregation and cumulative histogram to publish the degree distribution. The experiments demonstrate that our approaches greatly reduce the error of approximating the true degree distribution and have significant improvement over existing works. We also present the introspective analysis for understanding the factors of publishing the degree distribution with node-DP.

【Keywords】: degree distribution; differential privacy; private graph publishing

11. Principled Evaluation of Differentially Private Algorithms using DPBench.

Paper Link】 【Pages】:139-154

【Authors】: Michael Hay ; Ashwin Machanavajjhala ; Gerome Miklau ; Yan Chen ; Dan Zhang

【Abstract】: Differential privacy has become the dominant standard in the research community for strong privacy protection. There has been a flood of research into query answering algorithms that meet this standard. Algorithms are becoming increasingly complex, and in particular, the performance of many emerging algorithms is data dependent, meaning the distribution of the noise added to query answers may change depending on the input data. Theoretical analysis typically only considers the worst case, making empirical study of average case performance increasingly important. In this paper we propose a set of evaluation principles which we argue are essential for sound evaluation. Based on these principles we propose DPBench, a novel evaluation framework for standardized evaluation of privacy algorithms. We then apply our benchmark to evaluate algorithms for answering 1- and 2-dimensional range queries. The result is a thorough empirical study of 15 published algorithms on a total of 27 datasets that offers new insights into algorithm behavior---in particular the influence of dataset scale and shape---and a more complete characterization of the state of the art. Our methodology is able to resolve inconsistencies in prior empirical studies and place algorithm performance in context through comparison to simple baselines. Finally, we pose open research questions which we hope will guide future algorithm design.

【Keywords】: algorithm evaluation; differential privacy; privacy

12. PrivTree: A Differentially Private Algorithm for Hierarchical Decompositions.

Paper Link】 【Pages】:155-170

【Authors】: Jun Zhang ; Xiaokui Xiao ; Xing Xie

【Abstract】: Given a set D of tuples defined on a domain Omega, we study differentially private algorithms for constructing a histogram over Omega to approximate the tuple distribution in D. Existing solutions for the problem mostly adopt a hierarchical decomposition approach, which recursively splits Omega into sub-domains and computes a noisy tuple count for each sub-domain, until all noisy counts are below a certain threshold. This approach, however, requires that we (i) impose a limit h on the recursion depth in the splitting of Omega and (ii) set the noise in each count to be proportional to h. The choice of h is a serious dilemma: a small h makes the resulting histogram too coarse-grained, while a large h leads to excessive noise in the tuple counts used in deciding whether sub-domains should be split. Furthermore, h cannot be directly tuned based on D; otherwise, the choice of h itself reveals private information and violates differential privacy. To remedy the deficiency of existing solutions, we present PrivTree, a histogram construction algorithm that adopts hierarchical decomposition but completely eliminates the dependency on a pre-defined h. The core of PrivTree is a novel mechanism that (i) exploits a new analysis on the Laplace distribution and (ii) enables us to use only a constant amount of noise in deciding whether a sub-domain should be split, without worrying about the recursion depth of splitting. We demonstrate the application of PrivTree in modelling spatial data, and show that it can be extended to handle sequence data (where the decision in sub-domain splitting is not based on tuple counts but a more sophisticated measure). Our experiments on a variety of real datasets show that PrivTree considerably outperforms the states of the art in terms of data utility.

【Keywords】: differential privacy; hierarchical decompositions

13. Adaptive Indexing over Encrypted Numeric Data.

Paper Link】 【Pages】:171-183

【Authors】: Panagiotis Karras ; Artyom Nikitin ; Muhammad Saad ; Rudrika Bhatt ; Denis Antyukhov ; Stratos Idreos

【Abstract】: Today, outsourcing query processing tasks to remote cloud servers becomes a viable option; such outsourcing calls for encrypting data stored at the server so as to render it secure against eavesdropping adversaries and/or an honest-but-curious server itself. At the same time, to be efficiently managed, outsourced data should be indexed, and even adaptively so, as a side-effect of query processing. Computationally heavy encryption schemes render such outsourcing unattractive; an alternative, Order-Preserving Encryption Scheme (OPES), intentionally preserves and reveals the order in the data, hence is unattractive from the security viewpoint. In this paper, we propose and analyze a scheme for lightweight and indexable encryption, based on linear-algebra operations. Our scheme provides higher security than OPES and allows for range and point queries to be efficiently evaluated over encrypted numeric data, with decryption performed at the client side. We implement a prototype that performs incremental, query-triggered adaptive indexing over encrypted numeric data based on this scheme, without leaking order information in advance, and without prohibitive overhead, as our extensive experimental study demonstrates.

【Keywords】: adaptive indexing; cloud computing; data encryption; database cracking

Paper Link】 【Pages】:185-198

【Authors】: Ioannis Demertzis ; Stavros Papadopoulos ; Odysseas Papapetrou ; Antonios Deligiannakis ; Minos N. Garofalakis

【Abstract】: We consider a data owner that outsources its dataset to an untrusted server. The owner wishes to enable the server to answer range queries on a single attribute, without compromising the privacy of the data and the queries. There are several schemes on "practical" private range search (mainly in Databases venues) that attempt to strike a trade-off between efficiency and security. Nevertheless, these methods either lack provable security guarantees, or permit unacceptable privacy leakages. In this paper, we take an interdisciplinary approach, which combines the rigor of Security formulations and proofs with efficient Data Management techniques. We construct a wide set of novel schemes with realistic security/performance trade-offs, adopting the notion of Searchable Symmetric Encryption (SSE) primarily proposed for keyword search. We reduce range search to multi-keyword search using range covering techniques with tree-like indexes. We demonstrate that, given any secure SSE scheme, the challenge boils down to (i) formulating leakages that arise from the index structure, and (ii) minimizing false positives incurred by some schemes under heavy data skew. We analytically detail the superiority of our proposals over prior work and experimentally confirm their practicality.

【Keywords】: database security and privacy; range queries; searchable symmetric encryption

15. Privacy Preserving Subgraph Matching on Large Graphs in Cloud.

Paper Link】 【Pages】:199-213

【Authors】: Zhao Chang ; Lei Zou ; Feifei Li

【Abstract】: The wide presence of large graph data and the increasing popularity of storing data in the cloud drive the needs for graph query processing on a remote cloud. But a fundamental challenge is to process user queries without compromising sensitive information. This work focuses on privacy preserving subgraph matching in a cloud server. The goal is to minimize the overhead on both cloud and client sides for subgraph matching, without compromising users' sensitive information. To that end, we transform an original graph $G$ into a privacy preserving graph Gk, which meets the requirement of an existing privacy model known as k-automorphism. By making use of the symmetry in a k-automorphic graph, a subgraph matching query can be efficiently answered using a graph Go, a small subset of Gk. This approach saves both space and query cost in the cloud server. We also anonymize the query graphs to protect their label information using label generalization technique. To reduce the search space for a subgraph matching query, we propose a cost model to select the more effective label combinations. The effectiveness and efficiency of our method are demonstrated through extensive experimental results on real datasets.

【Keywords】: cloud; graph; privacy; subgraph match

Session 3 - Logical and Physical Database Design 7

16. The Snowflake Elastic Data Warehouse.

Paper Link】 【Pages】:215-226

【Authors】: Benoît Dageville ; Thierry Cruanes ; Marcin Zukowski ; Vadim Antonov ; Artin Avanes ; Jon Bock ; Jonathan Claybaugh ; Daniel Engovatov ; Martin Hentschel ; Jiansheng Huang ; Allison W. Lee ; Ashish Motivala ; Abdul Q. Munir ; Steven Pelley ; Peter Povinec ; Greg Rahn ; Spyridon Triantafyllis ; Philipp Unterbrunner

【Abstract】: We live in the golden age of distributed computing. Public cloud platforms now offer virtually unlimited compute and storage resources on demand. At the same time, the Software-as-a-Service (SaaS) model brings enterprise-class systems to users who previously could not afford such systems due to their cost and complexity. Alas, traditional data warehousing systems are struggling to fit into this new environment. For one thing, they have been designed for fixed resources and are thus unable to leverage the cloud's elasticity. For another thing, their dependence on complex ETL pipelines and physical tuning is at odds with the flexibility and freshness requirements of the cloud's new types of semi-structured data and rapidly evolving workloads. We decided a fundamental redesign was in order. Our mission was to build an enterprise-ready data warehousing solution for the cloud. The result is the Snowflake Elastic Data Warehouse, or "Snowflake" for short. Snowflake is a multi-tenant, transactional, secure, highly scalable and elastic system with full SQL support and built-in extensions for semi-structured and schema-less data. The system is offered as a pay-as-you-go service in the Amazon cloud. Users upload their data to the cloud and can immediately manage and query it using familiar tools and interfaces. Implementation began in late 2012 and Snowflake has been generally available since June 2015. Today, Snowflake is used in production by a growing number of small and large organizations alike. The system runs several million queries per day over multiple petabytes of data. In this paper, we describe the design of Snowflake and its novel multi-cluster, shared-data architecture. The paper highlights some of the key features of Snowflake: extreme elasticity and availability, semi-structured and schema-less data, time travel, and end-to-end security. It concludes with lessons learned and an outlook on ongoing work.

【Keywords】: data warehousing; database as a service; multi-cluster shared data architecture

17. Closing the functional and Performance Gap between SQL and NoSQL.

Paper Link】 【Pages】:227-238

【Authors】: Zhen Hua Liu ; Beda Christoph Hammerschmidt ; Doug McMahon ; Ying Liu ; Hui Joe Chang

【Abstract】: Oracle release 12cR1 supports JSON data management that enables users to store, index and query JSON data along with relational data. The integration of the JSON data model into the RDBMS allows a new paradigm of data management where data is storable, indexable and queryable without upfront schema definition. We call this new paradigm Flexible Schema Data Management (FSDM). In this paper, we present enhancements to Oracle's JSON data management in the upcoming 12cR2 release. We present JSON DataGuide, an auto-computed dynamic soft schema for JSON collections that closes the functional gap between the fixed-schema SQL world and the schema-less NoSQL world. We present a self-contained query friendly binary format for encoding JSON (OSON) to close the query performance gap between schema-encoded relational data and schema free JSON textual data. The addition of these new features makes the Oracle RDBMS well suited to both fixedschema SQL and flexible-schema NoSQL use cases, and allows users to freely mix the two paradigms in a single data management system.

【Keywords】: JSON; SQL/JSON; SQL/XML; XML; flexible schema; noSQL; schema-less

18. Have Your Data and Query It Too: From Key-Value Caching to Big Data Management.

Paper Link】 【Pages】:239-251

【Authors】: Dipti Borkar ; Ravi Mayuram ; Gerald Sangudi ; Michael Carey

【Abstract】: Couchbase Server is a rethinking of the database given the current set of realities. Memory today is much cheaper than disks were when traditional databases were designed back in the 1970's, and networks are much faster and much more reliable than ever before. Application agility is also an extremely important requirement. Today's Couchbase Server is a memory- and network-centric, shared-nothing, auto-partitioned, and distributed NoSQL database system that offers both key-based and secondary index-based data access paths as well as API- and query-based data access capabilities. This is a major change from Couchbase's roots; in its early days, its focus was entirely on high performance and highly available key-value (memcache) based caching. Customer needs and competitive pressures in the evolving non-relational database market also accelerated this change. This paper describes the architectural changes needed to address the requirements posed by next-generation database applications. In addition, it details the implementation of such an architecture using Couchbase Server and explains the evolution of Couchbase Server from its early roots to its present form. Particular attention is paid to how today's Couchbase Server cluster architecture is influenced by the memory-first, high-performance, and scalability demands of typical customer deployments. Key features include a layer-consolidated cache, a consistency-controllable interplay between updates, indexes, and queries, and a unique "multi-dimensional" approach to cluster scaling. The paper closes with a look at future plans for supporting semi-structured operational data analytics in addition to today's more OLTP-like, front-facing use cases.

【Keywords】: document database; key-value; noSQL

19. Ambry: LinkedIn's Scalable Geo-Distributed Object Store.

Paper Link】 【Pages】:253-265

【Authors】: Shadi A. Noghabi ; Sriram Subramanian ; Priyesh Narayanan ; Sivabalan Narayanan ; Gopalakrishna Holla ; Mammad Zadeh ; Tianwei Li ; Indranil Gupta ; Roy H. Campbell

【Abstract】: The infrastructure beneath a worldwide social network has to continually serve billions of variable-sized media objects such as photos, videos, and audio clips. These objects must be stored and served with low latency and high throughput by a system that is geo-distributed, highly scalable, and load-balanced. Existing file systems and object stores face several challenges when serving such large objects. We present Ambry, a production-quality system for storing large immutable data (called blobs). Ambry is designed in a decentralized way and leverages techniques such as logical blob grouping, asynchronous replication, rebalancing mechanisms, zero-cost failure detection, and OS caching. Ambry has been running in LinkedIn's production environment for the past 2 years, serving up to 10K requests per second across more than 400 million users. Our experimental evaluation reveals that Ambry offers high efficiency (utilizing up to 88% of the network bandwidth), low latency (less than 50 ms latency for a 1 MB object), and load balancing (improving imbalance of request rate among disks by 8x-10x).

【Keywords】: geographically distributed; load balancing; object store; scalable

20. SQL Schema Design: Foundations, Normal Forms, and Normalization.

Paper Link】 【Pages】:267-279

【Authors】: Henning Köhler ; Sebastian Link

【Abstract】: Normalization helps us find a database schema at design time that can process the most frequent updates efficiently at run time. Unfortunately, relational normalization only works for idealized database instances in which duplicates and null markers are not present. On one hand, these features occur frequently in real-world data compliant with the industry standard SQL, and especially in modern application domains. On the other hand, the features impose challenges that have made it impossible so far to extend the existing forty year old normalization framework to SQL. We introduce a new class of functional dependencies and show that they provide the right notion for SQL schema design. Axiomatic and linear-time algorithmic characterizations of the associated implication problem are established. These foundations enable us to propose a Boyce-Codd normal form for SQL. Indeed, we justify the normal form by showing that it permits precisely those SQL instances which are free from data redundancy. Unlike the relational case, there are SQL schemata that cannot be converted into Boyce-Codd normal form. Nevertheless, for an expressive sub-class of our functional dependencies we establish a normalization algorithm that always produces a schema in Value-Redundancy free normal form. This normal form permits precisely those instances which are free from any redundant data value occurrences other than the null marker. Experiments show that our functional dependencies occur frequently in real-world data and that they are effective in eliminating redundant values from these data sets without loss of information.

【Keywords】: Boyce-Codd normal form; SQL; axiomatization; discovery; functional dependency; implication problem; key; normalization; null; redundancy

21. SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment.

Paper Link】 【Pages】:281-293

【Authors】: Shrainik Jain ; Dominik Moritz ; Daniel Halperin ; Bill Howe ; Ed Lazowska

【Abstract】: We analyze the workload from a multi-year deployment of a database-as-a-service platform targeting scientists and data scientists with minimal database experience. Our hypothesis was that relatively minor changes to the way databases are delivered can increase their use in ad hoc analysis environments. The web-based SQLShare system emphasizes easy dataset-at-a-time ingest, relaxed schemas and schema inference, easy view creation and sharing, and full SQL support. We find that these features have helped attract workloads typically associated with scripts and files rather than relational databases: complex analytics, routine processing pipelines, data publishing, and collaborative analysis. Quantitatively, these workloads are characterized by shorter dataset "lifetimes", higher query complexity, and higher data complexity. We report on usage scenarios that suggest SQL is being used in place of scripts for one-off data analysis and ad hoc data sharing. The workload suggests that a new class of relational systems emphasizing short-term, ad hoc analytics over engineered schemas may improve uptake of database technology in data science contexts. Our contributions include a system design for delivering databases into these contexts, a description of a public research query workload dataset released to advance research in analytic data systems, and an initial analysis of the workload that provides evidence of new use cases under-supported in existing systems.

【Keywords】: database management as a cloud service; database management sytems; relational databases

22. Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data.

Paper Link】 【Pages】:295-310

【Authors】: Michael DiScala ; Daniel J. Abadi

【Abstract】: Self-describing key-value data formats such as JSON are becoming increasingly popular as application developers choose to avoid the rigidity imposed by the relational model. Database systems designed for these self-describing formats, such as MongoDB, encourage users to use denormalized, heavily nested data models so that relationships across records and other schema information need not be predefined or standardized. Such data models contribute to long-term development complexity, as their lack of explicit entity and relationship tracking burdens new developers unfamiliar with the dataset. Furthermore, the large amount of data repetition present in such data layouts can introduce update anomalies and poor scan performance, which reduce both the quality and performance of analytics over the data. In this paper we present an algorithm that automatically transforms the denormalized, nested data commonly found in NoSQL systems into traditional relational data that can be stored in a standard RDBMS. This process includes a schema generation algorithm that discovers relationships across the attributes of the denormalized datasets in order to organize those attributes into relational tables. It further includes a matching algorithm that discovers sets of attributes that represent overlapping entities and merges those sets together. These algorithms reduce data repetition, allow the use of data analysis tools targeted at relational data, accelerate scan-intensive algorithms over the data, and help users gain a semantic understanding of complex, nested datasets.

【Keywords】: JSON; deduplication; denormalized data; entity extraction; functional dependencies; functional dependency mining; key-value data; normalization; relational databases; schema extraction; schema generation; schema matching; semistructured data; semistructured-to-relational mappings

Session 4 - New Storage and Network Architectures 6

23. Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation.

Paper Link】 【Pages】:311-326

【Authors】: Harald Lang ; Tobias Mühlbauer ; Florian Funke ; Peter A. Boncz ; Thomas Neumann ; Alfons Kemper

【Abstract】: This work aims at reducing the main-memory footprint in high performance hybrid OLTP & OLAP databases, while retaining high query performance and transactional throughput. For this purpose, an innovative compressed columnar storage format for cold data, called Data Blocks is introduced. Data Blocks further incorporate a new light-weight index structure called Positional SMA that narrows scan ranges within Data Blocks even if the entire block cannot be ruled out. To achieve highest OLTP performance, the compression schemes of Data Blocks are very light-weight, such that OLTP transactions can still quickly access individual tuples. This sets our storage scheme apart from those used in specialized analytical databases where data must usually be bit-unpacked. Up to now, high-performance analytical systems use either vectorized query execution or just-in-time (JIT) query compilation. The fine-grained adaptivity of Data Blocks necessitates the integration of the best features of each approach by an interpreted vectorized scan subsystem feeding into JIT-compiled query pipelines. Experimental evaluation of HyPer, our full-fledged hybrid OLTP & OLAP database system, shows that Data Blocks accelerate performance on a variety of query workloads while retaining high transaction throughput.

【Keywords】: compression; hybrid OLTP&OLAP; query compilation; vectorization

24. GeckoFTL: Scalable Flash Translation Techniques For Very Large Flash Devices.

Paper Link】 【Pages】:327-342

【Authors】: Niv Dayan ; Philippe Bonnet ; Stratos Idreos

【Abstract】: The volume of metadata needed by a flash translation layer (FTL) is proportional to the storage capacity of a flash device. Ideally, this metadata should reside in the device's integrated RAM to enable fast access. However, as flash devices scale to terabytes, the necessary volume of metadata is exceeding the available integrated RAM. Moreover, recovery time after power failure, which is proportional to the size of the metadata, is becoming impractical. The simplest solution is to persist more metadata in flash. The problem is that updating metadata in flash increases the amount of internal IOs thereby harming performance and device lifetime. In this paper, we identify a key component of the metadata called the Page Validity Bitmap (PVB) as the bottleneck. PVB is used by the garbage-collectors of state-of-the-art FTLs to keep track of which physical pages in the device are invalid. PVB constitutes 95% of the FTL's RAM-resident metadata, and recovering PVB after power fails takes a significant proportion of the overall recovery time. To solve this problem, we propose a page-associative FTL called GeckoFTL, whose central innovation is replacing PVB with a new data structure called Logarithmic Gecko. Logarithmic Gecko is similar to an LSM-tree in that it first logs updates and later reorganizes them to ensure fast and scalable access time. Relative to the baseline of storing PVB in flash, Logarithmic Gecko enables cheaper updates at the cost of slightly more expensive garbage-collection queries. We show that this is a good trade-off because (1) updates are intrinsically more frequent than garbage-collection queries to page validity metadata, and (2) flash writes are more expensive than flash reads. We demonstrate analytically and empirically through simulation that GeckoFTL achieves a 95% reduction in space requirements and at least a 51% reduction in recovery time by storing page validity metadata in flash while keeping the contribution to internal IO overheads 98% lower than the baseline.

【Keywords】: EMMC; FTL; LSM-tree; SSD; embedded multimedia card; flash; flash translation table; log-structured-merge tree; lsm-tree; solid state drive

25. SHARE Interface in Flash Storage for Relational and NoSQL Databases.

Paper Link】 【Pages】:343-354

【Authors】: Gihwan Oh ; Chiyoung Seo ; Ravi Mayuram ; Yang-Suk Kee ; Sang-Won Lee

【Abstract】: Database consistency and recoverability require guaranteeing write atomicity for one or more pages. However, contemporary database systems consider write operations non-atomic. Thus, many database storage engines have traditionally relied on either journaling or copy-on-write approaches for atomic propagation of updated pages to the storage. This reliance achieves write atomicity at the cost of various write amplifications such as redundant writes, tree-wandering, and compaction. This write amplification results in reduced performance and, for flash storage, accelerates device wear-out. In this paper, we propose a flash storage interface, SHARE. Being able to explicitly remap the address mapping inside flash storage using SHARE interface enables host-side database storage engines to achieve write atomicity without causing write amplification. We have implemented SHARE on a real SSD board, OpenSSD, and modified MySQL/InnoDB and Couchbase NoSQL storage engines to make them compatible with the extended SHARE interface. Our experimental results show that this SHARE-based MySQL/InnoDB and Couchbase configurations can significantly boost database performance. In particular, the inevitable and costly Couchbase compaction process can complete without copying any data pages.

【Keywords】: address remapping; flash storage devices; flash translation layer; share interface; write atomicity

26. Accelerating Relational Databases by Leveraging Remote Memory and RDMA.

Paper Link】 【Pages】:355-370

【Authors】: Feng Li ; Sudipto Das ; Manoj Syamala ; Vivek R. Narasayya

【Abstract】: Memory is a crucial resource in relational databases (RDBMSs). When there is insufficient memory, RDBMSs are forced to use slower media such as SSDs or HDDs, which can significantly degrade workload performance. Cloud database services are deployed in data centers where network adapters supporting remote direct memory access (RDMA) at low latency and high bandwidth are becoming prevalent. We study the novel problem of how a Symmetric Multi-Processing (SMP) RDBMS, whose memory demands exceed locally-available memory, can leverage available remote memory in the cluster accessed via RDMA to improve query performance. We expose available memory on remote servers using a lightweight file API that allows an SMP RDBMS to leverage the benefits of remote memory with modest changes. We identify and implement several novel scenarios to demonstrate these benefits, and address design challenges that are crucial for efficient implementation. We implemented the scenarios in Microsoft SQL Server engine and present the first end-to-end study to demonstrate benefits of remote memory for a variety of micro-benchmarks and industry-standard benchmarks. Compared to using disks when memory is insufficient, we improve the throughput and latency of queries with short reads and writes by 3X to 10X, while improving the latency of multiple TPC-H and TPC-DS queries by 2X to 100X.

【Keywords】: RDMA; buffer pool extension; opportunistic caching; relational databases; remote memory; semantic caching

27. FPTree: A Hybrid SCM-DRAM Persistent and Concurrent B-Tree for Storage Class Memory.

Paper Link】 【Pages】:371-386

【Authors】: Ismail Oukid ; Johan Lasperas ; Anisoara Nica ; Thomas Willhalm ; Wolfgang Lehner

【Abstract】: The advent of Storage Class Memory (SCM) is driving a rethink of storage systems towards a single-level architecture where memory and storage are merged. In this context, several works have investigated how to design persistent trees in SCM as a fundamental building block for these novel systems. However, these trees are significantly slower than DRAM-based counterparts since trees are latency-sensitive and SCM exhibits higher latencies than DRAM. In this paper we propose a novel hybrid SCM-DRAM persistent and concurrent B-Tree, named Fingerprinting Persistent Tree (FPTree) that achieves similar performance to DRAM-based counterparts. In this novel design, leaf nodes are persisted in SCM while inner nodes are placed in DRAM and rebuilt upon recovery. The FPTree uses Fingerprinting, a technique that limits the expected number of in-leaf probed keys to one. In addition, we propose a hybrid concurrency scheme for the FPTree that is partially based on Hardware Transactional Memory. We conduct a thorough performance evaluation and show that the FPTree outperforms state-of-the-art persistent trees with different SCM latencies by up to a factor of 8.2. Moreover, we show that the FPTree scales very well on a machine with 88 logical cores. Finally, we integrate the evaluated trees in memcached and a prototype database. We show that the FPTree incurs an almost negligible performance overhead over using fully transient data structures, while significantly outperforming other persistent trees.

【Keywords】: B-tree; data management; data structures; database recovery; hardware transactional memory; storage class memory

28. Micro-architectural Analysis of In-memory OLTP.

Paper Link】 【Pages】:387-402

【Authors】: Utku Sirin ; Pinar Tözün ; Danica Porobic ; Anastasia Ailamaki

【Abstract】: Micro-architectural behavior of traditional disk-based online transaction processing (OLTP) systems has been investigated extensively over thepast couple of decades. Results show that traditional OLTP mostly under-utilize the available micro-architectural resources. In-memory OLTP systems, on the other hand, process all the data in main-memory, and therefore, can omit the buffer pool. In addition, they usually adopt more lightweight concurrency control mechanisms, cache-conscious data structures, and cleaner codebases since they are usually designed from scratch. Hence, we expect significant differences in micro-architectural behavior when running OLTP on platforms optimized for in-memory processing as opposed to disk-based database systems. In particular, we expect that in-memory systems exploit micro architectural features such as instruction and data caches significantly better than disk-based systems. This paper sheds light on the micro-architectural behavior of in-memory database systems by analyzing and contrasting it to the behavior of disk-based systems when running OLTP workloads. The results show that despite all the design changes, in-memory OLTP exhibits very similar micro-architectural behavior to disk-based OLTP systems: more than half of the execution time goes to memory stalls where L1 instruction misses and the long-latency data misses from the last-level cache are the dominant factors in the overall stall time. Even though aggressive compilation optimizations can almost eliminate instruction misses, the reduction in instruction stalls amplifies the impact of last-level cache data misses. As a result, the number of instructions retired per cycle barely reaches one on machines that are able to retire up to four for both traditional disk-based and new generation in-memory OLTP.

【Keywords】: OLTP; in-memory OLTP systems; micro-architectural analysis; workload characterization

Session 5 - Graphs 1: Infrastructure and Processing on Modern Hardware 6

Paper Link】 【Pages】:403-416

【Authors】: Hang Liu ; H. Howie Huang ; Yang Hu

【Abstract】: Breadth-First Search (BFS) is a key graph algorithm with many important applications. In this work, we focus on a special class of graph traversal algorithm - concurrent BFS - where multiple breadth-first traversals are performed simultaneously on the same graph. We have designed and developed a new approach called iBFS that is able to run i concurrent BFSes from i distinct source vertices, very efficiently on Graphics Processing Units (GPUs). iBFS consists of three novel designs. First, iBFS develops a single GPU kernel for joint traversal of concurrent BFS to take advantage of shared frontiers across different instances. Second, outdegree-based GroupBy rules enables iBFS to selectively run a group of BFS instances which further maximizes the frontier sharing within such a group. Third, iBFS brings additional performance benefit by utilizing highly optimized bitwise operations on GPUs, which allows a single GPU thread to inspect a vertex for concurrent BFS instances. The evaluation on a wide spectrum of graph benchmarks shows that iBFS on one GPU runs up to 30x faster than executing BFS instances sequentially, and on 112 GPUs achieves near linear speedup with the maximum performance of 57,267 billion traversed edges per second (TEPS).

【Keywords】: GPUs

30. Tornado: A System For Real-Time Iterative Analysis Over Evolving Data.

Paper Link】 【Pages】:417-430

【Authors】: Xiaogang Shi ; Bin Cui ; Yingxia Shao ; Yunhai Tong

【Abstract】: There is an increasing demand for real-time iterative analysis over evolving data. In this paper, we propose a novel execution model to obtain timely results at given instants. We notice that a loop starting from a good initial guess usually converges fast. Hence we organize the execution of iterative methods over evolving data into a main loop and several branch loops. The main loop is responsible for the gathering of inputs and maintains the approximation to the timely results. When the results are requested by a user, a branch loop is forked from the main loop and iterates until convergence to produce the results. Using the approximation of the main loop, the branch loops can start from a place near the fixed-point and converge quickly. Since the inputs not reflected in the approximation is concerned with the approximation error, we develop a novel bounded asynchronous iteration model to enhance the timeliness. The bounded asynchronous iteration model can achieve fine-grained updates while ensuring correctness for general iterative methods. Based on the proposed execution model, we design and implement a prototype system named Tornado on top of Storm. Tornado provides a graph-parallel programming model which eases the programming of most real-time iterative analysis tasks. The reliability is also enhanced by provisioning efficient fault tolerance mechanisms. Empirical evaluation conducted on Tornado validates that various real-time iterative analysis tasks can improve their performance and efficiently tolerate failures with our execution model.

【Keywords】: approximation method; concurrency control; iterative computation; stream processing

31. EmptyHeaded: A Relational Engine for Graph Processing.

Paper Link】 【Pages】:431-446

【Authors】: Christopher R. Aberger ; Susan Tu ; Kunle Olukotun ; Christopher Ré

【Abstract】: There are two types of high-performance graph processing engines: low- and high-level engines. Low-level engines (Galois, PowerGraph, Snap) provide optimized data structures and computation models but require users to write low-level imperative code, hence ensuring that efficiency is the burden of the user. In high-level engines, users write in query languages like datalog (SociaLite) or SQL (Grail). High-level engines are easier to use but are orders of magnitude slower than the low-level graph engines. We present EmptyHeaded, a high-level engine that supports a rich datalog-like query language and achieves performance comparable to that of low-level engines. At the core of EmptyHeaded's design is a new class of join algorithms that satisfy strong theoretical guarantees but have thus far not achieved performance comparable to that of specialized graph processing engines. To achieve high performance, EmptyHeaded introduces a new join engine architecture, including a novel query optimizer and data layouts that leverage single-instruction multiple data (SIMD) parallelism. With this architecture, EmptyHeaded outperforms high-level approaches by up to three orders of magnitude on graph pattern queries, PageRank, and Single-Source Shortest Paths (SSSP) and is an order of magnitude faster than many low-level baselines. We validate that EmptyHeaded competes with the best-of-breed low-level engine (Galois), achieving comparable performance on PageRank and at most 3x worse performance on SSSP.

【Keywords】: GHD; SIMD; generalized hypertree decomposition; graph processing; single instruction multiple data; worst-case optimal join

32. GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs.

Paper Link】 【Pages】:447-461

【Authors】: Min-Soo Kim ; Kyuhyeon An ; Himchan Park ; Hyunseok Seo ; Jinwook Kim

【Abstract】: A fast and scalable graph processing method becomes increasingly important as graphs become popular in a wide range of applications and their sizes are growing rapidly. Most of distributed graph processing methods require a lot of machines equipped with a total of thousands of CPU cores and a few terabyte main memory for handling billion-scale graphs. Meanwhile, GPUs could be a promising direction toward fast processing of large-scale graphs by exploiting thousands of GPU cores. All of the existing methods using GPUs, however, fail to process large-scale graphs that do not fit in main memory of a single machine. Here, we propose a fast and scalable graph processing method GTS that handles even RMAT32 (64 billion edges) very efficiently only by using a single machine. The proposed method stores graphs in PCI-E SSDs and executes a graph algorithm using thousands of GPU cores while streaming topology data of graphs to GPUs via PCI-E interface. GTS is fast due to no communication overhead and scalable due to no data duplication from graph partitioning among machines. Through extensive experiments, we show that GTS consistently and significantly outperforms the major distributed graph processing methods, GraphX, Giraph, and PowerGraph, and the state-of-the-art GPU-based method TOTEM.

【Keywords】: GPUs; SSDs; graph processing; stream

33. Graph Analytics Through Fine-Grained Parallelism.

Paper Link】 【Pages】:463-478

【Authors】: Zechao Shang ; Feifei Li ; Jeffrey Xu Yu ; Zhiwei Zhang ; Hong Cheng

【Abstract】: Large graphs are getting increasingly popular and even indispensable in many applications, for example, in social media data, large networks, and knowledge bases. Efficient graph analytics thus becomes an important subject of study. To increase efficiency and scalability, in-memory computation and parallelism have been explored extensively to speed up various graph analytical workloads. In many graph analytical engines (e.g., Pregel, Neo4j, GraphLab), parallelism is achieved via one of the three concurrency control models, namely, bulk synchronization processing (BSP), asynchronous processing, and synchronous processing. Among them, synchronous processing has the potential to achieve the best performance due to fine-grained parallelism, while ensuring the correctness and the convergence of the computation, if an effective concurrency control scheme is used. This paper explores the topological properties of the underlying graph to design and implement a highly effective concurrency control scheme for efficient synchronous processing in an in-memory graph analytical engine. Our design uses a novel hybrid approach that combines 2PL (two-phase locking) with OCC (optimistic concurrency control), for high degree and low degree vertices in a graph respectively. Our results show that the proposed hybrid synchronous scheduler has significantly outperformed other synchronous schedulers in existing graph analytical engines, as well as BSP and asynchronous schedulers.

【Keywords】: fine-grained parallelism; graph engine; parallel graph analytics; synchronous processing; transaction processing

34. Hybrid Pulling/Pushing for I/O-Efficient Distributed and Iterative Graph Computing.

Paper Link】 【Pages】:479-494

【Authors】: Zhigang Wang ; Yu Gu ; Yubin Bao ; Ge Yu ; Jeffrey Xu Yu

【Abstract】: Billion-node graphs are rapidly growing in size in many applications such as online social networks. Most graph algorithms generate a large number of messages during iterative computations. Vertex-centric distributed systems usually store graph data and message data on disk to improve scalability. Currently, these distributed systems with disk-resident data take a push-based approach to handle messages. This works well if few messages reside on disk. Otherwise, it is I/O-inefficient due to expensive random writes. By contrast, the existing memory-resident pull-based approach individually pulls messages for each vertex on demand. Although it can be used to avoid disk operations regarding messages, expensive I/O costs are incurred by random and frequent access to vertices. This paper proposes a hybrid solution to support switching between push and pull adaptively, to obtain optimal performance for distributed systems with disk-resident data in different scenarios. We first employ a new block-centric technique (b-pull) to improve the I/O-performance of pulling messages, although the iterative computation is vertex-centric. I/O costs of data accesses are shifted from the receiver side where messages are written/read by push to the sender side where graph data are read by b-pull. Graph data are organized by clustering vertices and edges to achieve high I/O-efficiency in b-pull. Second, we design a seamless switching mechanism and a prominent performance prediction method to guarantee efficiency when switching between push and b-pull. We conduct extensive performance studies to confirm the effectiveness of our proposals over existing up-to-date solutions using a broad spectrum of real-world graphs.

【Keywords】: I/O-efficient; distributed graph computing; pull; push

Session 6 - Streaming 1: Systems and Outlier Detection 6

35. Scalable Pattern Sharing on Event Streams.

Paper Link】 【Pages】:495-510

【Authors】: Medhabi Ray ; Chuan Lei ; Elke A. Rundensteiner

【Abstract】: Complex Event Processing (CEP) has emerged as a technology of choice for high performance event analytics in time-critical decision-making applications. Yet it is becoming increasingly difficult to support high-performance event processing due to the rising number and complexity of event pattern queries and the increasingly high velocity of event streams. In this work we design the SPASS framework that successfully tackles these demanding CEP workloads. Our SPASS optimizer identifies opportunities for effective shared processing among CEP queries by leveraging time-based event correlations among queries. The problem of pattern sharing is shown to be NP-hard by reducing the Minimum Substring Cover problem to our CEP pattern sharing problem. The SPASS optimizer is designed that finds a shared pattern plan in polynomial-time covering all sequence patterns while still guaranteeing an optimality bound. To execute this shared pattern plan, the SPASS runtime employs stream transactions that assure concurrent shared maintenance and re-use of sub-patterns across queries. Our experimental study confirms that the SPASS framework achieves over 16 fold performance improvement for a wide range of experiments compared to the state-of-the-art solution.

【Keywords】: complex event processing; sequence pattern; sharing

36. How to Win a Hot Dog Eating Contest: Distributed Incremental View Maintenance with Batch Updates.

Paper Link】 【Pages】:511-526

【Authors】: Milos Nikolic ; Mohammad Dashti ; Christoph Koch

【Abstract】: In the quest for valuable information, modern big data applications continuously monitor streams of data. These applications demand low latency stream processing even when faced with high volume and velocity of incoming changes and the user's desire to ask complex queries. In this paper, we study low-latency incremental computation of complex SQL queries in both local and distributed streaming environments. We develop a technique for the efficient incrementalization of queries with nested aggregates for batch updates. We identify the cases in which batch processing can boost the performance of incremental view maintenance but also demonstrate that tuple-at-a-time processing often can achieve better performance in local mode. Batch updates are essential for enabling distributed incremental view maintenance and amortizing the cost of network communication and synchronization. We show how to derive incremental programs optimized for running on large-scale processing platforms. Our implementation of distributed incremental view maintenance can process tens of million of tuples with few-second latency using hundreds of nodes.

【Keywords】: batch processing; distributed query processing; query compilation; query optimization; recursive incremental view maintenance; stream management system

37. Sharing-Aware Outlier Analytics over High-Volume Data Streams.

Paper Link】 【Pages】:527-540

【Authors】: Lei Cao ; Jiayuan Wang ; Elke A. Rundensteiner

【Abstract】: Real-time analytics of anomalous phenomena on streaming data typically relies on processing a large variety of continuous outlier detection requests, each configured with different parameter settings. The processing of such complex outlier analytics workloads is resource consuming due to the algorithmic complexity of the outlier mining process. In this work we propose a sharing-aware multi-query execution strategy for outlier detection on data streams called SOP. A key insight of SOP is to transform the problem of handling a multi-query outlier analytics workload into a single-query skyline computation problem. We prove that the output of the skyline computation process corresponds to the minimal information needed for determining the outlier status of any point in the stream. Based on this new formulation, we design a customized skyline algorithm called K-SKY that leverages the domination relationships among the streaming data points to minimize the number of data points that must be evaluated for supporting multi-query outlier detection. Based on this K-SKY algorithm, our SOP solution achieves minimal utilization of both computational and memory resources for the processing of these complex outlier analytics workload. Our experimental study demonstrates that SOP consistently outperforms the state-of-art solutions by three orders of magnitude in CPU time, while only consuming 5% of their memory footprint - a clear win-win. Furthermore, SOP is shown to scale to large workloads composed of thousands of parameterized queries.

【Keywords】: multi-query; outlier; stream

38. THEMIS: Fairness in Federated Stream Processing under Overload.

Paper Link】 【Pages】:541-553

【Authors】: Evangelia Kalyvianaki ; Marco Fiscato ; Theodoros Salonidis ; Peter R. Pietzuch

【Abstract】: Federated stream processing systems, which utilise nodes from multiple independent domains, can be found increasingly in multi-provider cloud deployments, internet-of-things systems, collaborative sensing applications and large-scale grid systems. To pool resources from several sites and take advantage of local processing, submitted queries are split into query fragments, which are executed collaboratively by different sites. When supporting many concurrent users, however, queries may exhaust available processing resources, thus requiring constant load shedding. Given that individual sites have autonomy over how they allocate query fragments on their nodes, it is an open challenge how to ensure global fairness on processing quality experienced by queries in a federated scenario. We describe THEMIS, a federated stream processing system for resource-starved, multi-site deployments. It executes queries in a globally fair fashion and provides users with constant feedback on the experienced processing quality for their queries. THEMIS associates stream data with its source information content (SIC), a metric that quantifies the contribution of that data towards the query result, based on the amount of source data used to generate it. We provide the BALANCE-SIC distributed load shedding algorithm that balances the SIC values of result data. Our evaluation shows that the BALANCE-SIC algorithm yields balanced SIC values across queries, as measured by Jain's Fairness Index. Our approach also incurs a low execution time overhead.

【Keywords】: approximate data processing; fairness; federated data stream processing; tuple shedding

39. SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures.

Paper Link】 【Pages】:555-569

【Authors】: Alexandros Koliousis ; Matthias Weidlich ; Raul Castro Fernandez ; Alexander L. Wolf ; Paolo Costa ; Peter Pietzuch

【Abstract】: Modern servers have become heterogeneous, often combining multi-core CPUs with many-core GPGPUs. Such heterogeneous architectures have the potential to improve the performance of data-intensive stream processing applications, but they are not supported by current relational stream processing engines. For an engine to exploit a heterogeneous architecture, it must execute streaming SQL queries with sufficient data-parallelism to fully utilise all available heterogeneous processors, and decide how to use each in the most effective way. It must do this while respecting the semantics of streaming SQL queries, in particular with regard to window handling. We describe Saber, a hybrid high-performance relational stream processing engine for CPUs and GPGPUs. Saber executes window-based streaming SQL queries in a data-parallel fashion using all available CPU and GPGPU cores. Instead of statically assigning query operators to heterogeneous processors, Saber employs a new adaptive heterogeneous lookahead scheduling strategy, which increases the share of queries executing on the processor that yields the highest performance. To hide data movement costs, Saber pipelines the transfer of stream data between CPU and GPGPU memory. Our experimental comparison against state-of-the-art engines shows that Saber increases processing throughput while maintaining low latency for a wide range of streaming SQL queries with both small and large window sizes.

【Keywords】: data parallelism; gpgpus; heterogeneous hardware; hybrid scheduling; stream processing; windowing support

40. Range Thresholding on Streams.

Paper Link】 【Pages】:571-582

【Authors】: Miao Qiao ; Junhao Gan ; Yufei Tao

【Abstract】: This paper studies a type of continuous queries called range thresholding on streams (RTS). Imagine the stream as an unbounded sequence of elements each of which is a real value. A query registers an interval, and must be notified as soon as a certain number of incoming elements fall into the interval. The system needs to support multiple queries simultaneously, and aims to minimize the space consumption and computation time. Currently, all the solutions to this problem entail quadratic time O(nm) to process n stream elements and m queries, which severely limits their applicability to only a small number of queries. We propose the first algorithm that breaks the quadratic barrier, by reducing the computation cost dramatically to O(n + m), subject only to a polylogarithmic factor. The algorithm is general enough to guarantee the same on weighted versions of the queries even in d-dimensional space of any constant d. Its vast advantage over the previous methods in practical environments has been confirmed through extensive experimentation.

【Keywords】: algorithms; range thresholding; streams

Session 7 - Approximate Query Processing 7

41. Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads.

Paper Link】 【Pages】:583-598

【Authors】: Joy Arulraj ; Andrew Pavlo ; Prashanth Menon

【Abstract】: Data-intensive applications seek to obtain trill insights in real-time by analyzing a combination of historical data sets alongside recently collected data. This means that to support such hybrid workloads, database management systems (DBMSs) need to handle both fast ACID transactions and complex analytical queries on the same database. But the current trend is to use specialized systems that are optimized for only one of these workloads, and thus require an organization to maintain separate copies of the database. This adds additional cost to deploying a database application in terms of both storage and administration overhead. To overcome this barrier, we present a hybrid DBMS architecture that efficiently supports varied workloads on the same database. Our approach differs from previous methods in that we use a single execution engine that is oblivious to the storage layout of data without sacrificing the performance benefits of the specialized systems. This obviates the need to maintain separate copies of the database in multiple independent systems. We also present a technique to continuously evolve the database's physical storage layout by analyzing the queries' access patterns and choosing the optimal layout for different segments of data within the same table. To evaluate this work, we implemented our architecture in an in-memory DBMS. Our results show that our approach delivers up to 3x higher throughput compared to static storage layouts across different workloads. We also demonstrate that our continuous adaptation mechanism allows the DBMS to achieve a near-optimal layout for an arbitrary workload without requiring any manual tuning.

【Keywords】: HTAP; column-stores; hybrid workloads

42. An Effective Syntax for Bounded Relational Queries.

Paper Link】 【Pages】:599-614

【Authors】: Yang Cao ; Wenfei Fan

【Abstract】: A query Q is boundedly evaluable under a set A of access constraints if for all datasets D that satisfy A, there exists a fraction DQ of D such that Q(D) = Q(DQ), and the size of DQ and time for identifying DQ are both independent of the size of D. That is, we can compute Q(D) by accessing a bounded amount of data no matter how big D grows. However, while desirable, it is undecidable to determine whether a query in relational algebra (RA) is bounded under A. In light of the undecidability, this paper develops an effective syntax for bounded RA queries. We identify a class of covered RA queries such that under A, (a) every boundedly evaluable RA query is equivalent to a covered query, (b) every covered RA query is boundedly evaluable, and (c) it takes PTIME in |Q| and |A| to check whether Q is covered by A. We provide quadratic-time algorithms to check the coverage of Q, and to generate a bounded query plan for covered Q. We also study a new optimization problem for minimizing access constraints for covered queries. Using real-life data, we experimentally verify that a large number of RA queries in practice are covered, and that bounded query plans improve RA query evaluation by orders of magnitude.

【Keywords】: big data; bounded evaluability; query evaluation

43. Wander Join: Online Aggregation via Random Walks.

Paper Link】 【Pages】:615-629

【Authors】: Feifei Li ; Bin Wu ; Ke Yi ; Zhuoyue Zhao

【Abstract】: Joins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and even needs unrealistic assumptions (e.g., tuples in a table are stored in random order). This paper proposes a new approach, the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. We also design an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori. Compared with ripple join, wander join is particularly efficient for equality joins involving multiple tables, but also supports θ-joins. Selection predicates and group-by clauses can be handled as well. Extensive experiments using the TPC-H benchmark have demonstrated the superior performance of wander join over ripple join. In particular, we have integrated and tested wander join in the latest version of PostgreSQL, demonstrating its practicality in a full-fledged database system.

【Keywords】: joins; online aggregation; random walks

44. Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters.

Paper Link】 【Pages】:631-646

【Authors】: Srikanth Kandula ; Anil Shanbhag ; Aleksandar Vitorovic ; Matthaios Olma ; Robert Grandl ; Surajit Chaudhuri ; Bolin Ding

【Abstract】: We present a system that approximates the answer to complex ad-hoc queries in big-data clusters by injecting samplers on-the-fly and without requiring pre-existing samples. Improvements can be substantial when big-data queries take multiple passes over data and when samplers execute early in the query plan. We present a new, universe, sampler which is able to sample multiple join inputs. By incorporating samplers natively into a cost-based query optimizer, we automatically generate plans with appropriate samplers at appropriate locations. We devise an accuracy analysis method using which we ensure that query plans with samplers will not miss groups and that aggregate values are within a small ratio of their true value. An implementation on a cluster with tens of thousands of machines shows that queries in the TPC-DS benchmark use a median of 2X fewer resources. In contrast, approaches that construct input samples even when given 10X the size of the input to store samples improve only 22% of the queries, i.e., a median speed up of 0X.

【Keywords】: accuracy analysis; analytics; approximation; big-data; clusters; consistent sampler; data-parallel; parallel; query optimization; relational; sampling; sampling joins; streaming; universe sampler

45. A Study of Sorting Algorithms on Approximate Memory.

Paper Link】 【Pages】:647-662

【Authors】: Shuang Chen ; Shunning Jiang ; Bingsheng He ; Xueyan Tang

【Abstract】: Hardware evolution has been one of the driving factors for the redesign of database systems. Recently, approximate storage emerges in the area of computer architecture. It trades off precision for better performance and/or energy consumption. Previous studies have demonstrated the benefits of approximate storage for applications that are tolerant to imprecision such as image processing. However, it is still an open question whether and how approximate storage can be used for applications that do not expose such intrinsic tolerance. In this paper, we study one of the most basic operations in database--sorting on a hybrid storage system with both precise storage and approximate storage. Particularly, we start with a study of three common sorting algorithms on approximate storage. Experimental results show that a 95% sorted sequence can be obtained with up to 40% reduction in total write latencies. Thus, we propose an approx-refine execution mechanism to improve the performance of sorting algorithms on the hybrid storage system to produce precise results. Our optimization gains the performance benefits by offloading the sorting operation to approximate storage, followed by an efficient refinement to resolve the unsortedness on the output of the approximate storage. Our experiments show that our approx-refine can reduce the total memory access time by up to 11%. These studies shed light on the potential of approximate hardware for improving the performance of applications that require precise results.

【Keywords】: approximate storage; database; hybrid storage; phase change memory; sorting algorithms

46. Distributed Wavelet Thresholding for Maximum Error Metrics.

Paper Link】 【Pages】:663-677

【Authors】: Ioannis Mytilinis ; Dimitrios Tsoumakos ; Nectarios Koziris

【Abstract】: Modern data analytics involve simple and complex computations over enormous numbers of data records. The volume of data and the increasingly stringent response-time requirements place increasing emphasis on the efficiency of approximate query processing. A major challenge over the past years has been the efficient construction of fixed-space synopses that provide a deterministic quality guarantee, often expressed in terms of a maximum error metric. For data reduction, wavelet decomposition has proved to be a very effective tool, as it can successfully approximate sharp discontinuities and provide accurate answers to queries. However, existing polynomial time wavelet thresholding schemes that minimize maximum error metrics are constrained with impractical time and space complexities for large datasets. In order to provide a practical solution to the problem, we develop parallel algorithms that take advantage of key-properties of the wavelet decomposition and allocate tasks to multiple workers. To that end, we present i) a general framework for the parallelization of existing dynamic programming algorithms, ii) a parallel version of one such DP-based algorithm and iii) a new parallel greedy algorithm for the problem. To the best of our knowledge, this is the first attempt to scale algorithms for wavelet thresholding for maximum error metrics via a state-of-the-art distributed runtime. Our extensive experiments on both real and synthetic datasets over Hadoop show that the proposed algorithms achieve linear scalability and superior running-time performance compared to their centralized counterparts. Furthermore, our distributed greedy algorithm outperforms the distributed version of the current state-of-the-art dynamic programming algorithm by 2 to 4 times, without compromising the quality of results.

【Keywords】: approximate query processing; distributed data management; greedy thresholding; wavelet synopses

47. Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee.

Paper Link】 【Pages】:679-694

【Authors】: Bolin Ding ; Silu Huang ; Surajit Chaudhuri ; Kaushik Chakrabarti ; Chi Wang

【Abstract】: Data volumes are growing exponentially for our decision-support systems making it challenging to ensure interactive response time for ad-hoc queries without increasing cost of hardware. Aggregation queries with Group By that produce an aggregate value for every combination of values in the grouping columns are the most important class of ad-hoc queries. As small errors are usually tolerable for such queries, approximate query processing (AQP) has the potential to answer them over very large datasets much faster. In many cases analysts require the distribution of (group, aggvalue) pairs in the estimated answer to be guaranteed within a certain error threshold of the exact distribution. Existing AQP techniques are inadequate for two main reasons. First, users cannot express such guarantees. Second, sampling techniques used in traditional AQP can produce arbitrarily large errors even for summ queries. To address those limitations, we first introduce a new precision metric, called distribution precision, to express such error guarantees. We then study how to provide fast approximate answers to aggregation queries with distribution precision guaranteed within a user-specified error bound. The main challenges are to provide rigorous error guarantees and to handle arbitrary highly selective predicates without maintaining large-sized samples. We propose a novel sampling scheme called measure-biased sampling to address the former challenge. For the latter, we propose two new indexes to augment in-memory samples. Like other sampling-based AQP techniques, our solution supports any aggregate that can be estimated from random samples. In addition to deriving theoretical guarantees, we conduct experimental study to compare our system with state-of-the-art AQP techniques and a commercial column-store database system on both synthetic and real enterprise datasets. Our system provides a median speed-up of more than 100x with around 5% distribution error compared with the commercial database.

【Keywords】: approximate query processing; indexing; precision guarantee; sampling

Session 8 - Networks and the Web 6

48. Stop-and-Stare: Optimal Sampling Algorithms for Viral Marketing in Billion-scale Networks.

Paper Link】 【Pages】:695-710

【Authors】: Hung T. Nguyen ; My T. Thai ; Thang N. Dinh

【Abstract】: Influence Maximization (IM), that seeks a small set of key users who spread the influence widely into the network, is a core problem in multiple domains. It finds applications in viral marketing, epidemic control, and assessing cascading failures within complex systems. Despite the huge amount of effort, IM in billion-scale networks such as Facebook, Twitter, and World Wide Web has not been satisfactorily solved. Even the state-of-the-art methods such as TIM+ and IMM may take days on those networks. In this paper, we propose SSA and D-SSA, two novel sampling frameworks for IM-based viral marketing problems. SSA and D-SSA are up to 1200 times faster than the SIGMOD'15 best method, IMM, while providing the same (1-1/e-ε) approximation guarantee. Underlying our frameworks is an innovative Stop-and-Stare strategy in which they stop at exponential check points to verify (stare) if there is adequate statistical evidence on the solution quality. Theoretically, we prove that SSA and D-SSA are the first approximation algorithms that use (asymptotically) minimum numbers of samples, meeting strict theoretical thresholds characterized for IM. The absolute superiority of SSA and D-SSA are confirmed through extensive experiments on real network data for IM and another topic-aware viral marketing problem, named TVM.

【Keywords】: influence maximization; sampling; stop-and-stare

49. Spheres of Influence for More Effective Viral Marketing.

Paper Link】 【Pages】:711-726

【Authors】: Yasir Mehmood ; Francesco Bonchi ; David García-Soriano

【Abstract】: What is the set of nodes of a social network that, under a probabilistic contagion model, would get infected if a given node $s$ gets infected? We call this set the sphere of influence of s. Due to the stochastic nature of the contagion model we need to define a notion of "expected" or "typical" cascade: this is a set of nodes which is the closest to all the possible cascades starting from s. We thus formalize the Typical Cascade problem which requires, for a given source node s, to find the set of nodes minimizing the expected Jaccard distance to all the possible cascades from s. The expected cost of a typical cascade also provides us a measure of the stability of cascade propagation, i.e., how much random cascades from a source node s deviate from the "typical" cascade. In this sense source nodes with lower expected costs are more reliable. We show that, while computing the quality of a candidate solution is SPhard, a method based on (1) sampling random cascades and (2) computing their Jaccard Median, can obtain a multiplicative approximation with just O(1) samples. We then devise an index that allows to efficiently compute the sphere of influence for any node in the network. Finally, we propose to approach the influence maximization problem as an instance of set cover on the spheres of influence. Through exhaustive evaluation using real-world networks and different methods of assigning the influence probability to each edge, we show that our approach outperforms in quality the theoretically optimal greedy algorithm.

【Keywords】: influence maximization; sphere of influence; typical cascade; uncertain graphs

50. Continuous Influence Maximization: What Discounts Should We Offer to Social Network Users?

Paper Link】 【Pages】:727-741

【Authors】: Yu Yang ; Xiangbo Mao ; Jian Pei ; Xiaofei He

【Abstract】: Imagine we are introducing a new product through a social network, where we know for each user in the network the purchase probability curve with respect to discount. Then, what discount should we offer to those social network users so that the adoption of the product is maximized in expectation under a predefined budget? Although influence maximization has been extensively explored, surprisingly, this appealing practical problem still cannot be answered by the existing influence maximization methods. In this paper, we tackle the problem systematically. We formulate the general continuous influence maximization problem, investigate the essential properties, and develop a general coordinate descent algorithm as well as the engineering techniques for practical implementation. Our investigation does not assume any specific influence model and thus is general and principled. At the same time, using the most popularly adopted independent influence model as a concrete example, we demonstrate that more efficient methods are feasible under specific influence models. Our extensive empirical study on four benchmark real world networks with synthesized purchase probability curves clearly illustrates that continuous influence maximization can improve influence spread significantly with very moderate extra running time comparing to the classical influence maximization methods.

【Keywords】: coordinate descent; influence maximization

51. Holistic Influence Maximization: Combining Scalability and Efficiency with Opinion-Aware Models.

Paper Link】 【Pages】:743-758

【Authors】: Sainyam Galhotra ; Akhil Arora ; Shourya Roy

【Abstract】: The steady growth of graph data from social networks has resulted in wide-spread research in finding solutions to the influence maximization problem. In this paper, we propose a holistic solution to the influence maximization (IM) problem. (1) We introduce an opinion-cum-interaction (OI) model that closely mirrors the real-world scenarios. Under the OI model, we introduce a novel problem of Maximizing the Effective Opinion (MEO) of influenced users. We prove that the MEO problem is NP-hard and cannot be approximated within a constant ratio unless P=NP. (2) We propose a heuristic algorithm OSIM to efficiently solve the MEO problem. To better explain the OSIM heuristic, we first introduce EaSyIM - the opinion-oblivious version of OSIM, a scalable algorithm capable of running within practical compute times on commodity hardware. In addition to serving as a fundamental building block for OSIM, EaSyIM is capable of addressing the scalability aspect - memory consumption and running time, of the IM problem as well. Empirically, our algorithms are capable of maintaining the deviation in the spread always within 5% of the best known methods in the literature. In addition, our experiments show that both OSIM and EaSyIM are effective, efficient, scalable and significantly enhance the ability to analyze real datasets.

【Keywords】: efficiency; greedy algorithm; influence maximization; opinion; scalability; social networks; viral marketing

52. Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale.

Paper Link】 【Pages】:759-771

【Authors】: Astrid Rheinländer ; Mario Lehmann ; Anja Kunkel ; Jörg Meier ; Ulf Leser

【Abstract】: In many domains, a plethora of textual information is available on the web as news reports, blog posts, community portals, etc. Information extraction (IE) is the default technique to turn unstructured text into structured fact databases, but systematically applying IE techniques to web input requires highly complex systems, starting from focused crawlers over quality assurance methods to cope with the HTML input to long pipelines of natural language processing and IE algorithms. Although a number of tools for each of these steps exists, their seamless, flexible, and scalable combination into a web scale end-to-end text analytics system still is a true challenge. In this paper, we report our experiences from building such a system for comparing the "web view" on health related topics with that derived from a controlled scientific corpus, i.e., Medline. The system combines a focused crawler, applying shallow text analysis and classification to maintain focus, with a sophisticated text analytic engine inside the Big Data processing system Stratosphere. We describe a practical approach to seed generation which led us crawl a corpus of ~1 TB web pages highly enriched for the biomedical domain. Pages were run through a complex pipeline of best-of-breed tools for a multitude of necessary tasks, such as HTML repair, boilerplate detection, sentence detection, linguistic annotation, parsing, and eventually named entity recognition for several types of entities. Results are compared with those from running the same pipeline (without the web-related tasks) on a corpus of 24 million scientific abstracts and a third corpus made of ~250K scientific full texts. We evaluate scalability, quality, and robustness of the employed methods and tools. The focus of this paper is to provide a large, real-life use case to inspire future research into robust, easy-to-use, and scalable methods for domain-specific IE at web scale.

【Keywords】: focused crawling; information extraction; massively parallel data analysis

53. Robust and Noise Resistant Wrapper Induction.

Paper Link】 【Pages】:773-784

【Authors】: Tim Furche ; Jinsong Guo ; Sebastian Maneth ; Christian Schallhart

【Abstract】: Wrapper induction is the problem of automatically inferring a query from annotated web pages of the same template. This query should not only select the annotated content accurately but also other content following the same template. Beyond accurately matching the template, we consider two additional requirements: (1) wrappers should be robust against a large class of changes to the web pages, and (2) the induction process should be noise resistant, i.e., tolerate slightly erroneous (e.g., machine generated) samples. Key to our approach is a query language that is powerful enough to permit accurate selection, but limited enough to force noisy samples to be generalized into wrappers that select the likely intended items. We introduce such a language as subset of XPATH and show that even for such a restricted language, inducing optimal queries according to a suitable scoring is infeasible. Nevertheless, our wrapper induction framework infers highly robust and noise resistant queries. We evaluate the queries on snapshots from web pages that change over time as provided by the Internet Archive, and show that the induced queries are as robust as the human-made queries. The queries often survive hundreds sometimes thousands of days, with many changes to the relative position of the selected nodes (including changes on template level). This is due to the few and discriminative anchor (intermediately selected) nodes of the generated queries. The queries are highly resistant against positive noise (up to 50%) and negative noise (up to 20%).

【Keywords】: XPath; wrapper; wrapper induction; wrapper maintenance

Session 9 - Data Discovery and Extraction 6

54. Goods: Organizing Google's Datasets.

Paper Link】 【Pages】:795-806

【Authors】: Alon Y. Halevy ; Flip Korn ; Natalya Fridman Noy ; Christopher Olston ; Neoklis Polyzotis ; Sudip Roy ; Steven Euijong Whang

【Abstract】: Enterprises increasingly rely on structured datasets to run their businesses. These datasets take a variety of forms, such as structured files, databases, spreadsheets, or even services that provide access to the data. The datasets often reside in different storage systems, may vary in their formats, may change every day. In this paper, we present GOODS, a project to rethink how we organize structured datasets at scale, in a setting where teams use diverse and often idiosyncratic ways to produce the datasets and where there is no centralized system for storing and querying them. GOODS extracts metadata ranging from salient information about each dataset (owners, timestamps, schema) to relationships among datasets, such as similarity and provenance. It then exposes this metadata through services that allow engineers to find datasets within the company, to monitor datasets, to annotate them in order to enable others to use their datasets, and to analyze relationships between them. We discuss the technical challenges that we had to overcome in order to crawl and infer the metadata for billions of datasets, to maintain the consistency of our metadata catalog at scale, and to expose the metadata to users. We believe that many of the lessons that we learned are applicable to building large-scale enterprise-level data-management systems in general.

【Keywords】: data culture; data flow; data lakes; data monitoring; data organization; data provenance; data search; enterprise data management; metadata extraction

55. Multi-Source Uncertain Entity Resolution at Yad Vashem: Transforming Holocaust Victim Reports into People.

Paper Link】 【Pages】:807-819

【Authors】: Tomer Sagi ; Avigdor Gal ; Omer Barkol ; Ruth Bergman ; Alexander Avram

【Abstract】: In this work we describe an entity resolution project performed at Yad Vashem, the central repository of Holocaust-era information. The Yad Vashem dataset is unique with respect to classic entity resolution, by virtue of being both massively multi-source and by requiring multi-level entity resolution. With today's abundance of information sources, this project sets an example for multi-source resolution on a big-data scale. We discuss a set of requirements that led us to choose the MFIBlocks entity resolution algorithm in achieving the goals of the application. We also provide a machine learning approach, based upon decision trees to transform soft clusters into ranked clustering of records, representing possible entities. An extensive empirical evaluation demonstrates the unique properties of this dataset, highlighting the shortcomings of current methods and proposing avenues for future research in this realm.

【Keywords】: blocking; holocaust; undertain entity resolution

56. A Hybrid Approach to Functional Dependency Discovery.

Paper Link】 【Pages】:821-833

【Authors】: Thorsten Papenbrock ; Felix Naumann

【Abstract】: Functional dependencies are structural metadata that can be used for schema normalization, data integration, data cleansing, and many other data management tasks. Despite their importance, the functional dependencies of a specific dataset are usually unknown and almost impossible to discover manually. For this reason, database research has proposed various algorithms for functional dependency discovery. None, however, are able to process datasets of typical real-world size, e.g., datasets with more than 50 attributes and a million records. We present a hybrid discovery algorithm called HyFD, which combines fast approximation techniques with efficient validation techniques in order to find all minimal functional dependencies in a given dataset. While operating on compact data structures, HyFD not only outperforms all existing approaches, it also scales to much larger datasets.

【Keywords】: data profiling; discovery; efficient; functional dependencies; functional dependency; hybrid; metadata; metanome; parallel

57. Ontological Pathfinding.

Paper Link】 【Pages】:835-846

【Authors】: Yang Chen ; Sean Louis Goldberg ; Daisy Zhe Wang ; Soumitra Siddharth Johri

【Abstract】: Recent years have seen a drastic rise in the construction of web-scale knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to limitations of human knowledge and information extraction algorithms, these knowledge bases are still far from complete. In this paper, we study the problem of mining first-order inference rules to facilitate knowledge expansion. We propose the Ontological Pathfinding algorithm (OP) that scales to web-scale knowledge bases via a series of parallelization and optimization techniques: a relational knowledge base model to apply inference rules in batches, a new rule mining algorithm that parallelizes the join queries, a novel partitioning algorithm to break the mining tasks into smaller independent sub-tasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. Combining these techniques, we develop the first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing approach achieves this scale.

【Keywords】: data mining; first-order logic; knowledge bases; scalability

58. Extracting Databases from Dark Data with DeepDive.

Paper Link】 【Pages】:847-859

【Authors】: Ce Zhang ; Jaeho Shin ; Christopher Ré ; Michael J. Cafarella ; Feng Niu

【Abstract】: DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data --- scientific papers, Web classified ads, customer service notes, and so on --- were instead in a relational database, it would give analysts access to a massive and highly-valuable new set of "big data" to exploit. DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference.

【Keywords】: dark data; data integration; information extraction; knowledge base construction

59. Estimating the Impact of Unknown Unknowns on Aggregate Query Results.

Paper Link】 【Pages】:861-876

【Authors】: Yeounoh Chung ; Michael Lind Mortensen ; Carsten Binnig ; Tim Kraska

【Abstract】: It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results? In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources.

【Keywords】: aggregate query; data integration; open-world assumption; species estimation; unknown unknowns

Session 10 - Data Integration / Cleaning 7

60. Constraint-Variance Tolerant Data Repairing.

Paper Link】 【Pages】:877-892

【Authors】: Shaoxu Song ; Han Zhu ; Jianmin Wang

【Abstract】: Integrity constraints, guiding the cleaning of dirty data, are often found to be imprecise as well. Existing studies consider the inaccurate constraints that are oversimplified, and thus refine the constraints via inserting more predicates (attributes). We note that imprecise constraints may not only be oversimplified so that correct data are erroneously identified as violations, but also could be overrefined that the constraints overfit the data and fail to identify true violations. In the latter case, deleting excessive predicates applies. To address the oversimplified and overrefined constraint inaccuracies, in this paper, we propose to repair data by allowing a small variation (with both predicate insertion and deletion) on the constraints. A novel θ-tolerant repair model is introduced, which returns a (minimum) data repair that satisfies at least one variant of the constraints (with constraint variation no greater than θ compared to the given constraints). To efficiently repair data among various constraint variants, we propose a single round, sharing enabled approach. Results on real data sets demonstrate that our proposal can capture more accurate data repairs compared to the existing methods with/without constraint repairs.

【Keywords】: data repairing; denial constraints

61. Interactive and Deterministic Data Cleaning: A Tossed Stone Raises a Thousand Ripples.

Paper Link】 【Pages】:893-907

【Authors】: Jian He ; Enzo Veltri ; Donatello Santoro ; Guoliang Li ; Giansalvatore Mecca ; Paolo Papotti ; Nan Tang

【Abstract】: We present Falcon, an interactive, deterministic, and declarative data cleaning system, which uses SQL update queries as the language to repair data. Falcon does not rely on the existence of a set of pre-defined data quality rules. On the contrary, it encourages users to explore the data, identify possible problems, and make updates to fix them. Bootstrapped by one user update, Falcon guesses a set of possible sql update queries that can be used to repair the data. The main technical challenge addressed in this paper consists in finding a set of sql update queries that is minimal in size and at the same time fixes the largest number of errors in the data. We formalize this problem as a search in a lattice-shaped space. To guarantee that the chosen updates are semantically correct, Falcon navigates the lattice by interacting with users to gradually validate the set of sql update queries. Besides using traditional one-hop based traverse algorithms (e.g., BFS or DFS), we describe novel multi-hop search algorithms such that Falcon can dive over the lattice and conduct the search efficiently. Our novel search strategy is coupled with a number of optimization techniques to further prune the search space and efficiently maintain the lattice. We have conducted extensive experiments using both real-world and synthetic datasets to show that Falcon can effectively communicate with users in data repairing.

【Keywords】: data cleaning; declarative; deterministic; interactive

62. Sequential Data Cleaning: A Statistical Approach.

Paper Link】 【Pages】:909-924

【Authors】: Aoqian Zhang ; Shaoxu Song ; Jianmin Wang

【Abstract】: Errors are prevalent in data sequences, such as GPS trajectories or sensor readings. Existing methods on cleaning sequential data employ a constraint on value changing speeds and perform constraint-based repairing. While such speed constraints are effective in identifying large spike errors, the small errors that do not significantly deviate from the truth and indeed satisfy the speed constraints can hardly be identified and repaired. To handle such small errors, in this paper, we propose a statistical based cleaning method. Rather than declaring a broad constraint of max/min speeds, we model the probability distribution of speed changes. The repairing problem is thus to maximize the likelihood of the sequence w.r.t. the probability of speed changes. We formalize the likelihood-based cleaning problem, show its NP-hardness, devise exact algorithms, and propose several approximate/heuristic methods to trade off effectiveness for efficiency. Experiments on real data sets (in various applications) demonstrate the superiority of our proposal.

【Keywords】: likelihood-based cleaning; speed changes

63. Learning-Based Cleansing for Indoor RFID Data.

Paper Link】 【Pages】:925-936

【Authors】: Asif Iqbal Baba ; Manfred Jaeger ; Hua Lu ; Torben Bach Pedersen ; Wei-Shinn Ku ; Xike Xie

【Abstract】: RFID is widely used for object tracking in indoor environments, e.g., airport baggage tracking. Analyzing RFID data offers insight into the underlying tracking systems as well as the associated business processes. However, the inherent uncertainty in RFID data, including noise (cross readings) and incompleteness (missing readings), pose challenges to high-level RFID data querying and analysis. In this paper, we address these challenges by proposing a learning-based data cleansing approach that, unlike existing approaches, requires no detailed prior knowledge about the spatio-temporal properties of the indoor space and the RFID reader deployment. Requiring only minimal information about RFID deployment, the approach learns relevant knowledge from raw RFID data and uses it to cleanse the data. In particular, we model raw RFID readings as time series that are sparse because the indoor space is only partly covered by a limited number of RFID readers. We propose the Indoor RFID Multi-variate Hidden Markov Model (IR-MHMM) to capture the uncertainties of indoor RFID data as well as the correlation of moving object locations and object RFID readings. We propose three state space design methods for IR-MHMM that enable the learning of parameters while contending with raw RFID data time series. We solely use raw uncleansed RFID data for the learning of model parameters, requiring no special labeled data or ground truth. The resulting IR-MHMM based RFID data cleansing approach is able to recover missing readings and reduce cross readings with high effectiveness and efficiency, as demonstrated by extensive experimental studies with both synthetic and real data. Given enough indoor RFID data for learning, the proposed approach achieves a data cleansing accuracy comparable to or even better than state-of-the-art techniques requiring very detailed prior knowledge, making our solution superior in terms of both effectiveness and employability.

【Keywords】: data cleansing; hidden markov models; indoor moving objects; rfid

64. PrivateClean: Data Cleaning and Differential Privacy.

Paper Link】 【Pages】:937-951

【Authors】: Sanjay Krishnan ; Jiannan Wang ; Michael J. Franklin ; Ken Goldberg ; Tim Kraska

【Abstract】: Recent advances in differential privacy make it possible to guarantee user privacy while preserving the main characteristics of the data. However, most differential privacy mechanisms assume that the underlying dataset is clean. This paper explores the link between data cleaning and differential privacy in a framework we call PrivateClean. PrivateClean includes a technique for creating private datasets of numerical and discrete-valued attributes, a formalism for privacy-preserving data cleaning, and techniques for answering sum, count, and avg queries after cleaning. We show: (1) how the degree of privacy affects subsequent aggregate query accuracy, (2) how privacy potentially amplifies certain types of errors in a dataset, and (3) how this analysis can be used to tune the degree of privacy. The key insight is to maintain a bipartite graph relating dirty values to clean values and use this graph to estimate biases due to the interaction between cleaning and privacy. We validate these results on four datasets with a variety of well-studied cleaning techniques including using functional dependencies, outlier filtering, and resolving inconsistent attributes.

【Keywords】: data cleaning; differential privacy; local differential privacy

65. RDFind: Scalable Conditional Inclusion Dependency Discovery in RDF Datasets.

Paper Link】 【Pages】:953-967

【Authors】: Sebastian Kruse ; Anja Jentzsch ; Thorsten Papenbrock ; Zoi Kaoudi ; Jorge-Arnulfo Quiané-Ruiz ; Felix Naumann

【Abstract】: Inclusion dependencies (INDs) form an important integrity constraint on relational databases, supporting data management tasks, such as join path discovery and query optimization. Conditional inclusion dependencies (CINDs), which define including and included data in terms of conditions, allow to transfer these capabilities to RDF data. However, CIND discovery is computationally much more complex than IND discovery and the number of CINDs even on small RDF datasets is intractable. To cope with both problems, we first introduce the notion of pertinent CINDs with an adjustable relevance criterion to filter and rank CINDs based on their extent and implications among each other. Second, we present RDFind, a distributed system to efficiently discover all pertinent CINDs in RDF data. RDFind employs a lazy pruning strategy to drastically reduce the CIND search space. Also, its exhaustive parallelization strategy and robust data structures make it highly scalable. In our experimental evaluation, we show that RDFind is up to 419 times faster than the state-of-the-art, while considering a more general class of CINDs. Furthermore, it is capable of processing a very large dataset of billions of triples, which was entirely infeasible before.

【Keywords】: CIND discovery; RDF; RDFind; conditional inclusion dependencies; data profiling

66. Cost-Effective Crowdsourced Entity Resolution: A Partial-Order Approach.

Paper Link】 【Pages】:969-984

【Authors】: Chengliang Chai ; Guoliang Li ; Jian Li ; Dong Deng ; Jianhua Feng

【Abstract】: Crowdsourced entity resolution has recently attracted significant attentions because it can harness the wisdom of crowd to improve the quality of entity resolution. However existing techniques either cannot achieve high quality or incur huge monetary costs. To address these problems, we propose a cost-effective crowdsourced entity resolution framework, which significantly reduces the monetary cost while keeping high quality. We first define a partial order on the pairs of records. Then we select a pair as a question and ask the crowd to check whether the records in the pair refer to the same entity. After getting the answer of this pair, we infer the answers of other pairs based on the partial order. Next we iteratively select pairs without answers to ask until we get the answers of all pairs. We devise effective algorithms to judiciously select the pairs to ask in order to minimize the number of asked pairs. To further reduce the cost, we propose a grouping technique to group the pairs and we only ask one pair instead of all pairs in each group. We develop error-tolerant techniques to tolerate the errors introduced by the partial order and the crowd. Experimental results show that our method reduces the cost to 1.25% of existing approaches (or existing approaches take 80* monetary cost of our method) while not sacrificing the quality.

【Keywords】: crowdsourcing; entity resolution; partial order

Session 11 - Spatio / Temporal Databases 7

67. Topic Exploration in Spatio-Temporal Document Collections.

Paper Link】 【Pages】:985-998

【Authors】: Kaiqi Zhao ; Lisi Chen ; Gao Cong

【Abstract】: Huge amounts of data with both spatial and temporal information (e.g., geo-tagged tweets) are being generated, and are often used to share and spread personal updates, spontaneous ideas, and breaking news. We refer to such data as spatio-temporal documents. It is of great interest to explore topics in a collection of spatio-temporal documents. In this paper, we study the problem of efficiently mining topics from spatio-temporal documents within a user specified bounded region and timespan, to provide users with insights about events, trends, and public concerns within the specified region and time period. We propose a novel algorithm that is able to efficiently combine two pre-trained topic models learnt from two document sets with a bounded error, based on which we develop an efficient approach to mining topics from a large number of spatio-temporal documents within a region and a timespan. Our experimental results show that our approach is able to improve the runtime by at least an order of magnitude compared with the baselines. Meanwhile, the effectiveness of our proposed method is close to the baselines.

【Keywords】: algorithm; exploratory queries; spatial-temporal data; topic model

68. ParTime: Parallel Temporal Aggregation.

Paper Link】 【Pages】:999-1010

【Authors】: Markus Pilman ; Martin Kaufmann ; Florian Köhl ; Donald Kossmann ; Damien Profeta

【Abstract】: This paper presents ParTime, a parallel algorithm for temporal aggregation. Temporal aggregation is one of the most important, yet most complex temporal query operators. It has been extensively studied in the past, but so far there has only been one attempt to parallelize this operator. ParTime supports data parallelism and has a number of additional advantages: It supports the full bitemporal data model, it requires no a-priori indexing, it supports shared computation, and it runs well on modern hardware (i.e., NUMA machines with large main memories). We implemented ParTime in a parallel database system and carried out comprehensive performance experiments with a real workload from the airline industry and a synthetic benchmark, the TPC-BiH benchmark. The results show that ParTime significantly outperforms any other available temporal database system. Furthermore, the results show that ParTime is competitive as compared to the Timeline Index, the best known technique to process temporal queries from the research literature and which is based on pre-computation and indexing.

【Keywords】: in-memory databases; query processing; shared scans; temporal data; temporal query processing

69. Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets.

Paper Link】 【Pages】:1011-1025

【Authors】: Fernando Chirigati ; Harish Doraiswamy ; Theodoros Damoulas ; Juliana Freire

【Abstract】: The increasing ability to collect data from urban environments, coupled with a push towards openness by governments, has resulted in the availability of numerous spatio-temporal data sets covering diverse aspects of a city. Discovering relationships between these data sets can produce new insights by enabling domain experts to not only test but also generate hypotheses. However, discovering these relationships is difficult. First, a relationship between two data sets may occur only at certain locations and/or time periods. Second, the sheer number and size of the data sets, coupled with the diverse spatial and temporal scales at which the data is available, presents computational challenges on all fronts, from indexing and querying to analyzing them. Finally, it is non-trivial to differentiate between meaningful and spurious relationships. To address these challenges, we propose Data Polygamy, a scalable topology-based framework that allows users to query for statistically significant relationships between spatio-temporal data sets. We have performed an experimental evaluation using over 300 spatial-temporal urban data sets which shows that our approach is scalable and effective at identifying interesting relationships.

【Keywords】: relationship querying; spatio-temporal data analysis; topology-based techniques; urban data

70. Distributed Evaluation of Top-k Temporal Joins.

Paper Link】 【Pages】:1027-1039

【Authors】: Julien Pilourdault ; Vincent Leroy ; Sihem Amer-Yahia

【Abstract】: We study a particular kind of join, coined Ranked Temporal Join (RTJ), featuring predicates that compare time intervals and a scoring function associated with each predicate to quantify how well it is satisfied. RTJ queries are prevalent in a variety of applications such as network traffic monitoring, task scheduling, and tweet analysis. RTJ queries are often best interpreted as top-k queries where only the best matches are returned. We show how to exploit the nature of temporal predicates and the properties of their associated scoring semantics to design TKIJ, an efficient query evaluation approach on a distributed Map-Reduce architecture. TKIJ relies on an offline statistics computation that, given a time partitioning into granules, computes the distribution of intervals' endpoints in each granule, and an online computation that generates query-dependent score bounds. Those statistics are used for workload assignment to reducers. This aims at reducing data replication, to limit I/O cost. Additionally, high-scoring results are distributed evenly to enable each reducer to prune unnecessary results. Our extensive experiments on synthetic and real datasets show that TKIJ outperforms state-of-the-art competitors and provides very good performance for n-ary RTJ queries on temporal data.

【Keywords】: distributed processing; join; temporal data; top-k

71. AT-GIS: Highly Parallel Spatial Query Processing with Associative Transducers.

Paper Link】 【Pages】:1041-1054

【Authors】: Peter Ogden ; David B. Thomas ; Peter Pietzuch

【Abstract】: Users in many domains, including urban planning, transportation, and environmental science want to execute analytical queries over continuously updated spatial datasets. Current solutions for large-scale spatial query processing either rely on extensions to RDBMS, which entails expensive loading and indexing phases when the data changes, or distributed map/reduce frameworks, running on resource-hungry compute clusters. Both solutions struggle with the sequential bottleneck of parsing complex, hierarchical spatial data formats, which frequently dominates query execution time. Our goal is to fully exploit the parallelism offered by modern multi-core CPUs for parsing and query execution, thus providing the performance of a cluster with the resources of a single machine. We describe AT-GIS, a highly-parallel spatial query processing system that scales linearly to a large number of CPU cores. AT-GIS integrates the parsing and querying of spatial data using a new computational abstraction called associative transducers (ATs). ATs can form a single data-parallel pipeline for computation without requiring the spatial input data to be split into logically independent blocks. Using ATs, AT-GIS can execute, in parallel, spatial query operators on the raw input data in multiple formats, without any pre-processing. On a single 64-core machine, AT-GIT provides 3x the performance of an 8-node Hadoop cluster with 192 cores for containment queries, and 10x for aggregation queries.

【Keywords】: JSON; NODB; XML; multi-core CPUs; parallel automata; spatial query processing

Paper Link】 【Pages】:1055-1070

【Authors】: Kaiyu Feng ; Gao Cong ; Sourav S. Bhowmick ; Wen-Chih Peng ; Chunyan Miao

【Abstract】: The increasing popularity and growth of mobile devices and location-based services enable us to utilize large-scale geo-tagged data to support novel location-based applications. This paper introduces a novel problem called the best region search (BRS) problem and provides efficient solutions to it. Given a set O of spatial objects, a submodular monotone aggregate score function, and the size a x b of a query rectangle, the BRS problem aims to find a x b rectangular region such that the aggregate score of the spatial objects inside the region is maximized. This problem is fundamental to support several real-world applications such as most influential region search (eg. the best location for a signage to attract most audience) and most diversified region search (eg. region with most diverse facilities). We propose an efficient algorithm called SliceBRS to find the exact answer to the BRS problem. Furthermore, we propose an approximate solution called CoverBRS and prove that the answer found by it is bounded by a constant. Our experimental study with real-world datasets and applications demonstrates the effectiveness and superiority of our proposed algorithms.

【Keywords】: aggregation; data exploration; optimization; region search

73. Simba: Efficient In-Memory Spatial Analytics.

Paper Link】 【Pages】:1071-1085

【Authors】: Dong Xie ; Feifei Li ; Bin Yao ; Gefei Li ; Liang Zhou ; Minyi Guo

【Abstract】: Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial analytics systems are disk-based and optimized for IO efficiency. But increasingly, data are stored and processed in memory to achieve low latency, and CPU time becomes the new bottleneck. We present the Simba (Spatial In-Memory Big data Analytics) system that offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba is based on Spark and runs over a cluster of commodity machines. In particular, Simba extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API. It introduces indexes over RDDs in order to work with big spatial data and complex spatial operations. Lastly, Simba implements an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput. Extensive experiments over large data sets demonstrate Simba's superior performance compared against other spatial analytics system.

【Keywords】: big spatial data; in-memory computing; main-memory DBMS; spark; spark SQL; spatial analytics; spatial queries

Session 12 - Distributed Data Processing 6

74. Realtime Data Processing at Facebook.

Paper Link】 【Pages】:1087-1098

【Authors】: Guoqiang Jerry Chen ; Janet L. Wiener ; Shridhar Iyer ; Anshul Jaiswal ; Ran Lei ; Nikhil Simha ; Wei Wang ; Kevin Wilfong ; Tim Williamson ; Serhat Yilmaz

【Abstract】: Realtime data processing powers many use cases at Facebook, including realtime reporting of the aggregated, anonymized voice of Facebook users, analytics for mobile applications, and insights for Facebook page administrators. Many companies have developed their own systems; we have a realtime data processing ecosystem at Facebook that handles hundreds of Gigabytes per second across hundreds of data pipelines. Many decisions must be made while designing a realtime stream processing system. In this paper, we identify five important design decisions that affect their ease of use, performance, fault tolerance, scalability, and correctness. We compare the alternative choices for each decision and contrast what we built at Facebook to other published systems. Our main decision was targeting seconds of latency, not milliseconds. Seconds is fast enough for all of the use cases we support and it allows us to use a persistent message bus for data transport. This data transport mechanism then paved the way for fault tolerance, scalability, and multiple options for correctness in our stream processing systems Puma, Swift, and Stylus. We then illustrate how our decisions and systems satisfy our requirements for multiple use cases at Facebook. Finally, we reflect on the lessons we learned as we built and operated these systems.

【Keywords】: realtime data processing; stream processing

75. SparkR: Scaling R Programs with Spark.

Paper Link】 【Pages】:1099-1104

【Authors】: Shivaram Venkataraman ; Zongheng Yang ; Davies Liu ; Eric Liang ; Hossein Falaki ; Xiangrui Meng ; Reynold Xin ; Ali Ghodsi ; Michael J. Franklin ; Ion Stoica ; Matei Zaharia

【Abstract】: R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machine's memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Spark's distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level DataFrame API enables scalable computation and present some of the key details of our implementation.

【Keywords】: R; spark; statistical computing

76. VectorH: Taking SQL-on-Hadoop to the Next Level.

Paper Link】 【Pages】:1105-1117

【Authors】: Andrei Costea ; Adrian Ionescu ; Bogdan Raducanu ; Michal Switakowski ; Cristian Bârca ; Juliusz Sompolski ; Alicja Luszczak ; Michal Szafranski ; Giel de Nijs ; Peter A. Boncz

【Abstract】: Actian Vector in Hadoop (VectorH for short) is a new SQL-on-Hadoop system built on top of the fast Vectorwise analytical database system. VectorH achieves fault tolerance and storage scalability by relying on HDFS, and extends the state-of-the-art in SQL-on-Hadoop systems by instrumenting the HDFS replication policy to optimize read locality. VectorH integrates with YARN for workload management, achieving a high degree of elasticity. Even though HDFS is an append-only filesystem, and VectorH supports (update-averse) ordered tables, trickle updates are possible thanks to Positional Delta Trees (PDTs), a differential update structure that can be queried efficiently. We describe the changes made to single-server Vectorwise to turn it into a Hadoop-based MPP system, encompassing workload management, parallel query optimization and execution, HDFS storage, transaction processing and Spark integration. We evaluate VectorH against HAWQ, Impala, SparkSQL and Hive, showing orders of magnitude better performance.

【Keywords】: SQL-on-Hadoop; column stores; distributed systems; parallel database systems; query optimization; query processing

77. Adaptive Logging: Optimizing Logging and Recovery Costs in Distributed In-memory Databases.

Paper Link】 【Pages】:1119-1134

【Authors】: Chang Yao ; Divyakant Agrawal ; Gang Chen ; Beng Chin Ooi ; Sai Wu

【Abstract】: By maintaining the data in main memory, in-memory databases dramatically reduce the I/O cost of transaction processing. However, for recovery purposes, in-memory systems still need to flush the log to disk, which incurs a substantial number of I/Os. Recently, command logging has been proposed to replace the traditional data log (e.g., ARIES logging) in in-memory databases. Instead of recording how the tuples are updated, command logging only tracks the transactions that are being executed, thereby effectively reducing the size of the log and improving the performance. However, when a failure occurs, all the transactions in the log after the last checkpoint must be redone sequentially and this significantly increases the cost of recovery. In this paper, we first extend the command logging technique to a distributed system, where all the nodes can perform their recovery in parallel. We show that in a distributed system, the only bottleneck of recovery caused by command logging is the synchronization process that attempts to resolve the data dependency among the transactions. We then propose an adaptive logging approach by combining data logging and command logging. The percentage of data logging versus command logging becomes a tuning knob between the performance of transaction processing and recovery to meet different OLTP requirements, and a model is proposed to guide such tuning. Our experimental study compares the performance of our proposed adaptive logging, ARIES-style data logging and command logging on top of H-Store. The results show that adaptive logging can achieve a 10x boost for recovery and a transaction throughput that is comparable to that of command logging.

【Keywords】: OLTP; aries logging; command logging; distributed in-memory database; transaction logging

78. Big Data Analytics with Datalog Queries on Spark.

Paper Link】 【Pages】:1135-1149

【Authors】: Alexander Shkapsky ; Mohan Yang ; Matteo Interlandi ; Hsuan Chiu ; Tyson Condie ; Carlo Zaniolo

【Abstract】: There is great interest in exploiting the opportunity provided by cloud computing platforms for large-scale analytics. Among these platforms, Apache Spark is growing in popularity for machine learning and graph analytics. Developing efficient complex analytics in Spark requires deep understanding of both the algorithm at hand and the Spark API or subsystem APIs (e.g., Spark SQL, GraphX). Our BigDatalog system addresses the problem by providing concise declarative specification of complex queries amenable to efficient evaluation. Towards this goal, we propose compilation and optimization techniques that tackle the important problem of efficiently supporting recursion in Spark. We perform an experimental comparison with other state-of-the-art large-scale Datalog systems and verify the efficacy of our techniques and effectiveness of Spark in supporting Datalog-based analytics.

【Keywords】: datalog; monotonic aggregates; recursive queries; spark

79. An Efficient MapReduce Cube Algorithm for Varied DataDistributions.

Paper Link】 【Pages】:1151-1165

【Authors】: Tova Milo ; Eyal Altshuler

【Abstract】: Data cubes allow users to discover insights from their data and are commonly used in data analysis. While very useful, the data cube is expensive to compute, in particular when the input relation is very large. To address this problem, we consider cube computation in MapReduce, the popular paradigm for distributed big data processing, and present an efficient algorithm for computing cubes over large data sets. We show that our new algorithm consistently performs better than the previous solutions. In particular, existing techniques for cube computation in MapReduce suffer from sensitivity to the distribution of the input data and their performance heavily depends on whether or not, and how exactly, the data is skewed. In contrast, the cube algorithm that we present here is resilient and significantly outperforms previous solutions for varying data distributions. At the core of our solution is a dedicated data structure called the Skews and Partitions Sketch (SP-Sketch for short). The SP-Sketch is compact in size and fast to compute, and records all needed information for identifying skews and effectively partitioning the workload between the machines. Our algorithm uses the sketch to speed up computation and minimize communication overhead. Our theoretical analysis and thorough experimental study demonstrate the feasibility and efficiency of our solution, including comparisons to state of the art tools for big data processing such as Pig and Hive.

【Keywords】: MapReduce; datacubes

Session 13 - Graphs 2: Subgraph-based Optimization Techniques 6

80. Diversified Top-k Subgraph Querying in a Large Graph.

Paper Link】 【Pages】:1167-1182

【Authors】: Zhengwei Yang ; Ada Wai-Chee Fu ; Ruifeng Liu

【Abstract】: Subgraph querying in a large data graph is interesting for different applications. A recent study shows that top-k diversified results are useful since the number of matching subgraphs can be very large. In this work, we study the problem of top-k diversified subgraph querying that asks for a set of up to k subgraphs isomorphic to a given query graph, and that covers the largest number of vertices. We propose a novel level-based algorithm for this problem which supports early termination and has a theoretical approximation guarantee. From experiments, most of our results on real datasets used in previous works are near optimal with a query time within 10ms on a commodity machine.

【Keywords】: diversity; maximum k-coverage; subgraph isomorphism; top-k

81. Graph Indexing for Shortest-Path Finding over Dynamic Sub-Graphs.

Paper Link】 【Pages】:1183-1197

【Authors】: Mohamed S. Hassan ; Walid G. Aref ; Ahmed M. Aly

【Abstract】: A variety of applications spanning various domains, e.g., social networks, transportation, and bioinformatics, have graphs as first-class citizens. These applications share a vital operation, namely, finding the shortest path between two nodes. In many scenarios, users are interested in filtering the graph before finding the shortest path. For example, in social networks, one may need to compute the shortest path between two persons on a sub-graph containing only family relationships. This paper focuses on dynamic graphs with labeled edges, where the target is to find a shortest path after filtering some edges based on user-specified query labels. This problem is termed the Edge-Constrained Shortest Path query (or ECSP, for short). This paper introduces Edge-Disjoint Partitioning (EDP, for short), a new technique for efficiently answering ECSP queries over dynamic graphs. EDP has two main components: a dynamic index that is based on graph partitioning, and a traversal algorithm that exploits the regular patterns of the answers of ECSP queries. The main idea of EDP is to partition the graph based on the labels of the edges. On demand, EDP computes specific sub-paths within each partition and updates its index. The computed sub-paths act as pre-computations that can be leveraged by future queries. To answer an ECSP query, EDP connects sub-paths from different partitions using its efficient traversal algorithm. EDP can dynamically handle various types of graph updates, e.g., label, edge, and node updates. The index entries that are potentially affected by graph updates are invalidated and re-computed on demand. EDP is evaluated using real graph datasets from various domains. Experimental results demonstrate that EDP can achieve query performance gains of up to four orders of magnitude in comparison to state of the art techniques.

【Keywords】: graph data management; graph indexing; graph partitioning; graph query; query optimization; query processing; shortest path

82. Efficient Subgraph Matching by Postponing Cartesian Products.

Paper Link】 【Pages】:1199-1214

【Authors】: Fei Bi ; Lijun Chang ; Xuemin Lin ; Lu Qin ; Wenjie Zhang

【Abstract】: In this paper, we study the problem of subgraph matching that extracts all subgraph isomorphic embeddings of a query graph q in a large data graph G. The existing algorithms for subgraph matching follow Ullmann's backtracking approach; that is, iteratively map query vertices to data vertices by following a matching order of query vertices. It has been shown that the matching order of query vertices is a very important aspect to the efficiency of a subgraph matching algorithm. Recently, many advanced techniques, such as enforcing connectivity and merging similar vertices in query or data graphs, have been proposed to provide an effective matching order with the aim to reduce unpromising intermediate results especially the ones caused by redundant Cartesian products. In this paper, for the first time we address the issue of unpromising results by Cartesian products from "dissimilar" vertices. We propose a new framework by postponing the Cartesian products based on the structure of a query to minimize the redundant Cartesian products. Our second contribution is proposing a new path-based auxiliary data structure, with the size O(|E(G)| x |V(q)|), to generate a matching order and conduct subgraph matching, which significantly reduces the exponential size O(|V(G)||V(q)|-1) of the existing path-based auxiliary data structure, where V (G) and E (G) are the vertex and edge sets of a data graph G, respectively, and V (q) is the vertex set of a query $q$. Extensive empirical studies on real and synthetic graphs demonstrate that our techniques outperform the state-of-the-art algorithms by up to $3$ orders of magnitude.

【Keywords】: compact path index; core-forest-leaf decomposition; postpone cartesian products; subgraph isomorphism

83. Adding Counting Quantifiers to Graph Patterns.

Paper Link】 【Pages】:1215-1230

【Authors】: Wenfei Fan ; Yinghui Wu ; Jingbo Xu

【Abstract】: This paper proposes quantified graph patterns (QGPs), an extension of graph patterns by supporting simple counting quantifiers on edges. We show that QGPs naturally express universal and existential quantification, numeric and ratio aggregates, as well as negation. Better still, the increased expressivity does not come with a much higher price. We show that quantified matching, i.e., graph pattern matching with QGPs, remains NP-complete in the absence of negation, and is DP-complete for general QGPs. We show how quantified matching can be conducted by incorporating quantifier checking into conventional subgraph isomorphism methods. We also develop parallel scalable algorithms for quantified matching. As an application of QGPs, we introduce quantified graph association rules defined with QGPs, to identify potential customers in social media marketing. Using real-life and synthetic graphs, we experimentally verify the effectiveness of QGPs and the scalability of our algorithms.

【Keywords】: graph association rules; quantified graph patterns

84. DUALSIM: Parallel Subgraph Enumeration in a Massive Graph on a Single Machine.

Paper Link】 【Pages】:1231-1245

【Authors】: Hyeonji Kim ; Juneyoung Lee ; Sourav S. Bhowmick ; Wook-Shin Han ; Jeong-Hoon Lee ; Seongyun Ko ; Moath H. A. Jarrah

【Abstract】: Subgraph enumeration is important for many applications such as subgraph frequencies, network motif discovery, graphlet kernel computation, and studying the evolution of social networks. Most earlier work on subgraph enumeration assumes that graphs are resident in memory, which results in serious scalability problems. Recently, efforts to enumerate all subgraphs in a large-scale graph have seemed to enjoy some success by partitioning the data graph and exploiting the distributed frameworks such as MapReduce and distributed graph engines. However, we notice that all existing distributed approaches have serious performance problems for subgraph enumeration due to the explosive number of partial results. In this paper, we design and implement a disk-based, single machine parallel subgraph enumeration solution called DualSim that can handle massive graphs without maintaining exponential numbers of partial results. Specifically, we propose a novel concept of the dual approach for subgraph enumeration. The dual approach swaps the roles of the data graph and the query graph. Specifically, instead of fixing the matching order in the query and then matching data vertices, it fixes the data vertices by fixing a set of disk pages and then finds all subgraph matchings in these pages. This enables us to significantly reduce the number of disk reads. We conduct extensive experiments with various real-world graphs to systematically demonstrate the superiority of DualSim over state-of-the-art distributed subgraph enumeration methods. DualSim outperforms the state-of-the-art methods by up to orders of magnitude, while they fail for many queries due to explosive intermediate results.

【Keywords】: dual approach; graph analytics; subgraph enumeration

85. Distributed Set Reachability.

Paper Link】 【Pages】:1247-1261

【Authors】: Sairam Gurajada ; Martin Theobald

【Abstract】: In this paper, we focus on the efficient and scalable processing of set-reachability queries over a distributed, directed data graph. A "set-reachability query" is a generalized form of a reachability query, in which we consider two sets S and T of source and target vertices, respectively, to be given as the query. The result of a set-reachability query are all pairs of source and target vertices (s, t), with s -- S and t #8712; T, where s is reachable to t (denoted as S ↝ T). In case the data graph is partitioned into multiple, edge- and vertex-disjoint subgraphs (e.g., when distributed across multiple compute nodes in a cluster), we refer to the resulting set-reachability problem as "distributed set reachability". The key goal in processing a distributed set-reachability query over a partitioned data graph both efficiently and in a scalable manner is (1) to avoid redundant computations within the local compute nodes as much as possible, (2) to partially evaluate the local components of a set-reachability query S ↝ T among all compute nodes in parallel, and (3) to minimize both the size and number of messages exchanged among the compute nodes. Distributed set reachability has a plethora of applications in graph analytics and for query processing. The current W3C recommendation for SPARQL 1.1, for example, introduces a notion of "labeled property paths" which resolves to processing a form of generalized graph-pattern queries with set-reachability predicates. Moreover, analyzing dependencies among "social-network communities" inherently involves reachability checks between large sets of source and target vertices. Our experiments confirm very significant performance gains of our approach in comparison to state-of-the-art graph engines such as Giraph++, and over a variety of graph collections with up to 1.4 billion edges.

【Keywords】: SparQL 1.1 property paths; distributed system; large scale graphs; set reachability

Session 14 - Main Memory Analytics 5

86. Fast Multi-Column Sorting in Main-Memory Column-Stores.

Paper Link】 【Pages】:1263-1278

【Authors】: Wenjian Xu ; Ziqiang Feng ; Eric Lo

【Abstract】: Sorting is a crucial operation that could be used to implement SQL operators such as GROUP BY, ORDER BY, and SQL:2003 PARTITION BY. Queries with multiple attributes in those clauses are common in real workloads. When executing queries of that kind, state-of-the-art main-memory column-stores require one round of sorting per input column. With the advent of recent fast scans and denormalization techniques, that kind of multi-column sorting could become a bottleneck. In this paper, we propose a new technique called "code massaging", which manipulates the bits across the columns so that the overall sorting time can be reduced by eliminating some rounds of sorting and/or by improving the degree of SIMD data level parallelism. Empirical results show that a main-memory column-store with code massaging can achieve speedup of up to 4.7X, 4.7X, 4X, and 3.2X on TPC-H, TPC-H skew, TPC-DS, and real workload, respectively.

【Keywords】: SIMD; column-store; main-memory; multi-column sorting

87. Elastic Pipelining in an In-Memory Database Cluster.

Paper Link】 【Pages】:1279-1294

【Authors】: Li Wang ; Minqi Zhou ; Zhenjie Zhang ; Yin Yang ; Aoying Zhou ; Dina Bitton

【Abstract】: An in-memory database cluster consists of multiple interconnected nodes with a large capacity of RAM and modern multi-core CPUs. As a conventional query processing strategy, pipelining remains a promising solution for in-memory parallel database systems, as it avoids expensive intermediate result materialization and parallelizes the data processing among nodes. However, to fully unleash the power of pipelining in a cluster with multi-core nodes, it is crucial for the query optimizer to generate good query plans with appropriate intra-node parallelism, in order to maximize CPU and network bandwidth utilization. A suboptimal plan, on the contrary, causes load imbalance in the pipelines and consequently degrades the query performance. Parallelism assignment optimization at compile time is nearly impossible, as the workload in each node is affected by numerous factors and is highly dynamic during query evaluation. To tackle this problem, we propose elastic pipelining, which makes it possible to optimize intra-node parallelism assignments in the pipelines based on the actual workload at runtime. It is achieved with the adoption of new elastic iterator model and a fully optimized dynamic scheduler. The elastic iterator model generally upgrades traditional iterator model with new dynamic multi-core execution adjustment capability. And the dynamic scheduler efficiently provisions CPU cores to query execution segments in the pipelines based on the light-weight measurements on the operators. Extensive experiments on real and synthetic (TPC-H) data show that our proposal achieves almost full CPU utilization on typical decision-making analytical queries, outperforming state-of-the-art open-source systems by a huge margin.

【Keywords】: distributed query processing; in-memory database; multi-core; numa architecture; query processing

88. Page As You Go: Piecewise Columnar Access In SAP HANA.

Paper Link】 【Pages】:1295-1306

【Authors】: Reza Sherkat ; Colin Florendo ; Mihnea Andrei ; Anil K. Goel ; Anisoara Nica ; Peter Bumbulis ; Ivan Schreter ; Günter Radestock ; Christian Bensberg ; Daniel Booss ; Heiko Gerwens

【Abstract】: In-memory columnar databases such as SAP HANA achieve extreme performance by means of vector processing over logical units of main memory resident columns. The core in-memory algorithms can be challenged when the working set of an application does not fit into main memory. To deal with memory pressure, most in-memory columnar databases evict candidate columns (or tables) using a set of heuristics gleaned from recent workload. As an alternative approach, we propose to reduce the unit of load and eviction from column to a contiguous portion of the in-memory columnar representation, which we call a page. In this paper, we adapt the core algorithms to be able to operate with partially loaded columns while preserving the performance benefits of vector processing. Our approach has two key advantages. First, partial column loading reduces the mandatory memory footprint for each column, making more memory available for other purposes. Second, partial eviction extends the in-memory lifetime of partially loaded column. We present a new in-memory columnar implementation for our approach, that we term page loadable column. We design a new persistency layout and access algorithms for the encoded data vector of the column, the order-preserving dictionary, and the inverted index. We compare the performance attributes of page loadable columns with those of regular in-memory columns and present a use-case for page loadable columns for cold data in data aging scenarios. Page loadable columns are completely integrated in SAP HANA, and we present extensive experimental results that quantify the performance overhead and the resource consumption when these columns are deployed.

【Keywords】: data aging; in-memory columnar databases

89. Hybrid Garbage Collection for Multi-Version Concurrency Control in SAP HANA.

Paper Link】 【Pages】:1307-1318

【Authors】: Juchang Lee ; Hyungyu Shin ; Chang Gyoo Park ; Seongyun Ko ; Jaeyun Noh ; Yongjae Chuh ; Wolfgang Stephan ; Wook-Shin Han

【Abstract】: While multi-version concurrency control (MVCC) supports fast and robust performance in in-memory, relational databases, it has the potential problem of a growing number of versions over time due to obsolete versions. Although a few TB of main memory is available for enterprise machines, the memory resource should be used carefully for economic and practical reasons. Thus, in order to maintain the necessary number of versions in MVCC, versions which will no longer be used need to be deleted. This process is called garbage collection. MVCC uses the concept of visibility to define garbage. A set of versions for each record is first identified as candidate if their version timestamps are lower than the minimum value of snapshot timestamps of active snapshots in the system. All such candidates, except the one which has the maximum version timestamp, are safely reclaimed as garbage versions. In mixed OLTP and OLAP workloads, the typical garbage collector may not effectively reclaim record versions. In these workloads, OLTP applications generate a high volume of new versions, while long-lived queries or transactions in OLAP applications often block garbage collection, since we need to compare the version timestamp of each record version with the snapshot timestamp of the oldest, long-lived snapshot. Thus, these workloads typically cause the in-memory version space to grow. Additionally, the increasing version chains of records over time may also increase the traversal cost for them. In this paper, we present an efficient and effective garbage collector called HybridGC in SAP HANA. HybridGC integrates three novel concepts of garbage collection: timestamp-based group garbage collection, table garbage collection, and interval garbage collection. Through experiments using mixed OLTP and OLAP workloads, we show that HybridGC effectively and efficiently collects garbage versions with negligible overhead.

【Keywords】: SAP HANA; garbage collection; multi-version concurrency control

90. UpBit: Scalable In-Memory Updatable Bitmap Indexing.

Paper Link】 【Pages】:1319-1332

【Authors】: Manos Athanassoulis ; Zheng Yan ; Stratos Idreos

【Abstract】: Bitmap indexes are widely used in both scientific and commercial databases. They bring fast read performance for specific types of queries, such as equality and selective range queries. A major drawback of bitmap indexes, however, is that supporting updates is particularly costly. Bitmap indexes are kept compressed to minimize storage footprint; as a result, updating a bitmap index requires the expensive step of decoding and then encoding a bitvector. Today, more and more applications need support for both reads and writes, blurring the boundaries between analytical processing and transaction processing. This requires new system designs and access methods that support general updates and, at the same time, offer competitive read performance. In this paper, we propose scalable in-memory Updatable Bitmap indexing (UpBit), which offers efficient updates, without hurting read performance. UpBit relies on two design points. First, in addition to the main bitvector for each domain value, UpBit maintains an update bitvector, to keep track of updated values. Effectively, every update can now be directed to a highly-compressible, easy-to-update bitvector. While update bitvectors double the amount of uncompressed data, they are sparse, and as a result their compressed size is small. Second, we introduce fence pointers in all update bitvectors which allow for efficient retrieval of a value at an arbitrary position. Using both synthetic and real-life data, we demonstrate that UpBit significantly outperforms state-of-the-art bitmap indexes for workloads that contain both reads and writes. In particular, compared to update-optimized bitmap index designs UpBit is 15-29x faster in terms of update time and 2.7x faster in terms of read performance. In addition, compared to read-optimized bitmap index designs UpBit achieves efficient and scalable updates (51-115x lower update latency), while allowing for comparable read performance, having up to 8% overhead.

【Keywords】: bitmap index; efficient updates; fence pointers; upbit; update bitvectors

Session 15 - Interactive Analytics 6

91. FluxQuery: An Execution Framework for Highly Interactive Query Workloads.

Paper Link】 【Pages】:1333-1345

【Authors】: Roee Ebenstein ; Niranjan Kamat ; Arnab Nandi

【Abstract】: Modern computing devices and user interfaces have necessitated highly interactive querying. Some of these interfaces issue a large number of dynamically changing and continuous queries to the backend. In others, users expect to inspect results during the query formulation process, in order to guide or help them towards specifying a full-fledged query. Thus, users end up issuing a fast-changing workload to the underlying database. In such situations, the user's query intent can be thought of as being in flux. In this paper, we show that the traditional query execution engines are not well-suited for this new class of highly interactive workloads. We propose a novel model to interpret the variability of likely queries in a workload. We implemented a cyclic scan-based approach to process queries from such workloads in an efficient and practical manner while reducing the overall system load. We evaluate and compare our methods with traditional systems and demonstrate the scalability of our approach, enabling thousands of queries to run simultaneously within interactive response times given low memory and CPU requirements.

【Keywords】: cyclic join; cyclic scan; database querying; elevator scan; fast fluxjoin; fluxjoin; fluxquery; gestural querying; interactive databases; join; query intent; query intent model; shared scan; shared server

92. iOLAP: Managing Uncertainty for Efficient Incremental OLAP.

Paper Link】 【Pages】:1347-1361

【Authors】: Kai Zeng ; Sameer Agarwal ; Ion Stoica

【Abstract】: The size of data and the complexity of analytics continue to grow along with the need for timely and cost-effective analysis. However, the growth of computation power cannot keep up with the growth of data. This calls for a paradigm shift from traditional batch OLAP processing model to an incremental OLAP processing model. In this paper, we propose iOLAP, an incremental OLAP query engine that provides a smooth trade-off between query accuracy and latency, and fulfills a full spectrum of user requirements from approximate but timely query execution to a more traditional accurate query execution. iOLAP enables interactive incremental query processing using a novel mini-batch execution model---given an OLAP query, iOLAP first randomly partitions the input dataset into smaller sets (mini-batches) and then incrementally processes through these mini-batches by executing a delta update query on each mini-batch, where each subsequent delta update query computes an update based on the output of the previous one. The key idea behind iOLAP is a novel delta update algorithm that models delta processing as an uncertainty propagation problem, and minimizes the recomputation during each subsequent delta update by minimizing the uncertainties in the partial (including intermediate) query results. We implement iOLAP on top of Apache Spark and have successfully demonstrated it at scale on over 100 machines. Extensive experiments on a multitude of queries and datasets demonstrate that iOLAP can deliver approximate query answers for complex OLAP queries orders of magnitude faster than traditional OLAP engines, while continuously delivering updates every few seconds.

【Keywords】: OLAP; approximate query processing; bootstrap; incremental

93. Dynamic Prefetching of Data Tiles for Interactive Visualization.

Paper Link】 【Pages】:1363-1375

【Authors】: Leilani Battle ; Remco Chang ; Michael Stonebraker

【Abstract】: In this paper, we present ForeCache, a general-purpose tool for exploratory browsing of large datasets. ForeCache utilizes a client-server architecture, where the user interacts with a lightweight client-side interface to browse datasets, and the data to be browsed is retrieved from a DBMS running on a back-end server. We assume a detail-on-demand browsing paradigm, and optimize the back-end support for this paradigm by inserting a separate middleware layer in front of the DBMS. To improve response times, the middleware layer fetches data ahead of the user as she explores a dataset. We consider two different mechanisms for prefetching: (a) learning what to fetch from the user's recent movements, and (b) using data characteristics (e.g., histograms) to find data similar to what the user has viewed in the past. We incorporate these mechanisms into a single prediction engine that adjusts its prediction strategies over time, based on changes in the user's behavior. We evaluated our prediction engine with a user study, and found that our dynamic prefetching strategy provides: (1) significant improvements in overall latency when compared with non-prefetching systems (430% improvement); and (2) substantial improvements in both prediction accuracy (25% improvement) and latency (88% improvement) relative to existing prefetching techniques.

【Keywords】: array browsing; data exploration; predictive caching; visual exploration

94. Expressive Query Construction through Direct Manipulation of Nested Relational Results.

Paper Link】 【Pages】:1377-1392

【Authors】: Eirik Bakke ; David R. Karger

【Abstract】: Despite extensive research on visual query systems, the standard way to interact with relational databases remains to be through SQL queries and tailored form interfaces. We consider three requirements to be essential to a successful alternative: (1) query specification through direct manipulation of results, (2) the ability to view and modify any part of the current query without departing from the direct manipulation interface, and (3) SQL-like expressiveness. This paper presents the first visual query system to meet all three requirements in a single design. By directly manipulating nested relational results, and using spreadsheet idioms such as formulas and filters, the user can express a relationally complete set of query operators plus calculation, aggregation, outer joins, sorting, and nesting, while always remaining able to track and modify the state of the complete query. Our prototype gives the user an experience of responsive, incremental query building while pushing all actual query processing to the database layer. We evaluate our system with formative and controlled user studies on 28 spreadsheet users; the controlled study shows our system significantly outperforming Microsoft Access on the System Usability Scale.

【Keywords】: direct manipulation; hierarchical data models; nested relations; report generation; spreadsheet interfaces; user studies; visual query languages; visual query systems

95. Shasta: Interactive Reporting At Scale.

Paper Link】 【Pages】:1393-1404

【Authors】: Gokul Nath Babu Manoharan ; Stephan Ellner ; Karl Schnaitter ; Sridatta Chegu ; Alejandro Estrella-Balderrama ; Stephan Gudmundson ; Apurv Gupta ; Ben Handy ; Bart Samwel ; Chad Whipkey ; Larysa Aharkava ; Himani Apte ; Nitin Gangahar ; Jun Xu ; Shivakumar Venkataraman ; Divyakant Agrawal ; Jeffrey D. Ullman

【Abstract】: We describe Shasta, a middleware system built at Google to support interactive reporting in complex user-facing applications related to Google's Internet advertising business. Shasta targets applications with challenging requirements: First, user query latencies must be low. Second, underlying transactional data stores have complex "read-unfriendly" schemas, placing significant transformation logic between stored data and the read-only views that Shasta exposes to its clients. This transformation logic must be expressed in a way that scales to large and agile engineering teams. Finally, Shasta targets applications with strong data freshness requirements, making it challenging to precompute query results using common techniques such as ETL pipelines or materialized views. Instead, online queries must go all the way from primary storage to user-facing views, resulting in complex queries joining 50 or more tables. Designed as a layer on top of Google's F1 RDBMS and Mesa data warehouse, Shasta combines language and system techniques to meet these requirements. To help with expressing complex view specifications, we developed a query language called RVL, with support for modularized view templates that can be dynamically compiled into SQL. To execute these SQL queries with low latency at scale, we leveraged and extended F1's distributed query engine with facilities such as safe execution of C++ and Java UDFs. To reduce latency and increase read parallelism, we extended F1 storage with a distributed read-only in-memory cache. The system we describe is in production at Google, powering critical applications used by advertisers and internal sales teams. Shasta has significantly improved system scalability and software engineering efficiency compared to the middleware solutions it replaced.

【Keywords】: SQL generation; caching; heterogeneous data; middleware; user-defined functions

96. Datometry Hyper-Q: Bridging the Gap Between Real-Time and Historical Analytics.

Paper Link】 【Pages】:1405-1416

【Authors】: Lyublena Antova ; Rhonda Baldwin ; Derrick Bryant ; Tuan Cao ; Michael Duller ; John Eshleman ; Zhongxian Gu ; Entong Shen ; Mohamed A. Soliman ; F. Michael Waas

【Abstract】: Wall Street's trading engines are complex database applications written for time series databases like kdb+ that uses the query language Q to perform real-time analysis. Extending the models to include other data sources, e.g., historic data, is critical for backtesting and compliance. However, Q applications cannot run directly on SQL databases. Therefore, financial institutions face the dilemma of either maintaining two separate application stacks, one written in Q and the other in SQL, which means increased IT cost and increased risk, or migrating all Q applications to SQL, which results in losing the inherent competitive advantage on Q real-time processing. Neither solution is desirable as both alternatives are costly, disruptive, and suboptimal. In this paper we present Hyper-Q, a data virtualization plat- form that overcomes the chasm. Hyper-Q enables Q applications to run natively on PostgreSQL-compatible databases by translating queries and results on the fly. We outline the basic concepts, detail specific difficulties, and demonstrate the viability of the approach with a case study.

【Keywords】: big data; data analytics; data virtualization; financial services; query processing

Session 16 - Streaming 2: Sketches 6

97. Time Adaptive Sketches (Ada-Sketches) for Summarizing Data Streams.

Paper Link】 【Pages】:1417-1432

【Authors】: Anshumali Shrivastava ; Arnd Christian König ; Mikhail Bilenko

【Abstract】: Obtaining frequency information of data streams, in limited space, is a well-recognized problem in literature. A number of recent practical applications (such as those in computational advertising) require temporally-aware solutions: obtaining historical count statistics for both time-points as well as time-ranges. In these scenarios, accuracy of estimates is typically more important for recent instances than for older ones; we call this desirable property Time Adaptiveness. With this observation, [20] introduced the Hokusai technique based on count-min sketches for estimating the frequency of any given item at any given time. The proposed approach is problematic in practice, as its memory requirements grow linearly with time, and it produces discontinuities in the estimation accuracy. In this work, we describe a new method, Time-adaptive Sketches, (Ada-sketch), that overcomes these limitations, while extending and providing a strict generalization of several popular sketching algorithms. The core idea of our method is inspired by the well-known digital Dolby noise reduction procedure that dates back to the 1960s. The theoretical analysis presented could be of independent interest in itself, as it provides clear results for the time-adaptive nature of the errors. An experimental evaluation on real streaming datasets demonstrates the superiority of the described method over Hokusai in estimating point and range queries over time. The method is simple to implement and offers a variety of design choices for future extensions. The simplicity of the procedure and the method's generalization of classic sketching techniques give hope for wide applicability of Ada-sketches in practice.

【Keywords】: approximate counting algorithms; big-data mining; count-min sketches; hashing; randomized algorithms; sketching; streaming

98. Streaming Algorithms for Robust Distinct Elements.

Paper Link】 【Pages】:1433-1447

【Authors】: Di Chen ; Qin Zhang

【Abstract】: We study the problem of estimating distinct elements in the data stream model, which has a central role in traffic monitoring, query optimization, data mining and data integration. Different from all previous work, we study the problem in the noisy data setting, where two different looking items in the stream may reference the same entity (determined by a distance function and a threshold value), and the goal is to estimate the number of distinct entities in the stream. In this paper, we formalize the problem of robust distinct elements, and develop space and time-efficient streaming algorithms for datasets in the Euclidean space, using a novel technique we call bucket sampling. We also extend our algorithmic framework to other metric spaces by establishing a connection between bucket sampling and the theory of locality sensitive hashing. Moreover, we formally prove that our algorithms are still effective under small distinct elements ambiguity. Our experiments demonstrate the practicality of our algorithms.

【Keywords】: distinct elements; noisy data; streaming algorithms

99. Augmented Sketch: Faster and More Accurate Stream Processing.

Paper Link】 【Pages】:1449-1463

【Authors】: Pratanu Roy ; Arijit Khan ; Gustavo Alonso

【Abstract】: Approximated algorithms are often used to estimate the frequency of items on high volume, fast data streams. The most common ones are variations of Count-Min sketch, which use sub-linear space for the count, but can produce errors in the counts of the most frequent items and can misclassify low-frequency items. In this paper, we improve the accuracy of sketch-based algorithms by increasing the frequency estimation accuracy of the most frequent items and reducing the possible misclassification of low-frequency items, while also improving the overall throughput. Our solution, called Augmented Sketch (ASketch), is based on a pre-filtering stage that dynamically identifies and aggregates the most frequent items. Items overflowing the pre-filtering stage are processed using a conventional sketch algorithm, thereby making the solution general and applicable in a wide range of contexts. The pre-filtering stage can be efficiently implemented with SIMD instructions on multi-core machines and can be further parallelized through pipeline parallelism where the filtering stage runs in one core and the sketch algorithm runs in another core.

【Keywords】: approximated algorithms; data streams; data structures; sketch; stream summary

100. Matrix Sketching Over Sliding Windows.

Paper Link】 【Pages】:1465-1480

【Authors】: Zhewei Wei ; Xuancheng Liu ; Feifei Li ; Shuo Shang ; Xiaoyong Du ; Ji-Rong Wen

【Abstract】: Large-scale matrix computation becomes essential for many data data applications, and hence the problem of sketching matrix with small space and high precision has received extensive study for the past few years. This problem is often considered in the row-update streaming model, where the data set is a matrix A -- Rn x d, and the processor receives a row (1 x d) of A at each timestamp. The goal is to maintain a smaller matrix (termed approximation matrix, or simply approximation) B -- Rl x d as an approximation to A, such that the covariance error |AT A - BTB| is small and l ll n. This paper studies continuous tracking approximations to the matrix defined by a sliding window of most recent rows. We consider both sequence-based and time-based window. We show that maintaining ATA exactly requires linear space in the sliding window model, as opposed to O(d2) space in the streaming model. With this observation, we present three general frameworks for matrix sketching on sliding windows. The sampling techniques give random samples of the rows in the window according to their squared norms. The Logarithmic Method converts a mergeable streaming matrix sketch into a matrix sketch on time-based sliding windows. The Dyadic Interval framework converts arbitrary streaming matrix sketch into a matrix sketch on sequence-based sliding windows. In addition to proving all algorithmic properties theoretically, we also conduct extensive empirical study with real data sets to demonstrate the efficiency of these algorithms.

【Keywords】: matrix sketching; sliding window

101. Graph Stream Summarization: From Big Bang to Big Crunch.

Paper Link】 【Pages】:1481-1496

【Authors】: Nan Tang ; Qing Chen ; Prasenjit Mitra

【Abstract】: A graph stream, which refers to the graph with edges being updated sequentially in a form of a stream, has important applications in cyber security and social networks. Due to the sheer volume and highly dynamic nature of graph streams, the practical way of handling them is by summarization. Given a graph stream G, directed or undirected, the problem of graph stream summarization is to summarize G as SG with a much smaller (sublinear) space, linear construction time and constant maintenance cost for each edge update, such that SG allows many queries over G to be approximately conducted efficiently. The widely used practice of summarizing data streams is to treat each stream element independently by e.g., hash- or sample-based methods, without maintaining the connections (or relationships) between elements. Hence, existing methods can only solve ad-hoc problems, without supporting diversified and complicated analytics over graph streams. We present TCM, a novel generalized graph stream summary. Given an incoming edge, it summarizes both node and edge information in constant time. Consequently, the summary forms a graphical sketch where edges capture the connections inside elements, and nodes maintain relationships across elements. We discuss a wide range of supported queries and establish some error bounds. In addition, we experimentally show that TCM can effectively and efficiently support analytics over graph streams, which demonstrates its potential to start a new line of research and applications in graph stream management.

【Keywords】: data streams; graph streams; sketch; summarization

102. Scalable Approximate Query Tracking over Highly Distributed Data Streams.

Paper Link】 【Pages】:1497-1512

【Authors】: Nikos Giatrakos ; Antonios Deligiannakis ; Minos N. Garofalakis

【Abstract】: The recently-proposed Geometric Monitoring (GM) method has provided a general tool for the distributed monitoring of arbitrary non-linear queries over streaming data observed by a collection of remote sites, with numerous practical applications. Unfortunately, GM-based techniques can suffer from serious scalability issues with increasing numbers of remote sites. In this paper, we propose novel techniques that effectively tackle the aforementioned scalability problems by exploiting a carefully designed sample of the remote sites for efficient approximate query tracking. Our novel sampling-based scheme utilizes a sample of cardinality proportional to √N (compared to N for the original GM), where $N$ is the number of sites in the network, to perform the monitoring process. Our experimental evaluation over a variety of real-life data streams demonstrates that our sampling-based techniques can significantly reduce the communication cost during distributed monitoring with controllable, predefined accuracy guarantees.

【Keywords】: data streams; distributed monitoring; sampling

Session 17 - Transaction Processing 6

103. A Hybrid B+-tree as Solution for In-Memory Indexing on CPU-GPU Heterogeneous Computing Platforms.

Paper Link】 【Pages】:1523-1538

【Authors】: Amirhesam Shahvarani ; Hans-Arno Jacobsen

【Abstract】: An in-memory indexing tree is a critical component of many databases. Modern many-core processors, such as GPUs, are offering tremendous amounts of computing power making them an attractive choice for accelerating indexing. However, the memory available to the accelerating co-processor is rather limited and expensive in comparison to the memory available to the CPU. This drawback is a barrier to exploit the computing power of co-processors for arbitrarily large index trees. In this paper, we propose a novel design for a B+-tree based on the heterogeneous computing platform and the hybrid memory architecture found in GPUs. We propose a hybrid CPU-GPU B+-tree, "HB+-tree," which targets high search throughput use cases. Unique to our design is the joint and simultaneous use of computing and memory resources of CPU-GPU systems. Our experiments show that our HB+-tree can perform up to 240 million index queries per second, which is 2.4X higher than our CPU-optimized solution.

【Keywords】: B+-tree; heterogeneous computing; in-memory database; indexing

104. Low-Overhead Asynchronous Checkpointing in Main-Memory Database Systems.

Paper Link】 【Pages】:1539-1551

【Authors】: Kun Ren ; Thaddeus Diamond ; Daniel J. Abadi ; Alexander Thomson

【Abstract】: As it becomes increasingly common for transaction processing systems to operate on datasets that fit within the main memory of a single machine or a cluster of commodity machines, traditional mechanisms for guaranteeing transaction durability---which typically involve synchronous log flushes---incur increasingly unappealing costs to otherwise lightweight transactions. Many applications have turned to periodically checkpointing full database state. However, existing checkpointing methods---even those which avoid freezing the storage layer---often come with significant costs to operation throughput, end-to-end latency, and total memory usage. This paper presents Checkpointing Asynchronously using Logical Consistency (CALC), a lightweight, asynchronous technique for capturing database snapshots that does not require a physical point of consistency to create a checkpoint, and avoids conspicuous latency spikes incurred by other database snapshotting schemes. Our experiments show that CALC can capture frequent checkpoints across a variety of transactional workloads with extremely small cost to transactional throughput and low additional memory usage compared to other state-of-the-art checkpointing systems.

【Keywords】: checkpointing; consistency; logging; main-memory; recorvery; transaction processing

105. T-Part: Partitioning of Transactions for Forward-Pushing in Deterministic Database Systems.

Paper Link】 【Pages】:1553-1565

【Authors】: Shan-Hung Wu ; Tsai-Yu Feng ; Meng-Kai Liao ; Shao-Kan Pi ; Yu-Shan Lin

【Abstract】: Deterministic database systems have been shown to yield high throughput on a cluster of commodity machines while ensuring the strong consistency between replicas, provided that the data can be well-partitioned on these machines. However, data partitioning can be suboptimal for many reasons in real-world applications. In this paper, we present T-Part, a transaction execution engine that partitions transactions in a deterministic database system to deal with the unforeseeable workloads or workloads whose data are hard to partition. By modeling the dependency between transactions as a T-graph and continuously partitioning that graph, T-Part allows each transaction to know which later transactions on other machines will read its writes so that it can push forward the writes to those later transactions immediately after committing. This forward-pushing reduces the chance that the later transactions stall due to the unavailability of remote data. We implement a prototype for T-Part. Extensive experiments are conducted and the results demonstrate the effectiveness of T-Part.

【Keywords】: deterministic database systems; forward pushing; transaction partitioning; transaction processing

106. Reducing the Storage Overhead of Main-Memory OLTP Databases with Hybrid Indexes.

Paper Link】 【Pages】:1567-1581

【Authors】: Huanchen Zhang ; David G. Andersen ; Andrew Pavlo ; Michael Kaminsky ; Lin Ma ; Rui Shen

【Abstract】: Using indexes for query execution is crucial for achieving high performance in modern on-line transaction processing databases. For a main-memory database, however, these indexes consume a large fraction of the total memory available and are thus a major source of storage overhead of in-memory databases. To reduce this overhead, we propose using a two-stage index: The first stage ingests all incoming entries and is kept small for fast read and write operations. The index periodically migrates entries from the first stage to the second, which uses a more compact, read-optimized data structure. Our first contribution is hybrid index, a dual-stage index architecture that achieves both space efficiency and high performance. Our second contribution is Dual-Stage Transformation (DST), a set of guidelines for converting any order-preserving index structure into a hybrid index. Our third contribution is applying DST to four popular order-preserving index structures and evaluating them in both standalone microbenchmarks and a full in-memory DBMS using several transaction processing workloads. Our results show that hybrid indexes provide comparable throughput to the original ones while reducing the memory overhead by up to 70%.

【Keywords】: hybrid index; in-memory OLTP database

107. Design Principles for Scaling Multi-core OLTP Under High Contention.

Paper Link】 【Pages】:1583-1598

【Authors】: Kun Ren ; Jose M. Faleiro ; Daniel J. Abadi

【Abstract】: Although significant recent progress has been made in improving the multi-core scalability of high throughput transactional database systems, modern systems still fail to achieve scalable throughput for workloads involving frequent access to highly contended data. Most of this inability to achieve high throughput is explained by the fundamental constraints involved in guaranteeing ACID --- the addition of cores results in more concurrent transactions accessing the same contended data for which access must be serialized in order to guarantee isolation. Thus, linear scalability for contended workloads is impossible. However, there exist flaws in many modern architectures that exacerbate their poor scalability, and result in throughput that is much worse than fundamentally required by the workload. In this paper we identify two prevalent design principles that limit the multi-core scalability of many (but not all) transactional database systems on contended workloads: the multi-purpose nature of execution threads in these systems, and the lack of advanced planning of data access. We demonstrate the deleterious results of these design principles by implementing a prototype system, Orthrus, that is motivated by the principles of separation of database component functionality and advanced planning of transactions. We find that these two principles alone result in significantly improved scalability on high-contention workloads, and an order of magnitude increase in throughput for a non-trivial subset of these contended workloads.

【Keywords】: concurrency control; database architecture; multicore; scalability; transaction processing

108. DBSherlock: A Performance Diagnostic Tool for Transactional Databases.

Paper Link】 【Pages】:1599-1614

【Authors】: Dong Young Yoon ; Ning Niu ; Barzan Mozafari

【Abstract】: Running an online transaction processing (OLTP) system is one of the most daunting tasks required of database administrators (DBAs). As businesses rely on OLTP databases to support their mission-critical and real-time applications, poor database performance directly impacts their revenue and user experience. As a result, DBAs constantly monitor, diagnose, and rectify any performance decays. Unfortunately, the manual process of debugging and diagnosing OLTP performance problems is extremely tedious and non-trivial. Rather than being caused by a single slow query, performance problems in OLTP databases are often due to a large number of concurrent and competing transactions adding up to compounded, non-linear effects that are difficult to isolate. Sudden changes in request volume, transactional patterns, network traffic, or data distribution can cause previously abundant resources to become scarce, and the performance to plummet. This paper presents a practical tool for assisting DBAs in quickly and reliably diagnosing performance problems in an OLTP database. By analyzing hundreds of statistics and configurations collected over the lifetime of the system, our algorithm quickly identifies a small set of potential causes and presents them to the DBA. The root-cause established by the DBA is reincorporated into our algorithm as a new causal model to improve future diagnoses. Our experiments show that this algorithm is substantially more accurate than the state-of-the-art algorithm in finding correct explanations.

【Keywords】: OLTP; anomaly detection; performance diagnosis; transactions

Session 18 - Transactions and Consistency 6

109. TARDiS: A Branch-and-Merge Approach To Weak Consistency.

Paper Link】 【Pages】:1615-1628

【Authors】: Natacha Crooks ; Youer Pu ; Nancy Estrada ; Trinabh Gupta ; Lorenzo Alvisi ; Allen Clement

【Abstract】: This paper presents the design, implementation, and evaluation of TARDiS (Transactional Asynchronously Replicated Divergent Store), a transactional key-value store explicitly designed for weakly-consistent systems. Reasoning about these systems is hard, as neither causal consistency nor per-object eventual convergence allow applications to deal satisfactorily with write-write conflicts. TARDiS instead exposes as its fundamental abstraction the set of conflicting branches that arise in weakly-consistent systems. To this end, TARDiS introduces a new concurrency control mechanism: branch-on-conflict. On the one hand, TARDiS guarantees that storage will appear sequential to any thread of execution that extends a branch, keeping application logic simple. On the other, TARDiS provides applications, when needed, with the tools and context necessary to merge branches atomically, when and how applications want. Since branch-on-conflict in TARDiS is fast, weakly-consistent applications can benefit from adopting this paradigm not only for operations issued by different sites, but also, when appropriate, for conflicting local operations. We find that TARDiS reduces coding complexity for these applications and that judicious branch-on-conflict can improve their local throughput at each site by two to eight times.

【Keywords】: causal consistency; databases; distributed systems; eventual consistency; geo-distribution; merging; replication; transactions; weak consistency

110. TicToc: Time Traveling Optimistic Concurrency Control.

Paper Link】 【Pages】:1629-1642

【Authors】: Xiangyao Yu ; Andrew Pavlo ; Daniel Sanchez ; Srinivas Devadas

【Abstract】: Concurrency control for on-line transaction processing (OLTP) database management systems (DBMSs) is a nasty game. Achieving higher performance on emerging many-core systems is difficult. Previous research has shown that timestamp management is the key scalability bottleneck in concurrency control algorithms. This prevents the system from scaling to large numbers of cores. In this paper we present TicToc, a new optimistic concurrency control algorithm that avoids the scalability and concurrency bottlenecks of prior T/O schemes. TicToc relies on a novel and provably correct data-driven timestamp management protocol. Instead of assigning timestamps to transactions, this protocol assigns read and write timestamps to data items and uses them to lazily compute a valid commit timestamp for each transaction. TicToc removes the need for centralized timestamp allocation, and commits transactions that would be aborted by conventional T/O schemes. We implemented TicToc along with four other concurrency control algorithms in an in-memory, shared-everything OLTP DBMS and compared their performance on different workloads. Our results show that TicToc achieves up to 92% better throughput while reducing the abort rate by 3.3x over these previous algorithms.

【Keywords】: DBMs; concurrency control; tictoc; timestamp

111. Scaling Multicore Databases via Constrained Parallel Execution.

Paper Link】 【Pages】:1643-1658

【Authors】: Zhaoguo Wang ; Shuai Mu ; Yang Cui ; Han Yi ; Haibo Chen ; Jinyang Li

【Abstract】: Multicore in-memory databases often rely on traditional con- currency control schemes such as two-phase-locking (2PL) or optimistic concurrency control (OCC). Unfortunately, when the workload exhibits a non-trivial amount of contention, both 2PL and OCC sacrifice much parallel execution op- portunity. In this paper, we describe a new concurrency control scheme, interleaving constrained concurrency con- trol (IC3), which provides serializability while allowing for parallel execution of certain conflicting transactions. IC3 combines the static analysis of the transaction workload with runtime techniques that track and enforce dependencies among concurrent transactions. The use of static analysis simplifies IC3's runtime design, allowing it to scale to many cores. Evaluations on a 64-core machine using the TPC- C benchmark show that IC3 outperforms traditional con- currency control schemes under contention. It achieves the throughput of 434K transactions/sec on the TPC-C bench- mark configured with only one warehouse. It also scales better than several recent concurrent control schemes that also target contended workloads.

【Keywords】: concurrency control; in-memory database; multicore; transaction processing

112. Towards a Non-2PC Transaction Management in Distributed Database Systems.

Paper Link】 【Pages】:1659-1674

【Authors】: Qian Lin ; Pengfei Chang ; Gang Chen ; Beng Chin Ooi ; Kian-Lee Tan ; Zhengkui Wang

【Abstract】: Shared-nothing architecture has been widely used in distributed databases to achieve good scalability. While it offers superior performance for local transactions, the overhead of processing distributed transactions can degrade the system performance significantly. The key contributor to the degradation is the expensive two-phase commit (2PC) protocol used to ensure atomic commitment of distributed transactions. In this paper, we propose a transaction management scheme called LEAP to avoid the 2PC protocol within distributed transaction processing. Instead of processing a distributed transaction across multiple nodes, LEAP converts the distributed transaction into a local transaction. This benefits the processing locality and facilitates adaptive data repartitioning when there is a change in data access pattern. Based on LEAP, we develop an online transaction processing (OLTP) system, L-Store, and compare it with the state-of-the-art distributed in-memory OLTP system, H-Store, which relies on the 2PC protocol for distributed transaction processing, and H^L-Store, a H-Store that has been modified to make use of LEAP. Results of an extensive experimental evaluation show that our LEAP-based engines are superior over H-Store by a wide margin, especially for workloads that exhibit locality-based data accesses.

【Keywords】: 2PC; OLTP; distributed database; transaction management

113. ERMIA: Fast Memory-Optimized Database System for Heterogeneous Workloads.

Paper Link】 【Pages】:1675-1687

【Authors】: Kangnyeon Kim ; Tianzheng Wang ; Ryan Johnson ; Ippokratis Pandis

【Abstract】: Large main memories and massively parallel processors have triggered not only a resurgence of high-performance transaction processing systems optimized for large main-memory and massively parallel processors, but also an increasing demand for processing heterogeneous workloads that include read-mostly transactions. Many modern transaction processing systems adopt a lightweight optimistic concurrency control (OCC) scheme to leverage its low overhead in low contention workloads. However, we observe that the lightweight OCC is not suitable for heterogeneous workloads, causing significant starvation of read-mostly transactions and overall performance degradation. In this paper, we present ERMIA, a memory-optimized database system built from scratch to cater the need of handling heterogeneous workloads. ERMIA adopts snapshot isolation concurrency control to coordinate heterogeneous transactions and provides serializability when desired. Its physical layer supports the concurrency control schemes in a scalable way. Experimental results show that ERMIA delivers comparable or superior performance and near-linear scalability in a variety of workloads, compared to a recent lightweight OCC-based system. At the same time, ERMIA maintains high throughput on read-mostly transactions when the performance of the OCC-based system drops by orders of magnitude.

【Keywords】: append-only storage; epoch-based resource management; heterogeneous workloads; indirection arrays; log manager; main-memory databases; multicore scalability; multiversion concurrency control; serial safety net

114. Transaction Healing: Scaling Optimistic Concurrency Control on Multicores.

Paper Link】 【Pages】:1689-1704

【Authors】: Yingjun Wu ; Chee Yong Chan ; Kian-Lee Tan

【Abstract】: Today's main-memory databases can support very high transaction rate for OLTP applications. However, when a large number of concurrent transactions contend on the same data records, the system performance can deteriorate significantly. This is especially the case when scaling transaction processing with optimistic concurrency control (OCC) on multicore machines. In this paper, we propose a new concurrency-control mechanism, called transaction healing, that exploits program semantics to scale the conventional OCC towards dozens of cores even under highly contended workloads. Transaction healing captures the dependencies across operations within a transaction prior to its execution. Instead of blindly rejecting a transaction once its validation fails, the proposed mechanism judiciously restores any non-serializable operation and heals inconsistent transaction states as well as query results according to the extracted dependencies. Transaction healing can partially update the membership of read/write sets when processing dependent transactions. Such overhead, however, is largely reduced by carefully avoiding false aborts and rearranging validation orders. We implemented the idea of transaction healing in TheDB, a main-memory database prototype that provides full ACID guarantee with a scalable commit protocol. By evaluating TheDB on a 48-core machine with two widely-used benchmarks, we confirm that transaction healing can scale near-linearly, yielding significantly higher transaction rate than the state-of-the-art OCC implementations.

【Keywords】: multicore database; transaction processing

Session 19 - Query Optimization 6

115. Enabling Incremental Query Re-Optimization.

Paper Link】 【Pages】:1705-1720

【Authors】: Mengmeng Liu ; Zachary G. Ives ; Boon Thau Loo

【Abstract】: As declarative query processing techniques expand to the Web, data streams, network routers, and cloud platforms, there is an increasing need to re-plan execution in the presence of unanticipated performance changes. New runtime information may affect which query plan we prefer to run. Adaptive techniques require innovation both in terms of the algorithms used to estimate costs, and in terms of the search algorithm that finds the best plan. We investigate how to build a cost-based optimizer that recomputes the optimal plan incrementally given new cost information, much as a stream engine constantly updates its outputs given new data. Our implementation especially shows benefits for stream processing workloads. It lays the foundations upon which a variety of novel adaptive optimization algorithms can be built. We start by leveraging the recently proposed approach of formulating query plan enumeration as a set of recursive datalog queries; we develop a variety of novel optimization approaches to ensure effective pruning in both static and incremental cases. We further show that the lessons learned in the declarative implementation can be equally applied to more traditional optimizer implementations.

【Keywords】: incremental query re-optimization

116. Sampling-Based Query Re-Optimization.

Paper Link】 【Pages】:1721-1736

【Authors】: Wentao Wu ; Jeffrey F. Naughton ; Harneet Singh

【Abstract】: Despite of decades of work, query optimizers still make mistakes on "difficult" queries because of bad cardinality estimates, often due to the interaction of multiple predicates and correlations in the data. In this paper, we propose a low-cost post-processing step that can take a plan produced by the optimizer, detect when it is likely to have made such a mistake, and take steps to fix it. Specifically, our solution is a sampling-based iterative procedure that requires almost no changes to the original query optimizer or query evaluation mechanism of the system. We show that this indeed imposes low overhead and catches cases where three widely used optimizers (PostgreSQL and two commercial systems) make large errors.

【Keywords】: query optimization; sampling; selectivity estimation

117. A Fast Randomized Algorithm for Multi-Objective Query Optimization.

Paper Link】 【Pages】:1737-1752

【Authors】: Immanuel Trummer ; Christoph Koch

【Abstract】: Query plans are compared according to multiple cost metrics in multi-objective query optimization. The goal is to find the set of Pareto plans realizing optimal cost tradeoffs for a given query. So far, only algorithms with exponential complexity in the number of query tables have been proposed for multi-objective query optimization. In this work, we present the first algorithm with polynomial complexity in the query size. Our algorithm is randomized and iterative. It improves query plans via a multi-objective version of hill climbing that applies multiple transformations in each climbing step for maximal efficiency. Based on a locally optimal plan, we approximate the Pareto plan set within the restricted space of plans with similar join orders. We maintain a cache of Pareto-optimal plans for each potentially useful intermediate result to share partial plans that were discovered in different iterations. We show that each iteration of our algorithm performs in expected polynomial time based on an analysis of the expected path length between a random plan and local optima reached by hill climbing. We experimentally show that our algorithm can optimize queries with hundreds of tables and outperforms other randomized algorithms such as the NSGA-II genetic algorithm over a wide range of scenarios.

【Keywords】: multi-objective; query optimization; randomized algorithms

118. Operator and Query Progress Estimation in Microsoft SQL Server Live Query Statistics.

Paper Link】 【Pages】:1753-1764

【Authors】: Kukjin Lee ; Arnd Christian König ; Vivek R. Narasayya ; Bolin Ding ; Surajit Chaudhuri ; Brent Ellwein ; Alexey Eksarevskiy ; Manbeen Kohli ; Jacob Wyant ; Praneeta Prakash ; Rimma V. Nehme ; Jiexing Li ; Jeffrey F. Naughton

【Abstract】: We describe the design and implementation of the new Live Query Statistics (LQS) feature in Microsoft SQL Server 2016. The functionality includes the display of overall query progress as well as progress of individual operators in the query execution plan. We describe the overall functionality of LQS, give usage examples and detail all areas where we had to extend the current state-of-the-art to build the complete LQS feature. Finally, we evaluate the effect these extensions have on progress estimation accuracy with a series of experiments using a large set of synthetic and real workloads.

【Keywords】: database administration; databases query progress estimation

119. Optimization of Nested Queries using the NF2 Algebra.

Paper Link】 【Pages】:1765-1780

【Authors】: Jürgen Hölsch ; Michael Grossniklaus ; Marc H. Scholl

【Abstract】: A key promise of SQL is that the optimizer will find the most efficient execution plan, regardless of how the query is formulated. In general, query optimizers of modern database systems are able to keep this promise, with the notable exception of nested queries. While several optimization techniques for nested queries have been proposed, their adoption in practice has been limited. In this paper, we argue that the NF2 (non-first normal form) algebra, which was originally designed to process nested tables, is a better approach to nested query optimization as it fulfills two key requirements. First, the NF2 algebra can represent all types of nested queries as well as both existing and novel optimization techniques based on its equivalences. Second, performance benefits can be achieved with little changes to existing transformation-based query optimizers as the NF2 algebra is an extension of the relational algebra.

【Keywords】: NF2 algebra; SQL; nested queries; non-first normal form; performance

120. Extracting Equivalent SQL from Imperative Code in Database Applications.

Paper Link】 【Pages】:1781-1796

【Authors】: K. Venkatesh Emani ; Karthik Ramachandra ; Subhro Bhattacharya ; S. Sudarshan

【Abstract】: Optimizing the performance of database applications is an area of practical importance, and has received significant attention in recent years. In this paper we present an approach to this problem which is based on extracting a concise algebraic representation of (parts of) an application, which may include imperative code as well as SQL queries. The algebraic representation can then be translated into SQL to improve application performance, by reducing the volume of data transferred, as well as reducing latency by minimizing the number of network round trips. Our techniques can be used for performing optimizations of database applications that techniques proposed earlier cannot perform. The algebraic representations can also be used for other purposes such as extracting equivalent queries for keyword search on form results. Our experiments indicate that the techniques we present are widely applicable to real world database applications, in terms of successfully extracting algebraic representations of application behavior, as well as in terms of providing performance benefits when used for optimization.

【Keywords】: SQL; data access optimization; hibernate; program analysis

Session 20 - Graphs 3: Potpourri 6

121. Generating Preview Tables for Entity Graphs.

Paper Link】 【Pages】:1797-1811

【Authors】: Ning Yan ; Sona Hasani ; Abolfazl Asudeh ; Chengkai Li

【Abstract】: Users are tapping into massive, heterogeneous entity graphs for many applications. It is challenging to select entity graphs for a particular need, given abundant datasets from many sources and the oftentimes scarce information for them. We propose methods to produce preview tables for compact presentation of important entity types and relationships in entity graphs. The preview tables assist users in attaining a quick and rough preview of the data. They can be shown in a limited display space for a user to browse and explore, before she decides to spend time and resources to fetch and investigate the complete dataset. We formulate several optimization problems that look for previews with the highest scores according to intuitive goodness measures, under various constraints on preview size and distance between preview tables. The optimization problem under distance constraint is NP-hard. We design a dynamic-programming algorithm and an Apriori-style algorithm for finding optimal previews. Results from experiments, comparison with related work and user studies demonstrated the scoring measures' accuracy and the discovery algorithms' efficiency.

【Keywords】: data exploration; entity graph; knowledge graph; schema summarization

122. Speedup Graph Processing by Graph Ordering.

Paper Link】 【Pages】:1813-1828

【Authors】: Hao Wei ; Jeffrey Xu Yu ; Can Lu ; Xuemin Lin

【Abstract】: The CPU cache performance is one of the key issues to efficiency in database systems. It is reported that cache miss latency takes a half of the execution time in database systems. To improve the CPU cache performance, there are studies to support searching including cache-oblivious, and cache-conscious trees. In this paper, we focus on CPU speedup for graph computing in general by reducing the CPU cache miss ratio for different graph algorithms. The approaches dealing with trees are not applicable to graphs which are complex in nature. In this paper, we explore a general approach to speed up CPU computing, in order to further enhance the efficiency of the graph algorithms without changing the graph algorithms (implementations) and the data structures used. That is, we aim at designing a general solution that is not for a specific graph algorithm, neither for a specific data structure. The approach studied in this work is graph ordering, which is to find the optimal permutation among all nodes in a given graph by keeping nodes that will be frequently accessed together locally, to minimize the CPU cache miss ratio. We prove the graph ordering problem is NP-hard, and give a basic algorithm with a bounded approximation. To improve the time complexity of the basic algorithm, we further propose a new algorithm to reduce the time complexity and improve the efficiency with new optimization techniques based on a new data structure. We conducted extensive experiments to evaluate our approach in comparison with other 9 possible graph orderings (such as the one obtained by METIS) using 8 large real graphs and 9 representative graph algorithms. We confirm that our approach can achieve high performance by reducing the CPU cache miss ratios.

【Keywords】: CPU performance; graph algorithms; graph ordering

123. ROLL: Fast In-Memory Generation of Gigantic Scale-free Networks.

Paper Link】 【Pages】:1829-1842

【Authors】: Ali Hadian ; Sadegh Nobari ; Behrouz Minaei-Bidgoli ; Qiang Qu

【Abstract】: Real-world graphs are not always publicly available or sometimes do not meet specific research requirements. These challenges call for generating synthetic networks that follow properties of the real-world networks. Barabási-Albert (BA) is a well-known model for generating scale-free graphs, i.e graphs with power-law degree distribution. In BA model, the network is generated through an iterative stochastic process called preferential attachment. Although BA is highly demanded, due to the inherent complexity of the preferential attachment, this model cannot be scaled to generate billion-node graphs. In this paper, we propose ROLL-tree, a fast in-memory roulette wheel data structure that accelerates the BA network generation process by exploiting the statistical behaviors of the underlying growth model. Our proposed method has the following properties: (a) Fast: It performs +1000 times faster than the state-of-the-art on a single node PC; (b) Exact: It strictly follows the BA model, using an efficient data structure instead of approximation techniques; (c) Generalizable: It can be adapted for other "rich-get-richer" stochastic growth models. Our extensive experiments prove that ROLL-tree can effectively accelerate graph-generation through the preferential attachment process. On a commodity single processor machine, for example, ROLL-tree generates a scale-free graph of 1.1 billion nodes and 6.6 billion edges (the size of Yahoo's Webgraph) in 62 minutes while the state-of-the-art (SA) takes about four years on the same machine.

【Keywords】: barabási-albert; efficient; exact; fast; generalizable; gigantic; graph; in-memory; networks; power-law; preferential attachment; random; real-world; roll; roulette wheel; scale-free; synthetic

124. Functional Dependencies for Graphs.

Paper Link】 【Pages】:1843-1857

【Authors】: Wenfei Fan ; Yinghui Wu ; Jingbo Xu

【Abstract】: We propose a class of functional dependencies for graphs, referred to as GFDs. GFDs capture both attribute-value dependencies and topological structures of entities, and subsume conditional functional dependencies (CFDs) as a special case. We show that the satisfiability and implication problems for GFDs are coNP-complete and NP-complete, respectively, no worse than their CFD counterparts. We also show that the validation problem for GFDs is coNP-complete. Despite the intractability, we develop parallel scalable algorithms for catching violations of GFDs in large-scale graphs. Using real-life and synthetic data, we experimentally verify that GFDs provide an effective approach to detecting inconsistencies in knowledge and social graphs.

【Keywords】: functional dependencies; graphs; implication; satisfiability; validation

125. SLING: A Near-Optimal Index Structure for SimRank.

Paper Link】 【Pages】:1859-1874

【Authors】: Boyu Tian ; Xiaokui Xiao

【Abstract】: SimRank is a similarity measure for graph nodes that has numerous applications in practice. Scalable SimRank computation has been the subject of extensive research for more than a decade, and yet, none of the existing solutions can efficiently derive SimRank scores on large graphs with provable accuracy guarantees. In particular, the state-of-the-art solution requires up to a few seconds to compute a SimRank score in million-node graphs, and does not offer any worst-case assurance in terms of the query error. This paper presents SLING, an efficient index structure for SimRank computation. SLING guarantees that each SimRank score returned has at most ε additive error, and it answers any single-pair and single-source SimRank queries in O(1/ε) and O(n/ε) time, respectively. These time complexities are near-optimal, and are significantly better than the asymptotic bounds of the most recent approach. Furthermore, SLING requires only O(n/ε) space (which is also near-optimal in an asymptotic sense) and O(m/ε + n log n/δ/ε2) pre-computation time, where δ is the failure probability of the preprocessing algorithm. We experimentally evaluate SLING with a variety of real-world graphs with up to several millions of nodes. Our results demonstrate that SLING is up to 10000 times (resp. 110 times) faster than competing methods for single-pair (resp. single-source) SimRank queries, at the cost of higher space overheads.

【Keywords】: SimRank; approximate algorithm; graphs; indexing; query processing

126. Query Planning for Evaluating SPARQL Property Paths.

Paper Link】 【Pages】:1875-1889

【Authors】: Nikolay Yakovets ; Parke Godfrey ; Jarek Gryz

【Abstract】: The extension of SPARQL in version 1.1 with property paths offers a type of regular path query for RDF graph databases. Such queries are difficult to optimize and evaluate efficiently, however. We have embarked on a project, Waveguide, to build a cost-based optimizer for SPARQL queries with property paths. Waveguide builds a query plan--- which we call a waveplan (WP)--- which guides the query evaluation. There are numerous choices in the construction of a plan, and a number of optimization methods, so the space of plans for a query can be quite large. Execution costs of plans for the same query can vary by orders of magnitude. A WGP's costs can be estimated, which opens the way to cost-based optimization. We demonstrate that the plan space of Waveguide properly subsumes existing techniques and that the new plans it adds are relevant.

【Keywords】: property paths; query optimization; rdf; regular path queries; sparql

Session 21 - Hardware Acceleration and Query Compilation 6

127. Robust Query Processing in Co-Processor-accelerated Databases.

Paper Link】 【Pages】:1891-1906

【Authors】: Sebastian Breß ; Henning Funke ; Jens Teubner

【Abstract】: Technology limitations are making the use of heterogeneous computing devices much more than an academic curiosity. In fact, the use of such devices is widely acknowledged to be the only promising way to achieve application-speedups that users urgently need and expect. However, building a robust and efficient query engine for heterogeneous co-processor environments is still a significant challenge. In this paper, we identify two effects that limit performance in case co-processor resources become scarce. Cache thrashing occurs when the working set of queries does not fit into the co-processor's data cache, resulting in performance degradations up to a factor of 24. Heap contention occurs when multiple operators run in parallel on a co-processor and when their accumulated memory footprint exceeds the main memory capacity of the co-processor, slowing down query execution by up to a factor of six. We propose solutions for both effects. Data-driven operator placement avoids data movements when they might be harmful; query chopping limits co-processor memory usage and thus avoids contention. The combined approach-data-driven query chopping-achieves robust and scalable performance on co-processors. We validate our proposal with our open-source GPU-accelerated database engine CoGaDB and the popular star schema and TPC-H benchmarks.

【Keywords】: co-processing; co-processor; databases; query optimization; query processing

128. How to Architect a Query Compiler.

Paper Link】 【Pages】:1907-1922

【Authors】: Amir Shaikhha ; Yannis Klonatos ; Lionel Parreaux ; Lewis Brown ; Mohammad Dashti ; Christoph Koch

【Abstract】: This paper studies architecting query compilers. The state of the art in query compiler construction is lagging behind that in the compilers field. We attempt to remedy this by exploring the key causes of technical challenges in need of well founded solutions, and by gathering the most relevant ideas and approaches from the PL and compilers communities for easy digestion by database researchers. All query compilers known to us are more or less monolithic template expanders that do the bulk of the compilation task in one large leap. Such systems are hard to build and maintain. We propose to use a stack of multiple DSLs on different levels of abstraction with lowering in multiple steps to make query compilers easier to build and extend, ultimately allowing us to create more convincing and sustainable compiler-based data management systems. We attempt to derive our advice for creating such DSL stacks from widely acceptable principles. We have also re-created a well-known query compiler following these ideas and report on this effort.

【Keywords】: domain-specific languages; query compilation

129. Automated Demand-driven Resource Scaling in Relational Database-as-a-Service.

Paper Link】 【Pages】:1923-1934

【Authors】: Sudipto Das ; Feng Li ; Vivek R. Narasayya ; Arnd Christian König

【Abstract】: Relational Database-as-a-Service (DaaS) platforms today support the abstraction of a resource container that guarantees a fixed amount of resources. Tenants are responsible for selecting a container size suitable for their workloads, which they can change to leverage the cloud's elasticity. However, automating this task is daunting for most tenants since estimating resource demands for arbitrary SQL workloads in an RDBMS is complex and challenging. In addition, workloads and resource requirements can vary significantly within minutes to hours, and container sizes vary by orders of magnitude both in the amount of resources as well as monetary cost. We present a solution to enable a DaaS to auto-scale container sizes on behalf of its tenants. Approaches to auto-scale stateless services, such as web servers, that rely on historical resource utilization as the primary signal, often perform poorly for stateful database servers which are significantly more complex. Our solution derives a set of robust signals from database engine telemetry and combines them to significantly improve accuracy of demand estimation for database workloads resulting in more accurate scaling decisions. Our solution raises the abstraction by allowing tenants to reason about monetary budget and query latency rather than resources. We prototyped our approach in Microsoft Azure SQL Database and ran extensive experiments using workloads with realistic time-varying resource demand patterns obtained from production traces. Compared to an approach that uses only resource utilization to estimate demand, our approach results in 1.5x to 3x lower monetary costs while achieving comparable query latencies.

【Keywords】: auto-scaling; elasticity; relational database-as-a-service; resource demand estimation

130. GPL: A GPU-based Pipelined Query Processing Engine.

Paper Link】 【Pages】:1935-1950

【Authors】: Johns Paul ; Jiong He ; Bingsheng He

【Abstract】: Graphics Processing Units (GPUs) have evolved as a powerful query co-processor for main memory On-Line Analytical Processing (OLAP) databases. However, existing GPU-based query processors adopt a kernel-based execution approach which optimizes individual kernels for resource utilization and executes the GPU kernels involved in the query plan one by one. Such a kernel-based approach cannot utilize all GPU resources efficiently due to the resource underutilization of individual kernels and memory ping-pong across kernel executions. In this paper, we propose GPL, a novel pipelined query execution engine to improve the resource utilization of query co-processing on the GPU. Different from the existing kernel-based execution, GPL takes advantage of hardware features of new-generation GPUs including concurrent kernel execution and efficient data communication channel between kernels. We further develop an analytical model to guide the generation of the optimal pipelined query plan. Thus, the tile size of the pipelined query execution can be adapted in a cost-based manner. We evaluate GPL with TPC-H queries on both AMD and NVIDIA GPUs. The experimental results show that 1) the analytical model is able to guide determining the suitable parameter values in pipelined query execution plan, and 2) GPL is able to significantly outperform the state-of-the-art kernel-based query processing approaches, with improvement up to 48%.

【Keywords】: KBE; channel; pipelined execution; tiling

131. Towards a Hybrid Design for Fast Query Processing in DB2 with BLU Acceleration Using Graphical Processing Units: A Technology Demonstration.

Paper Link】 【Pages】:1951-1960

【Authors】: Sina Meraji ; Berni Schiefer ; Lan Pham ; Lee Chu ; Peter Kokosielis ; Adam J. Storm ; Wayne Young ; Chang Ge ; Geoffrey Ng ; Kajan Kanagaratnam

【Abstract】: In this paper, we show how we use Nvidia GPUs and host CPU cores for faster query processing in a DB2 database using BLU Acceleration (DB2's column store technology). Moreover, we show the benefits and problems of using hardware accelerators (more specifically GPUs) in a real commercial Relational Database Management System(RDBMS).We investigate the effect of off-loading specific database operations to a GPU, and show how doing so results in a significant performance improvement. We then demonstrate that for some queries, using just CPU to perform the entire operation is more beneficial. While we use some of Nvidia's fast kernels for operations like sort, we have also developed our own high performance kernels for operations such as group by and aggregation. Finally, we show how we use a dynamic design that can make use of optimizer metadata to intelligently choose a GPU kernel to run. For the first time in the literature, we use benchmarks representative of customer environments to gauge the performance of our prototype, the results of which show that we can get a speed increase upwards of 2x, using a realistic set of queries.

【Keywords】: DB2 blu; GPU; groupby/aggregate; query optimization; relational database

132. An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory.

Paper Link】 【Pages】:1961-1976

【Authors】: Stefan Schuh ; Xiao Chen ; Jens Dittrich

【Abstract】: Relational equi-joins are at the heart of almost every query plan. They have been studied, improved, and reexamined on a regular basis since the existence of the database community. In the past four years several new join algorithms have been proposed and experimentally evaluated. Some of those papers contradict each other in their experimental findings. This makes it surprisingly hard to answer a very simple question: what is the fastest join algorithm in 2015? In this paper we will try to develop an answer. We start with an end-to-end black box comparison of the most important methods. Afterwards, we inspect the internals of these algorithms in a white box comparison. We derive improved variants of state-of-the-art join algorithms by applying optimizations like~software-write combine buffers, various hash table implementations, as well as NUMA-awareness in terms of data placement and scheduling. We also inspect various radix partitioning strategies. Eventually, we are in the position to perform a comprehensive comparison of thirteen different join algorithms. We factor in scaling effects in terms of size of the input datasets, the number of threads, different page sizes, and data distributions. Furthermore, we analyze the impact of various joins on an (unchanged) TPC-H query. Finally, we conclude with a list of major lessons learned from our study and a guideline for practitioners implementing massive main-memory joins. As is the case with almost all algorithms in databases, we will learn that there is no single best join algorithm. Each algorithm has its strength and weaknesses and shines in different areas of the parameter space.

【Keywords】: Main Memory

133. Top-k Relevant Semantic Place Retrieval on Spatial RDF Data.

Paper Link】 【Pages】:1977-1990

【Authors】: Jieming Shi ; Dingming Wu ; Nikos Mamoulis

【Abstract】: RDF data are traditionally accessed using structured query languages, such as SPARQL. However, this requires users to understand the language as well as the RDF schema. Keyword search on RDF data aims at relieving the user from these requirements; the user only inputs a set of keywords and the goal is to find small RDF subgraphs which contain all keywords. At the same time, popular RDF knowledge bases also include spatial semantics, which opens the road to location-based search operations. In this work, we propose and study a novel location-based keyword search query on RDF data. The objective of top-k relevant semantic places (kSP) retrieval is to find RDF subgraphs which contain the query keywords and are rooted at spatial entities close to the query location. The novelty of kSP queries is that they are location-aware and that they do not rely on the use of structured query languages. We design a basic method for the processing of kSP queries. To further accelerate kSP retrieval, two pruning approaches and a data preprocessing technique are proposed. Extensive empirical studies on two real datasets demonstrate the superior and robust performance of our proposals compared to the basic method.

【Keywords】: graph processing; knowledge bases; semantic retrieval; spatial RDF

Paper Link】 【Pages】:1991-2005

【Authors】: Pei Wang ; Chuan Xiao ; Jianbin Qin ; Wei Wang ; Xiaoyang Zhang ; Yoshiharu Ishikawa

【Abstract】: With the growing popularity of electronic documents, replication can occur for many reasons. People may copy text segments from various sources and make modifications. In this paper, we study the problem of local similarity search to find partially replicated text. Unlike existing studies on similarity search which find entirely duplicated documents, our target is to identify documents that approximately share a pair of sliding windows which differ by no more than τ tokens. Our problem is technically challenging because for sliding windows the tokens to be indexed are less selective than entire documents, rendering set similarity join-based algorithms less efficient. Our proposed method is based on enumerating token combinations to obtain signatures with high selectivity. In order to strike a balance between signature and candidate generation, we partition the token universe and for different partitions we generate combinations composed of different numbers of tokens. A cost-aware algorithm is devised to find a good partitioning of the token universe. We also propose to leverage the overlap between adjacent windows to share computation and thus speed up query processing. In addition, we develop the techniques to support the large thresholds. Experiments on real datasets demonstrate the efficiency of our method against alternative solutions.

【Keywords】: k-wise signature; local similarity search; prefix filtering; unstructured text

135. Similarity Join over Array Data.

Paper Link】 【Pages】:2007-2022

【Authors】: Weijie Zhao ; Florin Rusu ; Bin Dong ; Kesheng Wu

【Abstract】: Scientific applications are generating an ever-increasing volume of multi-dimensional data that are largely processed inside distributed array databases and frameworks. Similarity join is a fundamental operation across scientific workloads that requires complex processing over an unbounded number of pairs of multi-dimensional points. In this paper, we introduce a novel distributed similarity join operator for multi-dimensional arrays. Unlike immediate extensions to array join and relational similarity join, the proposed operator minimizes the overall data transfer and network congestion while providing load-balancing, without completely repartitioning and replicating the input arrays. We define formally array similarity join and present the design, optimization strategies, and evaluation of the first array similarity join operator.

【Keywords】: array databases; integer programming; join processing; query optimization; similarity join

Paper Link】 【Pages】:2023-2037

【Authors】: Yuxin Zheng ; Qi Guo ; Anthony K. H. Tung ; Sai Wu

【Abstract】: Due to the "curse of dimensionality" problem, it is very expensive to process the nearest neighbor (NN) query in high-dimensional spaces; and hence, approximate approaches, such as Locality-Sensitive Hashing (LSH), are widely used for their theoretical guarantees and empirical performance. Current LSH-based approaches target at the L1 and L2 spaces, while as shown in previous work, the fractional distance metrics (Lp metrics with 0 < p < 1) can provide more insightful results than the usual L1 and L2 metrics for data mining and multimedia applications. However, none of the existing work can support multiple fractional distance metrics using one index. In this paper, we propose LazyLSH that answers approximate nearest neighbor queries for multiple Lp metrics with theoretical guarantees. Different from previous LSH approaches which need to build one dedicated index for every query space, LazyLSH uses a single base index to support the computations in multiple Lp spaces, significantly reducing the maintenance overhead. Extensive experiments show that LazyLSH provides more accurate results for approximate kNN search under fractional distance metrics.

【Keywords】: LP metrics; locality sensitive hashing; nearest neighbor search

Paper Link】 【Pages】:2039-2052

【Authors】: Jinglin Peng ; Hongzhi Wang ; Jianzhong Li ; Hong Gao

【Abstract】: A fundamental problem of time series is k nearest neighbor (k-NN) query processing. However, existing methods are not fast enough for large dataset. In this paper, we propose a novel approach, STS3, to process k-NN queries by transforming time series to sets and measure the similarity under Jaccard metric. Our approach is more accurate than Dynamic Time Warping(DTW) in our suitable scenarios and it is faster than most of the existing methods, due to the efficient similarity search for sets. Besides, we also developed an index, a pruning and an approximation technique to improve the k-NN query procedure. As shown in the experimental results, all of them could accelerate the query processing effectively.

【Keywords】: k-NN; representation; similarity search; time series

138. Range-based Obstructed Nearest Neighbor Queries.

Paper Link】 【Pages】:2053-2068

【Authors】: Huaijie Zhu ; Xiaochun Yang ; Bin Wang ; Wang-Chien Lee

【Abstract】: In this paper, we study a novel variant of obstructed nearest neighbor queries, namely, range-based obstructed nearest neighbor (RONN) search. A natural generalization of continuous obstructed nearest-neighbor (CONN), an RONN query retrieves the obstructed nearest neighbor for every point in a specified range. To process RONN, we first propose a CONN-Based (CONNB) algorithm as our baseline, which reduces the RONN query into a range query and four CONN queries processed using an R-tree. To address the shortcomings of the CONNB algorithm, we then propose a new RONN by R-tree Filtering (RONN-RF) algorithm, which explores effective filtering, also using R-tree. Next, we propose a new index, called O-tree, dedicated for indexing objects in the obstructed space. The novelty of O-tree lies in the idea of dividing the obstructed space into non-obstructed subspaces, aiming to efficiently retrieve highly qualified candidates for RONN processing. We develop an O-tree construction algorithm and propose a space division scheme, called optimal obstacle balance (OOB) scheme, to address the tree balance problem. Accordingly, we propose an efficient algorithm, called RONN by O-tree Acceleration (RONN-OA), which exploits O-tree to accelerate query processing of RONN. In addition, we extend O-tree for indexing polygons. At last, we conduct a comprehensive performance evaluation using both real and synthetic datasets to validate our ideas and the proposed algorithms. The experimental result shows that the RONN-OA algorithm outperforms the two R-tree based algorithms significantly. Moreover, we show that the OOB scheme achieves the best tree balance in O-tree and outperforms two baseline schemes.

【Keywords】: nearest neighbor; obstacle; range-based nearest neighbor

Session 23 - Demonstrations 31

139. Rheem: Enabling Multi-Platform Task Execution.

Paper Link】 【Pages】:2069-2072

【Authors】: Divy Agrawal ; Mouhamadou Lamine Ba ; Laure Berti-Equille ; Sanjay Chawla ; Ahmed K. Elmagarmid ; Hossam Hammady ; Yasser Idris ; Zoi Kaoudi ; Zuhair Khayyat ; Sebastian Kruse ; Mourad Ouzzani ; Paolo Papotti ; Jorge-Arnulfo Quiané-Ruiz ; Nan Tang ; Mohammed J. Zaki

【Abstract】: Many emerging applications, from domains such as healthcare and oil & gas, require several data processing systems for complex analytics. This demo paper showcases system, a framework that provides multi-platform task execution for such applications. It features a three-layer data processing abstraction and a new query optimization approach for multi-platform settings. We will demonstrate the strengths of system by using real-world scenarios from three different applications, namely, machine learning, data cleaning, and data fusion.

【Keywords】: big data; cross-platform execution; data analytics

140. Emma in Action: Declarative Dataflows for Scalable Data Analysis.

Paper Link】 【Pages】:2073-2076

【Authors】: Alexander Alexandrov ; Andreas Salzmann ; Georgi Krastev ; Asterios Katsifodimos ; Volker Markl

【Abstract】: Parallel dataflow APIs based on second-order functions were originally seen as a flexible alternative to SQL. Over time, however, their complexity increased due to the number of physical aspects that had to be exposed by the underlying engines in order to facilitate efficient execution. To retain a sufficient level of abstraction and lower the barrier of entry for data scientists, projects like Spark and Flink currently offer domain-specific APIs on top of their parallel collection abstractions. This demonstration highlights the benefits of an alternative design based on deep language embedding. We showcase Emma - a programming language embedded in Scala. Emma promotes parallel collection processing through native constructs like Scala's for-comprehensions - a declarative syntax akin to SQL. In addition, Emma also advocates quasi-quoting the entire data analysis algorithm rather than its individual dataflow expressions. This allows for decomposing the quoted code into (sequential) control flow and (parallel) dataflow fragments, optimizing the dataflows in context, and transparently offloading them to an engine like Spark or Flink. The proposed design promises increased programmer productivity due to avoiding an impedance mismatch, thereby reducing the lag times and cost of data analysis.

【Keywords】: data-parallel execution; emma; large-scale data analysis; mapreduce; monad comprehensions; parallel dataflows; scala macros

141. Wildfire: Concurrent Blazing Data Ingest and Analytics.

Paper Link】 【Pages】:2077-2080

【Authors】: Ronald Barber ; Matt Huras ; Guy M. Lohman ; C. Mohan ; René Müller ; Fatma Özcan ; Hamid Pirahesh ; Vijayshankar Raman ; Richard Sidle ; Oleg Sidorkin ; Adam J. Storm ; Yuanyuan Tian ; Pinar Tözün

【Abstract】: We demonstrate Hybrid Transactional and Analytics Processing (HTAP) on the Spark platform by the Wildfire prototype, which can ingest up to ~6 million inserts per second per node and simultaneously perform complex SQL analytics queries. Here, a simplified mobile application uses Wildfire to recommend advertising to mobile customers based upon their distance from stores and their interest in products sold by these stores, while continuously graphing analytics results as those customers move and respond to the ads with purchases.

【Keywords】: hybrid transactional and analytical processing; mobile applications; spark

142. Efficient Query Processing on Many-core Architectures: A Case Study with Intel Xeon Phi Processor.

Paper Link】 【Pages】:2081-2084

【Authors】: Xuntao Cheng ; Bingsheng He ; Mian Lu ; Chiew Tong Lau ; Huynh Phung Huynh ; Rick Siow Mong Goh

【Abstract】: Recently, Intel Xeon Phi is emerging as a many-core processor with up to 61 x86 cores. In this demonstration, we present PhiDB, an OLAP query processor with simultaneous multi-threading (SMT) capabilities on Xeon Phi as a case study for parallel database performance on future many-core processors. With the trend towards many-core architectures, query operator optimizations, and efficient query scheduling on such many-core architectures remain as challenging issues. This motivates us to redesign and evaluate query processors. In PhiDB, we apply Xeon Phi aware optimizations on query operators to exploit hardware features of Xeon Phi, and design a heuristic algorithm to schedule the concurrent execution of query operators for better performance, to demonstrate the performance impact of Xeon Phi aware optimizations. We have also developed a user interface for users to explore the underlying performance impacts of hardware-conscious optimizations and scheduling plans.

【Keywords】: Xeon Phi; query optimization; query processing

143. ReproZip: Computational Reproducibility With Ease.

Paper Link】 【Pages】:2085-2088

【Authors】: Fernando Chirigati ; Rémi Rampin ; Dennis Shasha ; Juliana Freire

【Abstract】: We present ReproZip, the recommended packaging tool for the SIGMOD Reproducibility Review. ReproZip was designed to simplify the process of making an existing computational experiment reproducible across platforms, even when the experiment was put together without reproducibility in mind. The tool creates a self-contained package for an experiment by automatically tracking and identifying all its required dependencies. The researcher can share the package with others, who can then use ReproZip to unpack the experiment, reproduce the findings on their favorite operating system, as well as modify the original experiment for reuse in new research, all with little effort. The demo will consist of examples of non-trivial experiments, showing how these can be packed in a Linux machine and reproduced on different machines and operating systems. Demo visitors will also be able to pack and reproduce their own experiments.

【Keywords】: computational reproducibility; provenance; reprozip

144. CLAMS: Bringing Quality to Data Lakes.

Paper Link】 【Pages】:2089-2092

【Authors】: Mina H. Farid ; Alexandra Roatis ; Ihab F. Ilyas ; Hella-Franziska Hoffmann ; Xu Chu

【Abstract】: With the increasing incentive of enterprises to ingest as much data as they can in what is commonly referred to as "data lakes", and with the recent development of multiple technologies to support this "load-first" paradigm, the new environment presents serious data management challenges. Among them, the assessment of data quality and cleaning large volumes of heterogeneous data sources become essential tasks in unveiling the value of big data. The coveted use of unstructured and semi-structured data in large volumes makes current data cleaning tools (primarily designed for relational data) not directly adoptable. We present CLAMS, a system to discover and enforce expressive integrity constraints from large amounts of lake data with very limited schema information (e.g., represented as RDF triples). This demonstration shows how CLAMS is able to discover the constraints and the schemas they are defined on simultaneously. CLAMS also introduces a scale-out solution to efficiently detect errors in the raw data. CLAMS interacts with human experts to both validate the discovered constraints and to suggest data repairs. CLAMS has been deployed in a real large-scale enterprise data lake and was experimented with a real data set of 1.2 billion triples. It has been able to spot multiple obscure data inconsistencies and errors early in the data processing stack, providing huge value to the enterprise.

【Keywords】: RDF; data lakes; data quality

145. FERARI: A Prototype for Complex Event Processing over Streaming Multi-cloud Platforms.

Paper Link】 【Pages】:2093-2096

【Authors】: Ioannis Flouris ; Vasiliki Manikaki ; Nikos Giatrakos ; Antonios Deligiannakis ; Minos N. Garofalakis ; Michael Mock ; Sebastian Bothe ; Inna Skarbovsky ; Fabiana Fournier ; Marko Stajcer ; Tomislav Krizan ; Jonathan Yom-Tov ; Taji Curin

【Abstract】: In this demo, we present FERARI, a prototype that enables real-time Complex Event Processing (CEP) for large volume event data streams over distributed topologies. Our prototype constitutes, to our knowledge, the first complete, multi-cloud based end-to-end CEP solution incorporating: a) a user-friendly, web-based query authoring tool, (b) a powerful CEP engine implemented on top of a streaming cloud platform, (c) a CEP optimizer that chooses the best query execution plan with respect to low latency and/or reduced inter-cloud communication burden, and (d) a query analytics dashboard encompassing graph and map visualization tools to provide a holistic picture with respect to the detected complex events to final stakeholders. As a proof-of-concept, we apply FERARI to enable mobile fraud detection over real, properly anonymized, telecommunication data from T-Hrvatski Telekom network in Croatia.

【Keywords】: Multi-cloud Platforms

146. Constance: An Intelligent Data Lake System.

Paper Link】 【Pages】:2097-2100

【Authors】: Rihan Hai ; Sandra Geisler ; Christoph Quix

【Abstract】: As the challenge of our time, Big Data still has many research hassles, especially the variety of data. The high diversity of data sources often results in information silos, a collection of non-integrated data management systems with heterogeneous schemas, query languages, and APIs. Data Lake systems have been proposed as a solution to this problem, by providing a schema-less repository for raw data with a common access interface. However, just dumping all data into a data lake without any metadata management, would only lead to a 'data swamp'. To avoid this, we propose Constance, a Data Lake system with sophisticated metadata management over raw data extracted from heterogeneous data sources. Constance discovers, extracts, and summarizes the structural metadata from the data sources, and annotates data and metadata with semantic information to avoid ambiguities. With embedded query rewriting engines supporting structured data and semi-structured data, Constance provides users a unified interface for query processing and data exploration. During the demo, we will walk through each functional component of Constance. Constance will be applied to two real-life use cases in order to show attendees the importance and usefulness of our generic and extensible data lake system.

【Keywords】: data integration; data lake; data quality

147. Exploring Privacy-Accuracy Tradeoffs using DPComp.

Paper Link】 【Pages】:2101-2104

【Authors】: Michael Hay ; Ashwin Machanavajjhala ; Gerome Miklau ; Yan Chen ; Dan Zhang ; George Bissias

【Abstract】: The emergence of differential privacy as a primary standard for privacy protection has led to the development, by the research community, of hundreds of algorithms for various data analysis tasks. Yet deployment of these techniques has been slowed by the complexity of algorithms and an incomplete understanding of the cost to accuracy implied by the adoption of differential privacy. In this demonstration we present DPComp, a publicly-accessible web-based system, designed to support a broad community of users, including data analysts, privacy researchers, and data owners. Users can use DPComp to assess the accuracy of state-of-the-art privacy algorithms and interactively explore algorithm output in order to understand, both quantitatively and qualitatively, the error introduced by the algorithms. In addition, users can contribute new algorithms and new (non-sensitive) datasets. DPComp automatically incorporates user contributions into an evolving benchmark based on a rigorous evaluation methodology articulated by Hay et al. (SIGMOD 2016).

【Keywords】: algorithm evaluation; differential privacy; privacy

Paper Link】 【Pages】:2105-2108

【Authors】: Alexander Kalinin ; Ugur Çetintemel ; Stan Zdonik

【Abstract】: Searchlight enables search and exploration of large, multi-dimensional data sets interactively. It allows users to explore by specifying rich constraints for the "objects" they are interested in identifying. Constraints can express a variety of properties, including a shape of the object (e.g., a waveform interval of length 10-100ms), its aggregate properties (e.g., the average amplitude of the signal over the interval is greater than 10), and similarity to another object (e.g., the distance between the interval's waveform and the query waveform is less than 5). Searchlight allows users to specify an arbitrary number of such constraints, with mixing different types of constraints in the same query. Searchlight enhances the query execution engine of an array DBMS (currently SciDB) with the ability to perform sophisticated search using the power of Constraint Programming (CP). This allows an existing CP solver from Or-Tools (an open-source suite of operations research tools from Google) to directly access data inside the DBMS without the need to extract and transform it. This demo will illustrate the rich search and exploration capabilities of Searchlight, and its innovative technical features, by using the real-world MIMIC II data set, which contains waveform data for multi-parameter recordings of ICU patients, such as ABP (Arterial Blood Pressure) and ECG (electrocardiogram). Users will be able to search for interesting waveform intervals by specifying aggregate properties of the corresponding signals. In addition, they will be able to search for intervals similar to already found, where similarity is defined as a distance between the signal sequences.

【Keywords】: array databases; constraint programming; data exploration; interactive results; query processing

149. Ontology-Based Integration of Streaming and Static Relational Data with Optique.

Paper Link】 【Pages】:2109-2112

【Authors】: Evgeny Kharlamov ; Sebastian Brandt ; Ernesto Jiménez-Ruiz ; Yannis Kotidis ; Steffen Lamparter ; Theofilos Mailis ; Christian Neuenstadt ; Özgür L. Özçep ; Christoph Pinkel ; Christoforos Svingos ; Dmitriy Zheleznyakov ; Ian Horrocks ; Yannis E. Ioannidis ; Ralf Möller

【Abstract】: Real-time processing of data coming from multiple heterogeneous data streams and static databases is a typical task in many industrial scenarios such as diagnostics of large machines. A complex diagnostic task may require a collection of up to hundreds of queries over such data. Although many of these queries retrieve data of the same kind, such as temperature measurements, they access structurally different data sources. In this work we show how Semantic Technologies implemented in our system optique can simplify such complex diagnostics by providing an abstraction layer---ontology---that integrates heterogeneous data. In a nutshell, optique allows complex diagnostic tasks to be expressed with just a few high-level semantic queries. The system can then automatically enrich these queries, translate them into a collection with a large number of low-level data queries, and finally optimise and efficiently execute the collection in a heavily distributed environment. We will demo the benefits of optique on a real world scenario from Siemens.

【Keywords】: information integration; ontologies; semantic web; siemens

150. The CloudMdsQL Multistore System.

Paper Link】 【Pages】:2113-2116

【Authors】: Boyan Kolev ; Carlyna Bondiombouy ; Patrick Valduriez ; Ricardo Jiménez-Peris ; Raquel Pau ; José Pereira

【Abstract】: The blooming of different cloud data management infrastructures has turned multistore systems to a major topic in the nowadays cloud landscape. In this demonstration, we present a Cloud Multidatastore Query Language (CloudMdsQL), and its query engine. CloudMdsQL is a functional SQL-like language, capable of querying multiple heterogeneous data stores (relational and NoSQL) within a single query that may contain embedded invocations to each data store's native query interface. The major innovation is that a CloudMdsQL query can exploit the full power of local data stores, by simply allowing some local data store native queries (e.g. a breadth-first search query against a graph database) to be called as functions, and at the same time be optimized. Within our demonstration, we focus on two use cases each involving four diverse data stores (graph, document, relational, and key-value) with its corresponding CloudMdsQL queries. The query execution flows are visualized by an embedded real-time monitoring subsystem. The users can also try out different ad-hoc queries, not necessarily in the context of the use cases.

【Keywords】: SQL and noSQL integration; cloud; heterogeneous data stores; multistore system

151. ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning.

Paper Link】 【Pages】:2117-2120

【Authors】: Sanjay Krishnan ; Michael J. Franklin ; Ken Goldberg ; Jiannan Wang ; Eugene Wu

【Abstract】: Databases can be corrupted with various errors such as missing, incorrect, or inconsistent values. Increasingly, modern data analysis pipelines involve Machine Learning, and the effects of dirty data can be difficult to debug.Dirty data is often sparse, and naive sampling solutions are not suited for high-dimensional models. We propose ActiveClean, a progressive framework for training Machine Learning models with data cleaning. Our framework updates a model iteratively as the analyst cleans small batches of data, and includes numerous optimizations such as importance weighting and dirty data detection. We designed a visual interface to wrap around this framework and demonstrate ActiveClean for a video classification problem and a topic modeling problem.

【Keywords】: data cleaning; machine learning

152. Wander Join: Online Aggregation for Joins.

Paper Link】 【Pages】:2121-2124

【Authors】: Feifei Li ; Bin Wu ; Ke Yi ; Zhuoyue Zhao

【Abstract】: Joins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and may also need very restrictive assumptions (e.g., tuples in a table are stored in random order). We introduce a new approach, wander join, to the online aggregation problem by performing random walks over the underlying join graph. We have also implemented and tested wander join in the latest PostgreSQL.

【Keywords】: database engines.; online aggregation; online join aggregation; query optimization

Paper Link】 【Pages】:2125-2128

【Authors】: Yaguang Li ; Han Su ; Ugur Demiryurek ; Bolong Zheng ; Kai Zeng ; Cyrus Shahabi

【Abstract】: In this paper, we study a route summarization framework for Personalized Navigation dubbed PerNav - with which the goal is to generate more intuitive and customized turn-by-turn directions based on user generated content. The turn-by-turn directions provided in the existing navigation applications are exclusively derived from underlying road network topology information i.e., the connectivity of nodes to each other. Therefore, the turn-by-turn directions are simplified as metric translation of physical world (e.g. distance/time to turn) to spoken language. Such translation- that ignores human cognition about the geographic space- is often verbose and redundant for the drivers who have knowledge about the geographical areas. PerNav utilizes wealth of user generated historical trajectory data to extract namely "landmarks" (e.g., point of interests or intersections) and frequently visited routes between them from the road network. Then this extracted information is used to obtain cognitive turn-by-turn directions customized for each user.

【Keywords】: personalized navigation; route summarization; trajectory

154. Making the Case for Query-by-Voice with EchoQuery.

Paper Link】 【Pages】:2129-2132

【Authors】: Gabriel Lyons ; Vinh Tran ; Carsten Binnig ; Ugur Çetintemel ; Tim Kraska

【Abstract】: Recent advances in automatic speech recognition and natural language processing have led to a new generation of robust voice-based interfaces. Yet, there is very little work on using voice-based interfaces to query database systems. In fact, one might even wonder who in her right mind would want to query a database system using voice commands! With this demonstration, we make the case for querying database systems using a voice-based interface, a new querying and interaction paradigm we call Query-by-Voice (QbV). We will demonstrate the practicality and utility of QbV for relational DBMSs using a using a proof-of-concept system called EchoQuery. To achieve a smooth and intuitive interaction, the query interface of EchoQuery is inspired by casual human-to-human conversations. Our demo will show that voice-based interfaces present an intuitive means of querying and consuming data in a database. It will also highlight the unique advantages of QbV over the more traditional approaches, text-based or visual interfaces, for applications where context switching is too expensive, too risky or even not possible at all.

【Keywords】: data exploration; natural language interfaces

155. QUEPA: QUerying and Exploring a Polystore by Augmentation.

Paper Link】 【Pages】:2133-2136

【Authors】: Antonio Maccioni ; Edoardo Basili ; Riccardo Torlone

【Abstract】: Polystore systems (or simply polystores) have been recently proposed to support a common scenario in which enterprise data are stored in a variety of database technologies relying on different data models and languages. Polystores provide a loosely coupled integration of data sources and support the direct access, with the local language, to each specific storage engine to exploit its distinctive features. Given the absence of a global schema, new challenges for accessing data arise in these environments. In fact, it is usually hard to know in advance if a query to a specific data store can be satisfied with data stored elsewhere in the polystore. QUEPA addresses these issues by introducing augmented search and augmented exploration in a polystore, two access methods based on the automatic enrichment of the result of a query over a storage system with related data in the rest of the polystore. These features do not impact on the applications running on top of the polystore and are compatible with the most common database systems. QUEPA implements in this way a lightweight mechanism for data integration in the polystore and operates in a plug-and-play mode, thus reducing the need for ad-hoc configurations and for middleware layers involving standard APIs, unified query languages or shared data models. In our demonstration audience can experience with the augmentation construct by using the native query languages of the database systems available in the polystore.

【Keywords】: BYOL; augmentation; augmented exploration; augmented search; explorative querying; noSQL; polyglot persistence; polystore

156. REACT: Context-Sensitive Recommendations for Data Analysis.

Paper Link】 【Pages】:2137-2140

【Authors】: Tova Milo ; Amit Somech

【Abstract】: Data analysis may be a difficult task, especially for non-expert users, as it requires deep understanding of the investigated domain and the particular context. In this demo we present REACT, a system that hooks to the analysis UI and provides the users with personalized recommendations of analysis actions. By matching the current user session to previous sessions of analysts working with the same or other data sets, REACT is able to identify the potentially best next analysis actions in the given user context. Unlike previous work that mainly focused on individual components of the analysis work, REACT provides a holistic approach that captures a wider range of analysis action types by utilizing novel notions of similarity in terms of the individual actions, the analyzed data and the entire analysis workflow. We demonstrate the functionality of REACT, as well as its effectiveness through a digital forensics scenario where users are challenged to detect cyber attacks in real life data achieved from honeypot servers.

【Keywords】: data analysis; react; recommender system

157. PerfEnforce Demonstration: Data Analytics with Performance Guarantees.

Paper Link】 【Pages】:2141-2144

【Authors】: Jennifer Ortiz ; Brendan Lee ; Magdalena Balazinska

【Abstract】: We demonstrate PerfEnforce, a dynamic scaling engine for analytics services. PerfEnforce automatically scales a cluster of virtual machines in order to minimize costs while probabilistically meeting the query runtime guarantees offered by a performance-oriented service level agreement (SLA). The demonstration will show three families of dynamic scaling algorithms --feedback control, reinforcement learning, and online machine learning--and will enable attendees to change tuning parameters, performance thresholds, and workloads to compare and contrast the algorithms in different settings.

【Keywords】: SLA; cloud; database; elasticity

158. High-Performance Geospatial Analytics in HyPerSpace.

Paper Link】 【Pages】:2145-2148

【Authors】: Varun Pandey ; Andreas Kipf ; Dimitri Vorona ; Tobias Mühlbauer ; Thomas Neumann ; Alfons Kemper

【Abstract】: In the past few years, massive amounts of location-based data has been captured. Numerous datasets containing user location information are readily available to the public. Analyzing such datasets can lead to fascinating insights into the mobility patterns and behaviors of users. Moreover, in recent times a number of geospatial data-driven companies like Uber, Lyft, and Foursquare have emerged. Real-time analysis of geospatial data is essential and enables an emerging class of applications. Database support for geospatial operations is turning into a necessity instead of a distinct feature provided by only a few databases. Even though a lot of database systems provide geospatial support nowadays, queries often do not consider the most current database state. Geospatial queries are inherently slow given the fact that some of these queries require a couple of geometric computations. Disk-based database systems that do support geospatial datatypes and queries, provide rich features and functions, but they fall behind when performance is considered: specifically if real-time analysis of the latest transactional state is a requirement. In this demonstration, we present HyPerSpace, an extension to the high-performance main-memory database system HyPer developed at the Technical University of Munich, capable of processing geospatial queries with sub-second latencies.

【Keywords】: geospatial data processing; indexing schemes

159. What Makes a Good Physical plan?: Experiencing Hardware-Conscious Query Optimization with Candomblé.

Paper Link】 【Pages】:2149-2152

【Authors】: Holger Pirk ; Oscar Moll ; Sam Madden

【Abstract】: Query optimization is hard and the current proliferation of "modern" hardware does nothing to make it any easier. In addition, the tools that are commonly used by performance engineers, such as compiler intrinsics, static analyzers or hardware performance counters are neither integrated with data management systems nor easy to learn. This fact makes it (unnecessarily) hard to educate engineers, to prototype and to optimize database query plans for modern hardware. To address this problem, we developed a system called Candomblé that lets database performance engineers interactively examine, optimize and evaluate query plans using a touch-based interface. Candomblé puts attendants in the place of a physical query optimizer that has to rewrite a physical query plan into a better equivalent plan. Attendants experience the challenges when ad-hoc optimizing a physical plan for processing devices such as GPUs and CPUs and capture their gained knowledge in rules to be used by a rule-based optimizer.

【Keywords】: interactive; learning; query optimization; teaching

160. SnappyData: A Hybrid Transactional Analytical Store Built On Spark.

Paper Link】 【Pages】:2153-2156

【Authors】: Jags Ramnarayan ; Barzan Mozafari ; Sumedh Wale ; Sudhir Menon ; Neeraj Kumar ; Hemant Bhanawat ; Soubhik Chakraborty ; Yogesh Mahajan ; Rishitesh Mishra ; Kishor Bachhav

【Abstract】: In recent years, our customers have expressed frustration in the traditional approach of using a combination of disparate products to handle their streaming, transactional and analytical needs. The common practice of stitching heterogeneous environments in custom ways has caused enormous production woes by increasing development complexity and total cost of ownership. With SnappyData, an open source platform, we propose a unified engine for real-time operational analytics, delivering stream analytics, OLTP and OLAP in a single integrated solution. We realize this platform through a seamless integration of Apache Spark (as a big data computational engine) with GemFire (as an in-memory transactional store with scale-out SQL semantics). In this demonstration, after presenting a few use case scenarios, we exhibit SnappyData as our our in-memory solution for delivering truly interactive analytics (i.e., a couple of seconds), when faced with large data volumes or high velocity streams. We show that SnappyData can exploit state-of-the-art approximate query processing techniques and a variety of data synopses. Finally, we allow the audience to define various high-level accuracy contracts (HAC), to communicate their accuracy requirements with SnappyData in an intuitive fashion.

【Keywords】: OLAP; OLTP; in-memory database; spark; spark streaming; stream analytics; stream processing

161. SourceSight: Enabling Effective Source Selection.

Paper Link】 【Pages】:2157-2160

【Authors】: Theodoros Rekatsinas ; Amol Deshpande ; Xin Luna Dong ; Lise Getoor ; Divesh Srivastava

【Abstract】: Recently there has been a rapid increase in the number of data sources and data services, such as cloud-based data markets and data portals, that facilitate the collection, publishing and trading of data. Data sources typically exhibit large heterogeneity in the type and quality of data they provide. Unfortunately, when the number of data sources is large, it is difficult for users to reason about the actual usefulness of sources for their applications and the trade-offs between the benefits and costs of acquiring and integrating sources. In this demonstration we present \textsc{SourceSight}, a system that allows users to interactively explore a large number of heterogeneous data sources, and discover valuable sets of sources for diverse integration tasks. \textsc{SourceSight}~uses a novel multi-level source quality index that enables effective source selection at different granularity levels, and introduces a collection of new techniques to discover and evaluate relevant sources for integration.

【Keywords】: SourceSight

162. BART in Action: Error Generation and Empirical Evaluations of Data-Cleaning Systems.

Paper Link】 【Pages】:2161-2164

【Authors】: Donatello Santoro ; Patricia C. Arocena ; Boris Glavic ; Giansalvatore Mecca ; Renée J. Miller ; Paolo Papotti

【Abstract】: Repairing erroneous or conflicting data that violate a set of constraints is an important problem in data management. Many automatic or semi-automatic data-repairing algorithms have been proposed in the last few years, each with its own strengths and weaknesses. Bart is an open-source error-generation system conceived to support thorough experimental evaluations of these data-repairing systems. The demo is centered around three main lessons. To start, we discuss how generating errors in data is a complex problem, with several facets. We introduce the important notions of detectability and repairability of an error, that stand at the core of Bart. Then, we show how, by changing the features of errors, it is possible to influence quite significantly the performance of the tools. Finally, we concretely put to work five data-repairing algorithms on dirty data of various kinds generated using Bart, and discuss their performance.

【Keywords】: data cleaning; data repairing; empirical evaluation; error generation

163. RxSpatial: Reactive Spatial Library for Real-Time Location Tracking and Processing.

Paper Link】 【Pages】:2165-2168

【Authors】: Youying Shi ; Abdeltawab M. Hendawi ; Hossam Fattah ; Mohamed H. Ali

【Abstract】: Current commercial spatial libraries implemented strong support on functionalities like intersection, distance, and area for various stationary geospatial objects. The missing point is the support for moving object. Performing moving object real-time location tracking and computation on server side of GIS application is challenging because of high user volume of moving object to track, time complexity of analysis and computation, and requirement of real-timing. In this Demo, we present the RxSpatial, a real time reactive spatial library that consists of (1) a front-end, a programming interface for developers who are familiar with the Reactive framework and the Microsoft Spatial Library, and (2) a back-end for processing spatial operations in a streaming fashion. Then we provide the demonstration scenarios that show how RxSpatial is employed in real-world applications. The demonstration scenarios include criminal activity tracking, collaborative vehicle system, performance analysis and an interactive internal inspection.

【Keywords】: GIS; location aware services; moving objects; reactive spatial; spatial database; spatial library; spatio-temporal data

164. Web-based Benchmarks for Forecasting Systems: The ECAST Platform.

Paper Link】 【Pages】:2169-2172

【Authors】: Robert Ulbricht ; Claudio Hartmann ; Martin Hahmann ; Hilko Donker ; Wolfgang Lehner

【Abstract】: The role of precise forecasts in the energy domain has changed dramatically. New supply forecasting methods are developed to better address this challenge, but meaningful benchmarks are rare and time-intensive. We propose the ECAST online platform in order to solve that problem. The system's capability is demonstrated on a real-world use case by comparing the performance of different prediction tools.

【Keywords】: benchmark; time series forecasting; transparency

165. Energy Elasticity on Heterogeneous Hardware using Adaptive Resource Reconfiguration LIVE.

Paper Link】 【Pages】:2173-2176

【Authors】: Annett Ungethüm ; Thomas Kissinger ; Willi-Wolfram Mentzel ; Dirk Habich ; Wolfgang Lehner

【Abstract】: Energy awareness of database systems has emerged as a critical research topic, since energy consumption is becoming a major limiter for their scalability. Recent energy-related hardware developments trend towards offering more and more configuration opportunities for the software to control its own energy consumption. Existing research so far mainly focused on leveraging this configuration spectrum to find the most energy-efficient configuration for specific operators or entire queries. In this demo, we introduce the concept of energy elasticity and propose the energy-control loop as an implementation of this concept. Energy elasticity refers to the ability of software to behave energy-proportional and energy-efficient at the same time while maintaining a certain quality of service. Thus, our system does not draw the least energy possible but the least energy necessary to still perform reasonably. We demonstrate our overall approach using a rich interactive GUI to give attendees the opportunity to learn more about our concept.

【Keywords】: adaptivity; energy; energy elasticity; heterogeneity

166. QFix: Demonstrating Error Diagnosis in Query Histories.

Paper Link】 【Pages】:2177-2180

【Authors】: Xiaolan Wang ; Alexandra Meliou ; Eugene Wu

【Abstract】: An increasing number of applications in all aspects of society rely on data. Despite the long line of research in data cleaning and repairs, data correctness has been an elusive goal. Errors in the data can be extremely disruptive, and are detrimental to the effectiveness and proper function of data-driven applications. Even when data is cleaned, new errors can be introduced by applications and users who interact with the data. Subsequent valid updates can obscure these errors and propagate them through the dataset causing more discrepancies. Any discovered errors tend to be corrected superficially, on a case-by-case basis, further obscuring the true underlying cause, and making detection of the remaining errors harder. In this demo proposal, we outline the design of QFix, a query-centric framework that derives explanations and repairs for discrepancies in relational data based on potential errors in the queries that operated on the data. This is a marked departure from traditional data-centric techniques that directly fix the data. We then describe how users will use QFix in a demonstration scenario. Participants will be able to select from a number of transactional benchmarks, introduce errors into the queries that are executed, and compare the fixes to the queries proposed by QFix as well as existing alternative algorithms such as decision trees.

【Keywords】: data cleaning; mixed-integer linear programming; query provenance

167. CoDAR: Revealing the Generalized Procedure & Recommending Algorithms of Community Detection.

Paper Link】 【Pages】:2181-2184

【Authors】: Xiang Ying ; Chaokun Wang ; Meng Wang ; Jeffrey Xu Yu ; Jun Zhang

【Abstract】: Community detection has attracted great interest in graph analysis and mining during the past decade, and a great number of approaches have been developed to address this problem. However, the lack of a uniform framework and a reasonable evaluation method makes it a puzzle to analyze, compare and evaluate the extensive work, let alone picking out a best one when necessary. In this paper, we design a tool called CoDAR, which reveals the generalized procedure of community detection and monitors the real-time structural changes of network during the detection process. Moreover, CoDAR adopts 12 recognized metrics and builds a rating model for performance evaluation of communities to recom- mend the best-performing algorithm. Finally, the tool also provides nice interactive windows for display.

【Keywords】: algorithm recommendation; community detection; procedure monitoring

168. DB-Risk: The Game of Global Database Placement.

Paper Link】 【Pages】:2185-2188

【Authors】: Victor Zakhary ; Faisal Nawab ; Divyakant Agrawal ; Amr El Abbadi

【Abstract】: Geo-replication is the process of maintaining copies of data at geographically dispersed datacenters for better availability and fault-tolerance. The distinguishing characteristic of geo-replication is the large wide-area latency between datacenters that varies widely depending on the location of the datacenters. Thus, choosing which datacenters to deploy a cloud application has a direct impact on the observable response time. We propose an optimization framework that automatically derives a geo-replication placement plan with the objective of minimizing latency. By running the optimization framework on real placement scenarios, we learn a set of placement optimizations for geo-replication. Some of these optimizations are surprising while others are in retrospect straight-forward. In this demonstration, we highlight the geo-replication placement optimizations through the DB-Risk game. DB-Risk invites players to create different placement scenarios while experimenting with the proposed optimizations. The placements created by the players are tested on real cloud deployments.

【Keywords】: geo-replication; placement; transactions

169. Quegel: A General-Purpose System for Querying Big Graphs.

Paper Link】 【Pages】:2189-2192

【Authors】: Qizhen Zhang ; Da Yan ; James Cheng

【Abstract】: Inspired by Google's Pregel, many distributed graph processing systems have been developed recently to process big graphs. These systems expose a vertex-centric programming interface to users, where a programmer thinks like a vertex when designing parallel graph algorithms. However, existing systems are designed for tasks where most vertices in a graph participate in the computation, and they are not suitable for processing light-workload graph queries which only access a small portion of vertices. This is because their programming model can seriously under-utilize the resources in a cluster for processing graph queries. In this demonstration, we introduce a general-purpose system for querying big graphs, called Quegel, which treats queries as first-class citizens in the design of its computing model. Quegel adopts a novel superstep-sharing execution model to overcome the weaknesses of existing systems. We demonstrate it is user-friendly to write parallel graph-querying programs with Quegel's interface; and we also show that Quegel is able to achieve real-time response time in various applications, including the two applications that we plan to demonstrate: point-to-point shortest-path queries and XML keyword search.

【Keywords】: big data; distributed; graph; pregel; query

Session 24 - Tutorials 10

170. Introduction to Spark 2.0 for Database Researchers.

Paper Link】 【Pages】:2193-2194

【Authors】: Michael Armbrust ; Doug Bateman ; Reynold Xin ; Matei Zaharia

【Abstract】: Originally started as an academic research project at UC Berkeley, Apache Spark is one of the most popular open source projects for big data analytics. Over 1000 volunteers have contributed code to the project; it is supported by virtually every commercial vendor; many universities are now offering courses on Spark. Spark has evolved significantly since the 2010 research paper: its foundational APIs are becoming more relational and structural with the introduction of the Catalyst relational optimizer, and its execution engine is developing quickly to adopt the latest research advances in database systems such as whole-stage code generation. This tutorial is designed for database researchers (graduate students, faculty members, and industrial researchers) interested in a brief hands-on overview of Spark. This tutorial covers the core APIs for using Spark 2.0, including DataFrames, Datasets, SQL, streaming and machine learning pipelines. Each topic includes slide and lecture content along with hands-on use of a Spark cluster through a web-based notebook environment. In addition, we will dive into the engine internals to discuss architectural design choices and their implications in practice. We will guide the audience to "hack" Spark by extending its query optimizer to speed up distributed join execution.

【Keywords】: Hadoop; SQL; big data; machine learning; spark; streaming

171. Design Tradeoffs of Data Access Methods.

Paper Link】 【Pages】:2195-2200

【Authors】: Manos Athanassoulis ; Stratos Idreos

【Abstract】: Database researchers and practitioners have been building methods to store, access, and update data for more than five decades. Designing access methods has been a constant effort to adapt to the ever changing underlying hardware and workload requirements. The recent explosion in data system designs - including, in addition to traditional SQL systems, NoSQL, NewSQL, and other relational and non-relational systems - makes understanding the tradeoffs of designing access methods more important than ever. Access methods are at the core of any new data system. In this tutorial we survey recent developments in access method design and we place them in the design space where each approach focuses primarily on one or a subset of read performance, update performance, and memory utilization. We discuss how to utilize designs and lessons-learned from past research. In addition, we discuss new ideas on how to build access methods that have tunable behavior, as well as, what is the scenery of open research problems.

【Keywords】: access methods; approximate indexing; cache optimizations; continuous reorganization; data skipping; design tradeoffs; differential updates; log-structure design; logarithmic structure; read-optimized indexing; rum tradeoffs; space-efficient indexing; update-optimized indexing

172. Data Cleaning: Overview and Emerging Challenges.

Paper Link】 【Pages】:2201-2206

【Authors】: Xu Chu ; Ihab F. Ilyas ; Sanjay Krishnan ; Jiannan Wang

【Abstract】: Detecting and repairing dirty data is one of the perennial challenges in data analytics, and failure to do so can result in inaccurate analytics and unreliable decisions. Over the past few years, there has been a surge of interest from both industry and academia on data cleaning problems including new abstractions, interfaces, approaches for scalability, and statistical techniques. To better understand the new advances in the field, we will first present a taxonomy of the data cleaning literature in which we highlight the recent interest in techniques that use constraints, rules, or patterns to detect errors, which we call qualitative data cleaning. We will describe the state-of-the-art techniques and also highlight their limitations with a series of illustrative examples. While traditionally such approaches are distinct from quantitative approaches such as outlier detection, we also discuss recent work that casts such approaches into a statistical estimation framework including: using Machine Learning to improve the efficiency and accuracy of data cleaning and considering the effects of data cleaning on statistical analysis.

【Keywords】: data cleaning; data quality; integrity constraints; sampling; statistical cleaning

173. Querying Geo-Textual Data: Spatial Keyword Queries and Beyond.

Paper Link】 【Pages】:2207-2212

【Authors】: Gao Cong ; Christian S. Jensen

【Abstract】: Over the past decade, we have moved from a predominantly desktop based web to a predominantly mobile web, where users most often access the web from mobile devices such as smartphones. In addition, we are witnessing a proliferation of geo-located, textual web content. Motivated in part by these developments, the research community has been hard at work enabling the efficient computation of a variety of query functionality on geo-textual data, yielding a sizable body of literature on the querying of geo-textual data. With a focus on different types of keyword-based queries on geo-textual data, the tutorial also explores topics such as continuous queries on streaming geo-textual data, queries that retrieve attractive regions of geo-textual objects, and queries that extract properties, e.g., topics and top-$k$ frequent words, of the objects in regions. The tutorial is designed to offer an overview of the problems addressed in this body of literature and offers an overview of pertinent concepts and techniques. In addition, the tutorial suggests open problems and new research direction.

【Keywords】: geo-textual data management; spatial keyword query

174. Provenance: On and Behind the Screens.

Paper Link】 【Pages】:2213-2217

【Authors】: Melanie Herschel ; Marcel Hlawatsch

【Abstract】: Collecting and processing provenance, i.e., information describing the production process of some end product, is important in various applications, e.g., to assess quality, to ensure reproducibility, or to reinforce trust in the end product. In the past, different types of provenance meta-data have been proposed, each with a different scope. The first part of the proposed tutorial provides an overview and comparison of these different types of provenance. To put provenance to good use, it is essential to be able to interact with and present provenance data in a user-friendly way. Often, users interested in provenance are not necessarily experts in databases or query languages, as they are typically domain experts of the product and production process for which provenance is collected (biologists, journalists, etc.). Furthermore, in some scenarios, it is difficult to use solely queries for analyzing and exploring provenance data. The second part of this tutorial therefore focuses on enabling users to leverage provenance through adapted visualizations. To this end, we will present some fundamental concepts of visualization before we discuss possible visualizations for provenance.

【Keywords】: data provenance; provenance visualization; tutorial; workflow provenance

175. Microblogs Data Management Systems: Querying, Analysis, and Visualization.

Paper Link】 【Pages】:2219-2222

【Authors】: Mohamed F. Mokbel ; Amr Magdy

【Abstract】: Microblogs data, e.g., tweets, reviews, news comments, and social media comments, has gained considerable attention in recent years due to its popularity and rich contents. Nowadays, microblogs applications span a wide spectrum of interests, including analyzing events and users activities and critical applications like discovering health issues and rescue services. Consequently, major research efforts are spent to manage, analyze, and visualize microblogs data to support different applications. In this tutorial, we give a 1.5 hours overview about microblogs data management, analysis, visualization, and systems. The tutorial gives a comprehensive review for research on core data management components to support microblogs queries at scale. This includes system-level issues and on-going work on supporting microblogs data through the rising wave of big data systems. In addition, the tutorial reviews research on microblogs data analysis and visualization. Through its different parts, the tutorial highlights the challenges and opportunities in microblogs data research.

【Keywords】: big data systems; microblogs; social media analysis; visualization

176. The Challenges of Global-scale Data Management.

Paper Link】 【Pages】:2223-2227

【Authors】: Faisal Nawab ; Divyakant Agrawal ; Amr El Abbadi

【Abstract】: Global-scale data management (GSDM) empowers systems by providing higher levels of fault-tolerance, read availability, and efficiency in utilizing cloud resources. This has led to the emergence of global-scale data management and event processing. However, the Wide-Area Network (WAN) latency separating data is orders of magnitude larger than conventional network latencies, and this requires a reevaluation of many of the traditional design trade-offs of data management systems. Therefore, data management problems must be revisited to account for the new design space. In this tutorial we survey recent developments in GSDM focusing on identifying fundamental challenges and advancements in addition to open research opportunities.

【Keywords】: availability; fault-tolerance; global-scale

177. Semistructured Models, Queries and Algebras in the Big Data Era: Tutorial Summary.

Paper Link】 【Pages】:2229-2233

【Authors】: Yannis Papakonstantinou

【Abstract】: Numerous databases promoted as SQL-on-Hadoop, NewSQL and NoSQL support semi-structured, schemaless and heterogeneous data, typically in the form of enriched JSON. They also provide corresponding query languages. In addition to these genuine JSON databases, relational databases also provide special functions and language features for the support of JSON columns, typically piggybacking on non-1NF (non first normal form) features that SQL acquired over the years. We refer to SQL databases with JSON support as SQL/JSON databases. The evolving query languages present multiple variations: Some are superficial syntactic ones, while other ones are genuine differences in modeling, language capabilities and semantics. Incompatibility with SQL presents a learning challenge for genuine JSON databases, while the table orientation of SQL/JSON databases often leads to cumbersome syntactic/semantic structures that are contrary to the semistructured nature of JSON. Furthermore, the query languages often fall short of full-fledged semistructured query language capabilities, when compared to the yardstick set by XQuery and prior works on semistructured data (even after superficial model differences are abstracted out). We survey features, the designers' options and differences in the approaches taken by actual systems. In particular, we first present a SQL backwards-compatible language, named SQL++, which can access both SQL and JSON data. SQL++ is expected to be supported by Couchbase's CouchDB and UCI's AsterixDB semistructured databases. Then we expand SQL++ into the Configurable SQL++, whereas multiple possible (and different) semantics are formally captured by the multiple options that the language's semantic configuration options can take. We show how appropriate setting of the configuration options morphs the Configurable SQL++ semantics into the semantics of 10 surveyed languages, hence providing a compact and formal tool to understand the essential semantic differences between different systems. We briefly comment on the utility of formally capturing semantic variations in polystore systems. Finally we discuss the comparison with prior nested and semistructured query languages (notably OQL and XQuery) and describe a key aspect of query processor implementation: set-oriented semistructured query algebras. In particular, we transfer into the JSON era lessons from the semistructured query processing research of the 90s and 00s and combine them with insights on current JSON databases. Again, the tutorial presents the algebras' fundamentals while it abstracts away modeling differences that are not applicable.

【Keywords】: JSON query languages; SQL; newSQL; noSQL; polystores; query algebras; semistructured query languages

178. Automatic Entity Recognition and Typing in Massive Text Data.

Paper Link】 【Pages】:2235-2239

【Authors】: Xiang Ren ; Ahmed El-Kishky ; Heng Ji ; Jiawei Han

【Abstract】: In today's computerized and information-based society, individuals are constantly presented with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. To extract value from these large, multi-domain pools of text, it is of great importance to gain an understanding of entities and their relationships. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically identify token spans as entity mentions in documents and label their fine-grained types (e.g., people, product and food) in a scalable way. Since these methods do not rely on annotated data, predefined typing schema or hand-crafted features, they can be quickly adapted to a new domain, genre and language. We demonstrate on real datasets including various genres (e.g., news articles, discussion forum posts, and tweets), domains (general vs. bio-medical domains) and languages (e.g., English, Chinese, Arabic, and even low-resource languages like Hausa and Yoruba) how these typed entities aid in knowledge discovery and management.

【Keywords】: entity; entity recognition; entity typing; phrase mining; phrases; text mining; typing

179. Big Graph Analytics Systems.

Paper Link】 【Pages】:2241-2243

【Authors】: Da Yan ; Yingyi Bu ; Yuanyuan Tian ; Amol Deshpande ; James Cheng

【Abstract】: In recent years we have witnessed a surging interest in developing Big Graph processing systems. To date, tens of Big Graph systems have been proposed. This tutorial provides a timely and comprehensive review of existing Big Graph systems, and summarizes their pros and cons from various perspectives. We start from the existing vertex-centric systems, which which a programmer thinks intuitively like a vertex when developing parallel graph algorithms. We then introduce systems that adopt other computation paradigms and execution settings. The topics covered in this tutorial include programming models and algorithm design, computation models, communication mechanisms, out-of-core support, fault tolerance, dynamic graph support, and so on. We also highlight future research opportunities on Big Graph analytics.

【Keywords】: graph; graphlab; platform; pregel; system; vertex-centric

Session 25 - Undergraduate Student Abstracts 11

180. Constructing Join Histograms from Histograms with q-error Guarantees.

Paper Link】 【Pages】:2245-2246

【Authors】: Kaleb Alway ; Anisoara Nica

【Abstract】: Histograms are implemented and used in any database system, usually defined on a single-column of a database table. However, one of the most desired statistical data in such systems are statistics on the correlation among columns. In this paper we present a novel construction algorithm for building a join histogram that accepts two single-column histograms over different attributes, each with q-error guarantees, and produces a histogram over the result of the join operation on these attributes. The join histogram is built only from the input histograms without accessing the base data or computing the join relation. Under certain restrictions, a q-error guarantee can be placed on the produced join histogram. It is possible to construct adversarial input histograms that produce arbitrarily large q-error in the resulting join histogram, but across several experiments, this type of input does not occur in either randomly generated data or real-world data. Our construction algorithm runs in linear time with respect to the size of the input histograms, and produces a join histogram that is at most as large as the sum of the sizes of the input histograms. These join histograms can be used to efficiently and accurately estimate the cardinality of join queries.

【Keywords】: cardinality estimation; histogram; query optimization

Paper Link】 【Pages】:2247-2248

【Authors】: Colin Biafore ; Faisal Nawab

【Abstract】: Trends detection in social networks is possible via a multitude of models with different characteristics. These models are pre-defined and rigid which creates the need to expose the social network graph to data scientists to introduce the human-element in trends detection. However, inspecting large social network graphs visually is tiresome. We tackle this problem by providing effective graph summarizations aimed at the application of geo-correlated trends detection in social networks.

【Keywords】: graphs; networks; summarization; trends

182. M3: Scaling Up Machine Learning via Memory Mapping.

Paper Link】 【Pages】:2249-2250

【Authors】: Dezhi Fang ; Duen Horng Chau

【Abstract】: To process data that do not fit in RAM, conventional wisdom would suggest using distributed approaches. However, recent research has demonstrated virtual memory's strong potential in scaling up graph mining algorithms on a single machine. We propose to use a similar approach for general machine learning. We contribute: (1) our latest finding that memory mapping is also a feasible technique for scaling up general machine learning algorithms like logistic regression and k-means, when data fits in or exceeds RAM (we tested datasets up to 190GB); (2) an approach, called M3, that enables existing machine learning algorithms to work with out-of-core datasets through memory mapping, achieving a speed that is significantly faster than a 4-instance Spark cluster, and comparable to an 8-instance cluster.

【Keywords】: Memory Mapping

183. K-means Split Revisited: Well-grounded Approach and Experimental Evaluation.

Paper Link】 【Pages】:2251-2252

【Authors】: Valentin Grigorev ; George Chernishev

【Abstract】: R-tree is a data structure used for multidimensional indexing. Essentially, it is a balanced tree consisting of nested hyper-rectangles which are used to locate the data. One of the most performance sensitive parts of this data structure is its split algorithm, which runs during node overflows. The split can be performed in multiple ways, according to many different criteria and in general the problem of finding an optimal solution is NP-hard. There are many heuristic split algorithms. In this paper we study an existing k-means node split algorithm. We describe a number of serious issues in its theoretical foundation, which made us to re-design k-means split. We propose several well-grounded solutions to the re-emerged problem of k-means split. Finally, we report the comparison results using PostgreSQL and contemporary benchmark for multidimensional structures.

【Keywords】: k-means; multidimensional indexing; r-tree; r-tree split

184. Main Memory Adaptive Denormalization.

Paper Link】 【Pages】:2253-2254

【Authors】: Zezhou Liu ; Stratos Idreos

【Abstract】: Joins have traditionally been the most expensive database operator, but they are required to query normalized schemas. In turn, normalized schemas are necessary to minimize update costs and space usage. Joins can be avoided altogether by using a denormalized schema instead of a normalized schema; this improves analytical query processing times at the tradeof increased update overhead, loading cost, and storage requirements. In our work, we show that we can achieve the best of both worlds by leveraging partial, incremental, and dynamic denormalized tables to avoid join operators, resulting in fast query performance while retaining the minimized loading, update, and storage costs of a normalized schema. We introduce adaptive denormalization for modern main memory systems. We replace the traditional join operations with efficient scans over the relevant partial universal tables without incurring the prohibitive cost of full denormalization.

【Keywords】: adaptive denormalization

185. Adaptive Data Skipping in Main-Memory Systems.

Paper Link】 【Pages】:2255-2256

【Authors】: Wilson Qin ; Stratos Idreos

【Abstract】: As modern main-memory optimized data systems increasingly rely on fast scans, lightweight indexes that allow for data skipping play a crucial role in data filtering to reduce system I/O. Scans benefit from data skipping when the data order is sorted, semi-sorted, or comprised of clustered values. However data skipping loses effectiveness over arbitrary data distributions. Applying data skipping techniques over non-sorted data can significantly decrease query performance since the extra cost of metadata reads result in no corresponding scan performance gains. We introduce adaptive data skipping as a framework for structures and techniques that respond to a vast array of data distributions and query workloads. We reveal an adaptive zonemaps design and implementation on a main-memory column store prototype to demonstrate that adaptive data skipping has potential for 1.4X speedup.

【Keywords】: adaptive; adaptive data skipping; adaptive zonemap; adaptive zonemaps; data skipping; lightweight index; lightweight indexing; zonemap; zonemaps

186. Searching Web Data using MinHash LSH.

Paper Link】 【Pages】:2257-2258

【Authors】: BiChen Rao ; Erkang Zhu

【Abstract】: In this extended abstract, we explore the use of MinHash Locality Sensitive Hashing (MinHash LSH) to address the problem of indexing and searching Web data. We discuss a statistical tuning strategy of MinHash LSH, and experimentally evaluate the accuracy and performance, compared with inverted index. In addition, we describe an on-line demo for the index with real Web data.

【Keywords】: LSH; MinHash; dataset search; web data

187. Research Contribution as a Measure of Influence.

Paper Link】 【Pages】:2259-2260

【Authors】: Lais M. A. Rocha ; Mirella M. Moro

【Abstract】: We propose the 3c-index that measures the influence degree of researchers by evaluating the links they establish between communities. We evaluate its performance against well known metrics. The results show 3c-index outperforms them in most cases and can be employed as a complementary metric to assess researchers' productivity.

【Keywords】: bibliometry; ranking strategy; research performance

188. Vectorizing an In Situ Query Engine.

Paper Link】 【Pages】:2261-2262

【Authors】: Panagiotis Sioulas ; Anastasia Ailamaki

【Abstract】: Database systems serve a wide range of use cases efficiently, but require data to be loaded and adapted to the system's execution engine. This pre-processing step is a bottleneck to the analysis of the increasingly large and heterogeneous datasets. Therefore, numerous research efforts advocate for querying each dataset in situ,i.e., without pre-loading it in a DBMS. On the other hand, performing analysis over raw data entails numerous overheads because of the potentially inefficient data representations. In this paper, we investigate the effect of vector processing on raw data querying. We enhance the operators of a query engine to use SIMD operations. Specifically, we examine the effect of SIMD on two different cases: the scan operators that perform the CPU-intensive task of input parsing, and the part of the query pipeline that performs a selection and computes an aggregate. We show that a vectorized approach has a lot of potential to improve performance, which nevertheless comes with trade-offs.

【Keywords】: JIT; SIMD; in situ query engine; raw data; vectorization; vectorized operators

189. Exploring Visualization of Data Transforms.

Paper Link】 【Pages】:2263-2264

【Authors】: Larry Xu

【Abstract】: In the context of data exploration, users often interact with relational database systems in an interactive query session to form useful insights. Each query a user executes can potentially transform a resultset in complex ways. We explore some of the challenges in understanding these transformations, and how these challenges can be solved through more informative visual representations of data transforms. We present the concept of "tweening" of resultsets as a method of incrementally visualizing data transformations, and explore approaches towards generating these resultset tweens. Through a series of user studies, we evaluate tweening as an effective method of understanding the changes that result from data transformations.

【Keywords】: animation; data transforms; visualization

190. Minimizing Average Regret Ratio in Database.

Paper Link】 【Pages】:2265-2266

【Authors】: Sepanta Zeighami ; Raymond Chi-Wing Wong

【Abstract】: We propose "average regret ratio" as a metric to measure users' satisfaction after a user sees k selected points of a database, instead of all of the points in the database. We introduce the average regret ratio as another means of multi-criteria decision making. Unlike the original k-regret operator that uses the maximum regret ratio, the average regret ratio takes into account the satisfaction of a general user. While assuming the existence of some utility functions for the users, in contrast to the top-k query, it does not require a user to input his or her utility function but instead depends on the probability distribution of the utility functions. We prove that the average regret ratio is a supermodular function and provide a polynomial-time approximation algorithm to find the average regret ratio minimizing set for a database.

【Keywords】: k-regret queries; query processing; skyline queries; top-k queries