ACM SIGMOD Conference 2017:Chicago, IL, USA

Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017. ACM 【DBLP Link

Paper Num: 149 || Session Num: 34

Keynote Session - Grand Challenges in Data Management: Transactions 2

1. The Next 700 Transaction Processing Engines.

Paper Link】 【Pages】:1-2

【Authors】: Anastasia Ailamaki

【Abstract】: In this talk, we discuss the implications of these trends on the design of next-generation transaction processing engines. We revisit old designs, examine current designs, and explore new designs with the twin goal of meeting changing application demands and optimizing for newer metrics by exploiting emerging hardware. We also discuss our ongoing projects to address the issues stemming from the changing hardware and software landscape and adapt engine designs to emerging trends, in order to demonstrate that transaction processing is a dynamic research area with a rich history, a vibrant present, and a revolutionary future.

【Keywords】: databases; multi-core systems; transaction processing engines; transactions

2. What Are We Doing With Our Lives?: Nobody Cares About Our Concurrency Control Research.

Paper Link】 【Pages】:3

【Authors】: Andrew Pavlo

【Abstract】: Most of the academic papers on concurrency control published in the last five years have assumed the following two design decisions: (1) applications execute transactions with serializable isolation and (2) applications execute most (if not all) of their transactions using stored procedures. But results from a recent survey of database administrators indicates that these assumptions are not realistic. This survey includes both legacy deployments where the cost of changing the application to use either serializable isolation or stored procedures is not feasible, as well as new "greenfield" projects that not encumbered by prior constraints. As such, the research produced by our community is not helping people with their real-world systems and thus is essentially irrelevant. I know this because I am guilty of writing these papers too. In this talk/denouncement, I will descend from my ivory tower and argue that we need to rethink our agenda for concurrency control research. Recent trends focus on asking the wrong questions and solving the wrong problems. I contend that the real issues that will have the most impact are not easily solved by more "clever" algorithms. Instead, in many cases, they can only be solved by hardware improvements and artificial intelligence.

【Keywords】: concurrency control; oltp databases

SIGMOD Session 1. Concurrency (1) 3

Paper Link】 【Pages】:5-20

【Authors】: Todd Warszawski ; Peter Bailis

【Abstract】: In theory, database transactions protect application data from corruption and integrity violations. In practice, database transactions frequently execute under weak isolation that exposes programs to a range of concurrency anomalies, and programmers may fail to correctly employ transactions. While low transaction volumes mask many potential concurrency-related errors under normal operation, determined adversaries can exploit them programmatically for fun and profit. In this paper, we formalize a new kind of attack on database-backed applications called an ACIDRain attack, in which an adversary systematically exploits concurrency-related vulnerabilities via programmatically accessible APIs. These attacks are not theoretical: ACIDRain attacks have already occurred in a handful of applications in the wild, including one attack which bankrupted a popular Bitcoin exchange. To proactively detect the potential for ACIDRain attacks, we extend the theory of weak isolation to analyze latent potential for non-serializable behavior under concurrent web API calls. We introduce a language-agnostic method for detecting potential isolation anomalies in web applications, called Abstract Anomaly Detection (2AD), that uses dynamic traces of database accesses to efficiently reason about the space of possible concurrent interleavings. We apply a prototype 2AD analysis tool to 12 popular self-hosted eCommerce applications written in four languages and deployed on over 2M websites. We identify and verify 22 critical ACIDRain attacks that allow attackers to corrupt store inventory, over-spend gift cards, and steal inventory.

【Keywords】:

4. Cicada: Dependably Fast Multi-Core In-Memory Transactions.

Paper Link】 【Pages】:21-35

【Authors】: Hyeontaek Lim ; Michael Kaminsky ; David G. Andersen

【Abstract】: Multi-core in-memory databases promise high-speed online transaction processing. However, the performance of individual designs suffers when the workload characteristics miss their small sweet spot of a desired contention level, read-write ratio, record size, processing rate, and so forth. Cicada is a single-node multi-core in-memory transactional database with serializability. To provide high performance under diverse workloads, Cicada reduces overhead and contention at several levels of the system by leveraging optimistic and multi-version concurrency control schemes and multiple loosely synchronized clocks while mitigating their drawbacks. On the TPC-C and YCSB benchmarks, Cicada outperforms Silo, TicToc, FOEDUS, MOCC, two-phase locking, Hekaton, and ERMIA in most scenarios, achieving up to 3X higher throughput than the next fastest design. It handles up to 2.07 M TPC-C transactions per second and 56.5 M YCSB transactions per second, and scans up to 356 M records per second on a single 28-core machine.

【Keywords】: best-effort inlining; contention regulation; loosely synchronized clocks; multi-version concurrency control; optimistic concurrency control; rapid garbage collection

5. BatchDB: Efficient Isolated Execution of Hybrid OLTP+OLAP Workloads for Interactive Applications.

Paper Link】 【Pages】:37-50

【Authors】: Darko Makreshanski ; Jana Giceva ; Claude Barthels ; Gustavo Alonso

【Abstract】: In this paper we present BatchDB, an in-memory database engine designed for hybrid OLTP and OLAP workloads. BatchDB achieves good performance, provides a high level of data freshness, and minimizes load interaction between the transactional and analytical engines, thus enabling real time analysis over fresh data under tight SLAs for both OLTP and OLAP workloads. BatchDB relies on primary-secondary replication with dedicated replicas, each optimized for a particular workload type (OLTP, OLAP), and a light-weight propagation of transactional updates. The evaluation shows that for standard TPC-C and TPC-H benchmarks, BatchDB can achieve competitive performance to specialized engines for the corresponding transactional and analytical workloads, while providing a level of performance isolation and predictable runtime for hybrid workload mixes (OLTP+OLAP) otherwise unmet by existing solutions.

【Keywords】: batchdb; batching; ch-benchmark; databases; htap; numa; olap; oltp; rdbms; rdma; tpc-c; tpc-h

SIGMOD Session 2. Storage and Distribution (1) 3

6. Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics.

Paper Link】 【Pages】:51-63

【Authors】: Raghu Ramakrishnan ; Baskar Sridharan ; John R. Douceur ; Pavan Kasturi ; Balaji Krishnamachari-Sampath ; Karthick Krishnamoorthy ; Peng Li ; Mitica Manu ; Spiro Michaylov ; Rogério Ramos ; Neil Sharman ; Zee Xu ; Youssef Barakat ; Chris Douglas ; Richard Draves ; Shrikant S. Naidu ; Shankar Shastry ; Atul Sikaria ; Simon Sun ; Ramarathnam Venkatesan

【Abstract】: Azure Data Lake Store (ADLS) is a fully-managed, elastic, scalable, and secure file system that supports Hadoop distributed file system (HDFS) and Cosmos semantics. It is specifically designed and optimized for a broad spectrum of Big Data analytics that depend on a very high degree of parallel reads and writes, as well as collocation of compute and data for high bandwidth and low-latency access. It brings together key components and features of Microsoft?s Cosmos file system-long used by internal customers at Microsoft and HDFS, and is a unified file storage solution for analytics on Azure. Internal and external workloads run on this unified platform. Distinguishing aspects of ADLS include its design for handling multiple storage tiers, exabyte scale, and comprehensive security and data sharing features. We present an overview of ADLS architecture, design points, and performance.

【Keywords】: aws; azure; big data; cloud service; distributed file system; gce; hadoop; hdfs; map-reduce; storage; tiered storage

7. OctopusFS: A Distributed File System with Tiered Storage Management.

Paper Link】 【Pages】:65-78

【Authors】: Elena Kakoulli ; Herodotos Herodotou

【Abstract】: The ever-growing data storage and I/O demands of modern large-scale data analytics are challenging the current distributed storage systems. A promising trend is to exploit the recent improvements in memory, storage media, and networks for sustaining high performance and low cost. While past work explores using memory or SSDs as local storage or combine local with network-attached storage in cluster computing, this work focuses on managing multiple storage tiers in a distributed setting. We present OctopusFS, a novel distributed file system that is aware of heterogeneous storage media (e.g., memory, SSDs, HDDs, NAS) with different capacities and performance characteristics. The system offers a variety of pluggable policies for automating data management across the storage tiers and cluster nodes. The policies employ multi-objective optimization techniques for making intelligent data management decisions based on the requirements of fault tolerance, data and load balancing, and throughput maximization. At the same time, the storage media are explicitly exposed to users and applications, allowing them to choose the distribution and placement of replicas in the cluster based on their own performance and fault tolerance requirements. Our extensive evaluation shows the immediate benefits of using OctopusFS with data-intensive processing systems, such as Hadoop and Spark, in terms of both increased performance and better cluster utilization.

【Keywords】: distributed file system; tiered storage management

Paper Link】 【Pages】:79-94

【Authors】: Niv Dayan ; Manos Athanassoulis ; Stratos Idreos

【Abstract】: In this paper, we show that key-value stores backed by an LSM-tree exhibit an intrinsic trade-off between lookup cost, update cost, and main memory footprint, yet all existing designs expose a suboptimal and difficult to tune trade-off among these metrics. We pinpoint the problem to the fact that all modern key-value stores suboptimally co-tune the merge policy, the buffer size, and the Bloom filters' false positive rates in each level. We present Monkey, an LSM-based key-value store that strikes the optimal balance between the costs of updates and lookups with any given main memory budget. The insight is that worst-case lookup cost is proportional to the sum of the false positive rates of the Bloom filters across all levels of the LSM-tree. Contrary to state-of-the-art key-value stores that assign a fixed number of bits-per-element to all Bloom filters, Monkey allocates memory to filters across different levels so as to minimize this sum. We show analytically that Monkey reduces the asymptotic complexity of the worst-case lookup I/O cost, and we verify empirically using an implementation on top of LevelDB that Monkey reduces lookup latency by an increasing margin as the data volume grows (50%-80% for the data sizes we experimented with). Furthermore, we map the LSM-tree design space onto a closed-form model that enables co-tuning the merge policy, the buffer size and the filters' false positive rates to trade among lookup cost, update cost and/or main memory, depending on the workload (proportion of lookups and updates), the dataset (number and size of entries), and the underlying hardware (main memory available, disk vs. flash). We show how to use this model to answer what-if design questions about how changes in environmental parameters impact performance and how to adapt the various LSM-tree design elements accordingly.

【Keywords】: adaptivity; auto-tuning; bloom filters; key-value store; log-structured merge-tree; lsm-tree; memory hierarchy; point lookups; point queries; read/write/memory trade-off

SIGMOD Session 3. Streams 3

9. Enabling Signal Processing over Data Streams.

Paper Link】 【Pages】:95-108

【Authors】: Milos Nikolic ; Badrish Chandramouli ; Jonathan Goldstein

【Abstract】: Internet of Things applications analyze the data coming from large networks of sensor devices using relational and signal processing operations and running the same query logic over groups of sensor signals. To support such increasingly important scenarios, many data management systems integrate with numerical frameworks like R. Such solutions, however, incur significant performance penalties as relational data processing engines and numerical tools operate on fundamentally different data models with expensive inter-communication mechanisms. In addition, none of these solutions supports efficient real-time and incremental analysis. In this paper, we advocate a deep integration of signal processing operations and general-purpose query processors. We aim to reconcile the disparate data models and provide a common query language that allows users to seamlessly interleave tempo-relational and signal operations for both online and offline processing. Our approach is extensible and offers frameworks for quick and easy integration of user-defined operations while supporting incremental computation. Our system that deeply integrates relational and signal operations, called TRILLDSP, achieves up to two orders of magnitude better performance than popular loosely-coupled data management systems on grouped signal processing workflows.

【Keywords】: array and relational data; data stream processing; digital signal processing; incremental computation; iot applications; realtime analysis; sensor networks; tight integration; trill; trilldsp

10. Complete Event Trend Detection in High-Rate Event Streams.

Paper Link】 【Pages】:109-124

【Authors】: Olga Poppe ; Chuan Lei ; Salah Ahmed ; Elke A. Rundensteiner

【Abstract】: Event processing applications from financial fraud detection to health care analytics continuously execute event queries with Kleene closure to extract event sequences of arbitrary, statically unknown length, called Complete Event Trends (CETs). Due to common event sub-sequences in CETs, either the responsiveness is delayed by repeated computations or an exorbitant amount of memory is required to store partial results. To overcome these limitations, we define the CET graph to compactly encode all CETs matched by a query. Based on the graph, we define the spectrum of CET detection algorithms from CPU-optimal to memory-optimal. We find the middle ground between these two extremes by partitioning the graph into time-centric graphlets and caching partial CETs per graphlet to enable effective reuse of these intermediate results. We reveal cost monotonicity properties of the search space of graph partitioning plans. Our CET optimizer leverages these properties to prune significant portions of the search to produce a partitioning plan with minimal CPU costs yet within the given memory limit. Our experimental study demonstrates that our CET detection solution achieves up to 42--fold speed-up even under rigid memory constraints compared to the state-of-the-art techniques in diverse scenarios.

【Keywords】: algorithms; complex event processing; cpu versus memory trade-off; dynamic graph analytics; dynamic graph partitioning; event stream analytics; event trends; kleene closure; query optimization

11. LittleTable: A Time-Series Database and Its Uses.

Paper Link】 【Pages】:125-138

【Authors】: Sean Rhea ; Eric Wang ; Edmund Wong ; Ethan Atkins ; Nat Storer

【Abstract】: We present LittleTable, a relational database that Cisco Meraki has used since 2008 to store usage statistics, event logs, and other time-series data from our customers' devices. LittleTable optimizes for time-series data by clustering tables in two dimensions. By partitioning rows by timestamp, it allows quick retrieval of recent measurements without imposing any penalty for retaining older history. By further sorting within each partition by a hierarchically-delineated key, LittleTable allows developers to optimize each table for the specific patterns with which they intend to access it. LittleTable further optimizes for time-series data by capitalizing on the reduced consistency and durability needs of our applications, three of which we present here. In particular, our applications are single-writer and append-only. At most one process inserts a given type of data collected from a given device, and applications never update rows written in the past, simplifying both lock management and crash recovery. Our most recently written data is also recoverable, as it can generally be re-read from the devices themselves, allowing LittleTable to safely lose some amount of recently-written data in the event of a crash. As a result of these optimizations, LittleTable is fast and efficient, even on a single processor and spinning disk. Querying an uncached table of 128-byte rows, it returns the first matching row in 31 ms, and it returns 500,000 rows/second thereafter, approximately 50% of the throughput of the disk itself. Today Meraki stores 320 TB of data across several hundred LittleTable servers system-wide.

【Keywords】: cloud computing; clustering; databases; internet of things; partitioning; time-series data

SIGMOD Session 4. Versions and Incremental Maintenance 3

12. Incremental View Maintenance over Array Data.

Paper Link】 【Pages】:139-154

【Authors】: Weijie Zhao ; Florin Rusu ; Bin Dong ; Kesheng Wu ; Peter Nugent

【Abstract】: Science applications are producing an ever-increasing volume of multi-dimensional data that are mainly processed with distributed array databases. These raw arrays are cooked'' into derived data products using complex pipelines that are time-consuming. As a result, derived data products are released infrequently and become stale soon thereafter. In this paper, we introduce materialized array views as a database construct for scientific data products. We model thecooking'' process as incremental view maintenance with batch updates and give a three-stage heuristic that finds effective update plans. Moreover, the heuristic repartitions the array and the view continuously based on a window of past updates as a side-effect of view maintenance without overhead. We design an analytical cost model for integrating materialized array views in queries. A thorough experimental evaluation confirms that the proposed techniques are able to incrementally maintain a real astronomical data product in a production environment.

【Keywords】: array similarity join; batch updates; greedy heuristics; mixed-integer programming; workload-driven array reorganization

13. Incremental Graph Computations: Doable and Undoable.

Paper Link】 【Pages】:155-169

【Authors】: Wenfei Fan ; Chunming Hu ; Chao Tian

【Abstract】: The incremental problem for a class Q of graph queries aims to compute, given a query Q in 'Q, graph G, output Q(G) and updates Δ G to G as input, changes Δ O to Q(G) such that Q(G ⊕ Δ G) = Q(G) ⊕ Δ O. It is called bounded if its cost can be expressed as a polynomial function in the sizes of Q, Δ G and Δ O. It is to reduce computations on possibly big G to small Δ G and Δ O. No matter how desirable, however, our first results are negative: for common graph queries such as graph traversal, connectivity, keyword search and pattern matching, their incremental problems are unbounded. In light of the negative results, we propose two characterizations for the effectiveness of incremental computation: (a) localizable, if its cost is decided by small neighbors of nodes in Δ G instead of the entire G; and (b) bounded relative to a batch algorithm T, if the cost is determined by the sizes of Δ G and changes to the affected area that is necessarily checked by T. We show that the incremental computations above are either localizable or relatively bounded, by providing corresponding incremental algorithms. That is, we can either reduce the incremental computations on big graphs to small data, or incrementalize batch algorithms by minimizing unnecessary recomputation. Using real-life graphs, we experimentally verify the effectiveness of our algorithms.

【Keywords】: graph data management; incremental computation; query optimization

14. DEX: Query Execution in a Delta-based Storage System.

Paper Link】 【Pages】:171-186

【Authors】: Amit Chavan ; Amol Deshpande

【Abstract】: The increasing reliance on robust data-driven decision-making across many domains has made it necessary for data management systems to manage many thousands to millions of versions of datasets, acquired or constructed at various stages of analysis pipelines over time. Delta encoding is an effective and widely-used solution to compactly store a large number of datasets, that simultaneously exploits redundancies across them and keeps the average retrieval cost of reconstructing any dataset low. However, supporting any kind of rich retrieval or querying functionality, beyond single dataset checkout, is challenging in such storage engines. In this paper, we initiate a systematic study of this problem, and present DEX, a novel stand-alone delta-oriented execution engine, whose goal is to take advantage of the already computed deltas between the datasets for efficient query processing. In this work, we study how to execute checkout, intersection, union and t-threshold queries over record-based files; we show that processing of even these basic queries leads to many new and unexplored challenges and trade-offs. Starting from a query plan that confines query execution to a small set of deltas, we introduce new transformation rules based on the algebraic properties of the deltas, that allow us to explore the search space of alternative plans. For the case of checkout, we present a dynamic programming algorithm to efficiently select the optimal query plan under our cost model, while we design efficient heuristics to select effective plans that vastly outperform the base checkout-then-query approach for other queries. A key characteristic of our query execution methods is that the computational cost is primarily dependent on the size and the number of deltas in the expression (typically small), and not the input dataset versions (which can be very large). We have implemented DEX prototype on top of git, a widely used version control system. We present an extensive experimental evaluation on synthetic data with diverse characteristics, that shows that our methods perform exceedingly well compared to the baseline.

【Keywords】: delta encoding; query processing; versioning

SIGMOD Session 5. Parallel and Distributed Query Processing (1) 3

15. Massively Parallel Processing of Whole Genome Sequence Data: An In-Depth Performance Study.

Paper Link】 【Pages】:187-202

【Authors】: Abhishek Roy ; Yanlei Diao ; Uday Evani ; Avinash Abhyankar ; Clinton Howarth ; Rémi Le Priol ; Toby Bloom

【Abstract】: This paper presents a joint effort between a group of computer scientists and bioinformaticians to take an important step towards a general big data platform for genome analysis pipelines. The key goals of this study are to develop a thorough understanding of the strengths and limitations of big data technology for genomic data analysis, and to identify the key questions that the research community could address to realize the vision of personalized genomic medicine. Our platform, called Gesall, is based on the new "Wrapper Technology" that supports existing genomic data analysis programs in their native forms, without having to rewrite them. To do so, our system provides several layers of software, including a new Genome Data Parallel Toolkit (GDPT), which can be used to "wrap" existing data analysis programs. This platform offers a concrete context for evaluating big data technology for genomics: we report on super-linear speedup and sublinear speedup for various tasks, as well as the reasons why a parallel program could produce different results from those of a serial program. These results lead to key research questions that require a synergy between genomics scientists and computer scientists to find solutions.

【Keywords】: benchmarking; big data processing; genomics; performance evaluation

16. Distributed Provenance Compression.

Paper Link】 【Pages】:203-218

【Authors】: Chen Chen ; Harshal Tushar Lehri ; Lay Kuan Loh ; Anupam Alur ; Limin Jia ; Boon Thau Loo ; Wenchao Zhou

【Abstract】: Network provenance, which records the execution history of network events as meta-data, is becoming increasingly important for network accountability and failure diagnosis. For example, network provenance may be used to trace the path that a message traversed in a network, or to reveal how a particular routing entry was derived and the parties involved in its derivation. A challenge when storing the provenance of a live network is that the large number of the arriving messages may incur substantial storage overhead. In this paper, we explore techniques to dynamically compress distributed provenance stored at scale. Logically, the compression is achieved by grouping equivalent provenance trees and maintaining only one concrete copy for each equivalence class. To efficiently identify equivalent provenance, we (1) introduce distributed event-based linear programs (DELP) to specify distributed network applications, and (2) statically analyze DELPs to allow for quick detection of provenance equivalence at runtime. Our experimental results demonstrate that our approach leads to significant storage reduction and query latency improvement over alternative approaches.

【Keywords】: distributed systems; provenance; static analysis; storage

17. ROBUS: Fair Cache Allocation for Data-parallel Workloads.

Paper Link】 【Pages】:219-234

【Authors】: Mayuresh Kunjir ; Brandon Fain ; Kamesh Munagala ; Shivnath Babu

【Abstract】: Systems for processing big data---e.g., Hadoop, Spark, and massively parallel databases---need to run workloads on behalf of multiple tenants simultaneously. The abundant disk-based storage in these systems is usually complemented by a smaller, but much faster, cache. Cache is a precious resource: Tenants who get to use the cache can see two orders of magnitude performance improvement. Cache is also a limited and hence shared resource: Unlike a resource like a CPU core which can be used by only one tenant at a time, a cached data item can be accessed by multiple tenants at the same time. Cache, therefore, has to be shared by a multi-tenancy-aware policy across tenants, each having a unique set of priorities and workload characteristics. In this paper, we develop cache allocation strategies that speed up the overall workload while being fair to each tenant. We build a novel fairness model targeted at the shared resource setting that incorporates not only the more standard concepts of Pareto-efficiency and sharing incentive, but we also define envy freeness via the notion of core from cooperative game theory. Our cache management platform, ROBUS, uses randomization over small time batches, and we develop a proportionally fair allocation mechanism that satisfies the core property in expectation. We show that this algorithm and related fair algorithms can be approximated to arbitrary precision in polynomial time. We evaluate these algorithms on a ROBUS prototype implemented on Spark with RDD store used as cache. Our evaluation on an industry-standard workload shows that our algorithms score high on both performance and fairness metrics across a wide variety of practical multi-tenant setups.

【Keywords】: cache management; data analytics; multi-tenancy; proportional fairness

SIGMOD Session 6. Concurrency (2) 4

18. Transaction Repair for Multi-Version Concurrency Control.

Paper Link】 【Pages】:235-250

【Authors】: Mohammad Dashti ; Sachin Basil John ; Amir Shaikhha ; Christoph Koch

【Abstract】: The optimistic variants of Multi-Version Concurrency Control (MVCC) avoid blocking concurrent transactions at the cost of having a validation phase. Upon failure in the validation phase, the transaction is usually aborted and restarted from scratch. The "abort and restart" approach becomes a performance bottleneck for use cases with high contention objects or long running transactions. In addition, restarting from scratch creates a negative feedback loop in the system, because the system incurs additional overhead that may create even more conflicts. In this paper, we propose a novel approach for conflict resolution in MVCC for in-memory databases. This low overhead approach summarizes the transaction programs in the form of a dependency graph. The dependency graph also contains the constructs used in the validation phase of the MVCC algorithm. Then, when encountering conflicts among transactions, our mechanism quickly detects the conflict locations in the program and partially re-executes the conflicting transactions. This approach maximizes the reuse of the computations done in the initial execution round, and increases the transaction processing throughput.

【Keywords】: conflict resolution; multi-version concurrency control; transaction; transaction repair

19. Concerto: A High Concurrency Key-Value Store with Integrity.

Paper Link】 【Pages】:251-266

【Authors】: Arvind Arasu ; Ken Eguro ; Raghav Kaushik ; Donald Kossmann ; Pingfan Meng ; Vineet Pandey ; Ravi Ramamurthy

【Abstract】: Verifying the integrity of outsourced data is a classic, well-studied problem. However current techniques have fundamental performance and concurrency limitations for update-heavy workloads. In this paper, we investigate the potential advantages of deferred and batched verification rather than the per-operation verification used in prior work. We present Concerto, a comprehensive key-value store designed around this idea. Using Concerto, we argue that deferred verification preserves the utility of online verification and improves concurrency resulting in orders-of-magnitude performance improvement. On standard benchmarks, the performance of Concerto is within a factor of two when compared to state-of-the-art key-value stores without integrity.

【Keywords】: concurrency; indexing; integrity; key-value stores; main memory; merkle trees; recovery; secure hardware; verification

20. Fast Failure Recovery for Main-Memory DBMSs on Multicores.

Paper Link】 【Pages】:267-281

【Authors】: Yingjun Wu ; Wentian Guo ; Chee-Yong Chan ; Kian-Lee Tan

【Abstract】: Main-memory database management systems (DBMS) can achieve excellent performance when processing massive volume of on-line transactions on modern multi-core machines. But existing durability schemes, namely, tuple-level and transaction-level logging-and-recovery mechanisms, either degrade the performance of transaction processing or slow down the process of failure recovery. In this paper, we show that, by exploiting application semantics, it is possible to achieve speedy failure recovery without introducing any costly logging overhead to the execution of concurrent transactions. We propose PACMAN, a parallel database recovery mechanism that is specifically designed for lightweight, coarse-grained transaction-level logging. PACMAN leverages a combination of static and dynamic analyses to parallelize the log recovery: at compile time, PACMAN decomposes stored procedures by carefully analyzing dependencies within and across programs; at recovery time, PACMAN exploits the availability of the runtime parameter values to attain an execution schedule with a high degree of parallelism. As such, recovery performance is remarkably increased. We evaluated PACMAN in a fully-fledged main-memory DBMS running on a 40-core machine. Compared to several state-of-the-art database recovery mechanisms, can significantly reduce recovery time without compromising the efficiency of transaction processing.

【Keywords】: database logging; database recovery; main-memory dbms

21. Bringing Modular Concurrency Control to the Next Level.

Paper Link】 【Pages】:283-297

【Authors】: Chunzhi Su ; Natacha Crooks ; Cong Ding ; Lorenzo Alvisi ; Chao Xie

【Abstract】: This paper presents Tebaldi, a distributed key-value store that explores new ways to harness the performance opportunity of combining different specialized concurrency control mechanisms (CCs) within the same database. Tebaldi partitions conflicts at a fine granularity and matches them to specialized CCs within a hierarchical framework that is modular, extensible, and able to support a wide variety of concurrency control techniques, from single-version to multiversion and from lock-based to timestamp-based. When running the TPC-C benchmark, Tebaldi yields more than 20× the throughput of the basic two-phase locking protocol, and over 3.7× the throughput of Callas, a recent system that, like Tebaldi, aims to combine different CCs.

【Keywords】: distributed database; modular concurrency control; transaction

SIGMOD Session 7. Storage and Distribution (2) 3

22. Wide Table Layout Optimization based on Column Ordering and Duplication.

Paper Link】 【Pages】:299-314

【Authors】: Haoqiong Bian ; Ying Yan ; Wenbo Tao ; Liang Jeff Chen ; Yueguo Chen ; Xiaoyong Du ; Thomas Moscibroda

【Abstract】: Modern data analytical tasks often witness very wide tables, from a few hundred columns to a few thousand. While it is commonly agreed that column stores are an appropriate data format for wide tables and analytical workloads, the physical order of columns has not been investigated. Column ordering plays a critical role in I/O performance, because in wide tables accessing the columns in a single horizontal partition may involve multiple disk seeks. An optimal column ordering will incur minimal cumulative disk seek costs for the set of queries applied to the data. In this paper, we aim to find such an optimal column layout to maximize I/O performance. Specifically, we study two problems for column stores on HDFS: column ordering and column duplication. Column ordering seeks an approximately optimal order of columns; column duplication complements column ordering in that some columns may be duplicated multiple times to reduce contention among the queries' diverse requirements on the column order. We consider an actual fine-grained cost model for column accesses and propose algorithms that take a query workload as input and output a column ordering strategy with or without storage redundancy that significantly improves the overall I/O performance. Experimental results over real-life data and production query workloads confirm the effectiveness of the proposed algorithms in diverse settings.

【Keywords】: colum store; data analytics; data layout optimization; disk-based storage system; hdfs; wide table

23. Query Centric Partitioning and Allocation for Partially Replicated Database Systems.

Paper Link】 【Pages】:315-330

【Authors】: Tilmann Rabl ; Hans-Arno Jacobsen

【Abstract】: A key feature of database systems is to provide transparent access to stored data. In distributed database systems, this includes data allocation and fragmentation. Transparent access introduces data dependencies and increases system complexity and inter-process communication. Therefore, many developers are exchanging transparency for better scalability using sharding and similar techniques. However, explicitly managing data distribution and data flow requires a deep understanding of the distributed system and the data access, and it reduces the possibilities for optimizations. To address this problem, we present an approach for efficient data allocation that features good scalability while keeping the data distribution transparent. We propose a workload-aware, query-centric, heterogeneity-aware analytical model. We formalize our approach and present an efficient allocation algorithm. The algorithm optimizes the partitioning and data layout for local query execution and balances the workload on homogeneous and heterogeneous systems according to the query history. In the evaluation, we demonstrate that our approach scales well in performance for OLTP- and OLAP-style workloads and reduces storage requirements significantly over replicated systems while guaranteeing configurable availability.

【Keywords】: data partitioning; database allocation; database data placement; partially replicated database; shared-nothing database system

24. Spanner: Becoming a SQL System.

Paper Link】 【Pages】:331-343

【Authors】: David F. Bacon ; Nathan Bales ; Nicolas Bruno ; Brian F. Cooper ; Adam Dickinson ; Andrew Fikes ; Campbell Fraser ; Andrey Gubarev ; Milind Joshi ; Eugene Kogan ; Alexander Lloyd ; Sergey Melnik ; Rajesh Rao ; David Shue ; Christopher Taylor ; Marcel van der Holst ; Dale Woodford

【Abstract】: Spanner is a globally-distributed data management system that backs hundreds of mission-critical services at Google. Spanner is built on ideas from both the systems and database communities. The first Spanner paper published at OSDI'12 focused on the systems aspects such as scalability, automatic sharding, fault tolerance, consistent replication, external consistency, and wide-area distribution. This paper highlights the database DNA of Spanner. We describe distributed query execution in the presence of resharding, query restarts upon transient failures, range extraction that drives query routing and index seeks, and the improved blockwise-columnar storage format. We touch upon migrating Spanner to the common SQL dialect shared with other systems at Google.

【Keywords】: cloud computing; distributed systems; sql; transactional databases

SIGMOD Session 8. Tree & Graph Processing (1) 4

25. Landmark Indexing for Evaluation of Label-Constrained Reachability Queries.

Paper Link】 【Pages】:345-358

【Authors】: Lucien D. J. Valstar ; George H. L. Fletcher ; Yuichi Yoshida

【Abstract】: Consider a directed edge-labeled graph, such as a social network or a citation network. A fundamental query on such data is to determine if there is a path in the graph from a given source vertex to a given target vertex, using only edges with labels in a restricted subset of the edge labels in the graph. Such label-constrained reachability (LCR) queries play an important role in graph analytics, for example, as a core fragment of the so-called regular path queries which are supported in practical graph query languages such as the W3C's SPARQL 1.1, Neo4j's Cypher, and Oracle's PGQL. Current solutions for LCR evaluation, however, do not scale to large graphs which are increasingly common in a broad range of application domains. In this paper we present the first practical solution for efficient LCR evaluation, leveraging landmark-based indexes for large graphs. We show through extensive experiments that our indexes are significantly smaller than state-of-the-art LCR indexing techniques, while supporting up to orders of magnitude faster query evaluation times. Our complete C++ codebase is available as open source for further research.

【Keywords】: label-constrained reachability; labeled graph; path queries

26. Efficient Ad-Hoc Graph Inference and Matching in Biological Databases.

Paper Link】 【Pages】:359-373

【Authors】: Xiang Lian ; Dongchul Kim

【Abstract】: In many real applications such as bioinformatics and biological network analysis, it has always been an important, yet challenging, topic to accurately infer/reconstruct gene regulatory networks (GRNs) from microarray data, and efficiently identify those matching GRNs with similar interaction structures for potential disease analysis and treatment tasks. Motivated by this, in this paper, we formalize the problem of ad-hoc inference and matching over gene regulatory networks (IM-GRN), which deciphers ad-hoc GRN graph structures online from gene feature databases (without full GRN materializations), and retrieves the inferred GRNs that are subgraph-isomorphic to a query GRN graph with high confidences. Specifically, we propose a novel probabilistic score to measure the possible interaction between any two genes (inferred from gene feature vectors), and thus model GRNs by probabilistic graphs, containing edge existence probabilities. In order to efficiently process IM-GRN queries, we propose effective reduction, pruning, and embedding strategies to significantly reduce the search space of GRN inference and matching, without materializing all GRNs. We also present an effective indexing mechanism and an efficient IM-GRN query processing algorithm by the index traversal. Finally, extensive experiments have been conducted to verify the efficiency and effectiveness of our proposed IM-GRN query answering approaches over real/synthetic GRN data sets.

【Keywords】: ad-hoc graph inference and matching; gene regulatory networks; im-grn

27. DAG Reduction: Fast Answering Reachability Queries.

Paper Link】 【Pages】:375-390

【Authors】: Junfeng Zhou ; Shijie Zhou ; Jeffrey Xu Yu ; Hao Wei ; Ziyang Chen ; Xian Tang

【Abstract】: Answering reachability queries is one of the fundamental graph operations. The existing approaches build indexes and answer reachability queries on a directed acyclic graph (DAG) G, which is constructed by coalescing each strongly connected component of the given directed graph G into a node of G. Considering that G can still be large to be processed efficiently, there are studies to further reduce G to a smaller graph. However, these approaches suffer from either inefficiency in answering reachability queries, or cannot scale to large graphs. In this paper, we study DAG reduction to accelerate reachability query processing, which reduces the size of G by computing transitive reduction (TR) followed by computing equivalence reduction (ER). For ER, we propose a divide-and-conquer algorithm, namely linear-ER. Given the result Gt of TR, linear-ER gets a smaller DAG Gε in linear time based on equivalence relationship between nodes in G. Our DAG reduction approaches (TR and ER) significantly improve the cost of time and space, and can be scaled to large graphs. We confirm the efficiency of our approaches by extensive experimental studies for TR, ER, and reachability query processing using 20 real datasets.

【Keywords】: dag reduction; equivalence reduction; reachability query processing; transitive reduction

28. Flexible and Feasible Support Measures for Mining Frequent Patterns in Large Labeled Graphs.

Paper Link】 【Pages】:391-402

【Authors】: Jinghan Meng ; Yi-Cheng Tu

【Abstract】: In recent years, the popularity of graph databases has grown rapidly. This paper focuses on single-graph as an effective model to represent information and its related graph mining techniques. In frequent pattern mining in a single-graph setting, there are two main problems: support measure and search scheme. In this paper, we propose a novel framework for constructing support measures that brings together existing minimum-image-based and overlap-graph-based support measures. Our framework is built on the concept of occurrence / instance hypergraphs. Based on that, we present two new support measures: minimum instance (MI) measure and minimum vertex cover (MVC) measure, that combine the advantages of existing measures. In particular, we show that the existing minimum-image-based support measure is an upper bound of the MI measure, which is also linear-time computable and results in counts that are close to number of instances of a pattern. Although the MVC measure is NP-hard, it can be approximated to a constant factor in polynomial time. We also provide polynomial-time relaxations for both measures and bounding theorems for all presented support measures in the hypergraph setting. We further show that the hypergraph-based framework can unify all support measures studied in this paper. This framework is also flexible in that more variants of support measures can be defined and profiled in it.

【Keywords】: data mining; graph mining; hypergraph; support measures

SIGMOD Session 9. New Hardware 4

29. Accelerating Pattern Matching Queries in Hybrid CPU-FPGA Architectures.

Paper Link】 【Pages】:403-415

【Authors】: David Sidler ; Zsolt István ; Muhsen Owaida ; Gustavo Alonso

【Abstract】: Taking advantage of recently released hybrid multicore architectures, such as the Intel's Xeon+FPGA machine, where the FPGA has coherent access to the main memory through the QPI bus, we explore the benefits of specializing operators to hardware. We focus on two commonly used SQL operators for strings: LIKE, and REGEXP_LIKE, and provide a novel and efficient implementation of these operators in reconfigurable hardware. We integrate the hardware accelerator into MonetDB, a main-memory column store, and demonstrate a significant improvement in response time and throughput. Our Hardware User Defined Function (HUDF) can speed up complex pattern matching by an order of magnitude in comparison to the database running on a 10-core CPU. The insights gained from integrating hardware based string operators into MonetDB should also be useful for future designs combining hardware specialization and databases.

【Keywords】: column store; hardware user defined functions; hybrid fpga-cpu architecture; regular expression; string matching

30. A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs.

Paper Link】 【Pages】:417-432

【Authors】: Elias Stehle ; Hans-Arno Jacobsen

【Abstract】: Sorting is at the core of many database operations, such as index creation, sort-merge joins, and user-requested output sorting. As GPUs are emerging as a promising platform to accelerate various operations, sorting on GPUs becomes a viable endeavour. Over the past few years, several improvements have been proposed for sorting on GPUs, leading to the first radix sort implementations that achieve a sorting rate of over one billion 32-bit keys per second. Yet, state-of-the-art approaches are heavily memory bandwidth-bound, as they require substantially more memory transfers than their CPU-based counterparts. Our work proposes a novel approach that almost halves the amount of memory transfers and, therefore, considerably lifts the memory bandwidth limitation. Being able to sort two gigabytes of eight-byte records in as little as 50 milliseconds, our approach achieves a 2.32-fold improvement over the state-of-the-art GPU-based radix sort for uniform distributions, sustaining a minimum speed-up of no less than a factor of 1.66 for skewed distributions. To address inputs that either do not reside on the GPU or exceed the available device memory, we build on our efficient GPU sorting approach with a pipelined heterogeneous sorting algorithm that mitigates the overhead associated with PCIe data transfers. Comparing the end-to-end sorting performance to the state-of-the-art CPU-based radix sort running 16 threads, our heterogeneous approach achieves a 2.06-fold and a 1.53-fold improvement for sorting 64 GB key-value pairs with a skewed and a uniform distribution, respectively.

【Keywords】: gpu; heterogeneous computing; parallel sorting; radix sort; sorting

31. FPGA-based Data Partitioning.

Paper Link】 【Pages】:433-445

【Authors】: Kaan Kara ; Jana Giceva ; Gustavo Alonso

【Abstract】: Implementing parallel operators in multi-core machines often involves a data partitioning step that divides the data into cache-size blocks and arranges them so to allow concurrent threads to process them in parallel. Data partitioning is expensive, in some cases up to 90% of the cost of, e.g., a parallel hash join. In this paper we explore the use of an FPGA to accelerate data partitioning. We do so in the context of new hybrid architectures where the FPGA is located as a co-processor residing on a socket and with coherent access to the same memory as the CPU residing on the other socket. Such an architecture reduces data transfer overheads between the CPU and the FPGA, enabling hybrid operator execution where the partitioning happens on the FPGA and the build and probe phases of a join happen on the CPU. Our experiments demonstrate that FPGA-based partitioning is significantly faster and more robust than CPU-based partitioning. The results open interesting options as FPGAs are gradually integrated tighter with the CPU.

【Keywords】: data partitioning; fpga; hardware acceleration; hash join

32. Template Skycube Algorithms for Heterogeneous Parallelism on Multicore and GPU Architectures.

Paper Link】 【Pages】:447-462

【Authors】: Kenneth S. Bøgh ; Sean Chester ; Darius Sidlauskas ; Ira Assent

【Abstract】: Multicore CPUs and cheap co-processors such as GPUs create opportunities for vastly accelerating database queries. However, given the differences in their threading models, expected granularities of parallelism, and memory subsystems, effectively utilising all cores with all co-processors for an intensive query is very difficult. This paper introduces a novel templating methodology to create portable, yet architecture-aware, algorithms. We apply this methodology on the very compute-intensive task of calculating the skycube, a materialisation of exponentially many skyline query results, which finds applications in data exploration and multi-criteria decision making. We define three parallel templates, two that leverage insights from previous skycube research and a third that exploits a novel point-based paradigm to expose more data parallelism. An experimental study shows that, relative to the state-of-the-art that does not parallelise well due to its memory and cache requirements, our algorithms provide an order of magnitude improvement on either architecture and proportionately improve as more GPUs are added.

【Keywords】: cross-device parallelism; gpu; heterogeneous architectures; manycore; multicore; numa; parallel algorithms; skycube; skyline; template algorithms

SIGMOD Session 10. Parallel and Distributed Query Processing (2) 3

33. Heterogeneity-aware Distributed Parameter Servers.

Paper Link】 【Pages】:463-478

【Authors】: Jiawei Jiang ; Bin Cui ; Ce Zhang ; Lele Yu

【Abstract】: We study distributed machine learning in heterogeneous environments in this work. We first conduct a systematic study of existing systems running distributed stochastic gradient descent; we find that, although these systems work well in homogeneous environments, they can suffer performance degradation, sometimes up to 10x, in heterogeneous environments where stragglers are common because their synchronization protocols cannot fit a heterogeneous setting. Our first contribution is a heterogeneity-aware algorithm that uses a constant learning rate schedule for updates before adding them to the global parameter. This allows us to suppress stragglers' harm on robust convergence. As a further improvement, our second contribution is a more sophisticated learning rate schedule that takes into consideration the delayed information of each update. We theoretically prove the valid convergence of both approaches and implement a prototype system in the production cluster of our industrial partner Tencent Inc. We validate the performance of this prototype using a range of machine-learning workloads. Our prototype is 2-12x faster than other state-of-the-art systems, such as Spark, Petuum, and TensorFlow; and our proposed algorithm takes up to 6x fewer iterations to converge.

【Keywords】: heterogeneity environment; machine learning; parameter server; stochastic gradient descent; straggler; synchronization protocol

34. Distributed Algorithms on Exact Personalized PageRank.

Paper Link】 【Pages】:479-494

【Authors】: Tao Guo ; Xin Cao ; Gao Cong ; Jiaheng Lu ; Xuemin Lin

【Abstract】: As one of the most well known graph computation problems, Personalized PageRank is an effective approach for computing the similarity score between two nodes, and it has been widely used in various applications, such as link prediction and recommendation. Due to the high computational cost and space cost of computing the exact Personalized PageRank Vector (PPV), most existing studies compute PPV approximately. In this paper, we propose novel and efficient distributed algorithms that compute PPV exactly based on graph partitioning on a general coordinator-based share-nothing distributed computing platform. Our algorithms takes three aspects into account: the load balance, the communication cost, and the computation cost of each machine. The proposed algorithms only require one time of communication between each machine and the coordinator at query time. The communication cost is bounded, and the work load on each machine is balanced. Comprehensive experiments conducted on five real datasets demonstrate the efficiency and the scalability of our proposed methods.

【Keywords】: personalized pagerank; random walk with restart; random walks

35. Parallelizing Sequential Graph Computations.

Paper Link】 【Pages】:495-510

【Authors】: Wenfei Fan ; Jingbo Xu ; Yinghui Wu ; Wenyuan Yu ; Jiaxin Jiang ; Zeyu Zheng ; Bohan Zhang ; Yang Cao ; Chao Tian

【Abstract】: This paper presents GRAPE, a parallel system for graph computations. GRAPE differs from prior systems in its ability to parallelize existing sequential graph algorithms as a whole. Underlying GRAPE are a simple programming model and a principled approach, based on partial evaluation and incremental computation. We show that sequential graph algorithms can be "plugged into" GRAPE with minor changes, and get parallelized. As long as the sequential algorithms are correct, their GRAPE parallelization guarantees to terminate with correct answers under a monotonic condition. Moreover, we show that algorithms in MapReduce, BSP and PRAM can be optimally simulated on GRAPE. In addition to the ease of programming, we experimentally verify that GRAPE achieves comparable performance to the state-of-the-art graph systems, using real-life and synthetic graphs.

【Keywords】: graph computation; in-cremental evaluation; parallel model; partial evaluation; scalability

Keynote Session - Grand Challenges in Data Management: Approximate Query Processing 3

36. Approximate Query Processing: No Silver Bullet.

Paper Link】 【Pages】:511-519

【Authors】: Surajit Chaudhuri ; Bolin Ding ; Srikanth Kandula

【Abstract】: In this paper, we reflect on the state of the art of Approximate Query Processing. Although much technical progress has been made in this area of research, we are yet to see its impact on products and services. We discuss two promising avenues to pursue towards integrating Approximate Query Processing into data platforms.

【Keywords】: approximate query processing; big data; error guarantee; olap; pre-computation; query optimization; query processing; sampling; transformation rules

37. Approximate Query Engines: Commercial Challenges and Research Opportunities.

Paper Link】 【Pages】:521-524

【Authors】: Barzan Mozafari

【Abstract】: Recent years have witnessed a surge of interest in Approximate Query Processing (AQP) solutions, both in academia and the commercial world. In addition to well-known open problems in this area, there are many new research challenges that have surfaced as a result of the first interaction of AQP technology with commercial and real-world customers. We categorize these into deployment, planning, and interface challenges. At the same time, AQP settings introduce many interesting opportunities that would not be possible in a database with precise answers. These opportunities create hopes for overcoming some of the major limitations of traditional database systems. For example, we discuss how a database can reuse its past work in a generic way, and become smarter as it answers new queries. Our goal in this talk is to suggest some of the exciting research directions in this field that are worth pursuing.

【Keywords】: analytics; approximation; interactive response times

38. Approximate Query Processing for Interactive Data Science.

Paper Link】 【Pages】:525

【Authors】: Tim Kraska

【Abstract】: shift in the algorithms and tools used to analyze data towards more interactive systems with highly collaborative and visual interfaces. Ideally, a data scientist and a domain expert should be able to make discoveries together by directly manipulating, analyzing, and visualizing data on the spot, for example, using an interactive whiteboard like the recently released Microsoft Surface Hub. While such an interactive pattern would democratize data science and make it more accessible to a wider range of users, it also requires a rethinking of the full analytical stack. Most importantly, it necessitates the next generation of approximate query processing (AQP) techniques to guarantee (visual) results at interactive speeds during the data exploration process. The first generation of AQP focused on online aggregation for simple OLAP queries; a small subset of the functionality needed for data science workflows. The second generation widened the scope to more complex workflows mainly by taking advantage of pre-computed samples at the cost of assuming that most or all queries are known upfront; again a bad fit as it is rarely the case that all exploration patterns are known in advance. The next, the third, generation of AQP has to give up on this assumption, that most queries are known upfront, but instead can leverage that data exploration pipelines are incrementally created by the user through a visual interface. In this talk, I will present some of our recent results from building a third-generation AQP system, called IDEA. IDEA is the first Interactive Data Exploration Accelerator and allows data scientists to connect to a data source and immediately start exploring without any preparation time while still guaranteeing interactive latencies largely independently of the type of operation or data size. IDEA achieves this through novel AQP- and result reuse-techniques, which better leverage the incremental nature of the exploration process. Most importantly, IDEA automatically creates stratified samples based on the user interaction and is able to reuse approximate intermediate results between interactions. The core idea behind our approximation and reuse technique is a reformulation of the AQP model itself based on the observation that most visualizations convey simple statistics over the data. For example, the commonly used count histograms can be seen as visualizations of the frequency statistic over the value range of an attribute. To leverage this, we propose a new AQP model that treats the aggregate query results as random variables. Surprisingly, this new model makes it not only easier to reuse results and to reason formally about the error bounds but also enables a completely new set of query rewrite rules based on probability theory. Finally, it turns out that online aggregation, which is typically used to approximate results without a pre-computed index, stratified samples, or sketches, struggles to provide high-quality results for rare sub-populations. At the same time, as one of our user studies revealed, it is quite common for users to explore rare sub-populations, as they often contain the most interesting insights (e.g., the habits of the few highly valued customers, the suspicious outliers, etc.). We, therefore, propose a new data structure, called tail index, which is a low-overhead partial index that is created on the fly based on the user interactions. Together with our new AQP model, tail indexes enable us to provide low approximation errors, even on increasingly small sub-population, at interactive speeds without any pre-computation or an upfront known workload.

【Keywords】: approximate query processing; aqp; data exploration; interactive data exploration

SIGMOD Session 11. Interactive Data Exploration and AQP (1) 3

39. Controlling False Discoveries During Interactive Data Exploration.

Paper Link】 【Pages】:527-540

【Authors】: Zheguang Zhao ; Lorenzo De Stefani ; Emanuel Zgraggen ; Carsten Binnig ; Eli Upfal ; Tim Kraska

【Abstract】: Recent tools for interactive data exploration significantly increase the chance that users make false discoveries. They allow users to (visually) examine many hypotheses and make inference with simple interactions, and thus incur the issue commonly known in statistics as the "multiple hypothesis testing error." In this work, we propose a solution to integrate the control of multiple hypothesis testing into interactive data exploration systems. A key insight is that existing methods for controlling the false discovery rate (such as FDR) are not directly applicable to interactive data exploration. We therefore discuss a set of new control procedures that are better suited for this task and integrate them in our system, QUDE. Via extensive experiments on both real-world and synthetic data sets we demonstrate how QUDE can help experts and novice users alike to efficiently control false discoveries.

【Keywords】: alpha investing; bonferroni; data analytics; false discovery control; false discovery rate; family-wise error rate; hypothesis testing; interactive data exploration; multiple comparisons problem; multiple hypothesis error; visualization

40. MacroBase: Prioritizing Attention in Fast Data.

Paper Link】 【Pages】:541-556

【Authors】: Peter Bailis ; Edward Gan ; Samuel Madden ; Deepak Narayanan ; Kexin Rong ; Sahaana Suri

【Abstract】: As data volumes continue to rise, manual inspection is becoming increasingly untenable. In response, we present MacroBase, a data analytics engine that prioritizes end-user attention in high-volume fast data streams. MacroBase enables efficient, accurate, and modular analyses that highlight and aggregate important and unusual behavior, acting as a search engine for fast data. MacroBase is able to deliver order-of-magnitude speedups over alternatives by optimizing the combination of explanation and classification tasks and by leveraging a new reservoir sampler and heavy-hitters sketch specialized for fast data streams. As a result, MacroBase delivers accurate results at speeds of up to 2M events per second per query on a single core. The system has delivered meaningful results in production, including at a telematics company monitoring hundreds of thousands of vehicles.

【Keywords】:

41. Data Canopy: Accelerating Exploratory Statistical Analysis.

Paper Link】 【Pages】:557-572

【Authors】: Abdul Wasay ; Xinding Wei ; Niv Dayan ; Stratos Idreos

【Abstract】: During exploratory statistical analysis, data scientists repeatedly compute statistics on data sets to infer knowledge. Moreover, statistics form the building blocks of core machine learning classification and filtering algorithms. Modern data systems, software libraries, and domain-specific tools provide support to compute statistics but lack a cohesive framework for storing, organizing, and reusing them. This creates a significant problem for exploratory statistical analysis as data grows: Despite existing overlap in exploratory workloads (which are repetitive in nature), statistics are always computed from scratch. This leads to repeated data movement and recomputation, hindering interactive data exploration. We address this challenge in Data Canopy, where descriptive and dependence statistics are synthesized from a library of basic aggregates. These basic aggregates are stored within an in-memory data structure, and and are reused for overlapping data parts and for various statistical measures. What this means for exploratory statistical analysis is that repeated requests to compute different statistics do not trigger a full pass over the data. We discuss in detail the basic design elements in Data Canopy, which address multiple challenges: (1) How to decompose statistics into basic aggregates for maximal reuse? (2) How to represent, store, maintain, and access these basic aggregates? (3) Under different scenarios, which basic aggregates to maintain? (4) How to tune Data Canopy in a hardware conscious way for maximum performance and how to maintain good performance as data grows and memory pressure increases? We demonstrate experimentally that Data Canopy results in an average speed-up of at least 10x after just 100 exploratory queries when compared with state-of-the-art systems used for exploratory statistical analysis.

【Keywords】: caching; data exploration; data science; machine learning; statistical analysis

SIGMOD Session 12. Beliefs, Conflicts, Knowledge 3

42. Beta Probabilistic Databases: A Scalable Approach to Belief Updating and Parameter Learning.

Paper Link】 【Pages】:573-586

【Authors】: Niccolo' Meneghetti ; Oliver Kennedy ; Wolfgang Gatterbauer

【Abstract】: Tuple-independent probabilistic databases (TI-PDBs) handle uncertainty by annotating each tuple with a probability parameter; when the user submits a query, the database derives the marginal probabilities of each output-tuple, assuming input-tuples are statistically independent. While query processing in TI-PDBs has been studied extensively, limited research has been dedicated to the problems of updating or deriving the parameters from observations of query results. Addressing this problem is the main focus of this paper. We introduce Beta Probabilistic Databases (B-PDBs), a generalization of TI-PDBs designed to support both (i) belief updating and (ii) parameter learning in a principled and scalable way. The key idea of B-PDBs is to treat each parameter as a latent, Beta-distributed random variable. We show how this simple expedient enables both belief updating and parameter learning in a principled way, without imposing any burden on regular query processing. We use this model to provide the following key contributions: (i) we show how to scalably compute the posterior densities of the parameters given new evidence; (ii) we study the complexity of performing Bayesian belief updates, devising efficient algorithms for tractable classes of queries; (iii) we propose a soft-EM algorithm for computing maximum-likelihood estimates of the parameters; (iv) we show how to embed the proposed algorithms into a standard relational engine; (v) we support our conclusions with extensive experimental results.

【Keywords】: bayesian inference; belief updating; expectation maximization; maximum likelihood estimation; parameter learning; probabilistic databases

43. Database Learning: Toward a Database that Becomes Smarter Every Time.

Paper Link】 【Pages】:587-602

【Authors】: Yongjoo Park ; Ahmad Shahab Tajik ; Michael J. Cafarella ; Barzan Mozafari

【Abstract】: In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each query reveals some degree of knowledge about the answer to another query because their answers stem from the same underlying distribution that has produced the entire dataset. Exploiting and refining this knowledge should allow us to answer queries more analytically, rather than by reading enormous amounts of raw data. Also, processing more queries should continuously enhance our knowledge of the underlying distribution, and hence lead to increasingly faster response times for future queries. We call this novel idea---learning from past query answers---Database Learning. We exploit the principle of maximum entropy to produce answers, which are in expectation guaranteed to be more accurate than existing sample-based approximations. Empowered by this idea, we build a query engine on top of Spark SQL, called Verdict. We conduct extensive experiments on real-world query traces from a large customer of a major database vendor. Our results demonstrate that database learning supports 73.7% of these queries, speeding them up by up to 23.0x for the same accuracy level compared to existing AQP systems.

【Keywords】: approximate query processing; database learning; machine learning; maximum entropy principle; online aggregation

44. Staging User Feedback toward Rapid Conflict Resolution in Data Fusion.

Paper Link】 【Pages】:603-618

【Authors】: Romila Pradhan ; Siarhei Bykau ; Sunil Prabhakar

【Abstract】: In domains such as the Web, sensor networks and social media, sources often provide conflicting information for the same data item. Several data fusion techniques have been proposed recently to resolve conflicts and identify correct data. The performance of these fusion systems, while quite accurate, is far from perfect. In this paper, we propose to leverage user feedback for validating data conflicts and rapidly improving the performance of fusion. To present the most beneficial data items for the user to validate, we take advantage of the level of consensus among sources, and the output of fusion to generate an effective ordering of items. We first evaluate data items individually, and then define a novel decision-theoretic framework based on the concept of value of perfect information (VPI) to order items by their ability to boost the performance of fusion. We further derive approximate formulae to scale up the decision-theoretic framework to large-scale data. We empirically evaluate our algorithms on three real-world datasets with different characteristics, and show that the accuracy of fusion can be significantly improved even while requesting feedback on a few data items. We also show that the performance of the proposed methods depends on the characteristics of data, and assess the trade-off between the amount of feedback acquired, and the effectiveness and efficiency of the methods.

【Keywords】: data fusion; data integration; user feedback

SIGMOD Session 13. Influence in Social Networks 3

45. Discovering Your Selling Points: Personalized Social Influential Tags Exploration.

Paper Link】 【Pages】:619-634

【Authors】: Yuchen Li ; Ju Fan ; Dongxiang Zhang ; Kian-Lee Tan

【Abstract】: Social influence has attracted significant attention owing to the prevalence of social networks (SNs). In this paper, we study a new social influence problem, called personalized social tags exploration (PITEX), to help any user in the SN explore how she influences the network. Given a target user, it finds a size-k tag set that maximizes this user's social influence. We prove the problem is NP-hard to be approximated within any constant ratio. To solve it, we introduce a sampling-based framework, which has an approximation ratio of 1-ε over 1+ε with high probabilistic guarantee. To speedup the computation, we devise more efficient sampling techniques and propose best-effort exploration to quickly prune tag sets with small influence. To further enable instant exploration, we devise a novel index structure and develop effective pruning and materialization techniques. Experimental results on real large-scale datasets validate our theoretical findings and show high performances of our proposed methods.

【Keywords】: index; personalization; recommendation; sampling; social influence

46. Coarsening Massive Influence Networks for Scalable Diffusion Analysis.

Paper Link】 【Pages】:635-650

【Authors】: Naoto Ohsaka ; Tomohiro Sonobe ; Sumio Fujita ; Ken-ichi Kawarabayashi

【Abstract】: Fueled by the increasing popularity of online social networks, social influence analysis has attracted a great deal of research attention in the past decade. The diffusion process is often modeled using influence graphs, and there has been a line of research that involves algorithmic problems in influence graphs. However, the vast size of today's real-world networks raises a serious issue with regard to computational efficiency. In this paper, we propose a new algorithm for reducing influence graphs. Given an input influence graph, the proposed algorithm produces a vertex-weighted influence graph, which is compact and approximates the diffusion properties of the input graph. The central strategy of influence graph reduction is coarsening, which has the potential to greatly reduce the number of edges by merging a vertex set into a single weighted vertex. We provide two implementations; a speed-oriented implementation which runs in linear time with linear space and a scalability-oriented implementation which runs in practically linear time with sublinear space. Further, we present general frameworks using our compact graphs that accelerate existing algorithms for influence maximization and influence estimation problems, which are motivated by practical applications, such as viral marketing. Using these frameworks, we can quickly obtain solutions that have accuracy guarantees under a reasonable assumption. Experiments with real-world networks demonstrate that the proposed algorithm can scale to billion-edge graphs and reduce the graph size to up to 4%. In addition, our influence maximization framework achieves four times speed-up of a state-of-the-art D-SSA algorithm, and our influence estimation framework cuts down the computation time of a simulation-based method to 3.5%.

【Keywords】: graph reduction; influence maximization; social networks

47. Debunking the Myths of Influence Maximization: An In-Depth Benchmarking Study.

Paper Link】 【Pages】:651-666

【Authors】: Akhil Arora ; Sainyam Galhotra ; Sayan Ranu

【Abstract】: Influence maximization (IM) on social networks is one of the most active areas of research in computer science. While various IM techniques proposed over the last decade have definitely enriched the field, unfortunately, experimental reports on existing techniques fall short in validity and integrity since many comparisons are not based on a common platform or merely discussed in theory. In this paper, we perform an in-depth benchmarking study of IM techniques on social networks. Specifically, we design a benchmarking platform, which enables us to evaluate and compare the existing techniques systematically and thoroughly under identical experimental conditions. Our benchmarking results analyze and diagnose the inherent deficiencies of the existing approaches and surface the open challenges in IM even after a decade of research. More fundamentally, we unearth and debunk a series of myths and establish that there is no single state-of-the-art technique in IM. At best, a technique is the state of the art in only one aspect.

【Keywords】: in-depth benchmarking study

SIGMOD Session 14. Mappings, Transformations, Pricing 3

48. Interactive Mapping Specification with Exemplar Tuples.

Paper Link】 【Pages】:667-682

【Authors】: Angela Bonifati ; Ugo Comignani ; Emmanuel Coquery ; Romuald Thion

【Abstract】: While schema mapping specification is a cumbersome task for data curation specialists, it becomes unfeasible for non-expert users, who are unacquainted with the semantics and languages of the involved transformations. In this paper, we present an interactive framework for schema mapping specification suited for non-expert users. The underlying key intuition is to leverage a few exemplar tuples to infer the underlying mappings and iterate the inference process via simple user interactions under the form of boolean queries on the validity of the initial exemplar tuples. The approaches available so far are mainly assuming pairs of complete universal data examples, which can be solely provided by data curation experts, or are limited to poorly expressive mappings. We present several exploration strategies of the space of all possible mappings that satisfy arbitrary user exemplar tuples. Along the exploration, we challenge the user to retain the mappings that fit the user's requirements at best and to dynamically prune the exploration space, thus reducing the number of user interactions. We prove that after the refinement process, the obtained mappings are correct. We present an extensive experimental analysis devoted to measure the feasibility of our interactive mapping strategies and the inherent quality of the obtained mappings.

【Keywords】: data exchange; exemplar tuples; mapping specification

49. Foofah: Transforming Data By Example.

Paper Link】 【Pages】:683-698

【Authors】: Zhongjun Jin ; Michael R. Anderson ; Michael J. Cafarella ; H. V. Jagadish

【Abstract】: Data transformation is a critical first step in modern data analysis: before any analysis can be done, data from a variety of sources must be wrangled into a uniform format that is amenable to the intended analysis and analytical software package. This data transformation task is tedious, time-consuming, and often requires programming skills beyond the expertise of data analysts. In this paper, we develop a technique to synthesize data transformation programs by example, reducing this burden by allowing the analyst to describe the transformation with a small input-output example pair, without being concerned with the transformation steps required to get there. We implemented our technique in a system, FOOFAH, that efficiently searches the space of possible data transformation operations to generate a program that will perform the desired transformation. We experimentally show that data transformation programs can be created quickly with FOOFAH for a wide variety of cases, with 60% less user effort than the well-known WRANGLER system.

【Keywords】: a* algorithm; data transformation; heuristic; program synthesis; programming by example

50. QIRANA: A Framework for Scalable Query Pricing.

Paper Link】 【Pages】:699-713

【Authors】: Shaleen Deep ; Paraschos Koutris

【Abstract】: Users are increasingly engaging in buying and selling data over the web. Facilitated by the proliferation of online marketplaces that bring such users together, data brokers need to serve requests where they provide results for user queries over the underlying datasets, and price them fairly according to the information disclosed by the query. In this work, we present a novel pricing system, called QIRANA, that performs query-based data pricing for a large class of SQL queries (including aggregation) in real time. QIRANA provides prices with formal guarantees: for example, it avoids prices that create arbitrage opportunities. Our framework also allows flexible pricing, by allowing the data seller to choose from a variety of pricing functions, as well as specify relation and attribute-level parameters that control the price of queries and assign different value to different portions of the data. We test QIRANA on a variety of real-world datasets and query workloads, and we show that it can efficiently compute the prices for queries over large-scale data.

【Keywords】: arbitrage; data pricing; query determinacy

SIGMOD Session 15. Optimization and Performance (1) 3

51. Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe?

Paper Link】 【Pages】:715-730

【Authors】: Michael S. Kester ; Manos Athanassoulis ; Stratos Idreos

【Abstract】: The advent of columnar data analytics engines fueled a series of optimizations on the scan operator. New designs include column-group storage, vectorized execution, shared scans, working directly over compressed data, and operating using SIMD and multi-core execution. Larger main memories and deeper cache hierarchies increase the efficiency of modern scans, prompting a revisit of the question of access path selection. In this paper, we compare modern sequential scans and secondary index scans. Through detailed analytical modeling and experimentation we show that while scans have become useful in more cases than before, both access paths are still useful, and so, access path selection (APS) is still required to achieve the best performance when considering variable workloads. We show how to perform access path selection. In particular, contrary to the way traditional systems choose between scans and secondary indexes, we find that in addition to the query selectivity, the underlying hardware, and the system design, modern optimizers also need to take into account query concurrency. We further discuss the implications of integrating access path selection in a modern analytical data system. We demonstrate, both theoretically and experimentally, that using the proposed model a system can quickly perform access path selection, outperforming solutions that rely on a single access path or traditional access path models. We outline a light-weight mechanism to integrate APS into main-memory analytical systems that does not interfere with low latency queries. We also use the APS model to explain how the division between sequential scan and secondary index scan has historically changed due to hardware and workload changes, which allows for future projections based on hardware advancements.

【Keywords】: access path selection; cost based optimization; main memory system

52. Optimization of Disjunctive Predicates for Main Memory Column Stores.

Paper Link】 【Pages】:731-744

【Authors】: Fisnik Kastrati ; Guido Moerkotte

【Abstract】: Optimization of disjunctive predicates is a very challenging task which has been vastly neglected by the research community and commercial databases. In this work, we focus on the complex problem of optimizing disjunctive predicates by means of the bypass processing technique. In bypass processing, selection operators split the input tuple stream into two disjoint output streams: the true-stream with tuples that satisfy the selection predicate and the false-stream with tuples that do not. Bypass processing is crucial in avoiding expensive predicates whenever the outcome of the query predicate can be determined by evaluating the less expensive ones. In main memory databases, CPU architectural characteristics, such as the branch misprediction penalty, become a prominent cost factor which cannot be ignored. Our algorithm takes into account the branch misprediction penalty, and, in addition, it eliminates common subexpressions. The current literature relies on two assumptions: (1) predicate costs are assumed to be constant, (2) predicate selectivities are assumed to be independent. Since both assumptions do not hold in practice, our approach is not based on any of them.

【Keywords】: algorithms; disjunctive predicates; query optimization

53. A Top-Down Approach to Achieving Performance Predictability in Database Systems.

Paper Link】 【Pages】:745-758

【Authors】: Jiamin Huang ; Barzan Mozafari ; Grant Schoenebeck ; Thomas F. Wenisch

【Abstract】: While much of the research on transaction processing has focused on improving overall performance in terms of throughput and mean latency, surprisingly less attention has been given to performance predictability: how often individual transactions exhibit execution latency far from the mean. Performance predictability is increasingly important when transactions lie on the critical path of latency-sensitive applications, enterprise software, or interactive web services. In this paper, we focus on understanding and mitigating the sources of performance unpredictability in today's transactional databases. We conduct the first quantitative study of major sources of variance in MySQL, Postgres (two of the largest and most popular open-source products on the market), and VoltDB (a non-conventional database). We carry out our study with a tool called TProfiler that, given the source code of a database system and programmer annotations indicating the start and end of a transaction, is able to identify the dominant sources of variance in transaction latency. Based on our findings, we investigate alternative algorithms, implementations, and tuning strategies to reduce latency variance without compromising mean latency or throughput. Most notably, we propose a new lock scheduling algorithm, called Variance-Aware Transaction Scheduling (VATS), and a lazy buffer pool replacement policy. In particular, our modified MySQL exhibits significantly lower variance and 99th percentile latencies by up to 5.6× and 6.3×, respectively. Our proposal has been welcomed by the open-source community, and our VATS algorithm has already been adopted as of MySQL's 5.7.17 release (and been made the default scheduling policy in MariaDB).

【Keywords】: predictability; profiling; transaction processing; transactional databases; variance

SIGMOD Session 16. Interactive Data Exploration and AQP (2) 3

54. Two-Level Sampling for Join Size Estimation.

Paper Link】 【Pages】:759-774

【Authors】: Yu Chen ; Ke Yi

【Abstract】: Join size estimation is a critical step in query optimization, and has been extensively studied in the literature. Among the many techniques, sampling based approaches are particularly appealing, due to their ability to handle arbitrary selection predicates. In this paper, we propose a new sampling algorithm for join size estimation, called two-level sampling, which combines the advantages of three previous sampling methods while making further improvements. Both analytical and empirical comparisons show that the new algorithm outperforms all the previous algorithms on a variety of joins, including primary key-foreign key joins, many-to-many joins, and multi-table joins. The new sampling algorithm is also very easy to implement, requiring just one pass over the data. It only relies on some basic statistical information about the data, such as the ℓk-norms and the heavy hitters.

【Keywords】: joins; sampling

55. A General-Purpose Counting Filter: Making Every Bit Count.

Paper Link】 【Pages】:775-787

【Authors】: Prashant Pandey ; Michael A. Bender ; Rob Johnson ; Robert Patro

【Abstract】: Approximate Membership Query (AMQ) data structures, such as the Bloom filter, quotient filter, and cuckoo filter, have found numerous applications in databases, storage systems, networks, computational biology, and other domains. However, many applications must work around limitations in the capabilities or performance of current AMQs, making these applications more complex and less performant. For example, many current AMQs cannot delete or count the number of occurrences of each input item, take up large amounts of space, are slow, cannot be resized or merged, or have poor locality of reference and hence perform poorly when stored on SSD or disk. This paper proposes a new general-purpose AMQ, the counting quotient filter (CQF). The CQF supports approximate membership testing and counting the occurrences of items in a data set. This general-purpose AMQ is small and fast, has good locality of reference, scales out of RAM to SSD, and supports deletions, counting (even on skewed data sets), resizing, merging, and highly concurrent access. The paper reports on the structure's performance on both manufactured and application-generated data sets. In our experiments, the CQF performs in-memory inserts and queries up to an order-of magnitude faster than the original quotient filter, several times faster than a Bloom filter, and similarly to the cuckoo filter, even though none of these other data structures support counting. On SSD, the CQF outperforms all structures by a factor of at least 2 because the CQF has good data locality. The CQF achieves these performance gains by restructuring the metadata bits of the quotient filter to obtain fast lookups at high load factors (i.e., even when the data structure is almost full). As a result, the CQF offers good lookup performance even up to a load factor of 95%. Counting is essentially free in the CQF in the sense that the structure is comparable or more space efficient even than non-counting data structures (e.g., Bloom, quotient, and cuckoo filters). The paper also shows how to speed up CQF operations by using new x86 bit-manipulation instructions introduced in Intel's Haswell line of processors. The restructured metadata transforms many quotient filter metadata operations into rank-and-select bit-vector operations. Thus, our efficient implementations of rank and select may be useful for other rank-and-select-based data structures.

【Keywords】: bloom filters and hashing; computational biology; network monitoring; sketching and sampling

56. BePI: Fast and Memory-Efficient Method for Billion-Scale Random Walk with Restart.

Paper Link】 【Pages】:789-804

【Authors】: Jinhong Jung ; Namyong Park ; Lee Sael ; U. Kang

【Abstract】: How can we measure similarity between nodes quickly and accurately on large graphs? Random walk with restart (RWR) provides a good measure, and has been used in various data mining applications including ranking, recommendation, link prediction and community detection. However, existing methods for computing RWR do not scale to large graphs containing billions of edges; iterative methods are slow in query time, and preprocessing methods require too much memory. In this paper, we propose BePI, a fast, memory-efficient, and scalable method for computing RWR on billion-scale graphs. BePI exploits the best properties from both preprocessing methods and iterative methods. BePI uses a block elimination approach, which is a preprocessing method, to enable fast query time. Also, BePI uses a preconditioned iterative method to decrease memory requirement. The performance of BePI is further improved by decreasing non-zeros of the matrix for the iterative method. Through extensive experiments, we show that BePI processes 100 times larger graphs, and requires up to 130 times less memory space than other preprocessing methods. In the query phase, BePI computes RWR scores up to 9 times faster than existing methods.

【Keywords】: random walk with restart; relevance score in graph

SIGMOD Session 17. User Preferences 4

57. Determining the Impact Regions of Competing Options in Preference Space.

Paper Link】 【Pages】:805-820

【Authors】: Bo Tang ; Kyriakos Mouratidis ; Man Lung Yiu

【Abstract】: In rank-aware processing, user preferences are typically represented by a numeric weight per data attribute, collectively forming a weight vector. The score of an option (data record) is defined as the weighted sum of its individual attributes. The highest-scoring options across a set of alternatives (dataset) are shortlisted for the user as the recommended ones. In that setting, the user input is a vector (equivalently, a point) in a d-dimensional preference space, where d is the number of data attributes. In this paper we study the problem of determining in which regions of the preference space the weight vector should lie so that a given option (focal record) is among the top-k score-wise. In effect, these regions capture all possible user profiles for which the focal record is highly preferable, and are therefore essential in market impact analysis, potential customer identification, profile-based marketing, targeted advertising, etc. We refer to our problem as k-Shortlist Preference Region identification (kSPR), and exploit its computational geometric nature to develop a framework for its efficient (and exact) processing. Using real and synthetic benchmarks, we show that our most optimized algorithm outperforms by three orders of magnitude a competitor we constructed from previous work on a different problem.

【Keywords】: market impact analysis; preference mining; preference query; ranking query; recommendation systems; reverse top-k query; top-k query

58. Efficient Computation of Regret-ratio Minimizing Set: A Compact Maxima Representative.

Paper Link】 【Pages】:821-834

【Authors】: Abolfazl Asudeh ; Azade Nazi ; Nan Zhang ; Gautam Das

【Abstract】: Finding the maxima of a database based on a user preference, especially when the ranking function is a linear combination of the attributes, has been the subject of recent research. A critical observation is that the em convex hull is the subset of tuples that can be used to find the maxima of any linear function. However, in real world applications the convex hull can be a significant portion of the database, and thus its performance is greatly reduced. Thus, computing a subset limited to $r$ tuples that minimizes the regret ratio (a measure of the user's dissatisfaction with the result from the limited set versus the one from the entire database) is of interest. In this paper, we make several fundamental theoretical as well as practical advances in developing such a compact set. In the case of two dimensional databases, we develop an optimal linearithmic time algorithm by leveraging the ordering of skyline tuples. In the case of higher dimensions, the problem is known to be NPcomplete. As one of our main results of this paper, we develop an approximation algorithm that runs in linearithmic time and guarantees a regret ratio, within any arbitrarily small user-controllable distance from the optimal regret ratio. The comprehensive set of experiments on both synthetic and publicly available real datasets confirm the efficiency, quality of output, and scalability of our proposed algorithms.

【Keywords】: compact maxima representative; compact top-k index; convex hull; regret ratio; skyline

59. FEXIPRO: Fast and Exact Inner Product Retrieval in Recommender Systems.

Paper Link】 【Pages】:835-850

【Authors】: Hui Li ; Tsz Nam Chan ; Man Lung Yiu ; Nikos Mamoulis

【Abstract】: Recommender systems have many successful applications in e-commerce and social media, including Amazon, Netflix, and Yelp. Matrix Factorization (MF) is one of the most popular recommendation approaches; the original user-product rating matrix R with millions of rows and columns is decomposed into a user matrix Q and an item matrix P, such that the product QT P approximates R. Each column q (p) of Q (P) holds the latent factors of the corresponding user (item), and qT p is a prediction of the rating to item p by user q. Recommender systems based on MF suggest to a user in q the items with the top-k scores in qT P. For this problem, we propose a Fast and EXact Inner PROduct retrieval (FEXIPRO) framework, based on sequential scan, which includes three elements. First, FEXIPRO applies an SVD transformation to P, after which the first several dimensions capture a large percentage of the inner products. This enables us to prune item vectors by only computing their partial inner products with q. Second, we construct an integer approximation version of P, which can be used to compute fast upper bounds for the inner products that can prune item vectors. Finally, we apply a lossless transformation to P, such that the resulting matrix has only positive values, allowing for the inner products to be monotonically increasing with dimensionality. Experiments on real data demonstrate that our framework outperforms alternative approaches typically by an order of magnitude.

【Keywords】: inner product; recommender systems; similarity search

60. Feedback-Aware Social Event-Participant Arrangement.

Paper Link】 【Pages】:851-865

【Authors】: Jieying She ; Yongxin Tong ; Lei Chen ; Tianshu Song

【Abstract】: Online event-based social networks (EBSNs) and studies on global event-participant arrangement strategies for EBSNs are becoming popular recently. Existing works measure satisfaction of an arrangement by a linear combination of few factors, weights of which are predefined and fixed, and do not allow users to provide feedbacks on whether accepting the arrangement or not. Besides, most of them only consider offline scenarios, where full information of users is known in advance. However, on real-world EBSN platforms, users can dynamically log in the platform and register for events on a first come, first served basis. In other words, online scenarios of event-participant arrangement strategies should be considered. In this work, we study a new event-participant arrangement strategy for online scenarios, the Feedback-Aware Social Event-participant Arrangement (FASEA) problem, where satisfaction scores of an arrangement are learned adaptively and users can choose to accept or reject the arranged events. Particularly, we model the problem as a contextual combinatorial bandit setting and use efficient and effective algorithms to solve the problem. The effectiveness and efficiency of the solutions are evaluated with extensive experimental studies and our findings indicate that the state-of-the-art Thompson Sampling that is reported to work well under basic multi-armed bandit does not perform well under FASEA.

【Keywords】: contextual combinatorial bandit; event-based social networks

SIGMOD Session 18. Tree & Graph Processing (2) 3

61. Exploiting Common Patterns for Tree-Structured Data.

Paper Link】 【Pages】:883-896

【Authors】: Zhiyi Wang ; Shimin Chen

【Abstract】: Tree-structured data formats, such as JSON and Protocol Buffers, are capable of expressing sophisticated data types, including nested, repeated, and missing values. While such expressing power contributes to their popularity in real-world applications, it presents a significant challenge for systems supporting tree-structured data. Existing systems have focused on general-purpose solutions either extending RDBMSs or designing native systems. However, the general-purpose approach often results in sophisticated data structures and algorithms, which may not reflect and optimize for the actual structure patterns in the real world. In this paper, we aim to better understand tree-structured data types in real uses and optimize for the common patterns. We present an in-depth study of five types of real-world use cases of tree-structured data. We find that a majority of the root-to-leaf paths in the tree structures are simple, containing up to one repeated node. Given this insight, we design and implement Steed, a native analytical database system for tree-structured data. Steed implements the baseline general-purpose support for storing and querying data in both row and column layouts. Then we enhance the baseline design with a set of optimizations to simplify and improve the processing of simple paths. Experimental evaluation shows that our optimization improves the baseline by a factor of up to 1.74x. Compared to three representative state-of-the-art systems (i.e. PostgreSQL, MongoDB, and Hive+Parquet), Steed achieves orders of magnitude better performance in both cold cache and hot cache scenarios.

【Keywords】: flat column assemble algorithm; flat data layout; json; mongodb; parquet; protocol buffers; real world patterns; simple path; steed; tree-structured data

62. Extracting and Analyzing Hidden Graphs from Relational Databases.

Paper Link】 【Pages】:897-912

【Authors】: Konstantinos Xirogiannopoulos ; Amol Deshpande

【Abstract】: Analyzing interconnection structures among underlying entities or objects in a dataset through the use of graph analytics can provide tremendous value in many application domains. However, graphs are not the primary representation choice for storing most data today, and in order to have access to these analyses, users are forced to manually extract data from their data stores, construct the requisite graphs, and then load them into some graph engine in order to execute their graph analysis task. Moreover, in many cases (especially when the graphs are dense), these graphs can be significantly larger than the initial input stored in the database, making it infeasible to construct or analyze such graphs in memory. In this paper we address both of these challenges by building a system that enables users to declaratively specify graph extraction tasks over a relational database schema and then execute graph algorithms on the extracted graphs. We propose a declarative domain specific language for this purpose, and pair it up with a novel condensed, in-memory representation that significantly reduces the memory footprint of these graphs, permitting analysis of larger-than-memory graphs. We present a general algorithm for creating such a condensed representation for a large class of graph extraction queries against arbitrary schemas. We observe that the condensed representation suffers from a duplication issue, that results in inaccuracies for most graph algorithms. We then present a suite of in-memory representations that handle this duplication in different ways and allow trading off the memory required and the computational cost for executing different graph algorithms. We also introduce several novel deduplication algorithms for removing this duplication in the graph, which are of independent interest for graph compression, and provide a comprehensive experimental evaluation over several real-world and synthetic datasets illustrating these trade-offs.

【Keywords】: graph analytics; graph compression; graph extraction

63. TrillionG: A Trillion-scale Synthetic Graph Generator using a Recursive Vector Model.

Paper Link】 【Pages】:913-928

【Authors】: Himchan Park ; Min-Soo Kim

【Abstract】: As many applications encounter exponential growth in graph sizes, a fast and scalable graph generator has become more important than ever before due to lack of large-scale realistic graphs for evaluating the performance of graph processing methods. Although there have been proposed a number of methods to generate synthetic graphs, they are not very efficient in terms of space and time complexities, and so, cannot generate even trillion-scale graphs using a moderate size cluster of commodity machines. Here, we propose an efficient and scalable disk-based graph generator, TrillionG that can generate massive graphs in a short time only using a small amount of memory. It can generate a graph of a trillion edges following the RMAT or Kronecker models within two hours only using 10 PCs. We first generalize existing graph generation models to the scope-based generation model, where RMAT and Kronecker correspond to two extremes. Then, we propose a new graph generation model called the recursive vector model, which compromises two extremes, and so, solves the space and time complexity problems existing in RMAT and Kronecker. We also extend the recursive vector model so as to generate a semantically richer graph database. Through extensive experiments, we have demonstrated that TrillionG outperforms the state-of-the-art graph generators by up to orders of magnitude.

【Keywords】: benchmark; distributed; graph generation; ldbc; rdf; realistic; recursive vector model; rmat; scalability; synthetic

SIGMOD Session 19. Machine Learning 4

64. Schema Independent Relational Learning.

Paper Link】 【Pages】:929-944

【Authors】: Jose Picado ; Arash Termehchy ; Alan Fern ; Parisa Ataei

【Abstract】: Learning novel relations from relational databases is an important problem with many applications. Relational learning algorithms learn the definition of a new relation in terms of existing relations in the database. Nevertheless, the same database may be represented under different schemas for various reasons, such as data quality, efficiency and usability. The output of current relational learning algorithms tends to vary quite substantially over the choice of schema. This variation complicates their off-the-shelf application. We introduce and formalize the property of schema independence of relational learning algorithms, and study both the theoretical and empirical dependence of existing algorithms on the common class of (de) composition schema transformations. We show that current algorithms are not schema independent. We propose Castor, a relational learning algorithm that achieves schema independence by leveraging data dependencies.

【Keywords】: learning over relational databases; machine learning; relational learning; schema independence; schema transformations

65. Scalable Kernel Density Classification via Threshold-Based Pruning.

Paper Link】 【Pages】:945-959

【Authors】: Edward Gan ; Peter Bailis

【Abstract】: Density estimation forms a critical component of many analytics tasks including outlier detection, visualization, and statistical testing. These tasks often seek to classify data into high and low-density regions of a probability distribution. Kernel Density Estimation (KDE) is a powerful technique for computing these densities, offering excellent statistical accuracy but quadratic total runtime. In this paper, we introduce a simple technique for improving the performance of using a KDE to classify points by their density (density classification). Our technique, thresholded kernel density classification (tKDC), applies threshold-based pruning to spatial index traversal to achieve asymptotic speedups over naïve KDE, while maintaining accuracy guarantees. Instead of exactly computing each point's exact density for use in classification, tKDC iteratively computes density bounds and short-circuits density computation as soon as bounds are either higher or lower than the target classification threshold. On a wide range of dataset sizes and dimensions, tKDC demonstrates empirical speedups of up to 1000x over alternatives.

【Keywords】: relational learning

66. The BUDS Language for Distributed Bayesian Machine Learning.

Paper Link】 【Pages】:961-976

【Authors】: Zekai J. Gao ; Shangyu Luo ; Luis Leopoldo Perez ; Chris Jermaine

【Abstract】: We describe BUDS, a declarative language for succinctly and simply specifying the implementation of large-scale machine learning algorithms on a distributed computing platform. The types supported in BUDS--vectors, arrays, etc.--are simply logical abstractions useful for programming, and do not correspond to the actual implementation. In fact, BUDS automatically chooses the physical realization of these abstractions in a distributed system, by taking into account the characteristics of the data. Likewise, there are many available implementations of the abstract operations offered by BUDS (matrix multiplies, transposes, Hadamard products, etc.). These are tightly coupled with the physical representation. In BUDS, these implementations are co-optimized along with the representation. All of this allows for the BUDS compiler to automatically perform deep optimizations of the user's program, and automatically generate efficient implementations.

【Keywords】: declarative language; distributed system; machine learning

67. A Cost-based Optimizer for Gradient Descent Optimization.

Paper Link】 【Pages】:977-992

【Authors】: Zoi Kaoudi ; Jorge-Arnulfo Quiané-Ruiz ; Saravanan Thirumuruganathan ; Sanjay Chawla ; Divy Agrawal

【Abstract】: As the use of machine learning (ML) permeates into diverse application domains, there is an urgent need to support a declarative framework for ML. Ideally, a user will specify an ML task in a high-level and easy-to-use language and the framework will invoke the appropriate algorithms and system configurations to execute it. An important observation towards designing such a framework is that many ML tasks can be expressed as mathematical optimization problems, which take a specific form. Furthermore, these optimization problems can be efficiently solved using variations of the gradient descent (GD) algorithm. Thus, to decouple a user specification of an ML task from its execution, a key component is a GD optimizer. We propose a cost-based GD optimizer that selects the best GD plan for a given ML task. To build our optimizer, we introduce a set of abstract operators for expressing GD algorithms and propose a novel approach to estimate the number of iterations a GD algorithm requires to converge. Extensive experiments on real and synthetic datasets show that our optimizer not only chooses the best GD plan but also allows for optimizations that achieve orders of magnitude performance speed-up.

【Keywords】: big data; data management systems; gradient descent; machine learning; optimization

SIGMOD Session 20. Optimization and Performance (2) 4

68. An Experimental Study of Bitmap Compression vs. Inverted List Compression.

Paper Link】 【Pages】:993-1008

【Authors】: Jianguo Wang ; Chunbin Lin ; Yannis Papakonstantinou ; Steven Swanson

【Abstract】: Bitmap compression has been studied extensively in the database area and many efficient compression schemes were proposed, e.g., BBC, WAH, EWAH, and Roaring. Inverted list compression is also a well-studied topic in the information retrieval community and many inverted list compression algorithms were developed as well, e.g., VB, PforDelta, GroupVB, Simple8b, and SIMDPforDelta. We observe that they essentially solve the same problem, i.e., how to store a collection of sorted integers with as few as possible bits and support query processing as fast as possible. Due to historical reasons, bitmap compression and inverted list compression were developed as two separated lines of research in the database area and information retrieval area. Thus, a natural question is: Which one is better between bitmap compression and inverted list compression? To answer the question, we present the first comprehensive experimental study to compare a series of 9 bitmap compression methods and 12 inverted list compression methods. We compare these 21 algorithms on synthetic datasets with different distributions (uniform, zipf, and markov) as well as 8 real-life datasets in terms of the space overhead, decompression time, intersection time, and union time. Based on the results, we provide many lessons and guidelines that can be used for practitioners to decide which technique to adopt in future systems and also for researchers to develop new algorithms.

【Keywords】: benchmarking; bitmap compression; evaluation; inverted list compression; main memory

69. Automatic Database Management System Tuning Through Large-scale Machine Learning.

Paper Link】 【Pages】:1009-1024

【Authors】: Dana Van Aken ; Andrew Pavlo ; Geoffrey J. Gordon ; Bohan Zhang

【Abstract】: Database management system (DBMS) configuration tuning is an essential aspect of any data-intensive application effort. But this is historically a difficult task because DBMSs have hundreds of configuration "knobs" that control everything in the system, such as the amount of memory to use for caches and how often data is written to storage. The problem with these knobs is that they are not standardized (i.e., two DBMSs use a different name for the same knob), not independent (i.e., changing one knob can impact others), and not universal (i.e., what works for one application may be sub-optimal for another). Worse, information about the effects of the knobs typically comes only from (expensive) experience. To overcome these challenges, we present an automated approach that leverages past experience and collects new information to tune DBMS configurations: we use a combination of supervised and unsupervised machine learning methods to (1) select the most impactful knobs, (2) map unseen database workloads to previous workloads from which we can transfer experience, and (3) recommend knob settings. We implemented our techniques in a new tool called OtterTune and tested it on two DBMSs. Our evaluation shows that OtterTune recommends configurations that are as good as or better than ones generated by existing tools or a human expert.

【Keywords】: autonomic computing; database management systems; database tuning; machine learning

70. Solving the Join Ordering Problem via Mixed Integer Linear Programming.

Paper Link】 【Pages】:1025-1040

【Authors】: Immanuel Trummer ; Christoph Koch

【Abstract】: We transform join ordering into a mixed integer linear program (MILP). This allows to address query optimization by mature MILP solver implementations that have evolved over decades and steadily improved their performance. They offer features such as anytime optimization and parallel search that are highly relevant for query optimization. We present a MILP formulation for searching left-deep query plans. We use sets of binary variables to represent join operands and intermediate results, operator implementation choices or the presence of interesting orders. Linear constraints restrict value assignments to the ones representing valid query plans. We approximate the cost of scan and join operations via linear functions, allowing to increase approximation precision up to arbitrary degrees. We integrated a prototypical implementation of our approach into the Postgres optimizer and compare against the original optimizer and several variants. Our experimental results are encouraging: we are able to optimize queries joining 40 tables within less than one minute of optimization time. Such query sizes are far beyond the capabilities of traditional query optimization algorithms with worst case guarantees on plan quality. Furthermore, as we use an existing solver, our optimizer implementation is small and can be integrated with low overhead.

【Keywords】: integer linear programming; query optimization

71. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases.

Paper Link】 【Pages】:1041-1052

【Authors】: Alexandre Verbitski ; Anurag Gupta ; Debanjan Saha ; Murali Brahmadesam ; Kamal Gupta ; Raman Mittal ; Sailesh Krishnamurthy ; Sandor Maurice ; Tengiz Kharatishvili ; Xiaofeng Bao

【Abstract】: Amazon Aurora is a relational database service for OLTP workloads offered as part of Amazon Web Services (AWS). In this paper, we describe the architecture of Aurora and the design considerations leading to that architecture. We believe the central constraint in high throughput data processing has moved from compute and storage to the network. Aurora brings a novel architecture to the relational database to address this constraint, most notably by pushing redo processing to a multi-tenant scale-out storage service, purpose-built for Aurora. We describe how doing so not only reduces network traffic, but also allows for fast crash recovery, failovers to replicas without loss of data, and fault-tolerant, self-healing storage. We then describe how Aurora achieves consensus on durable state across numerous storage nodes using an efficient asynchronous scheme, avoiding expensive and chatty recovery protocols. Finally, having operated Aurora as a production service for over 18 months, we share the lessons we have learnt from our customers on what modern cloud applications expect from databases.

【Keywords】: databases; distributed systems; log processing; oltp; performance; quorum models; recovery; replication

SIGMOD Session 21. Encryption 3

72. Fast Searchable Encryption With Tunable Locality.

Paper Link】 【Pages】:1053-1067

【Authors】: Ioannis Demertzis ; Charalampos Papamanthou

【Abstract】: Searchable encryption (SE) allows a client to outsource a dataset to an untrusted server while enabling the server to answer keyword queries in a private manner. SE can be used as a building block to support more expressive private queries such as range/point and boolean queries, while providing formal security guarantees. To scale SE to big data using external memory, new schemes with small locality have been proposed, where locality is defined as the number of non-continuous reads that the server makes for each query. Previous space-efficient SE schemes achieve optimal locality by increasing the read efficiency-the number of additional memory locations (false positives) that the server reads per result item. This can hurt practical performance. In this work, we design, formally prove secure, and evaluate the first SE scheme with tunable locality and linear space. Our first scheme has optimal locality and outperforms existing approaches (that have a slightly different leakage profile) by up to 2.5 orders of magnitude in terms of read efficiency, for all practical database sizes. Another version of our construction with the same leakage as previous works can be tuned to have bounded locality, optimal read efficiency and up to 60x more efficient end-to-end search time. We demonstrate that our schemes work fast in in-memory as well, leading to search time savings of up to 1 order of magnitude when compared to the most practical in-memory SE schemes. Finally, our construction can be tuned to achieve trade-offs between space, read efficiency, locality, parallelism and communication overhead.

【Keywords】: encrypted search; private keyword search; private point queries; searchable encryption

73. Cryptanalysis of Comparable Encryption in SIGMOD'16.

Paper Link】 【Pages】:1069-1084

【Authors】: Caleb Horst ; Ryo Kikuchi ; Keita Xagawa

【Abstract】: Comparable Encryption proposed by Furukawa (ESORICS 2013, CANS 2014) is a variant of order-preserving encryption (OPE) and order-revealing encryption (ORE); we cannot compare a ciphertext of v and another ciphertext of v', but we can compare a ciphertext of v and a token of b and compare a token of $b$ and another token of b'. Comparable encryption allows us to implement range and point queries while keeping the order of v's as secret as possible. Recently, Karras, Malhotra, Bhatt, Nikitin, Antyukhov, and Idreos independently re-define comparable encryption and propose two schemes, a basic one and an "ambiguous" one, based on linear algebra~(SIGMOD 2016). The basic scheme is just comparable encryption. To hide the order revealed by tokens, they also proposed an ambiguous scheme where each ciphertext has two interpretations v and vdummy. In the context of an indexed database, this means that every encryption has two places in the database corresponding to the two interpretations, masking the correct placement in the database unless the dummy value is detectable. They assessed that their basic scheme (and ambiguous scheme upon the basic scheme) is secure against known-plaintext attacks; the adversary will require O(ℓ) plaintext-ciphertext pairs to recover secret key, where ℓ is a security parameter. This paper cryptanalyzes their comparable encryption schemes by using simple linear algebra. We show that a few tokens and a few plaintext-ciphertext pairs instead of O(ℓ) pairs allow us to mount several attacks efficiently. Our attacks are summarized as follows: Attacks against the basic scheme: -A ciphertext-only attack using two tokens orders the ciphertexts correctly. -A known-plaintext attack using two tokens and two plaintexts reveals exact value of v. Attacks against the ambiguous scheme: -A ciphertext-only attack using two tokens orders the ciphertexts with a constant probability. -A known-plaintext attack using three tokens and three plaintexts reveals exact value of v.

【Keywords】: comparable encryption; cryptanalysis; encrypted db

74. BLOCKBENCH: A Framework for Analyzing Private Blockchains.

Paper Link】 【Pages】:1085-1100

【Authors】: Tien Tuan Anh Dinh ; Ji Wang ; Gang Chen ; Rui Liu ; Beng Chin Ooi ; Kian-Lee Tan

【Abstract】: Blockchain technologies are taking the world by storm. Public blockchains, such as Bitcoin and Ethereum, enable secure peer-to-peer applications like crypto-currency or smart contracts. Their security and performance are well studied. This paper concerns recent private blockchain systems designed with stronger security (trust) assumption and performance requirement. These systems target and aim to disrupt applications which have so far been implemented on top of database systems, for example banking, finance and trading applications. Multiple platforms for private blockchains are being actively developed and fine tuned. However, there is a clear lack of a systematic framework with which different systems can be analyzed and compared against each other. Such a framework can be used to assess blockchains' viability as another distributed data processing platform, while helping developers to identify bottlenecks and accordingly improve their platforms. In this paper, we first describe BLOCKBENCH, the first evaluation framework for analyzing private blockchains. It serves as a fair means of comparison for different platforms and enables deeper understanding of different system design choices. Any private blockchain can be integrated to BLOCKBENCH via simple APIs and benchmarked against workloads that are based on real and synthetic smart contracts. BLOCKBENCH measures overall and component-wise performance in terms of throughput, latency, scalability and fault-tolerance. Next, we use BLOCKBENCH to conduct comprehensive evaluation of three major private blockchains: Ethereum, Parity and Hyperledger Fabric. The results demonstrate that these systems are still far from displacing current database systems in traditional data processing workloads. Furthermore, there are gaps in performance among the three systems which are attributed to the design choices at different layers of the blockchain's software stack. We have released BLOCKBENCH for public use.

【Keywords】: blockchains; consensus; performance benchmark; security; smart contracts; transactions

SIGMOD Session 22. Cleaning, Versioning, Fusion (1) 3

75. Living in Parallel Realities: Co-Existing Schema Versions with a Bidirectional Database Evolution Language.

Paper Link】 【Pages】:1101-1116

【Authors】: Kai Herrmann ; Hannes Voigt ; Andreas Behrend ; Jonas Rausch ; Wolfgang Lehner

【Abstract】: We introduce end-to-end support of co-existing schema versions within one database. While it is state of the art to run multiple versions of a continuously developed application concurrently, it is hard to do the same for databases. In order to keep multiple co-existing schema versions alive -- which are all accessing the same data set -- developers usually employ handwritten delta code (e.g. views and triggers in SQL). This delta code is hard to write and hard to maintain: if a database administrator decides to adapt the physical table schema, all handwritten delta code needs to be adapted as well, which is expensive and error-prone in practice. In this paper, we present InVerDa: developers use the simple bidirectional database evolution language BiDEL, which carries enough information to generate all delta code automatically. Without additional effort, new schema versions become immediately accessible and data changes in any version are visible in all schema versions at the same time. InVerDa also allows for easily changing the physical table design without affecting the availability of co-existing schema versions. This greatly increases robustness (orders of magnitude less lines of code) and allows for significant performance optimization. A main contribution is the formal evaluation that each schema version acts like a common full-fledged database schema independently of the chosen physical table design.

【Keywords】: co-existing schema versions; database evolution

76. Synthesizing Mapping Relationships Using Table Corpus.

Paper Link】 【Pages】:1117-1132

【Authors】: Yue Wang ; Yeye He

【Abstract】: Mapping relationships, such as (country, country-code) or (company, stock-ticker), are versatile data assets for an array of applications in data cleaning and data integration like auto-correction and auto-join. However, today there are no good repositories of mapping tables that can enable these intelligent applications. Given a corpus of tables such as web tables or spreadsheet tables, we observe that values of these mappings often exist in pairs of columns in same tables. Motivated by their broad applicability, we study the problem of synthesizing mapping relationships using a large table corpus. Our synthesis process leverages compatibility of tables based on co-occurrence statistics, as well as constraints such as functional dependency. Experiment results using web tables and enterprise spreadsheets suggest that the proposed approach can produce high quality mappings.

【Keywords】: auto-fill; auto-join; data cleaning; data integration; functional dependency; mapping relationships; mapping tables

77. Waldo: An Adaptive Human Interface for Crowd Entity Resolution.

Paper Link】 【Pages】:1133-1148

【Authors】: Vasilis Verroios ; Hector Garcia-Molina ; Yannis Papakonstantinou

【Abstract】: In Entity Resolution, the objective is to find which records of a dataset refer to the same real-world entity. Crowd Entity Resolution uses humans, in addition to machine algorithms, to improve the quality of the outcome. We study a hybrid approach that combines two common interfaces for human tasks in Crowd Entity Resolution, taking into account key observations about the advantages and disadvantages of the two interfaces. We give a formal definition to the problem of human task selection and we derive algorithms with strong optimality guarantees. Our experiments with four real-world datasets show that our hybrid approach gives an improvement of 50% to 300% in the crowd cost to resolve a dataset, compared to using a single interface.

【Keywords】: adaptive human interface; crowd joins; crowdsourcing; data integration; deduplication; entity resolution; multi-item interface; waldo

SIGMOD Session 23. Tree & Graph Processing (3) 3

78. ZipG: A Memory-efficient Graph Store for Interactive Queries.

Paper Link】 【Pages】:1149-1164

【Authors】: Anurag Khandelwal ; Zongheng Yang ; Evan Ye ; Rachit Agarwal ; Ion Stoica

【Abstract】: We present ZipG, a distributed memory-efficient graph store for serving interactive graph queries. ZipG achieves memory efficiency by storing the input graph data using a compressed representation. What differentiates ZipG from other graph stores is its ability to execute a wide range of graph queries directly on this compressed representation. ZipG can thus execute a larger fraction of queries in main memory, achieving query interactivity. ZipG exposes a minimal API that is functionally rich enough to implement published functionalities from several industrial graph stores. We demonstrate this by implementing and evaluating graph queries from Facebook TAO, LinkBench, Graph Search and several other workloads on top of ZipG. On a single server with 244GB memory, ZipG executes tens of thousands of queries from these workloads for raw graph data over half a TB; this leads to an order of magnitude (sometimes as much as 23×) higher throughput than Neo4j and Titan. We get similar gains in distributed settings compared to Titan.

【Keywords】: graph compression; graph store; interactive graph queries

79. All-in-One: Graph Processing in RDBMSs Revisited.

Paper Link】 【Pages】:1165-1180

【Authors】: Kangfei Zhao ; Jeffrey Xu Yu

【Abstract】: To support analytics on massive graphs such as online social networks, RDF, Semantic Web, etc. many new graph algorithms are designed to query graphs for a specific problem, and many distributed graph processing systems are developed to support graph querying by programming. In this paper, we focus on RDBM, which has been well studied over decades to manage large datasets, and we revisit the issue how RDBM can support graph processing at the SQL level. Our work is motivated by the fact that there are many relations stored in RDBM that are closely related to a graph in real applications and need to be used together to query the graph, and RDBM is a system that can query and manage data while data may be updated over time. To support graph processing, in this work, we propose 4 new relational algebra operations, MM-join, MV-join, anti-join, and union-by-update. Here, MM-join and MV-join are join operations between two matrices and between a matrix and a vector, respectively, followed by aggregation computing over groups, given a matrix/vector can be represented by a relation. Both deal with the semiring by which many graph algorithms can be supported. The anti-join removes nodes/edges in a graph when they are unnecessary for the following computing. The union-by-update addresses value updates to compute PageRank, for example. The 4 new relational algebra operations can be defined by the 6 basic relational algebra operations with group-by & aggregation. We revisit SQL recursive queries and show that the 4 operations with others are ensured to have a fixpoint, following the techniques studied in DATALOG, and enhance the recursive WITH clause in SQL'99. We conduct extensive performance studies to test 10 graph algorithms using 9 large real graphs in 3 major RDBMs. We show that RDBMs are capable of dealing with graph processing in reasonable time. The focus of this work is at SQL level. There is high potential to improve the efficiency by main-memory RDBMs, efficient join processing in parallel, and new storage management.

【Keywords】: datalog; graph algorithms; mm-join; mv-join; relational database systems; sql recursion; xy-stratified

80. Computing A Near-Maximum Independent Set in Linear Time by Reducing-Peeling.

Paper Link】 【Pages】:1181-1196

【Authors】: Lijun Chang ; Wei Li ; Wenjie Zhang

【Abstract】: This paper studies the problem of efficiently computing a maximum independent set from a large graph, a fundamental problem in graph analysis. Due to the hardness results of computing an exact maximum independent set or an approximate maximum independent set with accuracy guarantee, the existing algorithms resort to heuristic techniques for approximately computing a maximum independent set with good performance in practice but no accuracy guarantee theoretically. Observing that the existing techniques have various limits, in this paper, we aim to develop efficient algorithms (with linear or near-linear time complexity) that can generate a high-quality (large-size) independent set from a graph in practice. In particular, firstly we develop a Reducing-Peeling framework which iteratively reduces the graph size by applying reduction rules on vertices with very low degrees (Reducing) and temporarily removing the vertex with the highest degree (Peeling) if the reduction rules cannot be applied. Secondly, based on our framework we design two baseline algorithms, BDOne and BDTwo, by utilizing the existing reduction rules for handling degree-one and degree-two vertices, respectively. Both algorithms can generate higher-quality (larger-size) independent sets than the existing algorithms. Thirdly, we propose a linear-time algorithm, LinearTime, and a near-linear time algorithm, NearLinear, by designing new reduction rules and developing techniques for efficiently and incrementally applying reduction rules. In practice, LinearTime takes similar time and space to BDOne but computes a higher quality independent set, similar in size to that of an independent set generated by BDTwo. Moreover, in practice NearLinear has a good chance to generate a maximum independent set and it often generates near-maximum independent sets. Fourthly, we extend our techniques to accelerate the existing iterated local search algorithms. Extensive empirical studies show that all our algorithms output much larger independent sets than the existing linear-time algorithms while having a similar running time, as well as achieve significant speedup against the existing iterated local search algorithms.

【Keywords】: linear time; maximum independent set; minimum vertex cover; power-law graphs; reducing-peeling; reduction rules

SIGMOD Session 24. Spatial and Multidimensional Data (1) 3

81. Utility-Aware Ridesharing on Road Networks.

Paper Link】 【Pages】:1197-1210

【Authors】: Peng Cheng ; Hao Xin ; Lei Chen

【Abstract】: Ridesharing enables drivers to share any empty seats in their vehicles with riders to improve the efficiency of transportation for the benefit of both drivers and riders. Different from existing studies in ridesharing that focus on minimizing the travel costs of vehicles, we consider that the satisfaction of riders (the utility values) is more important nowadays. Thus, we formulate the problem of utility-aware ridesharing on road networks (URR) with the goal of providing the optimal rider schedules for vehicles to maximize the overall utility, subject to spatial-temporal and capacity constraints. To assign a new rider to a given vehicle, we propose an efficient algorithm with a minimum increase in travel cost without reordering the existing schedule of the vehicle. We prove that the URR problem is NP-hard by reducing it from the 0-1 Knapsack problem and it is unlikely to be approximated within any constant factor in polynomial time through a reduction from the DENS k-SUBGRAPH problem. Therefore, we propose three efficient approximate algorithms, including a bilateral arrangement algorithm, an efficient greedy algorithm and a grouping-based scheduling algorithm, to assign riders to suitable vehicles with a high overall utility. Through extensive experiments, we demonstrate the efficiency and effectiveness of our URR approaches on both real and synthetic data sets.

【Keywords】: ridesharing; task scheduling

82. Distance Oracle on Terrain Surface.

Paper Link】 【Pages】:1211-1226

【Authors】: Victor Junqiu Wei ; Raymond Chi-Wing Wong ; Cheng Long ; David M. Mount

【Abstract】: Due to the advance of the geo-spatial positioning and the computer graphics technology, digital terrain data become more and more popular nowadays. Query processing on terrain data has attracted considerable attention from both the academic community and the industry community. One fundamental and important query is the shortest distance query and many other applications such as proximity queries (including nearest neighbor queries and range queries), 3D object feature vector construction and 3D object data mining are built based on the result of the shortest distance query. In this paper, we study the shortest distance query which is to find the shortest distance between a point-of-interest and another point-of-interest on the surface of the terrain due to a variety of applications. As observed by existing studies, computing the exact shortest distance is very expensive. Some existing studies proposed ε-approximate distance oracles where ε is a non-negative real number and is an error parameter. However, the best-known algorithm has a large oracle construction time, a large oracle size and a large distance query time. Motivated by this, we propose a novel ε-approximate distance oracle called the Space Efficient distance oracle (SE) which has a small oracle construction time, a small oracle size and a small distance query time due to its compactness storing concise information about pairwise distances between any two points-of-interest. Our experimental results show that the oracle construction time, the oracle size and the distance query time of SE are up to two orders of magnitude, up to 3 orders of magnitude and up to 5 orders of magnitude faster than the best-known algorithm.

【Keywords】: shortest distance queries; spatial database; terrain

83. Efficient Computation of Top-k Frequent Terms over Spatio-temporal Ranges.

Paper Link】 【Pages】:1227-1241

【Authors】: Pritom Ahmed ; Mahbub Hasan ; Abhijith Kashyap ; Vagelis Hristidis ; Vassilis J. Tsotras

【Abstract】: The wide availability of tracking devices has drastically increased the role of geolocation in social networks, resulting in new commercial applications; for example, marketers can identify current trending topics within a region of interest and focus their products accordingly. In this paper we study a basic analytics query on geotagged data, namely: given a spatiotemporal region, find the most frequent terms among the social posts in that region. While there has been prior work on keyword search on spatial data (find the objects nearest to the query point that contain the query keywords), and on group keyword search on spatial data (retrieving groups of objects), our problem is different in that it returns keywords and aggregated frequencies as output, instead of having the keyword as input. Moreover, we differ from works addressing the streamed version of this query in that we operate on large, disk resident data and we provide exact answers. We propose an index structure and algorithms to efficiently answer such top-k spatiotemporal range queries, which we refer as Top-k Frequent Spatiotemporal Terms (kFST) queries. Our index structure employs an R-tree augmented by top-k sorted term lists (STLs), where a key challenge is to balance the size of the index to achieve faster execution and smaller space requirements. We theoretically study and experimentally validate the ideal length of the stored term lists, and perform detailed experiments to evaluate the performance of the proposed methods compared to baselines on real datasets.

【Keywords】: social networks; spatio-temporal databases; top-k

SIGMOD Session 25. Optimization and Main Memory (1) 3

84. Optimizing Iceberg Queries with Complex Joins.

Paper Link】 【Pages】:1243-1258

【Authors】: Brett Walenz ; Sudeepa Roy ; Jun Yang

【Abstract】: Iceberg queries, commonly used for decision support, find groups whose aggregate values are above or below a threshold. In practice, iceberg queries are often posed over complex joins that are expensive to evaluate. This paper proposes a framework for combining a number of techniques---a-priori, memoization, and pruning---to optimize iceberg queries with complex joins. A-priori pushes partial GROUP BY and HAVING condition before a join to reduce its input size. Memoization caches and reuses join computation results. Pruning uses cached results to infer that certain tuples cannot contribute to the final query result, and short-circuits join computation. We formally derive conditions for correctly applying these techniques. Our practical rewrite algorithm produces highly efficient SQL that can exploit combinations of optimization opportunities in ways previously not possible. We evaluate our PostgreSQL-based implementation experimentally and show that it outperforms both baseline PostgreSQL and a commercial database system.

【Keywords】: databases; iceberg; iceberg queries; postgresql; query optimization

85. The Dynamic Yannakakis Algorithm: Compact and Efficient Query Processing Under Updates.

Paper Link】 【Pages】:1259-1274

【Authors】: Muhammad Idris ; Martín Ugarte ; Stijn Vansummeren

【Abstract】: Modern computing tasks such as real-time analytics require refresh of query results under high update rates. Incremental View Maintenance (IVM) approaches this problem by materializing results in order to avoid recomputation. IVM naturally induces a trade-off between the space needed to maintain the materialized results and the time used to process updates. In this paper, we show that the full materialization of results is a barrier for more general optimization strategies. In particular, we present a new approach for evaluating queries under updates. Instead of the materialization of results, we require a data structure that allows: (1) linear time maintenance under updates, (2) constant-delay enumeration of the output, (3) constant-time lookups in the output, while (4) using only linear space in the size of the database. We call such a structure a Dynamic Constant-delay Linear Representation (DCLR) for the query. We show that DYN, a dynamic version of the Yannakakis algorithm, yields DCLRs for the class of free-connex acyclic CQs. We show that this is optimal in the sense that no DCLR can exist for CQs that are not free-connex acyclic. Moreover, we identify a sub-class of queries for which DYN features constant-time update per tuple and show that this class is maximal. Finally, using the TPC-H and TPC-DS benchmarks, we experimentally compare DYN and a higher-order IVM (HIVM) engine. Our approach is not only more efficient in terms of memory consumption (as expected), but is also consistently faster in processing updates.

【Keywords】: acyclic joins; dynamic query processing; incremental view maintenance

86. Revisiting Reuse in Main Memory Database Systems.

Paper Link】 【Pages】:1275-1289

【Authors】: Kayhan Dursun ; Carsten Binnig ; Ugur Çetintemel ; Tim Kraska

【Abstract】: Reusing intermediates in databases to speed-up analytical query processing was studied in prior work. Existing solutions require intermediate results of individual operators to be materialized using materialization operators. However, inserting such materialization operations into a query plan not only incurs additional execution costs but also often eliminates important cache- and register-locality opportunities, resulting in even higher performance penalties. This paper studies a novel reuse model for intermediates, which caches internal physical data structures materialized during query processing (due to pipeline breakers) and externalizes them so that they become reusable for upcoming operations. We focus on hash tables, the most commonly used internal data structure in main memory databases to perform join and aggregation operations. As queries arrive, our reuse-aware optimizer reasons about the reuse opportunities for hash tables, employing cost models that take into account hash table statistics together with the CPU and data movement costs within the cache hierarchy. Experimental results, based on our prototype implementation, demonstrate performance gains of 2x for typical analytical workloads with no additional overhead for materializing intermediates.

【Keywords】: hash tables; main-memory database systems; query optimization; reuse

SIGMOD Session 26. Privacy 4

87. Pufferfish Privacy Mechanisms for Correlated Data.

Paper Link】 【Pages】:1291-1306

【Authors】: Shuang Song ; Yizhen Wang ; Kamalika Chaudhuri

【Abstract】: Many modern databases include personal and sensitive correlated data, such as private information on users connected together in a social network, and measurements of physical activity of single subjects across time. However, differential privacy, the current gold standard in data privacy, does not adequately address privacy issues in this kind of data. This work looks at a recent generalization of differential privacy, called Pufferfish, that can be used to address privacy in correlated data. The main challenge in applying Pufferfish is a lack of suitable mechanisms. We provide the first mechanism -- the Wasserstein Mechanism -- which applies to any general Pufferfish framework. Since this mechanism may be computationally inefficient, we provide an additional mechanism that applies to some practical cases such as physical activity measurements across time, and is computationally efficient. Our experimental evaluations indicate that this mechanism provides privacy and utility for synthetic as well as real data in two separate domains.

【Keywords】: differential privacy; privacy; pufferfish privacy

88. Bolt-on Differential Privacy for Scalable Stochastic Gradient Descent-based Analytics.

Paper Link】 【Pages】:1307-1322

【Authors】: Xi Wu ; Fengan Li ; Arun Kumar ; Kamalika Chaudhuri ; Somesh Jha ; Jeffrey F. Naughton

【Abstract】: While significant progress has been made separately on analytics systems for scalable stochastic gradient descent (SGD) and private SGD, none of the major scalable analytics frameworks have incorporated differentially private SGD. There are two inter-related issues for this disconnect between research and practice: (1) low model accuracy due to added noise to guarantee privacy, and (2) high development and runtime overhead of the private algorithms. This paper takes a first step to remedy this disconnect and proposes a private SGD algorithm to address both issues in an integrated manner. In contrast to the white-box approach adopted by previous work, we revisit and use the classical technique of output perturbation to devise a novel ``bolt-on'' approach to private SGD. While our approach trivially addresses (2), it makes (1) even more challenging. We address this challenge by providing a novel analysis of the L2-sensitivity of SGD, which allows, under the same privacy guarantees, better convergence of SGD when only a constant number of passes can be made over the data. We integrate our algorithm, as well as other state-of-the-art differentially private SGD, into Bismarck, a popular scalable SGD-based analytics system on top of an RDBMS. Extensive experiments show that our algorithm can be easily integrated, incurs virtually no overhead, scales well, and most importantly, yields substantially better (up to 4X) test accuracy than the state-of-the-art algorithms on many real datasets.

【Keywords】: differential privacy; optimization; scalable data analytics; stochastic gradient descent

89. Pythia: Data Dependent Differentially Private Algorithm Selection.

Paper Link】 【Pages】:1323-1337

【Authors】: Ios Kotsogiannis ; Ashwin Machanavajjhala ; Michael Hay ; Gerome Miklau

【Abstract】: Differential privacy has emerged as a preferred standard for ensuring privacy in analysis tasks on sensitive datasets. Recent algorithms have allowed for significantly lower error by adapting to properties of the input data. These so-called data-dependent algorithms have different error rates for different inputs. There is now a complex and growing landscape of algorithms without a clear winner that can offer low error over all datasets. As a result, the best possible error rates are not attainable in practice, because the data curator cannot know which algorithm to select prior to actually running the algorithm. We address this challenge by proposing a novel meta-algorithm designed to relieve the data curator of the burden of algorithm selection. It works by learning (from non-sensitive data) the association between dataset properties and the best-performing algorithm. The meta-algorithm is deployed by first testing the input for low-sensitivity properties and then using the results to select a good algorithm. The result is an end-to-end differentially private system: Pythia, which we show offers improvements over using any single algorithm alone. We empirically demonstrate the benefit of Pythia for the tasks of releasing histograms, answering 1- and 2-dimensional range queries, as well as for constructing private Naive Bayes classifiers.

【Keywords】: differential privacy; end-to-end privacy; regret learning

90. Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics.

Paper Link】 【Pages】:1339-1354

【Authors】: Samuel Haney ; Ashwin Machanavajjhala ; John M. Abowd ; Matthew Graham ; Mark Kutzbach ; Lars Vilhuber

【Abstract】: National statistical agencies around the world publish tabular summaries based on combined employer-employee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter ε≥ 1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional SDL algorithms. Those queries are fodder for future research.

【Keywords】: differential privacy; pufferfish privacy; u.s. census bureau

SIGMOD Session 27. Cleaning, Versioning, Fusion (2) 4

91. Online Deduplication for Databases.

Paper Link】 【Pages】:1355-1368

【Authors】: Lianghong Xu ; Andrew Pavlo ; Sudipta Sengupta ; Gregory R. Ganger

【Abstract】: dbDedup is a similarity-based deduplication scheme for on-line database management systems (DBMSs). Beyond block-level compression of individual database pages or operation log (oplog) messages, as used in today's DBMSs, dbDedup uses byte-level delta encoding of individual records within the database to achieve greater savings. dbDedup's single-pass encoding method can be integrated into the storage and logging components of a DBMS to provide two benefits: (1) reduced size of data stored on disk beyond what traditional compression schemes provide, and (2) reduced amount of data transmitted over the network for replication services. To evaluate our work, we implemented dbDedup in a distributed NoSQL DBMS and analyzed its properties using four real datasets. Our results show that dbDedup achieves up to 37x reduction in the storage size and replication traffic of the database on its own and up to 61x reduction when paired with the DBMS's block-level compression. dbDedup provides both benefits with negligible effect on DBMS throughput or client latency (average and tail).

【Keywords】: databases; deduplication; delta compression; replication

92. QFix: Diagnosing Errors through Query Histories.

Paper Link】 【Pages】:1369-1384

【Authors】: Xiaolan Wang ; Alexandra Meliou ; Eugene Wu

【Abstract】: Data-driven applications rely on the correctness of their data to function properly and effectively. Errors in data can be incredibly costly and disruptive, leading to loss of revenue, incorrect conclusions, and misguided policy decisions. While data cleaning tools can purge datasets of many errors before the data is used, applications and users interacting with the data can introduce new errors. Subsequent valid updates can obscure these errors and propagate them through the dataset causing more discrepancies. Even when some of these discrepancies are discovered, they are often corrected superficially, on a case-by-case basis, further obscuring the true underlying cause, and making detection of the remaining errors harder. In this paper, we propose QFix, a framework that derives explanations and repairs for discrepancies in relational data, by analyzing the effect of queries that operated on the data and identifying potential mistakes in those queries. QFix is flexible, handling scenarios where only a subset of the true discrepancies is known, and robust to different types of update workloads. We make four important contributions: (a) we formalize the problem of diagnosing the causes of data errors based on the queries that operated on and introduced errors to a dataset; (b) we develop exact methods for deriving diagnoses and fixes for identified errors using state-of-the-art tools; (c) we present several optimization techniques that improve our basic approach without compromising accuracy, and (d) we leverage a tradeoff between accuracy and performance to scale diagnosis to large datasets and query logs, while achieving near-optimal results. We demonstrate the effectiveness of QFix through extensive evaluation over benchmark and synthetic data.

【Keywords】: data errors; data repair; explanation; qfix; query errors; query history; query log; query provenance

93. UGuide: User-Guided Discovery of FD-Detectable Errors.

Paper Link】 【Pages】:1385-1397

【Authors】: Saravanan Thirumuruganathan ; Laure Berti-Equille ; Mourad Ouzzani ; Jorge-Arnulfo Quiané-Ruiz ; Nan Tang

【Abstract】: Error detection is the process of identifying problematic data cells that are different from their ground truth. Functional dependencies (FDs) have been widely studied in support of this process. Oftentimes, it is assumed that FDs are given by experts. Unfortunately, it is usually hard and expensive for the experts to define such FDs. In addition, automatic data profiling over dirty data in order to find correct FDs is known to be a hard problem. In this paper, we propose an end-to-end solution to detect FD-detectable errors from dirty data. The broad intuition is that given a dirty dataset, it is feasible to automatically find approximate FDs, as well as data that is possibly erroneous. Arguably, at this point, only experts can confirm true FDs or true errors. However, in practice, experts never have enough budget to find all errors. Hence, our problem is, given a limited budget of expert's time, which questions we should ask, either FDs, cells, or tuples, such that we can find as many data errors as possible. We present efficient algorithms to interact with the user. Extensive experiments demonstrate that our proposed framework is effective in detecting errors from dirty data.

【Keywords】: budgeted error detection; functional dependencies; interactive error detection

94. SLiMFast: Guaranteed Results for Data Fusion and Source Reliability.

Paper Link】 【Pages】:1399-1414

【Authors】: Theodoros Rekatsinas ; Manas Joglekar ; Hector Garcia-Molina ; Aditya G. Parameswaran ; Christopher Ré

【Abstract】: We focus on data fusion, i.e., the problem of unifying conflicting data from data sources into a single representation by estimating the source accuracies. We propose SLiMFast, a framework that expresses data fusion as a statistical learning problem over discriminative probabilistic models, which in many cases correspond to logistic regression. In contrast to previous approaches that use complex generative models, discriminative models make fewer distributional assumptions over data sources and allow us to obtain rigorous theoretical guarantees. Furthermore, we show how SLiMFast enables incorporating domain knowledge into data fusion, yielding accuracy improvements of up to 50% over state-of-the-art baselines. Building upon our theoretical results, we design an optimizer that obviates the need for users to manually select an algorithm for learning SLiMFast's parameters. We validate our optimizer on multiple real-world datasets and show that it can accurately predict the learning algorithm that yields the best data fusion results.

【Keywords】: data conflict resolution; data explanation; data fusion; data integration; data quality; truth discovery; unreliable sources; valuable sources

SIGMOD Session 28. Crowdsourcing 4

95. Crowdsourced Top-k Queries by Confidence-Aware Pairwise Judgments.

Paper Link】 【Pages】:1415-1430

【Authors】: Ngai Meng Kou ; Yan Li ; Hao Wang ; Leong Hou U ; Zhiguo Gong

【Abstract】: Crowdsourced query processing is an emerging processing technique that tackles computationally challenging problems by human intelligence. The basic idea is to decompose a computationally challenging problem into a set of human friendly microtasks (e.g., pairwise comparisons) that are distributed to and answered by the crowd. The solution of the problem is then computed (e.g., by aggregation) based on the crowdsourced answers to the microtasks. In this work, we attempt to revisit the crowdsourced processing of the top-k queries, aiming at (1) securing the quality of crowdsourced comparisons by a certain confidence level and (2) minimizing the total monetary cost. To secure the quality of each paired comparison, we employ two statistical tools, Student's t-distribution estimation and Stein's estimation, to estimate the confidence interval of the underlying mean value, which is then used to draw a conclusion to the comparison. Based on the pairwise comparison process, we attempt to minimize the monetary cost of the top-k processing within a Select-Partition-Rank framework. Our experiments, conducted on four real datasets, demonstrate that our stochastic method outperforms other existing top-k processing techniques by a visible difference.

【Keywords】: confidence-aware; crowdsourcing; pairwise judgments; top-k

96. Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services.

Paper Link】 【Pages】:1431-1446

【Authors】: Sanjib Das ; Paul Suganthan G. C. ; AnHai Doan ; Jeffrey F. Naughton ; Ganesh Krishnan ; Rohit Deep ; Esteban Arcaute ; Vijay Raghavendra ; Youngchoon Park

【Abstract】: Many works have applied crowdsourcing to entity matching (EM). While promising, these approaches are limited in that they often require a developer to be in the loop. As such, it is difficult for an organization to deploy multiple crowdsourced EM solutions, because there are simply not enough developers. To address this problem, a recent work has proposed Corleone, a solution that crowdsources the entire EM workflow, requiring no developers. While promising, Corleone is severely limited in that it does not scale to large tables. We propose Falcon, a solution that scales up the hands-off crowdsourced EM approach of Corleone, using RDBMS-style query execution and optimization over a Hadoop cluster. Specifically, we define a set of operators and develop efficient implementations. We translate a hands-off crowdsourced EM workflow into a plan consisting of these operators, optimize, then execute the plan. These plans involve both machine and crowd activities, giving rise to novel optimization techniques such as using crowd time to mask machine time. Extensive experiments show that Falcon can scale up to tables of millions of tuples, thus providing a practical solution for hands-off crowdsourced EM, to build cloud-based EM services.

【Keywords】: cloud services; crowdsourcing; entity matching; hands-off

97. CrowdDQS: Dynamic Question Selection in Crowdsourcing Systems.

Paper Link】 【Pages】:1447-1462

【Authors】: Asif R. Khan ; Hector Garcia-Molina

【Abstract】: In this paper, we present CrowdDQS, a system that uses the most recent set of crowdsourced voting evidence to dynamically issue questions to workers on Amazon Mechanical Turk (AMT). CrowdDQS posts all questions to AMT in a single batch, but delays the decision of the exact question to issue a worker until the last moment, concentrating votes on uncertain questions to maximize accuracy. Unlike previous works, CrowdDQS also (1) optionally can decide when it is more beneficial to issue gold standard questions with known answers than to solicit new votes (both can help us estimate worker accuracy, but gold standard questions provide a less noisy estimate of worker accuracy at the expense of not obtaining new votes), (2) estimates worker accuracies in real-time even with limited evidence (with or without gold standard questions), and (3) infers the distribution of worker skill levels to actively block poor workers. We deploy our system live on AMT to over 1000 crowdworkers, and find that CrowdDQS can accurately answer questions using up to 6x fewer votes than standard approaches. We also find there are many non-obvious practical challenges involved in deploying such a system seamlessly to crowdworkers, and discuss techniques to overcome these challenges.

【Keywords】: crowdsourcing; error estimation; quality control; task assignment

98. CDB: Optimizing Queries with Crowd-Based Selections and Joins.

Paper Link】 【Pages】:1463-1478

【Authors】: Guoliang Li ; Chengliang Chai ; Ju Fan ; Xueping Weng ; Jian Li ; Yudian Zheng ; Yuanbing Li ; Xiang Yu ; Xiaohang Zhang ; Haitao Yuan

【Abstract】: Crowdsourcing database systems have been proposed to leverage crowd-powered operations to encapsulate the complexities of interacting with the crowd. Existing systems suffer from two major limitations. Firstly, in order to optimize a query, they often adopt the traditional tree model to select an optimized table-level join order. However, the tree model provides a coarse-grained optimization, which generates the same order for different joined tuples and limits the optimization potential that different joined tuples can be optimized by different orders. Secondly, they mainly focus on optimizing the monetary cost. In fact, there are three optimization goals (i.e., smaller monetary cost, lower latency, and higher quality) in crowdsourcing, and it calls for a system to enable multi-goal optimization. To address the limitations, we develop a crowd-powered database system CDB that supports crowd-based query optimizations, with focus on join and selection. CDB has fundamental differences from existing systems. First, CDB employs a graph-based query model that provides more fine-grained query optimization. Second, CDB adopts a unified framework to perform the multi-goal optimization based on the graph model. We have implemented our system and deployed it on AMT, CrowdFlower and ChinaCrowd. We have also created a benchmark for evaluating crowd-powered databases. We have conducted both simulated and real experiments, and the experimental results demonstrate the performance superiority of CDB on cost, latency and quality.

【Keywords】: crowd-based join; crowd-based selection; crowdsourcing; crowdsourcing optimization

SIGMOD Session 29. Spatial and Multidimensional Data (2) 4

99. Scaling Locally Linear Embedding.

Paper Link】 【Pages】:1479-1492

【Authors】: Yasuhiro Fujiwara ; Naoki Marumo ; Mathieu Blondel ; Koh Takeuchi ; Hideaki Kim ; Tomoharu Iwata ; Naonori Ueda

【Abstract】: Locally Linear Embedding (LLE) is a popular approach to dimensionality reduction as it can effectively represent nonlinear structures of high-dimensional data. For dimensionality reduction, it computes a nearest neighbor graph from a given dataset where edge weights are obtained by applying the Lagrange multiplier method, and it then computes eigenvectors of the LLE kernel where the edge weights are used to obtain the kernel. Although LLE is used in many applications, its computation cost is significantly high. This is because, in obtaining edge weights, its computation cost is cubic in the number of edges to each data point. In addition, the computation cost in obtaining the eigenvectors of the LLE kernel is cubic in the number of data points. Our approach, Ripple, is based on two ideas: (1) it incrementally updates the edge weights by exploiting the Woodbury formula and (2) it efficiently computes eigenvectors of the LLE kernel by exploiting the LU decomposition-based inverse power method. Experiments show that Ripple is significantly faster than the original approach of LLE by guaranteeing the same results of dimensionality reduction.

【Keywords】: dimensionality reduction; efficient; locally linear embedding

100. Dynamic Density Based Clustering.

Paper Link】 【Pages】:1493-1507

【Authors】: Junhao Gan ; Yufei Tao

【Abstract】: Dynamic clustering---how to efficiently maintain data clusters along with updates in the underlying dataset---is a difficult topic. This is especially true for density-based clustering, where objects are aggregated based on transitivity of proximity, under which deciding the cluster(s) of an object may require the inspection of numerous other objects. The phenomenon is unfortunate, given the popular usage of this clustering approach in many applications demanding data updates. Motivated by the above, we investigate the algorithmic principles for dynamic clustering by DBSCAN, a successful representative of density-based clustering, and ρ-approximate DBSCAN, proposed to bring down the computational hardness of the former on static data. Surprisingly, we prove that the ρ-approximate version suffers from the very same hardness when the dataset is fully dynamic, namely, when both insertions and deletions are allowed. We also show that this issue goes away as soon as tiny further relaxation is applied, yet still ensuring the same quality---known as the ``sandwich guarantee''---of ρ-approximate DBSCAN. Our algorithms guarantee near-constant update processing, and outperform existing approaches by a factor over two orders of magnitude.

【Keywords】: algorithms; approximate dbscan; dynamic clustering

101. Extracting Top-K Insights from Multi-dimensional Data.

Paper Link】 【Pages】:1509-1524

【Authors】: Bo Tang ; Shi Han ; Man Lung Yiu ; Rui Ding ; Dongmei Zhang

【Abstract】: OLAP tools have been extensively used by enterprises to make better and faster decisions. Nevertheless, they require users to specify group-by attributes and know precisely what they are looking for. This paper takes the first attempt towards automatically extracting top-k insights from multi-dimensional data. This is useful not only for non-expert users, but also reduces the manual effort of data analysts. In particular, we propose the concept of insight which captures interesting observation derived from aggregation results in multiple steps (e.g., rank by a dimension, compute the percentage of measure by a dimension). An example insight is: ``Brand B's rank (across brands) falls along the year, in terms of the increase in sales''. Our problem is to compute the top-k insights by a score function. It poses challenges on (i) the effectiveness of the result and (ii) the efficiency of computation. We propose a meaningful scoring function for insights to address (i). Then, we contribute a computation framework for top-k insights, together with a suite of optimization techniques (i.e., pruning, ordering, specialized cube, and computation sharing) to address (ii). Our experimental study on both real data and synthetic data verifies the effectiveness and efficiency of our proposed solution.

【Keywords】: data exploration; insight extraction; olap

102. QUILTS: Multidimensional Data Partitioning Framework Based on Query-Aware and Skew-Tolerant Space-Filling Curves.

Paper Link】 【Pages】:1525-1537

【Authors】: Shoji Nishimura ; Haruo Yokota

【Abstract】: Recently, massive data management plays an increasingly important role in data analytics because data access is a major bottleneck. Data skipping is a promising technique to reduce the number of data accesses. Data skipping partitions data into pages and accesses only pages that contain data to be retrieved by a query. Therefore, effective data partitioning is required to minimize the number of page accesses. However, it is an NP-hard problem to obtain optimal data partitioning given query pattern and data distribution. We propose a framework that involves a multidimensional indexing technique based on a space-filling curve. A space-filling curve is a way to define which portion of data can be stored in the same page. Therefore, the problem can be interpreted as selecting a curve that distributes data to be accessed by a query to minimize the number of page accesses. To solve this problem, we analyzed how different space-filling curves affect the number of page accesses. We found that it is critical for a curve to fit a query pattern and be robust against any data distribution. We propose a cost model for measuring how well a space-filling curve fits a given query pattern and tolerates data skew. Also we propose a method for designing a query-aware and skew-tolerant curve for a given query pattern. We prototyped our framework using the defined query-aware and skew-tolerant curve. We conducted experiments using a skew data set, and confirmed that our framework can reduce the number of page accesses by an order of magnitude for data warehousing (DWH) and geographic information systems (GIS) applications with real-world data.

【Keywords】: multidimensional data partitioning; multidimensional index; space-filling curve

SIGMOD Session 30. Optimization and Main Memory (2) 3

103. Leveraging Re-costing for Online Optimization of Parameterized Queries with Guarantees.

Paper Link】 【Pages】:1539-1554

【Authors】: Anshuman Dutt ; Vivek R. Narasayya ; Surajit Chaudhuri

【Abstract】: Parametric query optimization (PQO) deals with the problem of finding and reusing a relatively small number of plans that can achieve good plan quality across multiple instances of a parameterized query. An ideal solution to PQO would process query instances online and ensure (a) tight, bounded cost sub-optimality for each instance, (b) low optimization overheads, and (c) only a small number of plans need to be stored. Existing solutions to online PQO however, fall short on at least one of the above metrics. We propose a plan re-costing based approach that enables us to perform well on all three metrics. We empirically show the effectiveness of our technique on industry benchmark and real-world query workloads with our modified version of the Microsoft SQL Server query optimizer.

【Keywords】: cost sub-optimality; online; parameterized queries; workload

104. Handling Environments in a Nested Relational Algebra with Combinators and an Implementation in a Verified Query Compiler.

Paper Link】 【Pages】:1555-1569

【Authors】: Joshua S. Auerbach ; Martin Hirzel ; Louis Mandel ; Avraham Shinnar ; Jérôme Siméon

【Abstract】: Algebras based on combinators, i.e., variable-free, have been proposed as a better representation for query compilation and optimization. A key benefit of combinators is that they avoid the need to handle variable shadowing or accidental capture during rewrites. This simplifies both the optimizer specification and its correctness analysis, but the environment from the source language has to be reified as records, which can lead to more complex query plans. This paper proposes NRAe, an extension of a combinators-based nested relational algebra (NRA) with built-in support for environments. We show that it can naturally encode an equivalent NRA with lambda terms and that all optimizations on NRA carry over to NRAe. This extension provides an elegant way to represent views in query plans, and can radically simplify compilation and optimization for source languages with rich environment manipulations. We have specified a query compiler using the Coq proof assistant with NRAe at its heart. Most of the compiler, including the query optimizer, is accompanied by a (machine-checked) correctness proof. The implementation is automatically extracted from the specification, resulting in a query compiler with a verified core.

【Keywords】: coq; environments; formal verification; nested queries; optimization; query compilers; relational algebra; variables handling

105. From In-Place Updates to In-Place Appends: Revisiting Out-of-Place Updates on Flash.

Paper Link】 【Pages】:1571-1586

【Authors】: Sergey Hardock ; Ilia Petrov ; Robert Gottstein ; Alejandro P. Buchmann

【Abstract】: Under update intensive workloads (TPC, LinkBench) small updates dominate the write behavior, e.g. 70% of all updates change less than 10 bytes across all TPC OLTP workloads. These are typically performed as in-place updates and result in random writes in page-granularity, causing major write-overhead on Flash storage, a write amplification of several hundred times and lower device longevity. In this paper we propose an approach that transforms those small in-place updates into small update deltas that are appended to the original page. We utilize the commonly ignored fact that modern Flash memories (SLC, MLC, 3D NAND) can handle appends to already programmed physical pages by using various low-level techniques such as ISPP to avoid expensive erases and page migrations. Furthermore, we extend the traditional NSM page-layout with a delta-record area that can absorb those small updates. We propose a scheme to control the write behavior as well as the space allocation and sizing of database pages. The proposed approach has been implemented under Shore- MT and evaluated on real Flash hardware (OpenSSD) and a Flash emulator. Compared to In-Page Logging [21] it performs up to 62% less reads and writes and up to 74% less erases on a range of workloads. The experimental evaluation indicates: (i) significant reduction of erase operations resulting in twice the longevity of Flash devices under update-intensive workloads; (ii) 15%-60% lower read/write I/O latencies; (iii) up to 45% higher transactional throughput; (iv) 2x to 3x reduction in overall write amplification.

【Keywords】: dbms; delta-record; in-page appends; native flash; noftl; oltp; page-layout; write-amplification; write_delta

Demonstrations 31

106. Visual Graph Query Construction and Refinement.

Paper Link】 【Pages】:1587-1590

【Authors】: Robert Pienta ; Fred Hohman ; Acar Tamersoy ; Alex Endert ; Shamkant B. Navathe ; Hanghang Tong ; Duen Horng Chau

【Abstract】: Locating and extracting subgraphs from large network datasets is a challenge in many domains, one that often requires learning new querying languages. We will present the first demonstration of VISAGE, an interactive visual graph querying approach that empowers analysts to construct expressive queries, without writing complex code (see our video: https://youtu.be/l2L7Y5mCh1s). VISAGE guides the construction of graph queries using a data-driven approach, enabling analysts to specify queries with varying levels of specificity, by sampling matches to a query during the analyst's interaction. We will demonstrate and invite the audience to try VISAGE on a popular film-actor-director graph from Rotten Tomatoes.

【Keywords】: graph querying; interactive querying; query construction

107. Demonstration of the Cosette Automated SQL Prover.

Paper Link】 【Pages】:1591-1594

【Authors】: Shumo Chu ; Daniel Li ; Chenglong Wang ; Alvin Cheung ; Dan Suciu

【Abstract】: In this demonstration, we showcase COSETTE, the first automated prover for determining the equivalences of SQL queries. Despite theoretical limitations, COSETTE leverages recent advances in both automated constraint solving and interactive theorem proving to decide the equivalences of a wide range of real world queries, including complex rewrite rules from the database literature. COSETTE can also validate the inequality of queries by finding counter examples, i.e., database instances which, when executed on the two queries, will return different results. COSETTE can find counter examples of many real world inequivalent queries including a number of real-world optimizer bugs. We showcase three representative applications of COSETTE: proving a query rewrite rule from magic set rewrite, finding counter examples from the infamous optimizer bug, and an interactive visualization of automated grading results powered by COSETTE, where COSETTE is used to check the equivalence of students' answers to the standard solution. For the demo, the audience can experience through the three applications, and explore the COSETTE by interacting with the tool using an easy-to-use web interface.

【Keywords】: correctness; education; query processing; sql

108. Interactive Time Series Analytics Powered by ONEX.

Paper Link】 【Pages】:1595-1598

【Authors】: Rodica Neamtu ; Ramoza Ahsan ; Charles Lovering ; Cuong Nguyen ; Elke A. Rundensteiner ; Gábor N. Sárközy

【Abstract】: Modern applications in this digital age collect a staggering amount of time series data from economic growth rates to electrical household consumption habits. To make sense of it, domain analysts interactively sift through these time series collections in search of critical relationships between and recurring patterns within these time series. The ONEX (Online Exploration of Time Series) system supports effective exploratory analysis of time series collections composed of heterogeneous, variable-length and misaligned time series using robust alignment dynamic time warping (DTW) methods. To assure real-time responsiveness even for these complex and compute-intensive analytics, ONEX precomputes and then encodes time series relationships based on the inexpensive-to-compute Euclidean distance into the ONEX base. Thereafter, based on a solid formal foundation, ONEX uses DTW-enhanced analytics to correctly extract relevant time series matches on this Euclidean-prepared ONEX base. Our live interactive demonstration shows how our ONEX exploratory tool, supported by a rich array of visual interactions and expressive visualizations, enables efficient mining and interpretation of the MATTERS real data collection composed of economic, social, and education data trends across the fifty American states.

【Keywords】: data mining; dynamic time warping; time series analytics; visual analytics

109. The VADA Architecture for Cost-Effective Data Wrangling.

Paper Link】 【Pages】:1599-1602

【Authors】: Nikolaos Konstantinou ; Martin Koehler ; Edward Abel ; Cristina Civili ; Bernd Neumayr ; Emanuel Sallinger ; Alvaro A. A. Fernandes ; Georg Gottlob ; John A. Keane ; Leonid Libkin ; Norman W. Paton

【Abstract】: Data wrangling, the multi-faceted process by which the data required by an application is identified, extracted, cleaned and integrated, is often cumbersome and labor intensive. In this paper, we present an architecture that supports a complete data wrangling lifecycle, orchestrates components dynamically, builds on automation wherever possible, is informed by whatever data is available, refines automatically produced results in the light of feedback, takes into account the user's priorities, and supports data scientists with diverse skill sets. The architecture is demonstrated in practice for wrangling property sales and open government data.

【Keywords】: data wrangling

110. A Demonstration of Lusail: Querying Linked Data at Scale.

Paper Link】 【Pages】:1603-1606

【Authors】: Essam Mansour ; Ibrahim Abdelaziz ; Mourad Ouzzani ; Ashraf Aboulnaga ; Panos Kalnis

【Abstract】: There has been a proliferation of datasets available as interlinked RDF data accessible through SPARQL endpoints. This has led to the emergence of various applications in life science, distributed social networks, and Internet of Things that need to integrate data from multiple endpoints. We will demonstrate Lusail; a system that supports the need of emerging applications to access tens to hundreds of geo-distributed datasets. Lusail is a geo-distributed graph engine for querying linked RDF data. Lusail delivers out- standing performance using (i) a novel locality-aware query decomposition technique that minimizes the intermediate data to be accessed by the subqueries, and (ii) selectivity-awareness and parallel query execution to reduce network latency and to increase parallelism. During the demo, the audience will be able to query actually deployed RDF end- points as well as large synthetic and real benchmarks that we have deployed in the public cloud. The demo will also show that Lusail outperforms state-of-the-art systems by orders of magnitude in terms of scalability and response time.

【Keywords】: decentralized rdf graphs; federated sparql queries; linked data; parallel query processing; query processing; rdf data; sparql

111. Foofah: A Programming-By-Example System for Synthesizing Data Transformation Programs.

Paper Link】 【Pages】:1607-1610

【Authors】: Zhongjun Jin ; Michael R. Anderson ; Michael J. Cafarella ; H. V. Jagadish

【Abstract】: Advancements in new data analysis and visualization technologies have resulted in wide applicability of data-driven decision making. However, raw data from various sources must be wrangled into a suitable form before they are processed by the downstream data tools. People traditionally write data transformation programs to automate this process, and such work is cumbersome and tedious. We built a system called FOOFAH for helping the user easily synthesize a desired data transformation program. Our system minimizes the user's effort by only asking for a small illustrative example comprised of the raw input data and the target transformed output; FOOFAH then synthesizes a program that can perform the desired data transformation. This demonstration showcases how the user can apply FOOFAH to real-world data transformation tasks.

【Keywords】: a* algorithm; data transformation; heuristic; program synthesis; programming by example

112. Virtualized Network Service Topology Exploration Using Nepal.

Paper Link】 【Pages】:1611-1614

【Authors】: Pramod A. Jamkhedkar ; Theodore Johnson ; Yaron Kanza ; Aman Shaikh ; N. K. Shankarnarayanan ; Vladislav Shkapenyuk ; Gordon Woodhull

【Abstract】: Modern communication networks are large, dynamic, and complex. To deploy, maintain, and troubleshoot such networks, it is essential to understand how network elements such as servers, switches, virtual machines, and virtual network functions are connected to one another, and to be able to discover communication paths between them. For network maintenance applications such as troubleshooting and service quality management it is also essential to understand how connections change over time, and be able to pose time-travel queries to retrieve information about past network states. With the industry-wide move to SDNs and virtualized network functions [13], maintaining these inventory databases becomes a critical issue. We represent a communication network inventory as a graph where the nodes are network entities and edges represent relationships between them, e.g. hosted-on, communicates-with, and so on. We have found that querying such a graph for e.g., troubleshooting, using a typical graph query language is too cumbersome for network analysts. In this demonstration we present Nepal -- a network path query language which is designed to effectively retrieve desired paths from a network graph. Nepal treats paths as first-class citizens of the language, which achieves closure under composition while maintaining simplicity. The Nepal schema system allows the complexities of items in the inventory database to be abstracted away when desired, and yet provide strongly-typed access. We demonstrate how Nepal path queries can simplify the extraction of information from a dynamic inventory of a multi-layer network and can be used for troubleshooting.

【Keywords】: graph database; graph schema; network inventory; network management; temporal database

113. VisualCloud Demonstration: A DBMS for Virtual Reality.

Paper Link】 【Pages】:1615-1618

【Authors】: Brandon Haynes ; Artem Minyaylov ; Magdalena Balazinska ; Luis Ceze ; Alvin Cheung

【Abstract】: We demonstrate VisualCloud, a database management system designed to efficiently ingest, store, and deliver virtual reality (VR) content at scale. VisualCloud targets both live and prerecorded spherical panoramic (a.k.a. 360°) VR videos. It persists content as a multidimensional array that utilizes both dense (e.g., space and time) and sparse (e.g., bitrate) dimensions. VisualCloud uses orientation prediction to reduce data transfer by degrading out-of-view portions of the video. Content delivered through VisualCloud requires up to 60% less bandwidth than existing methods and scales to many concurrent connections. This demonstration will allow attendees to view both live and prerecorded VR video content served through VisualCloud. Viewers will be able to dynamically adjust tuning parameters (e.g., bitrates and path prediction) and observe changes in visual fidelity.

【Keywords】: array database; in-memory database; virtual reality

114. The Best of Both Worlds: Big Data Programming with Both Productivity and Performance.

Paper Link】 【Pages】:1619-1622

【Authors】: Fan Yang ; Yuzhen Huang ; Yunjian Zhao ; Jinfeng Li ; Guanxian Jiang ; James Cheng

【Abstract】: Coarse-grained operators such as map and reduce have been widely used for large-scale data processing. While they are easy to master, over-simplified APIs sometimes hinder programmers from fine-grained control on how computation is performed and hence designing more efficient algorithms. On the other hand, resorting to domain-specific languages (DSLs) is also not a practical solution, since programmers may need to learn how to use many systems that can be very different from each other, and the use of low-level tools may even result in bug-prone programming. In [7] our prior work, we proposed Husky which provides a highly expressive API to solve the above dilemma. It allows developers to program in a variety of patterns, such as MapReduce, GAS, vertex-centric programs, and even asynchronous machine learning. While the Husky C++ engine provides great performance, in this demo proposal we introduce PyHusky and ScHusky, which allow users (e.g., data scientists) without system knowledge and low-level programming skills to leverage the performance of Husky and build high-level applications with ease using Python and Scala.

【Keywords】: data-parallel processing; distributed system; programming model

115. In-Browser Interactive SQL Analytics with Afterburner.

Paper Link】 【Pages】:1623-1626

【Authors】: Kareem El Gebaly ; Jimmy Lin

【Abstract】: This demonstration explores the novel and unconventional idea of implementing an analytical RDBMS in pure JavaScript so that it runs completely inside a browser with no external dependencies. Our prototype, called Afterburner, generates compiled query plans that exploit two JavaScript features: typed arrays and asm.js. On the TPC-H benchmark, we show that Afterburner achieves comparable performance to MonetDB running natively on the same machine. This is an interesting finding in that it shows how far JavaScript has come as an efficient execution platform. Beyond a mere technical curiosity, we demonstrate how our techniques can support interactive data exploration by automatically generating materialized views from a backend that is then shipped to the browser to facilitate subsequent interactions seamlessly and efficiently.

【Keywords】: data exploration; interactive analytics; query compilation; split execution

116. Debugging Big Data Analytics in Spark with BigDebug.

Paper Link】 【Pages】:1627-1630

【Authors】: Muhammad Ali Gulzar ; Matteo Interlandi ; Tyson Condie ; Miryung Kim

【Abstract】: To process massive quantities of data, developers leverage Data-Intensive Scalable Computing (DISC) systems such as Apache Spark. In terms of debugging, DISC systems support only post-mortem log analysis and do not provide any debugging functionality. This demonstration paper showcases BigDebug: a tool enhancing Apache Spark with a set of interactive debugging features that can help users in debug their Big Data Applications.

【Keywords】: automatic fault localization; big data analytics; data-intensive scalable computing; debugging; disc; interactive tools

117. Interactive Query Synthesis from Input-Output Examples.

Paper Link】 【Pages】:1631-1634

【Authors】: Chenglong Wang ; Alvin Cheung ; Rastislav Bodík

【Abstract】: This demo showcases Scythe, a novel query-by-example system that can synthesize expressive SQL queries from input-output examples. Scythe is designed to help end-users program SQL and explore data simply using input-output examples. From a web-browser, users can obtain SQL queries with Scythe in an automated, interactive fashion: from a provided example, Scythe synthesizes SQL queries and resolves ambiguities via conversations with the users. In this demo, we first show Scythe how end users can formulate queries using Scythe; we then switch to the perspective of an algorithm designer to show how Scythe can scale up to handle complex SQL features, like outer joins and subqueries.

【Keywords】: database interface; program synthesis; sql

118. Generating Concise Entity Matching Rules.

Paper Link】 【Pages】:1635-1638

【Authors】: Rohit Singh ; Vamsi Meduri ; Ahmed K. Elmagarmid ; Samuel Madden ; Paolo Papotti ; Jorge-Arnulfo Quiané-Ruiz ; Armando Solar-Lezama ; Nan Tang

【Abstract】: Entity matching (EM) is a critical part of data integration and cleaning. In many applications, the users need to understand why two entities are considered a match, which reveals the need for interpretable and concise EM rules. We model EM rules in the form of General Boolean Formulas (GBFs) that allows arbitrary attribute matching combined by conjunctions (∨), disjunctions (∧), and negations. (¬) GBFs can generate more concise rules than traditional EM rules represented in disjunctive normal forms (DNFs). We use program synthesis, a powerful tool to automatically generate rules (or programs) that provably satisfy a high-level specification, to automatically synthesize EM rules in GBF format, given only positive and negative matching examples. In this demo, attendees will experience the following features: (1) Interpretability -- they can see and measure the conciseness of EM rules defined using GBFs; (2) Easy customization -- they can provide custom experiment parameters for various datasets, and, easily modify a rich predefined (default) synthesis grammar, using a Web interface; and (3) High performance -- they will be able to compare the generated concise rules, in terms of accuracy, with probabilistic models (e.g., machine learning methods), and hand-written EM rules provided by experts. Moreover, this system will serve as a general platform for evaluating different methods that discover EM rules, which will be released as an open-source tool on GitHub.

【Keywords】: disjunctive normal forms; entity matching; general boolean formulas; program synthesis

119. A Demo of the Data Civilizer System.

Paper Link】 【Pages】:1639-1642

【Authors】: Raul Castro Fernandez ; Dong Deng ; Essam Mansour ; Abdulhakim Ali Qahtan ; Wenbo Tao ; Ziawasch Abedjan ; Ahmed K. Elmagarmid ; Ihab F. Ilyas ; Samuel Madden ; Mourad Ouzzani ; Michael Stonebraker ; Nan Tang

【Abstract】: Finding relevant data for a specific task from the numerous data sources available in any organization is a daunting task. This is not only because of the number of possible data sources where the data of interest resides, but also due to the data being scattered all over the enterprise and being typically dirty and inconsistent. In practice, data scientists are routinely reporting that the majority (more than 80%) of their effort is spent finding, cleaning, integrating, and accessing data of interest to a task at hand. We propose to demonstrate DATA CIVILIZER to ease the pain faced in analyzing data "in the wild". DATA CIVILIZER is an end-to-end big data management system with components for data discovery, data integration and stitching, data cleaning, and querying data from a large variety of storage engines, running in large enterprises.

【Keywords】: data cleaning; data discovery; data integration; data stitching; join path discovery; polystore queries

120. Querying and Exploring Polygamous Relationships in Urban Spatio-Temporal Data Sets.

Paper Link】 【Pages】:1643-1646

【Authors】: Yeukyin Chan ; Fernando Chirigati ; Harish Doraiswamy ; Cláudio T. Silva ; Juliana Freire

【Abstract】: The Data Polygamy framework allows users to uncover interesting patterns and interactions in the data exhaust from different components of an urban environment. But analyzing the plethora of relationships derived by the framework is challenging. In this demo, we show how visualization can help in the discovery of relationships that are potentially interesting by allowing users to query and explore the relationship set in an intuitive way. We will demonstrate the effectiveness of the visual interface through case studies, and demo visitors will also interact with the polygamous relationships.

【Keywords】: data polygamy; data set relationships; urban data

121. Graph Data Mining with Arabesque.

Paper Link】 【Pages】:1647-1650

【Authors】: Eslam Hussein ; Abdurrahman Ghanem ; Vinícius Vitor dos Santos Dias ; Carlos H. C. Teixeira ; Ghadeer AbuOda ; Marco Serafini ; Georgos Siganos ; Gianmarco De Francisci Morales ; Ashraf Aboulnaga ; Mohammed J. Zaki

【Abstract】: Graph data mining is defined as searching in an input graph for all subgraphs that satisfy some property that makes them interesting to the user. Examples of graph data mining problems include frequent subgraph mining, counting motifs, and enumerating cliques. These problems differ from other graph processing problems such as PageRank or shortest path in that graph data mining requires searching through an exponential number of subgraphs. Most current parallel graph analytics systems do not provide good support for graph data mining. One notable exception is Arabesque, a system that was built specifically to support graph data mining. Arabesque provides a simple programming model to express graph data mining computations, and a highly scalable and efficient implementation of this model, scaling to billions of subgraphs on hundreds of cores. This demonstration will showcase the Arabesque system, focusing on the end-user experience and showing how Arabesque can be used to simply and efficiently solve practical graph data mining problems that would be difficult with other systems.

【Keywords】: filter-process; graph exploration; think like an embedding

122. Alpine: Efficient In-Situ Data Exploration in the Presence of Updates.

Paper Link】 【Pages】:1651-1654

【Authors】: Antonios Anagnostou ; Matthaios Olma ; Anastasia Ailamaki

【Abstract】: The ever growing data collections create the need for brief explorations of the available data to extract relevant information before decision making becomes necessary. In this context of data exploration, current data analysis solutions struggle to quickly pinpoint useful information in data collections. One major reason is that loading data in a DBMS without knowing which part of it will actually be useful is a major bottleneck. To remove this bottleneck, state-of-the art approaches perform queries in situ, thus avoiding the loading overhead. In situ query engines, however, are index-oblivious, and lack sophisticated techniques to reduce the amount of data to be accessed. Furthermore, applications constantly generate fresh data and update the existing raw data files whereas state-of-the art in situ approaches support only append-like workloads. In this demonstration, we showcase the efficiency of adaptive indexing and partitioning techniques for analytical queries in the presence of updates. We demonstrate an online partitioning and indexing tuner for in situ querying which plugs to a query engine and offers support for fast queries over raw data files. We present Alpine, our prototype implementation, which combines the tuner with a query executor incorporating in situ query techniques to provide efficient raw data access. We will visually demonstrate how Alpine incrementally and adaptively builds auxiliary data structures and indexes over raw data files and how it adapts its behavior as a side-effect of updates in the raw data files.

【Keywords】: in-situ querying; indexes; partitions; updates

123. OrpheusDB: A Lightweight Approach to Relational Dataset Versioning.

Paper Link】 【Pages】:1655-1658

【Authors】: Liqi Xu ; Silu Huang ; SiLi Hui ; Aaron J. Elmore ; Aditya G. Parameswaran

【Abstract】: We demonstrate OrpheusDB, a lightweight approach to versioning of relational datasets. OrpheusDB is built as a thin layer on top of standard relational databases, and therefore inherits much of their benefits while also compactly storing, tracking, and recreating dataset versions on demand. OrpheusDB also supports a range of querying modalities spanning both SQL and git-style version commands. Conference attendees will be able to interact with OrpheusDB via an interactive version browser interface. The demo will highlight underlying design decisions of OrpheusDB, and provide an understanding of how OrpheusDB translates versioning commands into commands understood by a database system that is unaware of the presence of versions. OrpheusDB has been developed as open-source software; code is available at http://orpheus-db.github.io.

【Keywords】: data model; optimization; query translation; version control system

124. doppioDB: A Hardware Accelerated Database.

Paper Link】 【Pages】:1659-1662

【Authors】: David Sidler ; Zsolt István ; Muhsen Owaida ; Kaan Kara ; Gustavo Alonso

【Abstract】: Relational databases provide a wealth of functionality to a wide range of applications. Yet, there are tasks for which they are less than optimal, for instance when processing becomes more complex (e.g., matching regular expressions) or the data is less structured (e.g., text or long strings). In this demonstration we show the benefit of using specialized hardware for such tasks and highlight the importance of a flexible, reusable mechanism for extending database engines with hardware-based operators. We present doppioDB which consists of MonetDB, a main-memory column store, extended with Hardware User Defined Functions (HUDFs). In our demonstration the HUDFs are used to provide seamless acceleration of two string operators, LIKE and REGEXP_LIKE, and two analytics operators, SKYLINE and SGD (stochastic gradient descent). We evaluate doppioDB on an emerging hybrid multicore architecture, the Intel Xeon+FPGA platform, where the CPU and FPGA have cache-coherent access to the same memory, such that the hardware operators can directly access the database tables. For integration we rely on HUDFs as a unit of scheduling and management on the FPGA. In the demonstration we show the acceleration benefits of hardware operators, as well as their flexibility in accommodating changing workloads.

【Keywords】: hardware acceleration; hybrid architectures; regular expression; skyline; stochastic gradient descent

125. DBridge: Translating Imperative Code to SQL.

Paper Link】 【Pages】:1663-1666

【Authors】: K. Venkatesh Emani ; Tejas Deshpande ; Karthik Ramachandra ; S. Sudarshan

【Abstract】: Application programs that access data located remotely (such as in a database) often perform poorly due to multiple network round trips and transfer of unused data. This situation is exacerbated in applications that use object-relational mapping (ORM) frameworks such as Hibernate, as developers tend to express complex query logic using imperative code, resulting in poor performance. DBridge is a system for optimizing data access in database applications by using static program analysis and program transformations. Recently, we incorporated a new suite of optimization techniques into DBridge. These techniques optimize database application programs by identifying relational operations expressed in imperative code, and translating them into SQL. In this demonstration, we showcase these techniques using a plugin for the IntelliJ IDEA Java IDE as the front end. We show the performance gains achieved by employing our system on real world applications that use JDBC or Hibernate.

【Keywords】: database applications; hibernate; java; orm; program regions; program rewriting; rule based transformations; sql; static program analysis

126. BEAS: Bounded Evaluation of SQL Queries.

Paper Link】 【Pages】:1667-1670

【Authors】: Yang Cao ; Wenfei Fan ; Yanghao Wang ; Tengfei Yuan ; Yanchao Li ; Laura Yu Chen

【Abstract】: We demonstrate BEAS, a prototype system for querying relations with bounded resources. BEAS advocates an unconventional query evaluation paradigm under an access schema A, which is a combination of cardinality constraints and associated indices. Given an SQL query Q and a dataset D, BEAS computes Q(D) by accessing a bounded fraction DQ of D, such that Q(DQ) = Q(D) and DQ is determined by A and Q only, no matter how big D grows. It identifies DQ by reasoning about the cardinality constraints of A, and fetches DQ using the indices of A. We demonstrate the feasibility of bounded evaluation by walking through each functional component of BEAS. As a proof of concept, we demonstrate how BEAS conducts CDR analyses in telecommunication industry, compared with commercial database systems.

【Keywords】: bounded evaluation; resource bounded sql evaluation

127. Safe Visual Data Exploration.

Paper Link】 【Pages】:1671-1674

【Authors】: Zheguang Zhao ; Emanuel Zgraggen ; Lorenzo De Stefani ; Carsten Binnig ; Eli Upfal ; Tim Kraska

【Abstract】: Exploring data via visualization has become a popular way to understand complex data. Features or patterns in visualization can be perceived as relevant insights by users, even though they may actually arise from random noise. Moreover, interactive data exploration and visualization recommendation tools can examine a large number of observations, and therefore result in further increasing chance of spurious insights. Thus without proper statistical control, the risk of false discovery renders visual data exploration unsafe and makes users susceptible to questionable inference.To address these problems, we present QUDE, a visual data exploration system that interacts with users to formulate hypotheses based on visualizations and provides interactive control of false discoveries.

【Keywords】: false discovery control; hypothesis testing; interactive data exploration; multiple comparison problem; progressive computation; visual analytics; visualization

128. Optimizing Data-Intensive Applications Automatically By Leveraging Parallel Data Processing Frameworks.

Paper Link】 【Pages】:1675-1678

【Authors】: Maaz Bin Safeer Ahmad ; Alvin Cheung

【Abstract】: In this demonstration we will showcase CASPER, a novel tool that enables sequential data-intensive programs to automatically leverage the optimizations provided by parallel data processing frameworks. The goal of CASPER is to reduce the inertia against adaptation of new data processing frameworks---particularly for non-expert users---by automatically re-writing sequential programs written in general purpose languages to the high-level DSLs or APIs of these frameworks. Through CASPER's browser-based interface, users can enter the source code of their Java applications and have it automatically retargeted to execute on Apache Spark. In our interactive presentation, we will use CASPER to optimize sequential implementations of data visualization programs as well as image processing kernels. The optimized Spark implementations along with the original sequential implementations will then be executed simultaneously on the cloud to allow the demo to audience compare the runtime performances and outputs in real-time.

【Keywords】: data-parallelism; program synthesis; verification; verified lifting

129. DIAS: Differentially Private Interactive Algorithm Selection using Pythia.

Paper Link】 【Pages】:1679-1682

【Authors】: Ios Kotsogiannis ; Michael Hay ; Ashwin Machanavajjhala ; Gerome Miklau ; Margaret Orr

【Abstract】: Differential privacy has emerged as the dominant privacy standard for data analysis. Its wide acceptance has led to significant development of algorithms that meet this rigorous standard. For some tasks, such as the task of answering low dimensional counting queries, dozens of algorithms have been proposed. However, no single algorithm has emerged as the dominant performer, and in fact, algorithm performance varies drastically across inputs. Thus, it's not clear how to select an algorithm for a particular task, and choosing the wrong algorithm might lead to significant degradation in terms of analysis accuracy. We believe that the difficulty of algorithm selection is one factor limiting the adoption of differential privacy in real systems. In this demonstration we present DIAS (Differentially-private Interactive Algorithm Selection), an educational privacy game. Users are asked to perform algorithm selection for a variety of inputs and compare the performance of their choices against that of Pythia, an automated algorithm selection framework. Our hope is that by the end of the game users will understand the importance of algorithm selection and most importantly will have a good grasp on how to use differentially private algorithms for their own applications.

【Keywords】: classification trees; differential privacy; end-to-end privacy

130. Snorkel: Fast Training Set Generation for Information Extraction.

Paper Link】 【Pages】:1683-1686

【Authors】: Alexander J. Ratner ; Stephen H. Bach ; Henry R. Ehrenberg ; Christopher Ré

【Abstract】: State-of-the art machine learning methods such as deep learning rely on large sets of hand-labeled training data. Collecting training data is prohibitively slow and expensive, especially when technical domain expertise is required; even the largest technology companies struggle with this challenge. We address this critical bottleneck with Snorkel, a new system for quickly creating, managing, and modeling training sets. Snorkel enables users to generate large volumes of training data by writing labeling functions, which are simple functions that express heuristics and other weak supervision strategies. These user-authored labeling functions may have low accuracies and may overlap and conflict, but Snorkel automatically learns their accuracies and synthesizes their output labels. Experiments and theory show that surprisingly, by modeling the labeling process in this way, we can train high-accuracy machine learning models even using potentially lower-accuracy inputs. Snorkel is currently used in production at top technology and consulting companies, and used by researchers to extract information from electronic health records, after-action combat reports, and the scientific literature. In this demonstration, we focus on the challenging task of information extraction, a common application of Snorkel in practice. Using the task of extracting corporate employment relationships from news articles, we will demonstrate and build intuition for a radically different way of developing machine learning systems which allows us to effectively bypass the bottleneck of hand-labeling training data.

【Keywords】: structured information extraction; weak supervision

131. Synthesizing Extraction Rules from User Examples with SEER.

Paper Link】 【Pages】:1687-1690

【Authors】: Maeda F. Hanafi ; Azza Abouzied ; Laura Chiticariu ; Yunyao Li

【Abstract】: Our demonstration showcases SEER's end-to-end Information Extraction (IE) workflow where users highlight texts they wish to extract. Given a small set of user-specified example extractions, SEER synthesizes easy-to-understand IE rules and suggests them to the user. In addition to rule suggestions, users can quickly pick the desired rule by filtering the rule suggestion by accepting or rejecting proposed extractions. SEER's workflow allows users to jump start the IE rule development cycle; it is a less time-consuming alternative to machine learning methods that require large labeled datasets or rule-based approaches that are labor-intensive. SEER's design principles and learning algorithm are motivated by how rule developers naturally construct data extraction rules.

【Keywords】: data extraction; example-driven learning; information extraction

132. Scout: A GPU-Aware System for Interactive Spatio-temporal Data Visualization.

Paper Link】 【Pages】:1691-1694

【Authors】: Harshada Chavan ; Mohamed F. Mokbel

【Abstract】: This demo presents Scout; a full-fledged interactive data visualization system with native support for spatio-temporal data. Scout utilizes computing power of GPUs to achieve real-time query performance. The key idea behind Scout is a GPU-aware multi-version spatio-temporal index. The indexing and query processing modules of Scout are designed to complement the GPU hardware characteristics. Front end of Scout provides a user interface to submit queries and view results. Scout supports a variety of spatio-temporal queriesrange, k-NN, and join. We use real data sets to demonstrate scalability and important features of Scout.

【Keywords】: data visualization; gpus; spatial data processing

133. Graphflow: An Active Graph Database.

Paper Link】 【Pages】:1695-1698

【Authors】: Chathura Kankanamge ; Siddhartha Sahu ; Amine Mhedbhi ; Jeremy Chen ; Semih Salihoglu

【Abstract】: Many applications detect the emergence or deletion of certain subgraphs in their input graphs continuously. In order to evaluate such continuous subgraph queries, these applications resort to inefficient or highly specialized solutions because existing graph databases are passive systems that only support one-time subgraph queries. We demonstrate Graphflow, a prototype active graph data-base that evaluates general one-time and continuous subgraph queries. Graphflow supports the property graph data model and the Cypher++ query language, which extends Neo4j's declarative Cypher language with subgraph-condition-action triggers. At the core of Graphflow's query processor are two worst-case optimal join algorithms called Generic Join and our new Delta Generic Join algorithm for one-time and continuous subgraph queries, respectively.

【Keywords】: active queries; graph databases; incremental view maintenance; worst-case optimal joins

134. Demonstration: MacroBase, A Fast Data Analysis Engine.

Paper Link】 【Pages】:1699-1702

【Authors】: Peter Bailis ; Edward Gan ; Kexin Rong ; Sahaana Suri

【Abstract】: Data volumes are rising at an increasing rate, stressing the limits of human attention. Current techniques for prioritizing user attention in this fast data are characterized by either cumbersome, ad-hoc analysis pipelines comprised of a diverse set of analytics tools, or brittle, static rule-based engines. To address this gap, we have developed MacroBase, a fast data analytics engine that acts as a search engine over fast data streams. MacroBase provides a set of highly-optimized, modular operators for streaming feature transformation, classification, and explanation. Users can leverage these optimized operators to construct efficient pipelines tailored for their use case. In this demonstration, SIGMOD attendees will have the opportunity to interactively answer and refine queries using MacroBase and discover the potential benefits of an advanced engine for prioritizing attention in high-volume, real-world data streams.

【Keywords】: attention; classification; fast data; scale; streaming

135. Q*cert: A Platform for Implementing and Verifying Query Compilers.

Paper Link】 【Pages】:1703-1706

【Authors】: Joshua S. Auerbach ; Martin Hirzel ; Louis Mandel ; Avraham Shinnar ; Jérôme Siméon

【Abstract】: We present Qcert, a platform for the specification, verification, and implementation of query compilers written using the Coq proof assistant. The Qcert platform is open source and includes some support for SQL and OQL, and for code generation to Spark and Cloudant. It internally relies on familiar database intermediate representations, notably the nested relational algebra and calculus and a novel extension of the nested relational algebra that eases the handling of environments. The platform also comes with simple but functional and extensible query optimizers. We demonstrate how the platform can be used to implement a compiler for a new input language or develop new optimizations that can be formally verified. We also demonstrate a web-based interface that allows the developer to explore various compilation and optimization strategies.

【Keywords】: certification; demonstration; formal methods; nested relational algebra; oql; query compilers; query optimization; sql; theorem proving

136. A Demonstration of Interactive Analysis of Performance Measurements with Viska.

Paper Link】 【Pages】:1707-1710

【Authors】: Helga Gudmundsdottir ; Babak Salimi ; Magdalena Balazinska ; Dan R. K. Ports ; Dan Suciu

【Abstract】: The ultimate goal of system performance analysis is to identify the underlying causes for performance differences between different systems and different workloads. We make this goal easier to achieve with Viska, a new tool for generating and interpreting performance measurement results. Viska leverages cutting-edge techniques from big data analytics and data visualization to aid and automate this analysis, and helps users derive meaningful and statistically sound conclusions using state-of-the-art causal inference and hypothesis testing techniques.

【Keywords】: causal inference; data analytics; performance debugging

Tutorials 13

137. Crowdsourced Data Management: Overview and Challenges.

Paper Link】 【Pages】:1711-1716

【Authors】: Guoliang Li ; Yudian Zheng ; Ju Fan ; Jiannan Wang ; Reynold Cheng

【Abstract】: Many important data management and analytics tasks cannot be completely addressed by automated processes. Crowdsourcing is an effective way to harness human cognitive abilities to process these computer-hard tasks, such as entity resolution, sentiment analysis, and image recognition. Crowdsourced data management has been extensively studied in research and industry recently. In this tutorial, we will survey and synthesize a wide spectrum of existing studies on crowdsourced data management. We first give an overview of crowdsourcing, and then summarize the fundamental techniques, including quality control, cost control, and latency control, which must be considered in crowdsourced data management. Next we review crowdsourced operators, including selection, collection, join, top-k, sort, categorize, aggregation, skyline, planning, schema matching, mining and spatial crowdsourcing. We also discuss crowdsourcing optimization techniques and systems. Finally, we provide the emerging challenges.

【Keywords】: crowdsourcing; data management; optimization

138. Data Management in Machine Learning: Challenges, Techniques, and Systems.

Paper Link】 【Pages】:1717-1722

【Authors】: Arun Kumar ; Matthias Boehm ; Jun Yang

【Abstract】: Large-scale data analytics using statistical machine learning (ML), popularly called advanced analytics, underpins many modern data-driven applications. The data management community has been working for over a decade on tackling data management-related challenges that arise in ML workloads, and has built several systems for advanced analytics. This tutorial provides a comprehensive review of such systems and analyzes key data management challenges and techniques. We focus on three complementary lines of work: (1) integrating ML algorithms and languages with existing data systems such as RDBMSs, (2) adapting data management-inspired techniques such as query optimization, partitioning, and compression to new systems that target ML workloads, and (3) combining data management and ML ideas to build systems that improve ML lifecycle-related tasks. Finally, we identify key open data management challenges for future research in this important area.

【Keywords】: data management; machine learning

139. Data Management Challenges in Production Machine Learning.

Paper Link】 【Pages】:1723-1726

【Authors】: Neoklis Polyzotis ; Sudip Roy ; Steven Euijong Whang ; Martin Zinkevich

【Abstract】: The tutorial discusses data-management issues that arise in the context of machine learning pipelines deployed in production. Informed by our own experience with such largescale pipelines, we focus on issues related to understanding, validating, cleaning, and enriching training data. The goal of the tutorial is to bring forth these issues, draw connections to prior work in the database literature, and outline the open research questions that are not addressed by prior art.

【Keywords】: data enrichment; data management; data understanding; data validation; machine learning; production

140. Differential Privacy in the Wild: A Tutorial on Current Practices & Open Challenges.

Paper Link】 【Pages】:1727-1730

【Authors】: Ashwin Machanavajjhala ; Xi He ; Michael Hay

【Abstract】: Differential privacy has emerged as an important standard for privacy preserving computation over databases containing sensitive information about individuals. Research on differential privacy spanning a number of research areas, including theory, security, database, networks, machine learning, and statistics, over the last decade has resulted in a variety of privacy preserving algorithms for a number of analysis tasks. Despite maturing research efforts, the adoption of differential privacy by practitioners in industry, academia, or government agencies has so far been rare. Hence, in this tutorial, we will first describe the foundations of differentially private algorithm design that cover the state of the art in private computation on tabular data. In the second half of the tutorial we will highlight real world applications on complex data types, and identify research challenges in applying differential privacy to real world applications.

【Keywords】: differential privacy

141. Graph Querying Meets HCI: State of the Art and Future Directions.

Paper Link】 【Pages】:1731-1736

【Authors】: Sourav S. Bhowmick ; Byron Choi ; Chengkai Li

【Abstract】: Querying graph databases has emerged as an important research problem for real-world applications that center on large graph data. Given the syntactic complexity of graph query languages (e.g., SPARQL, Cypher), visual graph query interfaces make it easy for non-expert users to query such graph data repositories. In this tutorial, we survey recent developments in the emerging area of visual graph querying paradigm that bridges traditional graph querying with human computer interaction (HCI). We discuss manual and data-driven visual graph query interfaces, various strategies and guidance for constructing graph queries visually, interleaving processing of graph queries and visual actions, and visual exploration of graph query results. In addition, the tutorial suggests open problems and new research directions. In summary, in this tutorial we review and summarize the research thus far into HCI and graph querying in the database community, giving researchers a snapshot of the current state of the art in this topic, and future research directions.

【Keywords】: graph database; human computer interaction; query feedback; query formulation; query processing; results exploration; visual graph query; visual query interface

142. Graph Exploration: From Users to Large Graphs.

Paper Link】 【Pages】:1737-1740

【Authors】: Davide Mottin ; Emmanuel Müller

【Abstract】: The increasing interest in social networks, knowledge graphs, protein-interaction, and many other types of networks has raised the question how users can explore such large and complex graph structures easily. Current tools focus on graph management, graph mining, or graph visualization but lack user-driven methods for graph exploration. In many cases graph methods try to scale to the size and complexity of a real network. However, methods miss user requirements such as exploratory graph query processing, intuitive graph explanation, and interactivity in graph exploration. While there is consensus in database and data mining communities on the definition of data exploration practices for relational and semi-structured data, graph exploration practices are still indeterminate. In this tutorial, we will discuss a set of techniques, which have been developed in the last few years for independent purposes, within a unified graph exploration taxonomy. The tutorial will provide a generalized definition of graph exploration in which the user interacts directly with the system either providing feedback or a partial query. We will discuss common, diverse, and missing properties of graph exploration techniques based on this definition, our taxonomy, and multiple applications for graph exploration. Concluding this discussion we will highlight interesting and relevant challenges for data scientists in graph exploration.

【Keywords】: data exploration; graph analysis; graph mining; user feedback

143. Building Structured Databases of Factual Knowledge from Massive Text Corpora.

Paper Link】 【Pages】:1741-1745

【Authors】: Xiang Ren ; Meng Jiang ; Jingbo Shang ; Jiawei Han

【Abstract】: In today's computerized and information-based society, people are inundated with vast amounts of text data, ranging from news articles, social media post, scientific publications, to a wide range of textual information from various domains (corporate reports, advertisements, legal acts, medical reports). To turn such massive unstructured text data into structured, actionable knowledge, one of the grand challenges is to gain an understanding of the factual information (e.g., entities, attributes, relations) in the text. In this tutorial, we introduce data-driven methods on mining structured facts (i.e., entities and their relations/attributes for types of interest) from massive text corpora, to construct structured databases of factual knowledge (called StructDBs). State-of-the-art information extraction systems have strong reliance on large amounts of task/corpus-specific labeled data (usually created by domain experts). In practice, the scale and efficiency of such a manual annotation process are rather limited, especially when dealing with text corpora of various kinds (domains, languages, genres). We focus on methods that are minimally-supervised, domain-independent, and language-independent for timely StructDB construction across various application domains (news, social media, biomedical, business), and demonstrate on real datasets how these StructDBs aid in data exploration and knowledge discovery.

【Keywords】: attribute discovery; entity recognition and typing; massive text corpora; quality phrase mining; relation extraction

144. Data Profiling: A Tutorial.

Paper Link】 【Pages】:1747-1751

【Authors】: Ziawasch Abedjan ; Lukasz Golab ; Felix Naumann

【Abstract】: is to understand the dataset at hand and its metadata. The process of metadata discovery is known as data profiling. Profiling activities range from ad-hoc approaches, such as eye-balling random subsets of the data or formulating aggregation queries, to systematic inference of structural information and statistics of a dataset using dedicated profiling tools. In this tutorial, we highlight the importance of data profiling as part of any data-related use-case, and we discuss the area of data profiling by classifying data profiling tasks and reviewing the state-of-the-art data profiling systems and techniques. In particular, we discuss hard problems in data profiling, such as algorithms for dependency discovery and profiling algorithms for dynamic data and streams. We also pay special attention to visualizing and interpreting the results of data profiling. We conclude with directions for future research in the area of data profiling. This tutorial is based on our survey on profiling relational data [2].

【Keywords】: data exploration; data profiling; dependency discovery

145. How to Build a Non-Volatile Memory Database Management System.

Paper Link】 【Pages】:1753-1758

【Authors】: Joy Arulraj ; Andrew Pavlo

【Abstract】: The difference in the performance characteristics of volatile (DRAM) and non-volatile storage devices (HDD/SSDs) influences the design of database management systems (DBMSs). The key assumption has always been that the latter is much slower than the former. This affects all aspects of a DBMS's runtime architecture. But the arrival of new non-volatile memory (NVM) storage that is almost as fast as DRAM with fine-grained read/writes invalidates these previous design choices. In this tutorial, we provide an outline on how to build a new DBMS given the changes to hardware landscape due to NVM. We survey recent developments in this area, and discuss the lessons learned from prior research on designing NVM database systems. We highlight a set of open research problems, and present ideas for solving some of them.

【Keywords】: database management systems; non-volatile memory

146. Data Structure Engineering For Byte-Addressable Non-Volatile Memory.

Paper Link】 【Pages】:1759-1764

【Authors】: Ismail Oukid ; Wolfgang Lehner

【Abstract】: Storage Class Memory (SCM) is emerging as a viable alternative to traditional DRAM, alleviating its scalability limits, both in terms of capacity and energy consumption, while being non-volatile. Hence, SCM has the potential to become a universal memory, blurring well-known storage hierarchies. However, along with opportunities, SCM brings many challenges. In this tutorial we will dissect SCM challenges and provide an in-depth view of existing programming models that circumvent them, as well as novel data structures that stem from these models. We will also elaborate on fail-safety testing challenges -- an often overlooked, yet important topic. Finally, we will discuss SCM emulation techniques for end-to-end testing of SCM-based software components. In contrast to surveys investigating the use of SCM in database systems, this tutorial is designed as a programming guide for researchers and professionals interested in leveraging SCM in database systems.

【Keywords】: crash simulation; data structures; dram; emulation; main memory; memory management; non-volatile memory; programming model; storage-class memory; testing

147. Natural Language Data Management and Interfaces: Recent Development and Open Challenges.

Paper Link】 【Pages】:1765-1770

【Authors】: Yunyao Li ; Davood Rafiei

【Abstract】: The volume of natural language text data has been rapidly increasing over the past two decades, due to factors such as the growth of the Web, the low cost associated to publishing and the progress on the digitization of printed texts. This growth combined with the proliferation of natural language systems for search and retrieving information provides tremendous opportunities for studying some of the areas where database systems and natural language processing systems overlap. This tutorial explores two more relevant areas of overlap to the database community: (1) managing natural language text data in a relational database, and (2) developing natural language interfaces to databases. The tutorial presents state-of-the-art methods, related systems, research opportunities and challenges covering both areas.

【Keywords】: database interfaces; natural language text; qa; text data

148. Hybrid Transactional/Analytical Processing: A Survey.

Paper Link】 【Pages】:1771-1775

【Authors】: Fatma Özcan ; Yuanyuan Tian ; Pinar Tözün

【Abstract】: The popularity of large-scale real-time analytics applications (real-time inventory/pricing, recommendations from mobile apps, fraud detection, risk analysis, IoT, etc.) keeps rising. These applications require distributed data management systems that can handle fast concurrent transactions (OLTP) and analytics on the recent data. Some of them even need running analytical queries (OLAP) as part of transactions. Efficient processing of individual transactional and analytical requests, however, leads to different optimizations and architectural decisions while building a data management system. For the kind of data processing that requires both analytics and transactions, Gartner recently coined the term Hybrid Transactional/Analytical Processing (HTAP). Many HTAP solutions are emerging both from the industry as well as academia that target these new applications. While some of these are single system solutions, others are a looser coupling of OLTP databases or NoSQL systems with analytical big data platforms, like Spark. The goal of this tutorial is to 1-) quickly review the historical progression of OLTP and OLAP systems, 2-) discuss the driving factors for HTAP, and finally 3-) provide a deep technical analysis of existing and emerging HTAP solutions, detailing their key architectural differences and trade-offs.

【Keywords】: analytics; htap; hybrid transaction and analytics processing; olap; oltp; transactions

149. Query Processing Techniques for Big Spatial-Keyword Data.

Paper Link】 【Pages】:1777-1782

【Authors】: Ahmed R. Mahmood ; Walid G. Aref

【Abstract】: The widespread use of GPS-enabled cellular devices, i.e., smart phones, led to the popularity of numerous mobile applications, e.g., social networks, micro-blogs, mobile web search, and crowd-powered reviews. These applications generate large amounts of geo-tagged textual data, i.e., spatial-keyword data. This data needs to be processed and queried at an unprecedented scale. The management of spatial-keyword data at this scale goes beyond the capabilities of centralized systems. We live in the era of big data and the big data model is currently been used to address scalability issues in various application domains. This has led to the development of various big spatial-keyword processing systems. These systems are designed to ingest, store, index, and query huge amounts of spatial-keyword data. In this 1.5 hour tutorial, we explore recent research efforts in the area of big spatial-keyword processing. First, we give main motivations behind big spatial-keyword systems with real-life applications. We describe the main models for big spatial-keyword processing, and list the popular spatial-keyword queries. Then, we present the approaches that have been adopted in big spatial-keyword processing systems with special attention to data indexing and spatial and keyword data partitioning. Finally, we conclude this tutorial with a discussion on some of the open problems and research directions in the area of big spatial-keyword query processing.

【Keywords】: big data; indexing; query processing; spatial-keyword; systems