ACM SIGMOD Conference 2020:online [Portland, OR, USA]

Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020. ACM 【DBLP Link】

Paper Num: 234 || Session Num: 40

Award Talks 2
Demonstrations 36
Industry 1: Graph Databases and Knowledge Bases 6
Industry 2: Machine Learning and Analytics 5
Industry 3: Cloud and Distributed Databases 5
Industry 4: Advanced Functionality 5
Research 10: Storage and Indexing 5
Research 11: Machine Learning for Databases II 5
Research 12: Graph Matching and Discovery 5
Research 13: Data Matching 5
Research 14: Query Optimization and Execution 5
Research 15: Machine Learning for Cleaning, Integration, and Search 5
Research 16: Graph and Stream Processing 5
Research 17: Data Exploration and Preparation 5
Research 18: Main Memory Databases and Modern Hardware 5
Research 19: Machine Learning Systems and Applications 5
Research 1: Crowdsourcing and Visualization 5
Research 20: Graph Data Management and Analysis 5
Research 21: Spatial, Temporal, and Multimedia Data I 5
Research 22: Data Lakes, Web, and Knowledge Graph 5
Research 23: OLAP, Data Warehouses, and Key-Value Stores 5
Research 24: Spatial, Temporal, and Multimedia Data II 4
Research 25: Social Network Analysis 5
Research 26: Usability and Natural Language User Interfaces 5
Research 27: Distributed and Parallel Processing 5
Research 28: Stream Processing 5
Research 29: Data Mining and Similarity Search 5
Research 2: Serverless and Cloud Data Management 5
Research 3: Machine Learning for Databases I 5
Research 4: Uncertain, Probabilistic, and Approximate Data 5
Research 5: Data Provenance 5
Research 6: Transaction Processing and Query Optimization 5
Research 7: Security, Privacy, and Blockchain 5
Research 8: Graph Query Processing 5
Research 9: Data Cleaning 5
SIGMOD Keynote 1 1
SIGMOD Keynote 2 2
SIGMOD Panel 1
Student Abstracts 19
Tutorials 8

SIGMOD Keynote 1 1

1. Systems and ML: When the Sum is Greater than Its Parts.

【Paper Link】【Pages】:1

【Authors】: Ion Stoica

【Abstract】: BIOGRAPHY: Ion Stoica is a Professor in the EECS Department at the University of California at Berkeley, and the Director of RISELab (https://rise.cs.berkeley.edu/). He is currently doing research on cloud computing and AI systems. Past work includes Apache Spark, Apache Mesos, Tachyon, Chord DHT, and Dynamic Packet State (DPS). He is an ACM Fellow and has received numerous awards, including the Mark Weiser Award (2019), SIGOPS Hall of Fame Award (2015), the SIGCOMM Test of Time Award (2011), and the ACM doctoral dissertation award (2001). He also co-founded three companies, Anyscale (2019), Databricks (2013) and Conviva (2006).

【Keywords】:

Research 1: Crowdsourcing and Visualization 5

2. Recommending Deployment Strategies for Collaborative Tasks.

【Paper Link】【Pages】:3-17

【Authors】: Dong Wei ; Senjuti Basu Roy ; Sihem Amer-Yahia

【Abstract】: Our work contributes to aiding requesters in deploying collaborative tasks in crowdsourcing. We initiate the study of recommending deployment strategies for collaborative tasks to requesters that are consistent with deployment parameters they desire: a lower-bound on the quality of the crowd contribution, an upper-bound on the latency of task completion, and an upper-bound on the cost incurred by paying workers. A deployment strategy is a choice of value for three dimensions: Structure (whether to solicit the workforce sequentially or simultaneously), Organization (to organize it collaboratively or independently), and Style (to rely solely on the crowd or to combine it with machine algorithms). We propose StratRec, an optimization-driven middle layer that recommends deployment strategies and alternative deployment parameters to requesters by accounting for worker availability. Our solutions are grounded in discrete optimization and computational geometry techniques that produce results with theoretical guarantees. We present extensive experiments on Amazon Mechanical Turk, and conduct synthetic experiments to validate the qualitative and scalability aspects of StratRec.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query planning; Information retrieval; Information retrieval query processing; Query suggestion; World Wide Web; Web applications; Crowdsourcing; Theory of computation; Design and analysis of algorithms; Approximation algorithms analysis; Randomness, geometry and discrete structures; Computational geometry

3. Human-in-the-loop Outlier Detection.

【Paper Link】【Pages】:19-33

【Authors】: Chengliang Chai ; Lei Cao ; Guoliang Li ; Jian Li ; Yuyu Luo ; Samuel Madden

【Abstract】: Outlier detection is critical to a large number of applications from finance fraud detection to health care. Although numerous approaches have been proposed to automatically detect outliers, such outliers detected based on statistical rarity do not necessarily correspond to the true outliers to the interest of applications. In this work, we propose a human-in-the-loop outlier detection approach HOD that effectively leverages human intelligence to discover the true outliers. There are two main challenges in HOD. The first is to design human-friendly questions such that humans can easily understand the questions even if humans know nothing about the outlier detection techniques. The second is to minimize the number of questions. To address the first challenge, we design a clustering-based method to effectively discover a small number of objects that are unlikely to be outliers (aka, inliers) and yet effectively represent the typical characteristics of the given dataset. HOD then leverages this set of inliers (called context inliers) to help humans understand the context in which the outliers occur. This ensures humans are able to easily identify the true outliers from the outlier candidates produced by the machine-based outlier detection techniques. To address the second challenge, we propose a bipartite graph-based question selection strategy that is theoretically proven to be able to minimize the number of questions needed to cover all outlier candidates. Our experimental results on real data sets show that HOD significantly outperforms the state-of-the-art methods on both human efforts and the quality of the discovered outliers.

【Keywords】: Computing methodologies; Machine learning; Learning paradigms; Unsupervised learning; Anomaly detection; Information systems; World Wide Web; Web applications; Crowdsourcing

4. QUAD: Quadratic-Bound-based Kernel Density Visualization.

【Paper Link】【Pages】:35-50

【Authors】: Tsz Nam Chan ; Reynold Cheng ; Man Lung Yiu

【Abstract】: Kernel density visualization, or KDV, is used to view and understand data points in various domains, including traffic or crime hotspot detection, ecological modeling, chemical geology, and physical modeling. Existing solutions, which are based on computing kernel density (KDE) functions, are computationally expensive. Our goal is to improve the performance of KDV, in order to support large datasets (e.g., one million points) and high screen resolutions (e.g., 1280 x 960 pixels). We examine two widely-used variants of KDV, namely approximate kernel density visualization (EKDV) and thresholded kernel density visualization (TKDV). For these two operations, we develop fast solution, called QUAD, by deriving quadratic bounds of KDE functions for different types of kernel functions, including Gaussian, triangular etc. We further adopt a progressive visualization framework for KDV, in order to stream partial visualization results to users continuously. Extensive experiment results show that our new KDV techniques can provide at least one-order-of-magnitude speedup over existing methods, without degrading visualization quality. We further show that QUAD can produce the reasonable visualization results in real-time (0.5 sec) by combining the progressive visualization framework in single machine setting without using GPU and parallel computation.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Kernel methods; Mathematics of computing; Probability and statistics; Probabilistic representations; Nonparametric representations; Kernel density estimators; Theory of computation; Theory and algorithms for application domains; Machine learning theory; Kernel methods

5. ShapeSearch: A Flexible and Efficient System for Shape-based Exploration of Trendlines.

【Paper Link】【Pages】:51-65

【Authors】: Tarique Siddiqui ; Paul Luh ; Zesheng Wang ; Karrie Karahalios ; Aditya G. Parameswaran

【Abstract】: Identifying trendline visualizations with desired patterns is a common task during data exploration. Existing visual analytics tools offer limited flexibility, expressiveness, and scalability for such tasks, especially when the pattern of interest is under-specified and approximate. We propose ShapeSearch, an efficient and flexible pattern-searching tool, that enables the search for desired patterns via multiple mechanisms: sketch, natural-language, and visual regular expressions. We develop a novel shape querying algebra, with a minimal set of primitives and operators that can express a wide variety of shape search queries, and design a natural- language and regex-based parser to translate user queries to the algebraic representation. To execute these queries within interactive response times, ShapeSearch uses a fast shape algebra execution engine with query-aware optimizations, and perceptually-aware scoring methodologies. We present a thorough evaluation of the system, including a user study, a case study involving genomics data analysis, as well as performance experiments, comparing against state-of-the-art trendline shape matching approaches-that together demonstrate the usability and scalability of ShapeSearch.

【Keywords】: Information systems; Information systems applications; Decision support systems; Data analytics

6. Marviq: Quality-Aware Geospatial Visualization of Range-Selection Queries Using Materialization.

【Paper Link】【Pages】:67-82

【Authors】: Liming Dong ; Qiushi Bai ; Taewoo Kim ; Taiji Chen ; Weidong Liu ; Chen Li

【Abstract】: We study the problem of efficient spatial visualization on a large data set stored in a database using SQL queries with ad-hoc range conditions on numerical attributes, for example, a spatial scatterplot of taxi pickup events in New York between 1/1/2015 and 3/10/2015. We present a novel middleware-based technique called Marviq. It divides the selection-attribute domain into intervals, and precomputes and stores a visualization for each interval. These results are called MVS and stored as tables in the database. We can compute an exact visualization for a request by accessing MVS and retrieving additional records from the base table. To further reduce the latter time, we present algorithms for using MVS to compute an approximate visualization that satisfies a user-specified similarity threshold. We show a family of functions with certain properties that can use this technique. We present an improvement by dividing the MVS intervals into smaller intervals and materializing low-resolution visualization for these intervals. We report the results of an extensive evaluation of Marviq, including a user study, and show its high performance in both space and time.

【Keywords】: Information systems; Data management systems; Middleware for databases

Research 2: Serverless and Cloud Data Management 5

7. Transactional Causal Consistency for Serverless Computing.

【Paper Link】【Pages】:83-97

【Authors】: Chenggang Wu ; Vikram Sreekanti ; Joseph M. Hellerstein

【Abstract】: We consider the setting of serverless Function-as-a-Service (FaaS) platforms, where storage services are disaggregated from the machines that support function execution. FaaS applications consist of compositions of functions, each of which may run on a separate machine and access remote storage. The challenge we address is improving I/O latency in this setting while also providing application-wide consistency. Previous work has explored providing causal consistency for individual I/Os by carefully managing the versions stored in a client-side data cache. In our setting, a single application may execute multiple functions across different nodes, and therefore issue interrelated I/Os to multiple distinct caches. This raises the challenge of Multisite Transactional Causal Consistency (MTCC): the ability to provide causal consistency for all I/Os within a given transaction even if it runs across multiple physical sites. We present protocols for MTCC implemented in a system called HYDROCACHE. Our evaluation demonstrates orders-of-magnitude performance improvements due to caching, while also protecting against consistency anomalies that otherwise arise frequently.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Cloud computing; Information systems; Data management systems; Database management system engines; Distributed database transactions; Parallel and distributed DBMSs; Information storage systems; Storage architectures; Cloud based storage

8. Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings.

【Paper Link】【Pages】:99-113

【Authors】: Tarique Siddiqui ; Alekh Jindal ; Shi Qiao ; Hiren Patel ; Wangchao Le

【Abstract】: Query processing over big data is ubiquitous in modern clouds, where the system takes care of picking both the physical query execution plans and the resources needed to run those plans, using a cost-based query optimizer. A good cost model, therefore, is akin to better resource efficiency and lower operational costs. Unfortunately, the production workloads at Microsoft show that costs are very complex to model for big data systems. In this work, we investigate two key questions: (i) can we learn accurate cost models for big data systems, and (ii) can we integrate the learned models within the query optimizer. To answer these, we make three core contributions. First, we exploit workload patterns to learn a large number of individual cost models and combine them to achieve high accuracy and coverage over a long period. Second, we propose extensions to Cascades framework to pick optimal resources, i.e, number of containers, during query planning. And third, we integrate the learned cost models within the Cascade-style query optimizer of SCOPE at Microsoft. We evaluate the resulting system, Cleo, in a production environment using both production and TPC-H workloads. Our results show that the learned cost models are 2 to 3 orders of magnitude more accurate, and 20X more correlated with the actual runtimes, with a large majority (70%) of the plan changes leading to substantial improvements in latency as well as resource usage.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization

9. Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud Infrastructure.

【Paper Link】【Pages】:115-130

【Authors】: Ingo Müller ; Renato Marroquin ; Gustavo Alonso

【Abstract】: Serverless computing has recently attracted a lot of attention from research and industry due to its promise of ultimate elasticity and operational simplicity. However, there is no consensus yet on whether or not the approach is suitable for data processing. In this paper, we present Lambada, a serverless distributed data processing framework designed to explore how to perform data analytics on serverless computing. In our analysis, supported with extensive experiments, we show in which scenarios serverless makes sense from an economic and performance perspective. We address several important technical questions that need to be solved to support data analytics and present examples from several domains where serverless offers a cost and performance advantage over existing solutions.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Online analytical processing engines; Parallel and distributed DBMSs

10. Starling: A Scalable Query Engine on Cloud Functions.

【Paper Link】【Pages】:131-141

【Authors】: Matthew Perron ; Raul Castro Fernandez ; David J. DeWitt ; Samuel Madden

【Abstract】: Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle much of the time, meaning customers pay for compute resources even when underutilized. The ability of cloud function services, such as AWS Lambda or Azure Functions, to run small, fine granularity tasks make them appear to be a natural choice for query processing in such settings. But implementing an analytics system on cloud functions comes with its own set of challenges. These include managing hundreds of tiny stateless resource-constrained workers, handling stragglers, and shuffling data through opaque cloud services. In this paper we present Starling, a query execution engine built on cloud function services that employs a number of techniques to mitigate these challenges, providing interactive query latency at a lower total cost than provisioned systems with low-to-moderate utilization. In particular, on a 1TB TPC-H dataset in cloud storage, Starling is less expensive than the best provisioned systems for workloads when queries arrive 1 minute apart or more. Starling also has lower latency than competing systems reading from cloud object stores and can scale to larger datasets.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Cloud computing; Information systems; Data management systems; Database management system engines; Database query processing; DBMS engine architectures; Online analytical processing engines; Information systems applications; Decision support systems; Online analytical processing

11. Learning a Partitioning Advisor for Cloud Databases.

【Paper Link】【Pages】:143-157

【Authors】: Benjamin Hilprecht ; Carsten Binnig ; Uwe Röhm

【Abstract】: Cloud vendors provide ready-to-use distributed DBMS solutions as a service. While the provisioning of a DBMS is usually fully automated, customers typically still have to make important design decisions which were traditionally made by the database administrator such as finding an optimal partitioning scheme for a given database schema and workload. In this paper, we introduce a new learned partitioning advisor based on Deep Reinforcement Learning (DRL) for OLAP-style workloads. The main idea is that a DRL agent learns the cost tradeoffs of different partitioning schemes and can thus automate the partitioning decision. In the evaluation, we show that our advisor is able to find non-trivial partitionings for a wide range of workloads and outperforms more classical approaches for automated partitioning design.

【Keywords】: Information systems; Data management systems; Database administration; Autonomous database administration; Database management system engines; Parallel and distributed DBMSs

Research 3: Machine Learning for Databases I 5

12. DB4ML - An In-Memory Database Kernel with Machine Learning Support.

【Paper Link】【Pages】:159-173

【Authors】: Matthias Jasny ; Tobias Ziegler ; Tim Kraska ; Uwe Röhm ; Carsten Binnig

【Abstract】: In this paper, we revisit the question of how ML algorithms can be best integrated into existing DBMSs to not only avoid expensive data copies to external ML tools but also to comply with regulatory reasons. The key observation is that database transactions already provide an execution model that allows DBMSs to efficiently mimic the execution model of modern parallel ML algorithms. As a main contribution, this paper presents DB4ML, an in-memory database kernel that allows applications to implement user-defined ML algorithms and efficiently run them inside a DBMS. Thereby, the ML algorithms are implemented using a programming model based on the idea of so called iterative transactions. Our experimental evaluation shows that DB4ML can support user-defined ML algorithms inside a DBMS with the efficiency of modern specialized ML engines. In contrast to DB4ML, these engines not only need to transfer data out of the DBMS but also hardcode the ML algorithms and thus are not extensible.

【Keywords】: Computing methodologies; Machine learning; Information systems; Data management systems; Database management system engines; Parallel and distributed DBMSs

13. Active Learning for ML Enhanced Database Systems.

【Paper Link】【Pages】:175-191

【Authors】: Lin Ma ; Bailu Ding ; Sudipto Das ; Adith Swaminathan

【Abstract】: Recent research has shown promising results by using machine learning (ML) techniques to improve the performance of database systems, e.g., in query optimization or index recommendation. However, in many production deployments, the ML models' performance degrades significantly when the test data diverges from the data used to train these models. In this paper, we address this performance degradation by using B-instances to collect additional data during deployment. We propose an active data collection platform, ADCP, that employs active learning (AL) to gather relevant data cost-effectively. We develop a novel AL technique, Holistic Active Learner (HAL), that robustly combines multiple noisy signals for data gathering in the context of database applications. HAL applies to various ML tasks, budget sizes, cost types, and budgeting interfaces for database applications. We evaluate ADCP on both industry-standard benchmarks and real customer workloads. Our evaluation shows that, compared with other baselines, our technique improves ML models' prediction performance by up to 2x with the same cost budget. In particular, on production workloads, our technique reduces the prediction error of ML models by 75% using about 100 additionally collected queries.

【Keywords】: Information systems; Data management systems; Database administration; Autonomous database administration; Database management system engines

14. Qd-tree: Learning Data Layouts for Big Data Analytics.

【Paper Link】【Pages】:193-208

【Authors】: Zongheng Yang ; Badrish Chandramouli ; Chi Wang ; Johannes Gehrke ; Yinan Li ; Umar Farooq Minhas ; Per-Åke Larson ; Donald Kossmann ; Rajeev Acharya

【Abstract】: Corporations today collect data at an unprecedented and accelerating scale, making the need to run queries on large datasets increasingly important. Technologies such as columnar block-based data organization and compression have become standard practice in most commercial database systems. However, the problem of best assigning records to data blocks on storage is still open. For example, today's systems usually partition data by arrival time into row groups, or range/hash partition the data based on selected fields. For a given workload, however, such techniques are unable to optimize for the important metric of the number of blocks accessed by a query. This metric directly relates to the I/O cost, and therefore performance, of most analytical queries. Further, they are unable to exploit additional available storage to drive this metric down further. In this paper, we propose a new framework called a query-data routing tree, or qd-tree, to address this problem, and propose two algorithms for their construction based on greedy and deep reinforcement learning techniques. Experiments over benchmark and real workloads show that a qd-tree can provide physical speedups of more than an order of magnitude compared to current blocking schemes, and can reach within 2X of the lower bound for data skipping based on selectivity, while providing complete semantic descriptions of created blocks.

【Keywords】: Computing methodologies; Machine learning; Learning paradigms; Reinforcement learning; Information systems; Data management systems; Data structures; Data access methods; Data layout; Database management system engines; Database query processing; Information storage systems; Record storage systems; Relational storage

15. Facilitating SQL Query Composition and Analysis.

【Paper Link】【Pages】:209-224

【Authors】: Zainab Zolaktaf ; Mostafa Milani ; Rachel Pottinger

【Abstract】: Formulating efficient SQL queries requires several cycles of tuning and execution. We examine methods that can accelerate and improve this interaction by providing insights about SQL queries prior to execution. We achieve this by predicting properties such as the query answer size, its run-time, and error class. Unlike existing approaches, our approach does not rely on any statistics from the database instance or query execution plans. Our approach is based on using data-driven machine learning techniques that rely on large query workloads to model SQL queries and their properties. Empirical results show that the neural network models are more accurate in predicting several query properties.

【Keywords】: Computing methodologies; Machine learning; Information systems; Information systems applications; Data mining

16. MONSOON: Multi-Step Optimization and Execution of Queries with Partially Obscured Predicates.

【Paper Link】【Pages】:225-240

【Authors】: Sourav Sikdar ; Chris Jermaine

【Abstract】: User-defined functions (UDFs) in modern SQL database systems and Big Data processing systems such as Spark---that offer API bindings in high-level languages such as Python or Scala---make automatic optimization challenging. The foundation of modern database query optimization is the collection of statistics describing the data to be processed, but when a database or Big Data computation is partially obscured by UDFs, good statistics are often unavailable. In this paper, we describe a query optimizer called the Monsoon optimizer. In the presence of UDFs, the Monsoon optimizer may choose to collect statistics on the UDFs, and then run the computation. Or, it may optimize and execute part of the plan, collecting statistics on the result of the partial plan, followed by a re optimization step, with the process repeated as needed. Monsoon decides how to interleave execution and statistics collection in a principled fashion by formalizing the problem as a Markov decision process.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization

Research 4: Uncertain, Probabilistic, and Approximate Data 5

17. Causal Relational Learning.

【Paper Link】【Pages】:241-256

【Authors】: Babak Salimi ; Harsh Parikh ; Moe Kayali ; Lise Getoor ; Sudeepa Roy ; Dan Suciu

【Abstract】: Causal inference is at the heart of empirical research in natural and social sciences and is critical for scientific discovery and informed decision making. The gold standard in causal inference is performing randomized controlled trials ; unfortunately these are not always feasible due to ethical, legal, or cost constraints. As an alternative, methodologies for causal inference from observational data have been developed in statistical studies and social sciences. However, existing methods critically rely on restrictive assumptions such as the study population consisting of homogeneous elements that can be represented in a single flat table, where each row is referred to as a unit. In contrast, in many real-world settings, the study domain naturally consists of heterogeneous elements with complex relational structure, where the data is naturally represented in multiple related tables. In this paper, we present a formal framework for causal inference from such relational data. We propose a declarative language called CARL for capturing causal background knowledge and assumptions, and specifying causal queries using simple Datalog-like rules. CARL provides a foundation for inferring causality and reasoning about the effect of complex interventions in relational domains. We present an extensive experimental evaluation on real relational data to illustrate the applicability of CARL in social sciences and healthcare.

【Keywords】: Computing methodologies; Artificial intelligence; Knowledge representation and reasoning; Causal reasoning and diagnostics; Information systems; Data management systems; Database design and models; Data model extensions; Uncertainty; Mathematics of computing; Probability and statistics; Probabilistic representations; Causal networks; Theory of computation; Theory and algorithms for application domains; Database theory; Database query languages (principles); Incomplete, inconsistent, and uncertain databases; Machine learning theory

18. Sample Debiasing in the Themis Open World Database System.

【Paper Link】【Pages】:257-268

【Authors】: Laurel J. Orr ; Magdalena Balazinska ; Dan Suciu

【Abstract】: Open world database management systems assume tuples not in the database still exist and are becoming an increasingly important area of research. We present Themis, the first open world database that automatically rebalances arbitrarily biased samples to approximately answer queries as if they were issued over the entire population. We leverage apriori population aggregate information to develop and combine two different approaches for automatic debiasing: sample reweighting and Bayesian network probabilistic modeling. We build a prototype of Themis and demonstrate that Themis achieves higher query accuracy than the default AQP approach, an alternative sample reweighting technique, and a variety of Bayesian network models while maintaining interactive query response times. We also show that Themis is robust to differences in the support between the sample and population, a key use case when using social media samples.

【Keywords】: Information systems; Data management systems; Database design and models; Data model extensions; Incomplete data; Uncertainty; Information integration; Data cleaning

19. Stochastic Package Queries in Probabilistic Databases.

【Paper Link】【Pages】:269-283

【Authors】: Matteo Brucato ; Nishant Yadav ; Azza Abouzied ; Peter J. Haas ; Alexandra Meliou

【Abstract】: We provide methods for in-database support of decision making under uncertainty. Many important decision problems correspond to selecting a "package" (bag of tuples in a relational database) that jointly satisfy a set of constraints while minimizing some overall "cost" function; in most real-world problems, the data is uncertain. We provide methods for specifying---via a SQL extension---and processing stochastic package queries (SPQS), in order to solve optimization problems over uncertain data, right where the data resides. Prior work in stochastic programming uses Monte Carlo methods where the original stochastic optimization problem is approximated by a large deterministic optimization problem that incorporates many "scenarios", i.e., sample realizations of the uncertain data values. For large database tables, however, a huge number of scenarios is required, leading to poor performance and, often, failure of the solver software. We therefore provide a novel ßs algorithm that, instead of trying to solve a large deterministic problem, seamlessly approximates it via a sequence of smaller problems defined over carefully crafted "summaries" of the scenarios that accelerate convergence to a feasible and near-optimal solution. Experimental results on our prototype system show that ßs can be orders of magnitude faster than prior methods at finding feasible and high-quality packages.

【Keywords】:

20. Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints.

【Paper Link】【Pages】:285-295

【Authors】: Xi Liang ; Zechao Shang ; Sanjay Krishnan ; Aaron J. Elmore ; Michael J. Franklin

【Abstract】: Today, data analysts largely rely on intuition to determine whether missing or withheld rows of a dataset significantly affect their analyses. We propose a framework that can produce automatic contingency analysis, i.e., the range of values an aggregate SQL query could take, under formal constraints describing the variation and frequency of missing data tuples. We describe how to process SUM, COUNT, AVG, MIN, and MAX queries in these conditions resulting in hard error bounds with testable constraints. We propose an optimization algorithm based on an integer program that reconciles a set of such constraints, even if they are overlapping, conflicting, or unsatisfiable, into such bounds. Our experiments on real-world datasets against several statistical imputation and inference baselines show that statistical techniques can have a deceptively high error rate that is often unpredictable. In contrast, our framework offers hard bounds that are guaranteed to hold if the constraints are not violated. In spite of these hard bounds, we show competitive accuracy to statistical baselines.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Theory of computation; Design and analysis of algorithms; Streaming, sublinear and near linear time algorithms; Sketching and sampling

21. Mining Approximate Acyclic Schemes from Relations.

【Paper Link】【Pages】:297-312

【Authors】: Batya Kenig ; Pranay Mundra ; Guna Prasaad ; Babak Salimi ; Dan Suciu

【Abstract】: Acyclic schemes have numerous applications in databases and in machine learning, such as improved design, more efficient storage, and increased performance for queries and machine learning algorithms. Multivalued dependencies (MVDs) are the building blocks of acyclic schemes. The discovery from data of both MVDs and acyclic schemes is more challenging than other forms of data dependencies, such as Functional Dependencies, because these dependencies do not hold on subsets of data, and because they are very sensitive to noise in the data; for example a single wrong or missing tuple may invalidate the schema. In this paper we present Maimon, a system for discovering approximate acyclic schemes and MVDs from data. We give a principled definition of approximation, by using notions from information theory, then describe the two components of Maimon: mining for approximate MVDs, then reconstructing acyclic schemes from approximate MVDs. We conduct an experimental evaluation of Maimon on 20 real-world datasets, and show that it can scale up to 1M rows, and up to 30 columns.

【Keywords】: Information systems; Data management systems; Database design and models; Relational database model; Theory of computation; Theory and algorithms for application domains; Database theory; Data structures and algorithms for data management

Industry 1: Graph Databases and Knowledge Bases 6

22. AliCoCo: Alibaba E-commerce Cognitive Concept Net.

【Paper Link】【Pages】:313-327

【Authors】: Xusheng Luo ; Luxin Liu ; Yonghua Yang ; Le Bo ; Yuanpeng Cao ; Jinghang Wu ; Qiang Li ; Keping Yang ; Kenny Q. Zhu

【Abstract】: One of the ultimate goals of e-commerce platforms is to satisfy various shopping needs for their customers. Much efforts are devoted to creating taxonomies or ontologies in e-commerce towards this goal. However, user needs in e-commerce are still not well defined, and none of the existing ontologies has the enough depth and breadth for universal user needs understanding. The semantic gap in-between prevents shopping experience from being more intelligent. In this paper, we propose to construct a large-scale E-commerce Cognitive Concept Net named "AliCoCo", which is practiced in Alibaba, the largest Chinese e-commerce platform in the world. We formally define user needs in e-commerce, then conceptualize them as nodes in the net. We present details on how AliCoCo is constructed semi-automatically and its successful, ongoing and potential applications in e-commerce.

【Keywords】: Applied computing; Electronic commerce; E-commerce infrastructure; Computing methodologies; Artificial intelligence; Knowledge representation and reasoning; Ontology engineering; Semantic networks; Natural language processing; Information extraction

23. A1: A Distributed In-Memory Graph Database.

【Paper Link】【Pages】:329-344

【Authors】: Chiranjeeb Buragohain ; Knut Magne Risvik ; Paul Brett ; Miguel Castro ; Wonhee Cho ; Joshua Cowhig ; Nikolas Gloy ; Karthik Kalyanaraman ; Richendra Khanna ; John Pao ; Matthew Renzelmann ; Alex Shamis ; Timothy Tan ; Shuheng Zheng

【Abstract】: A1 is an in-memory distributed database used by the Bing search engine to support complex queries over structured data. The key enablers for A1 are availability of cheap DRAM and high speed RDMA (Remote Direct Memory Access) networking in commodity hardware. A1 uses FaRM [11,12] as its underlying storage layer and builds the graph abstraction and query engine on top. The combination of in-memory storage and RDMA access requires rethinking how data is allocated, organized and queried in a large distributed system. A single A1 cluster can store tens of billions of vertices and edges and support a throughput of 350+ million of vertex reads per second with end to end query latency in single digit milliseconds. In this paper we describe the A1 data model, RDMA optimized data structures and query execution.

【Keywords】: Information systems; Data management systems; Data structures; Data layout; Record and block layout; Database design and models; Graph-based database models; Network data models; Database management system engines; Database query processing; Main memory engines; Parallel and distributed DBMSs; Query languages; Query languages for non-relational engines

24. IBM Db2 Graph: Supporting Synergistic and Retrofittable Graph Queries Inside IBM Db2.

【Paper Link】【Pages】:345-359

【Authors】: Yuanyuan Tian ; En Liang Xu ; Wei Zhao ; Mir Hamid Pirahesh ; Suijun Tong ; Wen Sun ; Thomas Kolanko ; Md. Shahidul Haque Apu ; Huijuan Peng

【Abstract】: To meet the challenge of analyzing rapidly growing graph and network data created by modern applications, a large number of graph databases have emerged, such as Neo4j and JanusGraph. They mainly target low-latency graph queries, such as finding the neighbors of a vertex with certain properties, and retrieving the shortest path between two vertices. Although many of the graph databases handle the graph-only queries very well, they fall short for real life applications involving graph analysis. This is because graph queries are not all that one does in an analytics workload of a real life application. They are often only a part of an integrated heterogeneous analytics pipeline, which may include SQL, machine learning, graph, and other analytics. This means graph queries need to be synergistic with other analytics. Unfortunately, most existing graph databases are standalone and cannot easily integrate with other analytics systems. In addition, many graph data (data about relationships between objects or people) are already prevalent in existing non-graph databases, although they are not explicitly stored as graphs. None of existing graph databases can retrofit graph queries onto these existing data without transferring or transforming data. In this paper, we propose an in-DBMS graph query approach, IBM Db2 Graph, to support synergistic and retrofittable graph queries inside the IBM Db2 relational database. It is implemented as a layer inside Db2, and thus can support integrated graph and SQL analytics efficiently. Db2 Graph employs a novel graph overlay approach to expose a graph view of the relational data. This approach flexibly retrofits graph queries to existing graph data stored in relational tables, without expensive data transfer or transformation. In addition, it enables efficient execution of graph queries with the help of Db2 relational engine, through sophisticated compile-time and runtime optimization strategies. Our experimental study, as well as our experience with real customers using Db2 Graph, showed that Db2 Graph can provide very competitive and sometimes even better performance on graph-only queries, compared to existing graph databases. Moreover, it optimizes the overall performance of complex analytics workloads.

【Keywords】: Information systems; Data management systems; Database design and models; Graph-based database models; Database management system engines; Database query processing; Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms; Theory of computation; Design and analysis of algorithms; Graph algorithms analysis

25. An Ontology-Based Conversation System for Knowledge Bases.

【Paper Link】【Pages】:361-376

【Authors】: Abdul Quamar ; Chuan Lei ; Dorian Miller ; Fatma Ozcan ; Jeffrey Kreulen ; Robert J. Moore ; Vasilis Efthymiou

【Abstract】: Domain-specific knowledge bases (KB), carefully curated from various data sources, provide an invaluable reference for professionals. Conversation systems make these KBs easily accessible to professionals and are gaining popularity due to recent advances in natural language understanding and AI. Despite the increasing use of various conversation systems in open-domain applications, the requirements of a domain-specific conversation system are quite different and challenging. In this paper, we propose an ontology-based conversation system for domain-specific KBs. In particular, we exploit the domain knowledge inherent in the domain ontology to identify user intents, and the corresponding entities to bootstrap the conversation space. We incorporate the feedback from domain experts to further refine these patterns, and use them to generate training samples for the conversation model, lifting the heavy burden from the conversation designers. We have incorporated our innovations into a conversation agent focused on healthcare as a feature of the IBM Micromedex product.

【Keywords】: Human-centered computing; Human computer interaction (HCI); Interaction paradigms; Natural language interfaces; Information systems; Information systems applications

26. Aggregation Support for Modern Graph Analytics in TigerGraph.

【Paper Link】【Pages】:377-392

【Authors】: Alin Deutsch ; Yu Xu ; Mingxi Wu ; Victor E. Lee

【Abstract】: We describe how GSQL, TigerGraph's graph query language, supports the specification of aggregation in graph analytics. GSQL makes several unique design decisions with respect to both the expressive power and the evaluation complexity of the specified aggregation. We detail our design showing how our ideas transcend GSQL and are eminently portable to the upcoming graph query language standards as well as the existing pattern-based declarative query languages.

【Keywords】: Information systems; Data management systems; Database design and models; Graph-based database models; Query languages; Query languages for non-relational engines

27. GIANT: Scalable Creation of a Web-scale Ontology.

【Paper Link】【Pages】:393-409

【Authors】: Bang Liu ; Weidong Guo ; Di Niu ; Jinwen Luo ; Chaoyue Wang ; Zhen Wen ; Yu Xu

【Abstract】: Understanding what online users may pay attention to on the web is key to content recommendation and search services. These services will benefit from a highly structured and web-scale ontology of entities, concepts, events, topics and categories. While existing knowledge bases and taxonomies embody a large volume of entities and categories, we argue that they fail to discover properly grained concepts, events and topics in the language style of online users. Neither is a logically structured ontology maintained among these notions. In this paper, we present GIANT, a mechanism to construct a user-centered, web-scale, structured ontology, containing a large number of natural language phrases conforming to user attentions at various granularities, mined from the vast volume of web documents and search click logs. Various types of edges are also constructed to maintain a hierarchy in the ontology. We present our detailed techniques used in GIANT, and evaluate the proposed models and methods as compared to a variety of baselines, as well as deploy the resulted Attention Ontology in real-world applications, involving over a billion users, to observe its effect on content recommendation. The online performance of the ontology built by GIANT proves that it can significantly improve the click-through rate in news feeds recommendation.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing; Information extraction; Information systems; Information retrieval; Document representation; Information systems applications; Data mining

SIGMOD Panel 1

28. The Next 5 Years: What Opportunities Should the Database Community Seize to Maximize its Impact?

【Paper Link】【Pages】:411-414

【Authors】: Magda Balazinska ; Surajit Chaudhuri ; Anastasia Ailamaki ; Juliana Freire ; Sailesh Krishnamurthy ; Michael Stonebraker

【Abstract】: The database research community has been spectacularly successful in impacting the industry and academia since the invention of the relational model. Examples of innovation in the last decade include columnar storage for data analytic platforms, cloud data services, HTAP systems, and a new generation of data wrangling systems. Despite this success, critical self-assessment by the community and identifying key opportunities for the future is essential if we are to continue the tradition of impactful research. In the Fall of 2018, following a long tradition that dates back to 1988 [1], and five years after the last such meeting [2], a group of approximately thirty database researchers gathered at the University of Washington, Seattle for two days to discuss the opportunities we have as a community for impactful research. A report from that meeting is now available [3]. The discussions in the Seattle meeting focused not just on technical challenges and opportunities but also on topics related to how we organize ourselves as a community. This SIGMOD panel will provide a forum for the broader database community to review and debate the findings from the Seattle Report on Database Research [3] as well as to identify other challenges, and opportunities that need to be taken into account.

【Keywords】: Information systems; Data management systems

Research 5: Data Provenance 5

29. Equivalence-Invariant Algebraic Provenance for Hyperplane Update Queries.

【Paper Link】【Pages】:415-429

【Authors】: Pierre Bourhis ; Daniel Deutch ; Yuval Moskovitch

【Abstract】: The algebraic approach for provenance tracking, originating in the semiring model of Green et. al, has proven useful as an abstract way of handling metadata. Commutative Semirings were shown to be the "correct" algebraic structure for Union of Conjunctive Queries, in the sense that its use allows provenance to be invariant under certain expected query equivalence axioms.

【Keywords】: Information systems; Data management systems; Database design and models; Data model extensions; Data provenance; Database management system engines; Database transaction processing; Transaction logging; Theory of computation; Theory and algorithms for application domains; Database theory; Data provenance

30. Causality-Guided Adaptive Interventional Debugging.

【Paper Link】【Pages】:431-446

【Authors】: Anna Fariha ; Suman Nath ; Alexandra Meliou

【Abstract】: Runtime nondeterminism is a fact of life in modern database applications. Previous research has shown that nondeterminism can cause applications to intermittently crash, become unresponsive, or experience data corruption. We propose Adaptive Interventional Debugging (AID) for debugging such intermittent failures. AID combines existing statistical debugging, causal analysis, fault injection, and group testing techniques in a novel way to (1) pinpoint the root cause of an application's intermittent failure and (2) generate an explanation of how the root cause triggers the failure. AID works by first identifying a set of runtime behaviors (called predicates) that are strongly correlated to the failure. It then utilizes temporal properties of the predicates to (over)-approximate their causal relationships. Finally, it uses fault injection to execute a sequence of interventions on the predicates and discover their true causal relationships. This enables AID to identify the true root cause and its causal relationship to the failure. We theoretically analyze how fast AID can converge to the identification. We evaluate AID with six real-world applications that intermittently fail under specific inputs. In each case, AID was able to identify the root cause and explain how the root cause triggered the failure, much faster than group testing and more precisely than statistical debugging. We also evaluate AID with many synthetically generated applications with known root causes and confirm that the benefits also hold for them.

【Keywords】: Computing methodologies; Artificial intelligence; Knowledge representation and reasoning; Causal reasoning and diagnostics; Software and its engineering; Software creation and management; Software verification and validation; Software defect analysis; Software testing and debugging; Software organization and properties; Contextual software domains; Operating systems; Process management; Concurrency control; Theory of computation; Design and analysis of algorithms; Algorithm design techniques; Divide and conquer

31. PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models.

【Paper Link】【Pages】:447-462

【Authors】: Yinjun Wu ; Val Tannen ; Susan B. Davidson

【Abstract】: The ubiquitous use of machine learning algorithms brings new challenges to traditional database problems such as incremental view update. Much effort is being put in better understanding and debugging machine learning models, as well as in identifying and repairing errors in training datasets. Our focus is on how to assist these activities when they have to retrain the machine learning model after removing problematic training samples in cleaning or selecting different subsets of training data for interpretability. This paper presents an efficient provenance-based approach, PrIU, and its optimized version, PrIU-opt, for incrementally updating model parameters without sacrificing prediction accuracy. We prove the correctness and convergence of the incrementally updated model parameters, and validate it experimentally. Experimental results show that up to two orders of magnitude speed-ups can be achieved by PrIU-opt compared to simply retraining the model from scratch, yet obtaining highly similar models.

【Keywords】: Information systems; Data management systems; Database design and models; Data model extensions; Data provenance; Mathematics of computing; Probability and statistics; Statistical paradigms; Exploratory data analysis; Theory of computation; Design and analysis of algorithms; Mathematical optimization; Continuous optimization; Convex optimization

32. BugDoc: Algorithms to Debug Computational Processes.

【Paper Link】【Pages】:463-478

【Authors】: Raoni Lourenço ; Juliana Freire ; Dennis E. Shasha

【Abstract】: Data analysis for scientific experiments and enterprises, large-scale simulations, and machine learning tasks all entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous outputs, the pipeline may fail to execute or produce incorrect results. Inferring the root cause(s) of such failures is challenging, usually requiring time and much human thought, while still being error-prone. We propose a new approach that makes use of iteration and provenance to automatically infer the root causes and derive succinct explanations of failures. Through a detailed experimental evaluation, we assess the cost, precision, and recall of our approach compared to the state of the art. Our experimental data and processing software is available for use, reproducibility, and enhancement.

【Keywords】: Information systems; Data management systems; Database design and models; Data model extensions; Data provenance

33. Computing Local Sensitivities of Counting Queries with Joins.

【Paper Link】【Pages】:479-494

【Authors】: Yuchao Tao ; Xi He ; Ashwin Machanavajjhala ; Sudeepa Roy

【Abstract】: Local sensitivity of a query Q given a database instance D, i.e. how much the output Q(D) changes when a tuple is added to D or deleted from D, has many applications including query analysis, outlier detection, and differential privacy. However, it is NP-hard to find local sensitivity of a conjunctive query in terms of the size of the query, even for the class of acyclic queries. Although the complexity is polynomial when the query size is fixed, the naive algorithms are not efficient for large databases and queries involving multiple joins. In this paper, we present a novel approach to compute local sensitivity of counting queries involving join operations by tracking and summarizing tuple sensitivities. We give algorithms for the sensitivity problem for full acyclic join queries using join trees, that run in polynomial time in both the size of the database and query for an interesting sub-class of queries, which we call 'doubly acyclic queries' that include path queries, and in polynomial time in combined complexity when the maximum degree in the join tree is bounded. Our algorithms can be extended to certain non-acyclic queries using generalized hypertree decompositions. We evaluate our approach and show applications of our algorithms to obtain better results for differential privacy by orders of magnitude.

【Keywords】: Security and privacy; Database and storage security; Theory of computation; Theory and algorithms for application domains; Database theory; Theory of database privacy and security

Research 6: Transaction Processing and Query Optimization 5

34. Long-lived Transactions Made Less Harmful.

【Paper Link】【Pages】:495-510

【Authors】: Jong-Bin Kim ; Hyunsoo Cho ; Kihwang Kim ; Jaeseon Yu ; Sooyong Kang ; Hyungsoo Jung

【Abstract】: Many systems use snapshot isolation, or something similar, as defaults, and multi-version concurrency control (MVCC) remains essential to offering such point-in-time consistency. One major issue in MVCC is the timely removal of unnecessary versions of data items, especially in the presence of long-lived transactions (LLTs). We have observed that the latest versions of MySQL and PostgreSQL are still vulnerable to LLTs. Our analysis of existing proposals suggests that new solutions to this matter must provide rigorous rules for completely identifying unnecessary versions, and elaborate designs for version cleaning lest old versions required for LLTs should suspend garbage collection. In this paper, we formalize such rules into our version pruning theorem and version classification, of which all form theoretical foundations for our new version management system, vDriver, that bases its record versioning on a new principle: Single In-row Remaining Off-row (SIRO) versioning. We implemented a prototype of vDriver and integrated it with MySQL-8.0 and PostgreSQL-12.0. The experimental evaluation demonstrated that the engines with Driver continue to perform the reclamation of dead versions in the face of LLTs while retaining transaction throughput with reduced space consumption.

【Keywords】: Information systems; Data management systems; Database management system engines; DBMS engine architectures

35. Chiller: Contention-centric Transaction Execution and Data Partitioning for Modern Networks.

【Paper Link】【Pages】:511-526

【Authors】: Erfan Zamanian ; Julian Shun ; Carsten Binnig ; Tim Kraska

【Abstract】: Distributed transactions on high-overhead TCP/IP-based networks were conventionally considered to be prohibitively expensive and thus were avoided at all costs. To that end, the primary goal of almost any existing partitioning scheme is to minimize the number of cross-partition transactions. However, with the new generation of fast RDMA-enabled networks, this assumption is no longer valid. In fact, recent work has shown that distributed databases can scale even when the majority of transactions are cross-partition. In this paper, we first make the case that the new bottleneck which hinders truly scalable transaction processing in modern RDMA-enabled databases is data contention, and that optimizing for data contention leads to different partitioning layouts than optimizing for the number of distributed transactions. We then present Chiller, a new approach to data partitioning and transaction execution, which aims to minimize data contention for both local and distributed transactions. Finally, we evaluate Chiller using various workloads, and show that our partitioning and execution strategy outperforms traditional partitioning techniques which try to avoid distributed transactions, by up to a factor of 2.

【Keywords】: Information systems; Data management systems; Database management system engines; Distributed database transactions

36. Handling Highly Contended OLTP Workloads Using Fast Dynamic Partitioning.

【Paper Link】【Pages】:527-542

【Authors】: Guna Prasaad ; Alvin Cheung ; Dan Suciu

【Abstract】: Research on transaction processing has made significant progress towards improving performance of main memory multicore OLTP systems under low contention. However, these systems struggle on workloads with lots of conflicts. Partitioned databases (and variants) perform well on high contention workloads that are statically partitionable, but time-varying workloads often make them impractical. Towards addressing this, we propose Strife---a novel transaction processing scheme that clusters transactions together dynamically and executes most of them without any concurrency control. Strife executes transactions in batches, where each batch is partitioned into disjoint clusters without any cross-cluster conflicts and a small set of residuals. The clusters are then executed in parallel with no concurrency control, followed by residuals separately executed with concurrency control. Strife uses a fast dynamic clustering algorithm that exploits a combination of random sampling and concurrent union-find data structure to partition the batch online, before executing it. Strife outperforms lock-based and optimistic protocols by up to 2x on high contention workloads. While Strife incurs about 50% overhead relative to partitioned systems in the statically partitionable case, it performs 2x better when such static partitioning is not possible and adapts to dynamically varying workloads.

【Keywords】: Information systems; Data management systems; Database management system engines; Database transaction processing; Main memory engines; Theory of computation; Design and analysis of algorithms; Concurrent algorithms

37. A Transactional Perspective on Execute-order-validate Blockchains.

【Paper Link】【Pages】:543-557

【Authors】: Pingcheng Ruan ; Dumitrel Loghin ; Quang-Trung Ta ; Meihui Zhang ; Gang Chen ; Beng Chin Ooi

【Abstract】: Smart contracts have enabled blockchain systems to evolve from simple cryptocurrency platforms to general transactional systems. A new architecture called execute-order-validate has been proposed in Hyperledger Fabric to support parallel transactions. However, this architecture might render many invalid transactions when serializing them. This problem is further exaggerated as the block formation rate is inherently limited due to other factors beside data processing, such as cryptography and consensus. Inspired by optimistic concurrency control in modern databases, we propose a novel method to enhance the execute-order-validate architecture, by reordering transactions to reduce the abort rate. In contrast to existing blockchains that adopt database's preventive approaches which might over-abort serializable transactions, our method is theoretically more fine-grained: unserializable transactions are aborted before reordering and the rest are guaranteed to be serializable. We implement our method in two blockchains respectively, FabricSharp on top of Hyperledger Fabric, and FastFabricSharp on top of FastFabric. We compare the performance of FabricSharp with vanilla Fabric and three related systems, two of which are respectively implemented with one standard and one state-of-the-art concurrency control techniques from databases. The results demonstrate that FabricSharp achieves 25% higher throughput compared to the other systems in nearly all experimental scenarios. Moreover, the FastFabricSharp's improvement on FastFabric is up to 66%.

【Keywords】: Information systems; Data management systems; Database management system engines; Database transaction processing; Security and privacy; Systems security; Distributed systems security

38. Aggify: Lifting the Curse of Cursor Loops using Custom Aggregates.

【Paper Link】【Pages】:559-573

【Authors】: Surabhi Gupta ; Sanket Purandare ; Karthik Ramachandra

【Abstract】: Loops that iterate over SQL query results are quite common, both in application programs that run outside the DBMS, as well as User Defined Functions (UDFs) and stored procedures that run within the DBMS. It can be argued that set-oriented operations are more efficient and should be preferred over iteration; but from real world use cases, it is clear that loops over query results are inevitable in many situations, and are preferred by many users. Such loops, known as cursor loops, come with huge trade-offs and overheads w.r.t. performance, resource consumption and concurrency. We present Aggify, a technique for optimizing loops over query results that overcomes these overheads. It achieves this by automatically generating custom aggregates that are equivalent in semantics to the loop. Thereby, Aggify completely eliminates the loop by rewriting the query to use this generated aggregate. This technique has several advantages such as: (i) pipelining of entire cursor loop operations instead of materialization, (ii) pushing down loop computation from the application layer into the DBMS, closer to the data, (iii) leveraging existing work on optimization of aggregate functions, resulting in efficient query plans. We describe the technique underlying Aggify, and present our experimental evaluation over benchmarks as well as real workloads that demonstrate the significant benefits of this technique.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization; Query languages; Relational database query languages; Structured Query Language; Theory of computation; Theory and algorithms for application domains; Database theory; Database query processing and optimization (theory)

Research 7: Security, Privacy, and Blockchain 5

39. Querying Shared Data with Security Heterogeneity.

【Paper Link】【Pages】:575-585

【Authors】: Yang Cao ; Wenfei Fan ; Yanghao Wang ; Ke Yi

【Abstract】: There has been increasing need for secure data sharing. In practice a group of data owners often adopt a heterogeneous security scheme under which each pair of parties decide their own protocol to share data with diverse levels of trust. The scheme also keeps track of how the data is used. This paper studies distributed SQL query answering in the heterogeneous security setting. We define query plans by incorporating toll functions determined by data sharing agreements and reflected in the use of various security facilities. We formalize query answering as a bi-criteria optimization problem, to minimize both data sharing toll and parallel query evaluation cost. We show that this problem is PSPACE-hard for SQL and Σ_3^p-hard for SPC, and it is in NEXPTIME. Despite the hardness, we develop a set of approximate algorithms to generate distributed query plans that minimize data sharing toll and reduce parallel evaluation cost. Using real-life and synthetic data, we empirically verify the effectiveness, scalability and efficiency of our algorithms.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query planning; Parallel and distributed DBMSs; Relational parallel and distributed DBMSs; Security and privacy; Database and storage security; Management and querying of encrypted data

40. SAGMA: Secure Aggregation Grouped by Multiple Attributes.

【Paper Link】【Pages】:587-601

【Authors】: Timon Hackenjos ; Florian Hahn ; Florian Kerschbaum

【Abstract】: Encryption can protect data in outsourced databases -- in the cloud -- while still enabling query processing over the encrypted data. However, the processing leaks information about the data specific to the type of query, e.g., aggregation queries. Aggregation over user-defined groups using SQL's GROUP BY clause is extensively used in data analytics, e.g., to calculate the total number of visitors each month or the average salary in each department. The information leaked, e.g., the access pattern to a group, may reveal the group's frequency enabling simple, yet detrimental leakage-abuse attacks. In this work we present SAGMA -- an encryption scheme for performing secure aggregation grouped by multiple attributes. The querier can choose any combination of one or multiple attributes in the GROUP BY clause among the set of all grouping attributes. The encryption scheme only stores semantically secure ciphertexts at the cloud and query processing hides the access pattern, i.e., the frequency of each group. We implemented our scheme and our evaluation results underpin its practical feasibility.

【Keywords】: Security and privacy; Database and storage security; Management and querying of encrypted data; Security services; Privacy-preserving protocols

41. Crypt?: Crypto-Assisted Differential Privacy on Untrusted Servers.

【Paper Link】【Pages】:603-619

【Authors】: Amrita Roy Chowdhury ; Chenghong Wang ; Xi He ; Ashwin Machanavajjhala ; Somesh Jha

【Abstract】: Differential privacy (DP) is currently the de-facto standard for achieving privacy in data analysis, which is typically implemented either in the "central" or "local" model. The local model has been more popular for commercial deployments as it does not require a trusted data collector. This increased privacy, however, comes at the cost of utility and algorithmic expressibility as compared to the central model. In this work, we propose, Cryptε, a system and programming framework that (1) achieves the accuracy guarantees and algorithmic expressibility of the central model (2) without any trusted data collector like in the local model. Cryptε achieves the "best of both worlds" by employing two non-colluding untrusted servers that run DP programs on encrypted data from the data owners. In theory, straightforward implementations of DP programs using off-the-shelf secure multi-party computation tools can achieve the above goal. However, in practice, they are beset with many challenges like poor performance and tricky security proofs. To this end, Cryptε allows data analysts to author logical DP programs that are automatically translated to secure protocols that work on encrypted data. These protocols ensure that the untrusted servers learn nothing more than the noisy outputs, thereby guaranteeing DP (for computationally bounded adversaries) for all Cryptε programs. Cryptε supports a rich class of DP programs that can be expressed via a small set of transformation and measurement operators followed by arbitrary post-processing. Further, we propose performance optimizations leveraging the fact that the output is noisy. We demonstrate Cryptε's practical feasibility with extensive empirical evaluations on real world datasets.

【Keywords】: Security and privacy; Cryptography; Public key (asymmetric) techniques; Public key encryption; Database and storage security; Management and querying of encrypted data; Security services; Privacy-preserving protocols

42. Estimating Numerical Distributions under Local Differential Privacy.

【Paper Link】【Pages】:621-635

【Authors】: Zitao Li ; Tianhao Wang ; Milan Lopuhaä-Zwakenberg ; Ninghui Li ; Boris Skoric

【Abstract】: When collecting information, local differential privacy (LDP) relieves the concern of privacy leakage from users' perspective, as user's private information is randomized before sent to the aggregator. We study the problem of recovering the distribution over a numerical domain while satisfying LDP. While one can discretize a numerical domain and then apply the protocols developed for categorical domains, we show that taking advantage of the numerical nature of the domain results in better trade-off of privacy and utility. We introduce a new reporting mechanism, called the square wave (SW) mechanism, which exploits the numerical nature in reporting. We also develop an Expectation Maximization with Smoothing (EMS) algorithm, which is applied to aggregated histograms from the SW mechanism to estimate the original distributions. Extensive experiments demonstrate that our proposed approach, SW with EMS, consistently outperforms other methods in a variety of utility metrics.

【Keywords】: Security and privacy; Security services; Privacy-preserving protocols

43. FalconDB: Blockchain-based Collaborative Database.

【Paper Link】【Pages】:637-652

【Authors】: Yanqing Peng ; Min Du ; Feifei Li ; Raymond Cheng ; Dawn Song

【Abstract】: Nowadays an emerging class of applications are based oncollaboration over a shared database among different entities. However, the existing solutions on shared database may require trust on others, have high hardware demand that is unaffordable for individual users, or have relatively low performance. In other words, there is a trilemma among security, compatibility and efficiency. In this paper, we present FalconDB, which enables different parties with limited hardware resources to efficiently and securely collaborate on a database. FalconDB adopts database servers with verification interfaces accessible to clients and stores the digests for query/update authentications on a blockchain. Using blockchain as a consensus platform and a distributed ledger, FalconDB is able to work without any trust on each other. Meanwhile, FalconDB requires only minimal storage cost on each client, and provides anywhere-available, real-time and concurrent access to the database. As a result, FalconDB over-comes the disadvantages of previous solutions, and enables individual users to participate in the collaboration with high efficiency, low storage cost and blockchain-level security guarantees.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Client-server architectures; Information systems; Data management systems; Database design and models; Database management system engines; Integrity checking

Research 8: Graph Query Processing 5

44. Exact Single-Source SimRank Computation on Large Graphs.

【Paper Link】【Pages】:653-663

【Authors】: Hanzhi Wang ; Zhewei Wei ; Ye Yuan ; Xiaoyong Du ; Ji-Rong Wen

【Abstract】: SimRank is a popular measurement for evaluating the node-to-node similarities based on the graph topology. In recent years, single-source and top-k SimRank queries have received increasing attention due to their applications in web mining, social network analysis, and spam detection. However, a fundamental obstacle in studying SimRank has been the lack of ground truths. The only exact algorithm, Power Method, is computationally infeasible on graphs with more than 106 nodes. Consequently, no existing work has evaluated the actual trade-offs between query time and accuracy on large real-world graphs. In this paper, we present ExSim, the first algorithm that computes the exact single-source and top-k SimRank results on large graphs. With high probability, this algorithm produces ground truths with a rigorous theoretical guarantee. We conduct extensive experiments on real-world datasets to demonstrate the efficiency of ExactSim. The results show that ExactSim provides the ground truth for any single-source SimRank query with a precision up to 7 decimal places within a reasonable query time.

【Keywords】: Information systems; Information systems applications; Data mining; Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms

45. Distributed Processing of k Shortest Path Queries over Dynamic Road Networks.

【Paper Link】【Pages】:665-679

【Authors】: Ziqiang Yu ; Xiaohui Yu ; Nick Koudas ; Yang Liu ; Yifan Li ; Yueting Chen ; Dingyu Yang

【Abstract】: The problem of identifying the k -shortest paths (KSPs for short) in a dynamic road network is essential to many location-based services. Road networks are dynamic in the sense that the weights of the edges in the corresponding graph constantly change over time, representing evolving traffic conditions. Very often such services have to process numerous KSP queries over large road networks at the same time, thus there is a pressing need to identify distributed solutions for this problem. However, most existing approaches are designed to identify KSPs on a static graph in a sequential manner (i.e., the (i+1)-th shortest path is generated based on the i-th shortest path), restricting their scalability and applicability in a distributed setting. We therefore propose KSP-DG, a distributed algorithm for identifying k-shortest paths in a dynamic graph. It is based on partitioning the entire graph into smaller subgraphs, and reduces the problem of determining KSPs into the computation of partial KSPs in relevant subgraphs, which can execute in parallel on a cluster of servers. A distributed two-level index called DTLP is developed to facilitate the efficient identification of relevant subgraphs. A salient feature of DTLP is that it indexes a set of virtual paths that are insensitive to varying traffic conditions, leading to very low maintenance cost in dynamic road networks. This is the first treatment of the problem of processing KSP queries over dynamic road networks. Extensive experiments conducted on real road networks confirm the superiority of our proposal over baseline methods.

【Keywords】: Theory of computation; Design and analysis of algorithms; Algorithm design techniques; Branch-and-bound; Divide and conquer

46. On the Optimization of Recursive Relational Queries: Application to Graph Queries.

【Paper Link】【Pages】:681-697

【Authors】: Louis Jachiet ; Pierre Genevès ; Nils Gesbert ; Nabil Layaïda

【Abstract】: Graph databases have received a lot of attention as they are particularly useful in many applications such as social networks, life sciences and the semantic web. Various languages have emerged to query graph databases, many of which embed forms of recursion which reveal essential for navigating in graphs. The relational model has benefited from a huge body of research in the last half century and that is why many graph databases rely on techniques of relational query engines. Since its introduction, the relational model has seen various attempts to extend it with recursion and it is now possible to use recursion in several SQL or Datalog based database systems. The optimization of recursive queries remains, however, a challenge. We propose mu-RA, a variation of the Relational Algebra equipped with a fixpoint operator for expressing recursive relational queries. mu-RA can notably express unions of conjunctive regular path queries. Leveraging the fact that this fixpoint operator makes recursive terms more amenable to algebraic transformations, we propose new rewrite rules. These rules makes it possible to generate new query execution plans, that cannot be obtained with previous approaches. We present the syntax and semantics of mu-RA, and the rewriting rules that we specifically devised to tackle the optimization of recursive queries. We report on practical experiments that show that the newly generated plans can provide significant performance improvements for evaluating recursive queries over graphs.

【Keywords】: Theory of computation; Theory and algorithms for application domains; Database theory; Database query languages (principles); Database query processing and optimization (theory)

47. Pensieve: Skewness-Aware Version Switching for Efficient Graph Processing.

【Paper Link】【Pages】:699-713

【Authors】: Tangwei Ying ; Hanhua Chen ; Hai Jin

【Abstract】: Multi-version graph processing has recently attracted much research efforts. Existing multi-version graph storage designs use either copy-based schemes or delta-based schemes. A copy-based scheme stores every version separately and may lead to expensive space cost due to high storage redundancy. On the contrary, a delta based scheme only stores incremental deltas between different versions and relies on delta computation for version switching. In this work, we observe: 1) high degree vertices incur much more significant storage overheads during graph version evolving compared to low degree vertices; 2) the skewed access frequency among graph versions greatly influences the system performance for version reproducing. Based on the observations, we propose Pensieve, a skewness-aware multi-version graph processing system. Two factors contribute to the efficiency of Pensieve. First, Pensieve leverages a differentiated graph storage strategy that stores low degree vertices using copy-based scheme while stores high degree ones using delta-based scheme. Such a design achieves a good trade-off between storage cost and version switching time for multi-version graph processing. Second, the Pensieve graph storage exploits the time locality of graph version access and designs a novel last-root version switching scheme, which significantly improves the access efficiency for recent versions. We implement Pensieve on top of Ligra, and conduct comprehensive experiments to evaluate the performance of this design using large-scale datasets collected from real world systems. The results show that Pensieve substantially outperforms state-of-the-art designs in terms of memory consumption and version switching time.

【Keywords】: Information systems; Information storage systems; Storage architectures; Distributed storage; Storage management; Version management

48. Extending Graph Patterns with Conditions.

【Paper Link】【Pages】:715-729

【Authors】: Grace Fan ; Wenfei Fan ; Yuanhao Li ; Ping Lu ; Chao Tian ; Jingren Zhou

【Abstract】: We propose an extension of graph patterns, referred to as conditional graph patterns and denoted as CGPs. In a CGP,one can specify a simple condition on each edge such that the edge exists if and only if the condition is satisfied. We show that CGPs allow us to catch missing links, increase the expressivity of graph functional dependencies, and provide a succinct representation of graph patterns. We settle the complexity of their consistency, matching, incremental matching and containment problems, in linear time,NP-complete,NP-complete and p2-complete, respectively. These tell us that despite the increased expressive power of CGPs, the matching and incremental matching problems for CGPs are no harder than their counterparts for conventional patterns. We develop algorithms for matching and incremental matching of CGPs, and for (incremental) multi-CGP matching and optimization. Using real-life and synthetic graphs, we empirically verify the efficiency and effectiveness of our algorithms.

【Keywords】: Theory of computation; Design and analysis of algorithms; Data structures design and analysis; Pattern matching

Industry 2: Machine Learning and Analytics 5

49. Elastic Machine Learning Algorithms in Amazon SageMaker.

【Paper Link】【Pages】:731-737

【Authors】: Edo Liberty ; Zohar S. Karnin ; Bing Xiang ; Laurence Rouesnel ; Baris Coskun ; Ramesh Nallapati ; Julio Delgado ; Amir Sadoughi ; Yury Astashonok ; Piali Das ; Can Balioglu ; Saswata Chakravarty ; Madhav Jha ; Philip Gautier ; David Arpin ; Tim Januschowski ; Valentin Flunkert ; Yuyang Wang ; Jan Gasthaus ; Lorenzo Stella ; Syama Sundar Rangapuram ; David Salinas ; Sebastian Schelter ; Alex Smola

【Abstract】: There is a large body of research on scalable machine learning (ML). Nevertheless, training ML models on large, continuously evolving datasets is still a difficult and costly undertaking for many companies and institutions. We discuss such challenges and derive requirements for an industrial-scale ML platform. Next, we describe the computational model behind Amazon SageMaker, which is designed to meet such challenges. SageMaker is an ML platform provided as part of Amazon Web Services (AWS), and supports incremental training, resumable and elastic learning as well as automatic hyperparameter optimization. We detail how to adapt several popular ML algorithms to its computational model. Finally, we present an experimental evaluation on large datasets, comparing SageMaker to several scalable, JVM-based implementations of ML algorithms, which we significantly outperform with regard to computation time and cost.

【Keywords】: Information systems; Information systems applications; Data mining; Data stream mining

50. Timon: A Timestamped Event Database for Efficient Telemetry Data Processing and Analytics.

【Paper Link】【Pages】:739-753

【Authors】: Wei Cao ; Yusong Gao ; Feifei Li ; Sheng Wang ; Bingchen Lin ; Ke Xu ; Xiaojie Feng ; Yucong Wang ; Zhenjun Liu ; Gejin Zhang

【Abstract】: With the increasing demand for real-time system monitoring and tracking in various contexts, the amount of time-stamped event data grows at an astonishing rate. Analytics on time-stamped events must be real time and the aggregated results need to be accurate even when data arrives out of order. Unfortunately, frequent occurrences of out-of-order data will significantly slow down the processing, and cause a large delay in the query response. Timon is a timestamped event database that aims to support aggregations and handle late arrivals both correctly (i.e., upholding the exactly-once semantics) and efficiently. Our insight is that a broad range of applications can be implemented with data structures and corresponding operators that satisfy associative and commutative properties. Records arriving after the low watermark are appended to Timon directly, allowing aggregations to be performed lazily. To improve query efficiency, Timon maintains a TS-LSM-Tree, which keeps the most recent data in memory and contains a time-partitioning tree on disk for high-volume data accumulated over long time span. Besides, Timon supports materialized aggregation views and correlation analysis across multiple streams. Timon has been successfully deployed at Alibaba Cloud and is a critical building block for Alibaba cloud's continuous monitoring and anomaly analysis infrastructure.

【Keywords】: Information systems; Data management systems; Networks; Network services; Cloud computing

51. Vertica-ML: Distributed Machine Learning in Vertica Database.

【Paper Link】【Pages】:755-768

【Authors】: Arash Fard ; Anh Le ; George Larionov ; Waqas Dhillon ; Chuck Bear

【Abstract】: A growing number of companies rely on machine learning as a key element for gaining a competitive edge from their collected Big Data. An in-database machine learning system can provide many advantages in this scenario, e.g., eliminating the overhead of data transfer, avoiding the maintenance costs of a separate analytical system, and addressing data security and provenance concerns. In this paper, we present our distributed machine learning subsystem within the Vertica database. This subsystem, Vertica-ML, includes machine learning functionalities with SQL API which cover a complete data science workflow as well as model management. We treat machine learning models in Vertica as first-class database objects like tables and views; therefore, they enjoy a similar mechanism for archiving and managing. We explain the architecture of the subsystem, and present a set of experiments to evaluate the performance of the machine learning algorithms implemented on top of it.

【Keywords】: Computing methodologies; Distributed computing methodologies; Distributed algorithms; Machine learning; Information systems; Information systems applications; Data mining; Clustering; Decision support systems; Data analytics; Data warehouses

52. Database Workload Capacity Planning using Time Series Analysis and Machine Learning.

【Paper Link】【Pages】:769-783

【Authors】: Antony S. Higginson ; Mihaela Dediu ; Octavian Arsene ; Norman W. Paton ; Suzanne M. Embury

【Abstract】: When procuring or administering any I.T. system or a component of an I.T. system, it is crucial to understand the computational resources required to run the critical business functions that are governed by any Service Level Agreements. Predicting the resources needed for future consumption is like looking into the proverbial crystal ball. In this paper we look at the forecasting techniques in use today and evaluate if those techniques are applicable to the deeper layers of the technological stack such as clustered database instances, applications and groups of transactions that make up the database workload. The approach has been implemented to use supervised machine learning to identify traits such as reoccurring patterns, shocks and trends that the workloads exhibit and account for those traits in the forecast. An experimental evaluation shows that the approach we propose reduces the complexity of performing a forecast, and accurate predictions have been produced for complex workloads.

【Keywords】: Applied computing; Operations research; Forecasting; Computer systems organization; Architectures; Distributed architectures; Cloud computing; n-tier architectures; Computing methodologies; Machine learning; Learning paradigms; Supervised learning; Supervised learning by regression; Information systems; Data management systems; Database administration; Database performance evaluation; Mathematics of computing; Probability and statistics; Statistical paradigms; Time series analysis

53. The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development.

【Paper Link】【Pages】:785-800

【Authors】: Micah J. Smith ; Carles Sala ; James Max Kanter ; Kalyan Veeramachaneni

【Abstract】: As machine learning is applied more widely, data scientists often struggle to find or create end-to-end machine learning systems for specific tasks. The proliferation of libraries and frameworks and the complexity of the tasks have led to the emergence of "pipeline jungles" - brittle, ad hoc ML systems. To address these problems, we introduce the Machine Learning Bazaar, a new framework for developing machine learning and automated machine learning software systems. First, we introduce ML primitives, a unified API and specification for data processing and ML components from different software libraries. Next, we compose primitives into usable ML pipelines, abstracting away glue code, data flow, and data storage. We further pair these pipelines with a hierarchy of AutoML strategies - Bayesian optimization and bandit learning. We use these components to create a general-purpose, multi-task, end-to-end AutoML system that provides solutions to a variety of data modalities (image, text, graph, tabular, relational, etc.) and problem types (classification, regression, anomaly detection, graph matching, etc.). We demonstrate 5 real-world use cases and 2 case studies of our approach. Finally, we present an evaluation suite of 456 real-world ML tasks and describe the characteristics of 2.5 million pipelines searched over this task suite.

【Keywords】: Computing methodologies; Machine learning; Software and its engineering; Software creation and management; Software development techniques; Software organization and properties; Software system structures; Abstraction, modeling and modularity

SIGMOD Keynote 2 2

54. When the Web is your Data Lake: Creating a Search Engine for Datasets on the Web.

【Paper Link】【Pages】:801

【Authors】: Natasha F. Noy

【Abstract】: There are thousands of data repositories on the Web, providing access to millions of datasets. National and regional governments, scientific publishers and consortia, commercial data providers, and others publish data for fields ranging from social science to life science to high-energy physics to climate science and more. Access to this data is critical to facilitating reproducibility of research results, enabling scientists to build on others' work, and providing data journalists easier access to information and its provenance. In this talk, I will discuss our work on Dataset Search, which provides search capabilities over potentially all dataset repositories on the Web. I will talk about the open ecosystem for describing and citing datasets that we hope to encourage and the technical details on how we went about building Dataset Search. Finally, I will highlight research challenges in building a vibrant, heterogeneous, and open ecosystem where data becomes a first-class citizen.

【Keywords】: General and reference; Document types; General literature

55. The Challenge of Building Effective, Enterprise-scale Data Lakes.

【Paper Link】【Pages】:803

【Authors】: Awez Syed

【Abstract】: There has been a rapid rise in the popularity of data lakes as the data infrastructure for modern analytics and data science. The combination of cloud storage and fast, elastic processing provides an inexpensive and scalable solution for building analytical applications. While data lakes make it easy to ingest and store vast amounts of data, the ability to effectively make use of that data is still limited. This data often lacks context, doesn't meet the quality required for applications, and is not easily understandable or discoverable by users. Problems of data consistency and accuracy make it hard to derive value from data lakes and to trust the analytics based on this data. The traditional methods of manually documenting, classifying and assessing the data don't scale to the volume of cloud-based data lakes. New automated, learning-based approaches are required to discover, curate and make the data usable for a wide variety of users. In this talk, we describe the real-world implementation patterns of data lakes and give an overview of the many open challenges in deploying successful, enterprise-scale data lakes.

【Keywords】: Information systems; Data management systems; Database administration; Data dictionaries; Database design and models; Entity relationship models; Information integration; Data cleaning; Deduplication; Extraction, transformation and loading; Theory of computation; Theory and algorithms for application domains; Database theory; Data provenance

Research 9: Data Cleaning 5

56. Cleaning Denial Constraint Violations through Relaxation.

【Paper Link】【Pages】:805-815

【Authors】: Stella Giannakopoulou ; Manos Karpathiotakis ; Anastasia Ailamaki

【Abstract】: Data cleaning is a time-consuming process that depends on the data analysis that users perform. Existing solutions treat data cleaning as a separate offline process that takes place before analysis begins. Applying data cleaning before analysis assumes a priori knowledge of the inconsistencies and the query workload, thereby requiring effort on understanding and cleaning the data that is unnecessary for the analysis. We propose an approach that performs probabilistic repair of denial constraint violations on-demand, driven by the exploratory analysis that users perform. We introduce Daisy, a system that seamlessly integrates data cleaning into the analysis by relaxing query results. Daisy executes analytical query-workloads over dirty data by weaving cleaning operators into the query plan. Our evaluation shows that Daisy adapts to the workload and outperforms traditional offline cleaning on both synthetic and real-world workloads.

【Keywords】: Information systems; Data management systems; Database design and models; Data model extensions; Data provenance; Incomplete data; Inconsistent data; Uncertainty; Information integration; Data cleaning

57. On Multiple Semantics for Declarative Database Repairs.

【Paper Link】【Pages】:817-831

【Authors】: Amir Gilad ; Daniel Deutch ; Sudeepa Roy

【Abstract】: We study the problem of database repairs through a rule-based framework that we refer to as Delta Rules. Delta rules are highly expressive and allow specifying complex, cross-relations repair logic associated with Denial Constraints, Causal Rules, and allowing to capture Database Triggers of interest. We show that there are no one-size-fits-all semantics for repairs in this inclusive setting, and we consequently introduce multiple alternative semantics, presenting the case for using each of them. We then study the relationships between the semantics in terms of their output and the complexity of computation. Our results formally establish the tradeoff between the permissiveness of the semantics and its computational complexity. We demonstrate the usefulness of the framework in capturing multiple data repair scenarios for an academic search database and the TPC-H databases, showing how using different semantics affects the repair in terms of size and runtime, and examining the relationships between the repairs. We also compare our approach with SQL triggers and a state-of-the-art data repair system.

【Keywords】: Theory of computation; Theory and algorithms for application domains; Database theory; Data provenance; Database constraints theory

58. Discovery Algorithms for Embedded Functional Dependencies.

【Paper Link】【Pages】:833-843

【Authors】: Ziheng Wei ; Sven Hartmann ; Sebastian Link

【Abstract】: Embedded functional dependencies (eFDs) advance data management applications by data completeness and integrity requirements. We show that the discovery problem of eFDs is NP-complete, W[2]-complete in the output, and has a minimum solution space that is larger than the maximum solution space for functional dependencies. Nevertheless, we use novel data structures and search strategies to develop row-efficient, column-efficient, and hybrid algorithms for eFD discovery. Our experiments demonstrate that the algorithms scale well in terms of their design targets, and that ranking the eFDs by the number of redundant data values they cause can provide useful guidance in identifying meaningful eFDs for applications. Finally, we demonstrate the benefits of introducing completeness requirements and ranking by the number of redundant data values for approximate and genuine functional dependencies.

【Keywords】: Information systems; Data management systems; Database design and models; Information systems applications; Data mining

59. SCODED: Statistical Constraint Oriented Data Error Detection.

【Paper Link】【Pages】:845-860

【Authors】: Jing Nathan Yan ; Oliver Schulte ; Mohan Zhang ; Jiannan Wang ; Reynold Cheng

【Abstract】: Statistical Constraints (SCs) play an important role in statistical modeling and analysis. This paper brings the concept to data cleaning and studies how to leverage SCs for error detection. SCs provide a novel approach that has various application scenarios and works harmoniously with downstream statistical modeling. Entailment relationships between SCs and integrity constraints provide analytical insight into SCs. We develop SCODED, an SC-Oriented Data Error Detection system, comprising two key components: (1) SC Violation Detection : checks whether an SC is violated on a given dataset, and (2) Error Drill Down : identifies the top-k records that contribute most to the violation of an SC. Experiments on synthetic and real-world data show that SCs are effective in detecting data errors that violate them, compared to state-of-the-art approaches.

【Keywords】: Information systems; Data management systems; Information integration; Data cleaning

60. A Statistical Perspective on Discovering Functional Dependencies in Noisy Data.

【Paper Link】【Pages】:861-876

【Authors】: Yunjia Zhang ; Zhihan Guo ; Theodoros Rekatsinas

【Abstract】: We study the problem of discovering functional dependencies (FD) from a noisy data set. We adopt a statistical perspective and draw connections between FD discovery and structure learning in probabilistic graphical models. We show that discovering FDs from a noisy data set is equivalent to learning the structure of a model over binary random variables, where each random variable corresponds to a functional of the data set attributes. We build upon this observation to introduce FDX a conceptually simple framework in which learning functional dependencies corresponds to solving a sparse regression problem. We show that FDX can recover true functional dependencies across a diverse array of real-world and synthetic data sets, even in the presence of noisy or missing data. We find that FDX scales to large data instances with millions of tuples and hundreds of attributes while it yields an average F1 improvement of 2x against state-of-the-art FD discovery methods.

【Keywords】: Information systems; Data management systems; Database design and models; Database management system engines; Integrity checking; Information integration; Data cleaning

Research 10: Storage and Indexing 5

61. Rethinking Logging, Checkpoints, and Recovery for High-Performance Storage Engines.

【Paper Link】【Pages】:877-892

【Authors】: Michael Haubenschild ; Caetano Sauer ; Thomas Neumann ; Viktor Leis

【Abstract】: For decades, ARIES has been the standard for logging and recovery in database systems. ARIES offers important features like support for arbitrary workloads, fuzzy checkpoints, and transparent index recovery. Nevertheless, many modern in-memory database systems use more lightweight approaches that have less overhead and better multi-core scalability but only work well for the in-memory setting. Recently, a new class of high-performance storage engines has emerged, which exploit fast SSDs to achieve performance close to pure in-memory systems but also allow out-of-memory workloads. For these systems, ARIES is too slow whereas in-memory logging proposals are not applicable. In this work, we propose a new logging and recovery design that supports incremental and fuzzy checkpointing, index recovery, out-of-memory workloads, and low-latency transaction commits. Our continuous checkpointing algorithm guarantees bounded recovery time. Using per-thread logging with minimal synchronization, our implementation achieves near-linear scalability on multi-core CPUs. We implemented and evaluated these techniques in our LeanStore storage engine. For working sets that fit in main memory, we achieve performance close to that of an in-memory approach, even with logging, checkpointing, and dirty page writing enabled. For the out-of-memory scenario, we outperform a state-of-the-art ARIES implementation by a factor of two.

【Keywords】: Information systems; Data management systems; Database management system engines; Database transaction processing; Database recovery; DBMS engine architectures; Record and buffer management; Information storage systems; Information storage technologies; Storage class memory; Phase change memory; Software and its engineering; Software creation and management; Software development techniques; Error handling and recovery; Software organization and properties; Contextual software domains; Operating systems; Memory management; Secondary storage; Extra-functional properties; Software fault tolerance; Checkpoint / restart

62. Lethe: A Tunable Delete-Aware LSM Engine.

【Paper Link】【Pages】:893-908

【Authors】: Subhadeep Sarkar ; Tarikul Islam Papon ; Dimitris Staratzis ; Manos Athanassoulis

【Abstract】: Data-intensive applications fueled the evolution of log structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as a second-class citizen. A delete inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art LSM engines do not provide guarantees as to how fast a tombstone will propagate to persist the deletion. Further, LSM engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written. We highlight that fast persistent deletion without affecting read performance is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on the right-to-be-forgotten, and (iii) en masse cloud deployment of data systems that makes storage a precious resource. To address these challenges, in this paper, we build a new key-value storage engine, Lethe, that uses a very small amount of additional metadata, a set of new delete-aware compaction policies, and a new physical data layout that weaves the sort and the delete key order. We show that Lethe supports any user-defined threshold for the delete persistence latency offering higher read throughput (1.17-1.4x) and lower space amplification (2.1-9.8x), with a modest increase in write amplification (between 4% and 25%). In addition, Lethe supports efficient range deletes on a secondary delete key by dropping entire data pages without sacrificing read performance nor employing a costly full tree merge.

【Keywords】: Information systems; Data management systems; Data structures; Data access methods; Point lookups; Unidimensional range search; Data layout; Record and block layout; Database management system engines; Parallel and distributed DBMSs; Key-value stores; Security and privacy; Human and societal aspects of security and privacy; Privacy protections; Social aspects of security and privacy

63. BinDex: A Two-Layered Index for Fast and Robust Scans.

【Paper Link】【Pages】:909-923

【Authors】: Linwei Li ; Kai Zhang ; Jiading Guo ; Wen He ; Zhenying He ; Yinan Jing ; Weili Han ; X. Sean Wang

【Abstract】: In modern analytical database systems, the performance of the data scan operation is of key importance to the performance of query execution. Existing approaches may be categorized into index scan and sequential scan. However, both approaches have inherent inefficiencies. Indeed, sequential scan may need to access a large amount of unneeded data, especially for queries with low selectivity. Instead, index scan may involve a large number of expensive random memory accesses when the query selectivity is high. Moreover, with the growing complexities in database query workloads, it has become hard to predict which approach is better for a particular query. In order to obtain fast and robust scans under all selectivities, this paper proposes BinDex, a two-layered index structure based on binned bitmaps that can be used to significantly accelerate the scan operations for in-memory column stores. The first layer of BinDex consists of a set of binned bitmaps which filter out most unneeded values in a column. The second layer provides some auxiliary information to correct the bits that have incorrect values. By varying the number of bit vectors in the first layer, BinDex can make a tradeoff between memory space and performance. Experimental results show that BinDex outperforms the state-of-the-art approaches with less memory than a B+-tree would use. And by enlarging the memory space, BinDex can achieve up to 2.9 times higher performance, eliminating the need for making a choice between sequential or index scans.

【Keywords】: Information systems; Data management systems; Data structures; Data access methods; Data scans

64. Analysis of Indexing Structures for Immutable Data.

【Paper Link】【Pages】:925-935

【Authors】: Cong Yue ; Zhongle Xie ; Meihui Zhang ; Gang Chen ; Beng Chin Ooi ; Sheng Wang ; Xiaokui Xiao

【Abstract】: In emerging applications such as blockchains and collaborative data analytics, there are strong demands for data immutability, multi-version accesses, and tamper-evident controls. To provide efficient support for lookup and merge operations, three new index structures for immutable data, namely Merkle Patricia Trie (MPT), Merkle Bucket Tree(MBT), and Pattern-Oriented-Split Tree (POS-Tree), have been proposed. Although these structures have been adopted in real applications, there is no systematic evaluation of their pros and cons in the literature, making it difficult for practitioners to choose the right index structure for their applications. To alleviate the above problem, we present a comprehensive analysis of the existing index structures for immutable data, and evaluate both their asymptotic and empirical performance. Specifically, we show that MPT, MBT, and POS-Tree are all instances of a recently proposed framework, dubbed Structurally Invariant and Reusable Indexes (SIRI). We propose to evaluate the SIRI instances on their index performance and deduplication capability. We establish the worst-case guarantees of each index, and experimentally evaluate all indexes in a wide variety of settings. Based on our theoretical and empirical analysis, we conclude that POS-Tree is a favorable choice for indexing immutable data.

【Keywords】: Information systems; Data management systems; Database administration; Database performance evaluation; Information integration; Deduplication; Information storage systems; Record storage systems; Directory structures; B-trees; Storage management; Version management

65. Tree-Encoded Bitmaps.

【Paper Link】【Pages】:937-967

【Authors】: Harald Lang ; Alexander Beischl ; Viktor Leis ; Peter A. Boncz ; Thomas Neumann ; Alfons Kemper

【Abstract】: We propose a novel method to represent compressed bitmaps. Similarly to existing bitmap compression schemes, we exploit the compression potential of bitmaps populated with consecutive identical bits, i.e., 0-runs and 1-runs. But in contrast to prior work, our approach employs a binary tree structure to represent runs of various lengths. Leaf nodes in the upper tree levels thereby represent longer runs, and vice versa. The tree-based representation results in high compression ratios and enables efficient random access, which in turn allows for the fast intersection of bitmaps. Our experimental analysis with randomly generated bitmaps shows that our approach significantly improves over state-of-the-art compression techniques when bitmaps are dense and/or only barely clustered. Further, we evaluate our approach with real-world data sets, showing that our tree-encoded bitmaps can save up to one third of the space over existing techniques.

【Keywords】: Information systems; Data management systems; Data structures; Data layout; Data compression; Database management system engines; Database query processing

Research 11: Machine Learning for Databases II 5

66. ALEX: An Updatable Adaptive Learned Index.

【Paper Link】【Pages】:969-984

【Authors】: Jialin Ding ; Umar Farooq Minhas ; Jia Yu ; Chi Wang ; Jaeyoung Do ; Yinan Li ; Hantian Zhang ; Badrish Chandramouli ; Johannes Gehrke ; Donald Kossmann ; David B. Lomet ; Tim Kraska

【Abstract】: Recent work on "learned indexes" has changed the way we look at the decades-old field of DBMS indexing. The key idea is that indexes can be thought of as "models" that predict the position of a key in a dataset. Indexes can, thus, be learned. The original work by Kraska et al. shows that a learned index beats a B+ tree by a factor of up to three in search time and by an order of magnitude in memory footprint. However, it is limited to static, read-only workloads. In this paper, we present a new learned index called ALEX which addresses practical issues that arise when implementing learned indexes for workloads that contain a mix of point lookups, short range queries, inserts, updates, and deletes. ALEX effectively combines the core insights from learned indexes with proven storage and indexing techniques to achieve high performance and low memory footprint. On read-only workloads, ALEX beats the learned index from Kraska et al. by up to 2.2X on performance with up to 15X smaller index size. Across the spectrum of read-write workloads, ALEX beats B+ trees by up to 4.1X while never performing worse, with up to 2000X smaller index size. We believe ALEX presents a key step towards making learned indexes practical for a broader class of database workloads with dynamic updates.

【Keywords】: Information systems; Data management systems; Data structures; Data access methods; Point lookups; Unidimensional range search; Theory of computation; Theory and algorithms for application domains; Database theory; Data structures and algorithms for data management

67. Learning Multi-Dimensional Indexes.

【Paper Link】【Pages】:985-1000

【Authors】: Vikram Nathan ; Jialin Ding ; Mohammad Alizadeh ; Tim Kraska

【Abstract】: Scanning and filtering over multi-dimensional tables are key operations in modern analytical database engines. To optimize the performance of these operations, databases often create clustered indexes over a single dimension or multi-dimensional indexes such as R-Trees, or use complex sort orders (e.g., Z-ordering). However, these schemes are often hard to tune and their performance is inconsistent across different datasets and queries. In this paper, we introduce Flood, a multi-dimensional in-memory read-optimized index that automatically adapts itself to a particular dataset and workload by jointly optimizing the index structure and data storage layout. Flood achieves up to three orders of magnitude faster performance for range scans with predicates than state-of-the-art multi-dimensional indexes or sort orders on real-world datasets and workloads. Our work serves as a building block towards an end-to-end learned database system.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Information systems applications; Decision support systems; Data analytics; Theory of computation; Theory and algorithms for application domains; Database theory; Data structures and algorithms for data management

68. The Case for a Learned Sorting Algorithm.

【Paper Link】【Pages】:1001-1016

【Authors】: Ani Kristo ; Kapil Vaidya ; Ugur Çetintemel ; Sanchit Misra ; Tim Kraska

【Abstract】: Sorting is one of the most fundamental algorithms in Computer Science and a common operation in databases not just for sorting query results but also as part of joins (i.e., sort-merge-join) or indexing. In this work, we introduce a new type of distribution sort that leverages a learned model of the empirical CDF of the data. Our algorithm uses a model to efficiently get an approximation of the scaled empirical CDF for each record key and map it to the corresponding position in the output array. We then apply a deterministic sorting algorithm that works well on nearly-sorted arrays (e.g., Insertion Sort) to establish a totally sorted order. We compared this algorithm against common sorting approaches and measured its performance for up to 1 billion normally-distributed double-precision keys. The results show that our approach yields an average 3.38x performance improvement over C++ STL sort, which is an optimized Quicksort hybrid, 1.49x improvement over sequential Radix Sort, and 5.54x improvement over a C++ implementation of Timsort, which is the default sorting function for Java and Python.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Learning linear models; Information systems; Data management systems; Database management system engines; Database query processing; Query operators; Theory of computation; Design and analysis of algorithms; Data structures design and analysis; Sorting and searching

69. QuickSel: Quick Selectivity Learning with Mixture Models.

【Paper Link】【Pages】:1017-1033

【Authors】: Yongjoo Park ; Shucheng Zhong ; Barzan Mozafari

【Abstract】: Estimating the selectivity of a query is a key step in almost any cost-based query optimizer. Most of today's databases rely on histograms or samples that are periodically refreshed by re-scanning the data as the underlying data changes. Since frequent scans are costly, these statistics are often stale and lead to poor selectivity estimates. As an alternative to scans, query-driven histograms have been proposed, which refine the histograms based on the actual selectivities of the observed queries. Unfortunately, these approaches are either too costly to use in practice---i.e., require an exponential number of buckets---or quickly lose their advantage as they observe more queries. In this paper, we propose a selectivity learning framework, called QuickSel, which falls into the query-driven paradigm but does not use histograms. Instead, it builds an internal model of the underlying data, which can be refined significantly faster (e.g., only 1.9 milliseconds for 300 queries). This fast refinement allows QuickSel to continuously learn from each query and yield increasingly more accurate selectivity estimates over time. Unlike query-driven histograms, QuickSel relies on a mixture model and a new optimization algorithm for training its model. Our extensive experiments on two real-world datasets confirm that, given the same target accuracy, QuickSel is 34.0x--179.4x faster than state-of-the-art query-driven histograms, including ISOMER and STHoles. Further, given the same space budget, QuickSel is 26.8%--91.8% more accurate than periodically-updated histograms and samples, respectively.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization

70. Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries.

【Paper Link】【Pages】:1035-1050

【Authors】: Shohedul Hasan ; Saravanan Thirumuruganathan ; Jees Augustine ; Nick Koudas ; Gautam Das

【Abstract】: Selectivity estimation - the problem of estimating the result size of queries - is a fundamental problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. Poor cardinality estimates could result in the selection of bad plans by the query optimizer. Recently, deep learning has been applied to this problem with promising results. However, many of the proposed approaches often struggle to provide accurate results for multi attribute queries involving large number of predicates and with low selectivity. In this paper, we propose two complementary approaches that are effective for this scenario. Our first approach models selectivity estimation as a density estimation problem where one seeks to estimate the joint probability distribution from a finite number of samples. We leverage techniques from neural density estimation to build an accurate selectivity estimator. The key idea is to decompose the joint distribution into a set of tractable conditional probability distributions such that they satisfy the autoregressive property. Our second approach formulates selectivity estimation as a supervised deep learning problem that predicts the selectivity of a given query. We describe how to extend our algorithms for range queries. We also introduce and address a number of practical challenges arising when adapting deep learning for relational data. These include query/data featurization, incorporating query workload information in a deep learning framework and the dynamic scenario where both data and workload queries could be updated. Our extensive experiments with a special emphasis on queries with a large number of predicates and/or small result sizes demonstrates that our proposed techniques provide fast and accurate selective estimates with minimal space overhead.

【Keywords】: Computer systems organization; Architectures; Other architectures; Neural networks; Computing methodologies; Machine learning; Learning paradigms; Supervised learning; Unsupervised learning; Information systems; Data management systems; Database management system engines; Database query processing

Research 12: Graph Matching and Discovery 5

71. Efficient Algorithms for Densest Subgraph Discovery on Large Directed Graphs.

【Paper Link】【Pages】:1051-1066

【Authors】: Chenhao Ma ; Yixiang Fang ; Reynold Cheng ; Laks V. S. Lakshmanan ; Wenjie Zhang ; Xuemin Lin

【Abstract】: Given a directed graph G, the directed densest subgraph (DDS) problem refers to the finding of a subgraph from G, whose density is the highest among all the subgraphs of G. The DDS problem is fundamental to a wide range of applications, such as fraud detection, community mining, and graph compression. However, existing DDS solutions suffer from efficiency and scalability problems: on a three-thousand-edge graph, it takes three days for one of the best exact algorithms to complete. In this paper, we develop an efficient and scalable DDS solution. We introduce the notion of [x, y]-core, which is a dense subgraph for G, and show that the densest subgraph can be accurately located through the [x, y]-core with theoretical guarantees. Based on the [x, y]-core, we develop exact and approximation algorithms. We have performed an extensive evaluation of our approaches on eight real large datasets. The results show that our proposed solutions are up to six orders of magnitude faster than the state-of-the-art.

【Keywords】: Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms; Network flows; Theory of computation; Design and analysis of algorithms; Graph algorithms analysis

72. GPU-Accelerated Subgraph Enumeration on Partitioned Graphs.

【Paper Link】【Pages】:1067-1082

【Authors】: Wentian Guo ; Yuchen Li ; Mo Sha ; Bingsheng He ; Xiaokui Xiao ; Kian-Lee Tan

【Abstract】: Subgraph enumeration is important for many applications such as network motif discovery and community detection. Recent works utilize graphics processing units (GPUs) to parallelize subgraph enumeration, but they can only handle graphs that fit into the GPU memory. In this paper, we propose a new approach for GPU-accelerated subgraph enumeration that can efficiently scale to large graphs beyond the GPU memory. Our approach divides the graph into partitions, each of which fits into the GPU memory. The GPU processes one partition at a time and searches the matched subgraphs of a given pattern (i.e., instances) within the partition as in the small graph. The key challenge is on enumerating the instances across different partitions, because this search would enumerate considerably redundant subgraphs and cause the expensive data transfer cost via the PCI-e bus. Therefore, we propose a novel shared execution approach to eliminate the redundant subgraph searches and correctly generate all the instances across different partitions. The experimental evaluation shows that our approach can scale to large graphs and achieve significantly better performance than the existing single-machine solutions.

【Keywords】: Computing methodologies; Computer graphics; Graphics systems and interfaces; Graphics processors; Parallel computing methodologies; Parallel algorithms; Massively parallel algorithms; Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms; Graph enumeration

73. In-Memory Subgraph Matching: An In-depth Study.

【Paper Link】【Pages】:1083-1098

【Authors】: Shixuan Sun ; Qiong Luo

【Abstract】: We study the performance of eight representative in-memory subgraph matching algorithms. Specifically, we put QuickSI, GraphQL, CFL, CECI, DP-iso, RI and VF2++ in a common framework to compare them on the following four aspects: (1) method of filtering candidate vertices in the data graph; (2) method of ordering query vertices; (3) method of enumerating partial results; and (4) other optimization techniques. Then, we compare the overall performance of these algorithms with Glasgow, an algorithm based on the constraint programming. Through experiments, we find that (1) the filtering method of GraphQL is competitive to that of the latest algorithms CFL, CECI and DP-iso in terms of pruning power; (2) the ordering methods in GraphQL and RI are usually the most effective; (3) the set intersection based local candidate computation in CECI and DP-iso performs the best in the enumeration; and (4) the failing sets pruning in DP-iso can significantly improve the performance when queries become large. Our source code is publicly available at https://github.com/RapidsAtHKUST/SubgraphMatching.

【Keywords】: Information systems; Information retrieval; Information retrieval query processing

74. G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching.

【Paper Link】【Pages】:1099-1114

【Authors】: Yeonsu Park ; Seongyun Ko ; Sourav S. Bhowmick ; Kyoungmin Kim ; Kijae Hong ; Wook-Shin Han

【Abstract】: Despite the crucial role of cardinality estimation in query optimization, there has been no systematic and in-depth study of the existing cardinality estimation techniques for subgraph matching queries. In this paper, for the first time, we present a comprehensive study of the existing cardinality estimation techniques for subgraph matching queries, scaling far beyond the original experiments. We first introduce a novel framework called g-care that enables us to realize all existing techniques on top of it and that provides insights on their performance. By using g-care, we then reimplement representative cardinality estimation techniques for graph databases as well as relational databases. We next evaluate these techniques w.r.t accuracy on rdf and non-rdf graphs from different domains with subgraph matching queries of various topologies so far considered. Surprisingly, our results reveal that all existing techniques have serious problems in accuracy for various scenarios and datasets. Intriguingly, a simple sampling method based on an online aggregation technique designed for relational data, consistently outperforms all existing techniques.

【Keywords】: Information systems; Data management systems; Database management system engines; Information systems applications

75. Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees.

【Paper Link】【Pages】:1115-1131

【Authors】: Tahsin Reza ; Matei Ripeanu ; Geoffrey Sanders ; Roger Pearce

【Abstract】: There are multiple situations where supporting approximation in graph pattern matching tasks is highly desirable: (i) the data acquisition process can be noisy; (ii) a user may only have an imprecise idea of the search query; and (iii) approximation can be used for high volume vertex labeling when extracting machine learning features from graph data. We present a new algorithmic pipeline for approximate matching that combines edit-distance based matching with systematic graph pruning. We formalize the problem as identifying all exact matches for up to k edit-distance subgraphs of a user-supplied template. We design a solution which exploits unique optimization opportunities within the design space, not explored previously. Our solution is (i) highly scalable, (ii) supports arbitrary patterns and edit-distance, (iii) offers 100% precision and 100% recall guarantees, and (vi) supports a set of popular data analysis scenarios. We demonstrate its advantages through an implementation that offers good strong and weak scaling on massive real-world (257 billion edges) and synthetic (1.1 trillion edges) labeled graphs, respectively, and when operating on a massive cluster (256 nodes/9,216 cores), orders of magnitude larger than previously used for similar problems. Empirical comparison with the state-of-the-art highlights the advantages of our solution when handling massive graphs and complex patterns.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Information systems; Information systems applications; Data mining; Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms

Research 13: Data Matching 5

76. A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching.

【Paper Link】【Pages】:1133-1147

【Authors】: Venkata Vamsikrishna Meduri ; Lucian Popa ; Prithviraj Sen ; Mohamed Sarwat

【Abstract】: Entity Matching (EM) is a core data cleaning task, aiming to identify different mentions of the same real-world entity. Active learning is one way to address the challenge of scarce labeled data in practice, by dynamically collecting the necessary examples to be labeled by an Oracle and refining the learned model (classifier) upon them. In this paper, we build a unified active learning benchmark framework for EM that allows users to easily combine different learning algorithms with applicable example selection algorithms. The goal of the framework is to enable concrete guidelines for practitioners as to what active learning combinations will work well for EM. Towards this, we perform comprehensive experiments on publicly available EM datasets from product and publication domains to evaluate active learning methods, using a variety of metrics including EM quality, #labels and example selection latencies. Our most surprising result finds that active learning with fewer labels can learn a classifier of comparable quality as supervised learning. In fact, for several of the datasets, we show that there is an active learning combination that beats the state-of-the-art supervised learning result. Our framework also includes novel optimizations that improve the quality of the learned model by roughly 9% in terms of F1-score and reduce example selection latencies by up to 10× without affecting the quality of the model.

【Keywords】: Computing methodologies; Machine learning; Learning paradigms; Supervised learning; Supervised learning by classification; Machine learning algorithms; Ensemble methods; Machine learning approaches; Classification and regression trees; Kernel methods; Support vector machines; Neural networks; Rule learning; Information systems; Data management systems; Information integration; Data cleaning; Deduplication; Entity resolution; Theory of computation; Theory and algorithms for application domains; Machine learning theory; Active learning

77. ZeroER: Entity Resolution using Zero Labeled Examples.

【Paper Link】【Pages】:1149-1164

【Authors】: Renzhi Wu ; Sanya Chaba ; Saurabh Sawlani ; Xu Chu ; Saravanan Thirumuruganathan

【Abstract】: Entity resolution (ER) refers to the problem of matching records in one or more relations that refer to the same real-world entity. While supervised machine learning (ML) approaches achieve the state-of-the-art results, they require a large amount of labeled examples that are expensive to obtain and often times infeasible. We investigate an important problem that vexes practitioners: is it possible to design an effective algorithm for ER that requires Zero labeled examples, yet can achieve performance comparable to supervised approaches? In this paper, we answer in the affirmative through our proposed approach dubbed ZeroER. Our approach is based on a simple observation --- the similarity vectors for matches should look different from that of unmatches. Operationalizing this insight requires a number of technical innovations. First, we propose a simple yet powerful generative model based on Gaussian Mixture Models for learning the match and unmatch distributions. Second, we propose an adaptive regularization technique customized for ER that ameliorates the issue of feature overfitting. Finally, we incorporate the transitivity property into the generative model in a novel way resulting in improved accuracy. On five benchmark ER datasets, we show that ZeroER greatly outperforms existing unsupervised approaches and achieves comparable performance to supervised approaches.

【Keywords】: Computing methodologies; Machine learning; Learning paradigms; Unsupervised learning; Information systems; Data management systems; Information integration; Entity resolution

78. Towards Interpretable and Learnable Risk Analysis for Entity Resolution.

【Paper Link】【Pages】:1165-1180

【Authors】: Zhaoqiang Chen ; Qun Chen ; Boyi Hou ; Zhanhuai Li ; Guoliang Li

【Abstract】: Machine-learning-based entity resolution has been widely studied. However, some entity pairs may be mislabeled by machine learning models and existing studies do not study the risk analysis problem -- predicting and interpreting which entity pairs are mislabeled. In this paper, we propose an interpretable and learnable framework for risk analysis, which aims to rank the labeled pairs based on their risks of being mislabeled. We first describe how to automatically generate interpretable risk features, and then present a learnable risk model and its training technique. Finally, we empirically evaluate the performance of the proposed approach on real data. Our extensive experiments have shown that the learning risk model can identify the mislabeled pairs with considerably higher accuracy than the existing alternatives.

【Keywords】: Information systems; Data management systems; Information integration; Entity resolution

79. SLIM: Scalable Linkage of Mobility Data.

【Paper Link】【Pages】:1181-1196

【Authors】: Fuat Basik ; Hakan Ferhatosmanoglu ; Bugra Gedik

【Abstract】: We present a scalable solution to link entities across mobility datasets using their spatio-temporal information. This is a fundamental problem in many applications such as linking user identities for security, understanding privacy limitations of location based services, or producing a unified dataset from multiple sources for urban planning. Such integrated datasets are also essential for service providers to optimise their services and improve business intelligence. In this paper, we first propose a mobility based representation and similarity computation for entities. An efficient matching process is then developed to identify the final linked pairs, with an automated mechanism to decide when to stop the linkage. We scale the process with a locality-sensitive hashing (LSH) based approach that significantly reduces candidate pairs for matching. To realize the effectiveness and efficiency of our techniques in practice, we introduce an algorithm called SLIM. In the experimental evaluation, SLIM outperforms the two existing state-of-the-art approaches in terms of precision and recall. Moreover, the LSH-based approach brings two to four orders of magnitude speedup.

【Keywords】: Information systems; Data management systems; Information integration; Entity resolution; Information systems applications; Spatial-temporal systems; Theory of computation; Theory and algorithms for application domains; Database theory; Data integration

80. Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning Approach.

【Paper Link】【Pages】:1197-1212

【Authors】: Yaoshu Wang ; Chuan Xiao ; Jianbin Qin ; Xin Cao ; Yifang Sun ; Wei Wang ; Makoto Onizuka

【Abstract】: In this paper, we investigate the possibilities of utilizing deep learning for cardinality estimation of similarity selection. Answering this problem accurately and efficiently is essential to many data management applications, especially for query optimization. Moreover, in some applications the estimated cardinality is supposed to be consistent and interpretable. Hence a monotonic estimation w.r.t. the query threshold is preferred. We propose a novel and generic method that can be applied to any data type and distance function. Our method consists of a feature extraction model and a regression model. The feature extraction model transforms original data and threshold to a Hamming space, in which a deep learning-based regression model is utilized to exploit the incremental property of cardinality w.r.t. the threshold for both accuracy and monotonicity. We develop a training strategy tailored to our model as well as techniques for fast estimation. We also discuss how to handle updates. We demonstrate the accuracy and the efficiency of our method through experiments, and show how it improves the performance of a query optimizer.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks; Information systems; Data management systems; Database management system engines; Database query processing; Query optimization; Information integration; Entity resolution

Research 14: Query Optimization and Execution 5

81. Fast Join Project Query Evaluation using Matrix Multiplication.

【Paper Link】【Pages】:1213-1223

【Authors】: Shaleen Deep ; Xiao Hu ; Paraschos Koutris

【Abstract】: In the last few years, much effort has been devoted to developing join algorithms to achieve worst-case optimality for join queries over relational databases. Towards this end, the database community has had considerable success in developing efficient algorithms that achieve worst-case optimal runtime for full join queries, i.e., joins without projections. However, not much is known about join evaluation with projections beyond some simple techniques of pushing down the projection operator in the query execution plan. Such queries have a large number of applications in entity matching, graph analytics and searching over compressed graphs. In this paper, we study how a class of join queries with projections can be evaluated faster using worst-case optimal algorithms together with matrix multiplication. Crucially, our algorithms are parameterized by the output size of the final result, allowing for choosing the best execution strategy. We implement our algorithms as a subroutine and compare the performance with state-of-the-art techniques to show they can be improved upon by as much as 50x. More importantly, our experiments indicate that matrix multiplication is a useful operation that can help speed up join processing owing to highly optimized open source libraries that are also highly parallelizable.

【Keywords】: Computing methodologies; Symbolic and algebraic manipulation; Symbolic and algebraic algorithms; Information systems; Data management systems; Database management system engines; Database query processing; Join algorithms; Query optimization; Theory of computation; Theory and algorithms for application domains; Database theory; Data structures and algorithms for data management

82. Maintaining Acyclic Foreign-Key Joins under Updates.

【Paper Link】【Pages】:1225-1239

【Authors】: Qichen Wang ; Ke Yi

【Abstract】: A large number of analytical queries (e.g., all the 22 queries in the TPC-H benchmark) are based on acyclic foreign-key joins. In this paper, we study the problem of incrementally maintaining the query results of these joins under updates, i.e., insertion and deletion of tuples to any of the relations. Prior work has shown that this problem is inherently hard, requiring at least Ω(|db|1/2 -ε) time per update, where |db| is the size of the database, and ε > 0 can be any small constant. However, this negative result holds only on adversarially constructed update sequences; on the other hand, most real-world update sequences are "nice", nowhere near these worst-case scenarios. We introduce a measure λ, which we call the enclosureness of the update sequence, to more precisely characterize its intrinsic difficulty. We present an algorithm to maintain the query results of any acyclic foreign-key join in O(λ) time amortized, on any update sequence whose enclosureness is λ. This is complemented with a lower bound of Ω(λ1-ε), showing that our algorithm is essentially optimal with respect to λ. Next, using this algorithm as the core component, we show how all the 22 queries in the TPC-H benchmark can be supported in ~O(łambda) time. Finally, based on the algorithms developed, we built a continuous query processing system on top of Flink, and experimental results show that our system outperforms previous ones significantly.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Join algorithms; Database views; Stream management

83. Thrifty Query Execution via Incrementability.

【Paper Link】【Pages】:1241-1256

【Authors】: Dixin Tang ; Zechao Shang ; Aaron J. Elmore ; Sanjay Krishnan ; Michael J. Franklin

【Abstract】: Many applications schedule queries before all data is ready. To return fast query results, database systems can eagerly process existing data and incrementally incorporate new data into prior intermediate results, which often relies on incremental view maintenance (IVM) techniques. However, incrementally maintaining a query result can increase the total amount of work mainly as some early work is not useful for computing the final query result. In this paper, we propose a new metric incrementability to quantify the cost-effectiveness of IVM to decide how eagerly or lazily databases should incrementally execute a query. We further observe that different parts of a query have different levels of incrementability and the query execution should have a decomposed control flow based on the difference. Therefore, to address these needs, we propose a new query processing method Incrementability-aware Query Processing (InQP). We build a prototype InQP system based on Spark and show that InQP significantly reduces resource consumption with a similar latency compared with incrementability-oblivious approaches.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization; Query planning

84. A Method for Optimizing Opaque Filter Queries.

【Paper Link】【Pages】:1257-1272

【Authors】: Wenjia He ; Michael R. Anderson ; Maxwell Strome ; Michael J. Cafarella

【Abstract】: An important class of database queries in machine learning and data science workloads is the opaque filter query: a query with a selection predicate that is implemented with a UDF, with semantics that are unknown to the query optimizer. Some typical examples would include a CNN-style trained image classifier, or a textual sentiment classifier. Because the optimizer does not know the predicate's semantics, it cannot employ standard optimizations, yielding long query times. We propose voodoo indexing, a two-phase method for optimizing opaque filter queries. Before any query arrives, the method builds a hierarchical "query-independent" index of the database contents, which groups together similar objects. At query-time, the method builds a map of how much each group satisfies the predicate, while also exploiting the map to accelerate execution. Unlike past methods, voodoo indexing does not require insight into predicate semantics, works on any data type, and does not require in-query model training. We describe both standalone and SparkSQL-specific implementations, plus experiments on both image and text data, on more than 100 distinct opaque predicates. We show voodoo indexing can yield up to an 88% improvement over standard scan behavior, and a 79% improvement over the previous best method adapted from research literature.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization; Information retrieval; Specialized information retrieval; Multimedia and multimodal retrieval

85. Functional-Style SQL UDFs With a Capital 'F'.

【Paper Link】【Pages】:1273-1287

【Authors】: Christian Duta ; Torsten Grust

【Abstract】: We advocate to express complex in-database computation using a functional style in which SQL UDFs use plain self-invocation to recurse. The resulting UDFs are concise and readable, but their run time performance on contemporary RDBMSs is sobering. This paper describes how to compile such functional-style UDFs into SQL:1999 recursive common table expressions. We build on function call graphs to build the compiler's core and to realize a series of optimizations (reference counting, memoization, exploitation of linear and tail recursion). The compiled UDFs evaluate efficiently, challenging the performance of manually tweaked (but often convoluted) SQL code. SQL UDFs can indeed be functional and fast.

【Keywords】: Information systems; Data management systems; Query languages; Relational database query languages; Structured Query Language; Software and its engineering; Software notations and tools; Compilers; Source code generation; General programming languages; Language types; Functional languages

Research 15: Machine Learning for Cleaning, Integration, and Search 5

86. Learning to Validate the Predictions of Black Box Classifiers on Unseen Data.

【Paper Link】【Pages】:1289-1299

【Authors】: Sebastian Schelter ; Tammo Rukat ; Felix Bießmann

【Abstract】: Machine Learning (ML) models are difficult to maintain in production settings. In particular, deviations of the unseen serving data (for which we want to compute predictions) from the source data (on which the model was trained) pose a central challenge, especially when model training and prediction are outsourced via cloud services. Errors or shifts in the serving data can affect the predictive quality of a model, but are hard to detect for engineers operating ML deployments.

【Keywords】: Information systems; Data management systems; Database management system engines; Integrity checking

87. Learning Over Dirty Data Without Cleaning.

【Paper Link】【Pages】:1301-1316

【Authors】: Jose Picado ; John Davis ; Arash Termehchy ; Ga Young Lee

【Abstract】: Real-world datasets are dirty and contain many errors, such as violations of integrity constraints and entity duplicates. Learning over dirty databases may result in inaccurate models. Data scientists spend most of their time on preparing and repairing data errors to create clean databases for learning. Moreover, as the information required to repair these errors is not often available, there may be numerous possible clean versions for a dirty database. We propose Dirty Learn, DLearn, a novel learning system that learns directly over dirty databases effectively and efficiently without any preprocessing. DLearn leverages database constraints to learn accurate relational models over inconsistent and heterogeneous data. Its learned models represent patterns over all possible clean versions of the data in a usable form. Our empirical study indicates that DLearn learns accurate models over large real-world databases efficiently.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Logical and relational learning; Information systems; Data management systems; Information integration; Data cleaning; Theory of computation; Theory and algorithms for application domains; Database theory; Data integration

88. Complaint-driven Training Data Debugging for Query 2.0.

【Paper Link】【Pages】:1317-1334

【Authors】: Weiyuan Wu ; Lampros Flokas ; Eugene Wu ; Jiannan Wang

【Abstract】: As the need for machine learning (ML) increases rapidly across all industry sectors, there is a significant interest among commercial database providers to support "Query 2.0", which integrates model inference into SQL queries. Debugging Query 2.0 is very challenging since an unexpected query result may be caused by the bugs in training data (e.g., wrong labels, corrupted features). In response, we propose Rain, a complaint-driven training data debugging system. Rain allows users to specify complaints over the query's intermediate or final output, and aims to return a minimum set of training examples so that if they were removed, the complaints would be resolved. To the best of our knowledge, we are the first to study this problem. A naive solution requires retraining an exponential number of ML models. We propose two novel heuristic approaches based on influence functions which both require linear retraining steps. We provide an in-depth analytical and empirical analysis of the two approaches and conduct extensive experiments to evaluate their effectiveness using four real-world datasets. Results show that Rain achieves the highest [email protected] among all the baselines while still returns results interactively.

【Keywords】: Computing methodologies; Machine learning; Information systems; Data management systems; Database design and models; Data model extensions; Data provenance; Information integration; Data cleaning

89. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks.

【Paper Link】【Pages】:1335-1349

【Authors】: Riccardo Cappuzzo ; Paolo Papotti ; Saravanan Thirumuruganathan

【Abstract】: Deep learning based techniques have been recently used with promising results for data integration problems. Some methods directly use pre-trained embeddings that were trained on a large corpus such as Wikipedia. However, they may not always be an appropriate choice for enterprise datasets with custom vocabulary. Other methods adapt techniques from natural language processing to obtain embeddings for the enterprise's relational data. However, this approach blindly treats a tuple as a sentence, thus losing a large amount of contextual information present in the tuple. We propose algorithms for obtaining local embeddings that are effective for data integration tasks on relational databases. We make four major contributions. First, we describe a compact graph-based representation that allows the specification of a rich set of relationships inherent in the relational world. Second, we propose how to derive sentences from such a graph that effectively "describe" the similarity across elements (tokens, attributes, rows) in the two datasets. The embeddings are learned based on such sentences. Third, we propose effective optimization to improve the quality of the learned embeddings and the performance of integration tasks. Finally, we propose a diverse collection of criteria to evaluate relational embeddings and perform an extensive set of experiments validating them against multiple baseline methods. Our experiments show that our framework, EmbDI, produces meaningful results for data integration tasks such as schema matching and entity resolution both in supervised and unsupervised settings.

【Keywords】: Theory of computation; Theory and algorithms for application domains; Database theory; Data integration

90. Minimization of Classifier Construction Cost for Search Queries.

【Paper Link】【Pages】:1351-1365

【Authors】: Shay Gershtein ; Tova Milo ; Gefen Morami ; Slava Novgorodov

【Abstract】: Search over massive sets of items is the cornerstone of many modern applications. Users express a set of properties and expect the system to retrieve qualifying items. A common difficulty, however, is that the information on whether an item satisfies the search criteria is not explicitly recorded in the repository. Instead, it may be general knowledge or "hidden" in a picture/description, leading to incomplete search results. To overcome this problem, companies build dedicated classifiers that determine which items satisfy the given criteria. However, building classifiers requires volumes of high-quality labeled training data. Since the costs of training classifiers for different subsets of properties can vastly differ, the choice of which classifiers to train has great monetary significance. The goal of our research is to devise effective algorithms to choose which classifiers one should train to address a given query load while minimizing the cost. Previous work considered a simplified model with uniform classifier costs, and queries with two properties. We remove these restrictions in our model. We prove NP-hard inapproximability bounds and devise several algorithms with approximation guarantees. Moreover, we identify a common special case for which we provide an exact algorithm. Our experiments, performed over real-life datasets, demonstrate the effectiveness and efficiency of our algorithms.

【Keywords】: Information systems; World Wide Web; Web applications; Electronic commerce

Research 16: Graph and Stream Processing 5

91. Scaling Up Distance Labeling on Graphs with Core-Periphery Properties.

【Paper Link】【Pages】:1367-1381

【Authors】: Wentao Li ; Miao Qiao ; Lu Qin ; Ying Zhang ; Lijun Chang ; Xuemin Lin

【Abstract】: In indexing a graph for distance queries, distance labeling is a common practice; in particular, 2-hop labeling which guarantees the exactness of the query results is widely adopted. When it comes to a massive real graph with a relatively large treewidth such as social networks and web graphs, however, 2-hop labeling can hardly be constructed due to the oversized index. This paper discloses the theoretical relationships between the graph treewidth and 2-hop labeling's index size and query time. To scale up distance labeling, this paper proposes Core-Tree (CT) Index to facilitate a critical and effective trade-off between the index size and query time. The reduced index size enables CT-Index to handle massive graphs that no existing approaches can process while the cost in the query time is negligible: the query time is below 0.4 milliseconds on all tested graphs including one graph with 5.5 billion edges.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing

92. Factorized Graph Representations for Semi-Supervised Learning from Sparse Data.

【Paper Link】【Pages】:1383-1398

【Authors】: Krishna Kumar P. ; Paul Langton ; Wolfgang Gatterbauer

【Abstract】: Node classification is an important problem in graph data management. It is commonly solved by various label propagation methods that iteratively pass messages along edges, starting from a few labeled seed nodes. For graphs with arbitrary compatibilities between classes, these methods crucially depend on knowing the compatibility matrix, which must thus be provided by either domain experts or heuristics. We instead suggest a principled and scalable method for directly estimating the compatibilities from a sparsely labeled graph. This method, which we call distant compatibility estimation, works even on extremely sparsely labeled graphs (e.g., 1 in 10,000 nodes is labeled) in a fraction of the time it later takes to label the remaining nodes. Our approach first creates multiple factorized graph representations (with size independent of the graph) and then performs estimation on these smaller graph sketches. We refer to algebraic amplification as the underlying idea of leveraging algebraic properties of an algorithm's update equations to amplify sparse signals in data. We show that our estimator is by orders of magnitude faster than alternative approaches and that the end-to-end classification accuracy is comparable to using gold standard compatibilities. This makes it a cheap pre-processing step for any existing label propagation method and removes the current dependence on heuristics.

【Keywords】: Computing methodologies; Machine learning; Learning settings; Semi-supervised learning settings; Machine learning approaches; Factorization methods; Learning linear models; Symbolic and algebraic manipulation; Symbolic and algebraic algorithms; Linear algebra algorithms; Optimization algorithms; Information systems; Data management systems; Information systems applications; Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms; Paths and connectivity problems; Probability and statistics; Probabilistic reasoning algorithms; Loopy belief propagation; Probabilistic representations; Markov networks; Theory of computation; Theory and algorithms for application domains; Machine learning theory; Semi-supervised learning

93. Reliable Data Distillation on Graph Convolutional Network.

【Paper Link】【Pages】:1399-1414

【Authors】: Wentao Zhang ; Xupeng Miao ; Yingxia Shao ; Jiawei Jiang ; Lei Chen ; Olivier Ruas ; Bin Cui

【Abstract】: Graph Convolutional Network (GCN) is a widely used method for learning from graph-based data. However, it fails to use the unlabeled data to its full potential, thereby hindering its ability. Given some pseudo labels of the unlabeled data, the GCN can benefit from this extra supervision. Based on Knowledge Distillation and Ensemble Learning, lots of methods use a teacher-student architecture to make better use of the unlabeled data and then make a better prediction. However, these methods introduce unnecessary training costs and a high bias of student model if the teacher's predictions are unreliable. Besides, the final ensemble gains are limited due to limited diversity in the combined models. Therefore, we propose Reliable Data Distillation, a reliable data driven semi-supervised GCN training method. By defining the node reliability and edge reliability in a graph, we can make better use of high quality data and improve the graph representation learning. Furthermore, considering the data reliability and data importance, we propose a new ensemble learning method for GCN and a novel Self-Boosting SSL Framework to combine the above optimizations. Finally, our extensive evaluation of Reliable Data Distillation on real-world datasets shows that our approach outperforms the state-of-the-art methods on semi-supervised node classification tasks.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

94. Regular Path Query Evaluation on Streaming Graphs.

【Paper Link】【Pages】:1415-1430

【Authors】: Anil Pacaci ; Angela Bonifati ; M. Tamer Özsu

【Abstract】: We study persistent query evaluation over streaming graphs, which is becoming increasingly important. We focus on navigational queries that determine if there exists a path between two entities that satisfies a user-specified constraint. We adopt the Regular Path Query (RPQ) model that specifies navigational patterns with labeled constraints. We propose deterministic algorithms to efficiently evaluate persistent RPQs under both arbitrary and simple path semantics in a uniform manner. Experimental analysis on real and synthetic streaming graphs shows that the proposed algorithms can process up to tens of thousands of edges per second and efficiently answer RPQs that are commonly used in real-world workloads.

【Keywords】: Information systems; Data management systems; Database design and models; Graph-based database models; Database management system engines; Stream management; Query languages; Query languages for non-relational engines

95. Timely Reporting of Heavy Hitters using External Memory.

【Paper Link】【Pages】:1431-1446

【Authors】: Prashant Pandey ; Shikha Singh ; Michael A. Bender ; Jonathan W. Berry ; Martin Farach-Colton ; Rob Johnson ; Thomas M. Kroeger ; Cynthia A. Phillips

【Abstract】: Given an input stream of size N, a φ-heavy hitter is an item that occurs at least φ N times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = φ N-th occurrence (and hence becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams, and with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (Ω(N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable trade-off between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device's random I/O throughput, i.e., ~100K observations per second.

【Keywords】: Theory of computation; Design and analysis of algorithms; Data structures design and analysis; Streaming, sublinear and near linear time algorithms

Industry 3: Cloud and Distributed Databases 5

96. A Framework for Emulating Database Operations in Cloud Data Warehouses.

【Paper Link】【Pages】:1447-1461

【Authors】: Mohamed A. Soliman ; Lyublena Antova ; Marc Sugiyama ; Michael Duller ; Amirhossein Aleyasen ; Gourab Mitra ; Ehab Abdelhamid ; Mark Morcos ; Michele Gage ; Dmitri Korablev ; Florian M. Waas

【Abstract】: In recent years, increased interest in cloud-based data warehousing technologies has emerged with many enterprises moving away from on-premise data warehousing solutions. The incentives for adopting cloud data warehousing technologies are many: cost-cutting, on-demand pricing, offloading data centers, unlimited hardware resources, built-in disaster recovery, to name a few. There is inherent difference in the language surface and feature sets of on-premise and cloud data warehousing solutions. This could range from subtle syntactic and semantic differences, with potentially big impact on result correctness, to complete features that exist in one system but are missing in other systems. While there have been some efforts to help automate the migration of on-premise applications to new cloud environments, a major challenge that slows down the migration pace is the handling of features not yet supported, or partially supported, by the cloud technologies. In this paper we build on our earlier work in adaptive data virtualization and present novel techniques that allow running applications utilizing sophisticated database features within foreign query engines lacking the native support of such features. In particular, we introduce a framework to manage discrepancy of metadata across heterogeneous query engines, and various mechanisms to emulate database applications code in cloud environments without any need to rewrite or change the application code.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing

97. Taurus Database: How to be Fast, Available, and Frugal in the Cloud.

【Paper Link】【Pages】:1463-1478

【Authors】: Alex Depoutovitch ; Chong Chen ; Jin Chen ; Paul Larson ; Shu Lin ; Jack Ng ; Wenlin Cui ; Qiang Liu ; Wei Huang ; Yong Xiao ; Yongjun He

【Abstract】: Using cloud Database as a Service (DBaaS) offerings instead of on-premise deployments is increasingly common. Key advantages include improved availability and scalability at a lower cost than on-premise alternatives. In this paper, we describe the design of Taurus, a new multi-tenant cloud database system. Taurus separates the compute and storage layers in a similar manner to Amazon Aurora and Microsoft Socrates and provides similar benefits, such as read replica support, low network utilization, hardware sharing and scalability. However, the Taurus architecture has several unique advantages. Taurus offers novel replication and recovery algorithms providing better availability than existing approaches using the same or fewer replicas. Also, Taurus is highly optimized for performance, using no more than one network hop on critical paths and exclusively using append-only storage, delivering faster writes, reduced device wear, and constant-time snapshots. This paper describes Taurus and provides a detailed description and analysis of the storage node architecture, which has not been previously available from the published literature.

【Keywords】: Computer systems organization; Dependable and fault-tolerant systems and networks; Availability; Reliability; Information systems; Data management systems; Database management system engines; DBMS engine architectures; Parallel and distributed DBMSs; Relational parallel and distributed DBMSs

98. Reliability Analytics for Cloud Based Distributed Databases.

【Paper Link】【Pages】:1479-1492

【Authors】: Mathieu B. Demarne ; Jim Gramling ; Tomer Verona ; Miso Cilimdzic

【Abstract】: We present RADD, an innovative analytic pipeline used to measure reliability and availability for cloud-based distributed databases by leveraging the vast amount of telemetry present in the cloud. RADD can perform root cause analysis (RCA) to provide a minute-by-minute summary of the availability of a database close to real-time. On top of this data, RADD can raise alerts, analyze the stability of new versions during their deployment, and provide Key Performance Indicators (KPIs) that allow us to understand the stability of our system across all deployed databases. RADD implements an event correlation framework that puts the emphasis on data compliance and uses information entropy to measure causality and reduce noisy signals. It also uses statistical modelling to analyze new versions of the product and detect potential regressions early in our software development lifecycle. We demonstrate the application of RADD on top of Azure Synapse Analytics, where the system has helped us identify top-hitting and new issues and support on-call teams regarding every aspect of database health.

【Keywords】: Information systems; Data management systems; Database administration; Database utilities and tools; Database management system engines; Parallel and distributed DBMSs; Relational parallel and distributed DBMSs

99. CockroachDB: The Resilient Geo-Distributed SQL Database.

【Paper Link】【Pages】:1493-1509

【Authors】: Rebecca Taft ; Irfan Sharif ; Andrei Matei ; Nathan VanBenschoten ; Jordan Lewis ; Tobias Grieger ; Kai Niemi ; Andy Woods ; Anne Birzin ; Raphael Poss ; Paul Bardea ; Amruta Ranade ; Ben Darnell ; Bram Gruneir ; Justin Jaffray ; Lucy Zhang ; Peter Mattis

【Abstract】: We live in an increasingly interconnected world, with many organizations operating across countries or even continents. To serve their global user base, organizations are replacing their legacy DBMSs with cloud-based systems capable of scaling OLTP workloads to millions of users. CockroachDB is a scalable SQL DBMS that was built from the ground up to support these global OLTP workloads while maintaining high availability and strong consistency. Just like its namesake, CockroachDB is resilient to disasters through replication and automatic recovery mechanisms. This paper presents the design of CockroachDB and its novel transaction model that supports consistent geo-distributed transactions on commodity hardware. We describe how CockroachDB replicates and distributes data to achieve fault tolerance and high performance, as well as how its distributed SQL layer automatically scales with the size of the database cluster while providing the standard SQL interface that users expect. Finally, we present a comprehensive performance evaluation and share a couple of case studies of CockroachDB users. We conclude by describing lessons learned while building CockroachDB over the last five years.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Database transaction processing; Distributed database transactions; Parallel and distributed DBMSs; Relational parallel and distributed DBMSs

100. Azure SQL Database Always Encrypted.

【Paper Link】【Pages】:1511-1525

【Authors】: Panagiotis Antonopoulos ; Arvind Arasu ; Kunal D. Singh ; Ken Eguro ; Nitish Gupta ; Rajat Jain ; Raghav Kaushik ; Hanuma Kodavalla ; Donald Kossmann ; Nikolas Ogg ; Ravi Ramamurthy ; Jakub Szymaszek ; Jeffrey Trimmer ; Kapil Vaswani ; Ramarathnam Venkatesan ; Mike Zwilling

【Abstract】: This paper presents Always Encrypted, a recently released feature of Microsoft SQL Server that uses column granularity encryption to provide cryptographic data protection guarantees. Always Encrypted can be used to outsource database administration while keeping the data confidential from an administrator, including cloud operators. The first version of Always Encrypted was released in Azure SQL Database and as part of SQL Server 2016, and supported equality operations over deterministically encrypted columns. The second version, released as part of SQL Server 2019, uses an enclave running within a trusted execution environment to provide richer functionality that includes comparison and string pattern matching for an IND-CPA-secure (randomized) encryption scheme. We present the security, functionality, and design of Always Encrypted, and provide a performance evaluation using the TPC-C benchmark.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Cloud computing; Information systems; Data management systems; Database management system engines; Database query processing; Database transaction processing; Security and privacy; Database and storage security; Security in hardware

Research 17: Data Exploration and Preparation 5

101. Automatically Generating Data Exploration Sessions Using Deep Reinforcement Learning.

【Paper Link】【Pages】:1527-1537

【Authors】: Ori Bar El ; Tova Milo ; Amit Somech

【Abstract】: Exploratory Data Analysis (EDA) is an essential yet highly demanding task. To get a head start before exploring a new dataset, data scientists often prefer to view existing EDA notebooks -- illustrative, curated exploratory sessions, on the same dataset, that were created by fellow data scientists who shared them online. Unfortunately, such notebooks are not always available (e.g., if the dataset is new or confidential). To address this, we present ATENA, a system that takes an input dataset and auto-generates a compelling exploratory session, presented in an EDA notebook. We shape EDA into a control problem, and devise a novel Deep Reinforcement Learning (DRL) architecture to effectively optimize the notebook generation. Though ATENA uses a limited set of EDA operations, our experiments show that it generates useful EDA notebooks, allowing users to gain actual insights.

【Keywords】: Mathematics of computing; Probability and statistics; Statistical paradigms; Exploratory data analysis; Theory of computation; Theory and algorithms for application domains; Machine learning theory; Reinforcement learning

【Paper Link】【Pages】:1539-1554

【Authors】: Cong Yan ; Yeye He

【Abstract】: Data preparation is widely recognized as the most time-consuming process in modern business intelligence (BI) and machine learning (ML) projects. Automating complex data preparation steps (e.g., Pivot, Unpivot, Normalize-JSON, etc.)holds the potential to greatly improve user productivity, and has therefore become a central focus of research. We propose a novel approach to "auto-suggest" contextualized data preparation steps, by "learning" from how data scientists would manipulate data, which are documented by data science notebooks widely available today. Specifically, we crawled over 4M Jupyter notebooks on GitHub, and replayed them step-by-step, to observe not only full input/output tables (data-frames) at each step, but also the exact data-preparation choices data scientists make that they believe are best suited to the input data (e.g., how input tables are Joined/Pivoted/Unpivoted, etc.). By essentially "logging" how data scientists interact with diverse tables, and using the resulting logs as a proxy of "ground truth", we can learn-to-recommend data preparation steps best suited to given user data, just like how search engines (Google or Bing) leverage their click-through logs to learn-to-rank documents. This data-driven and log-driven approach leverages the "collective wisdom" of data scientists embodied in the notebooks, and is shown to significantly outperform strong baselines including commercial systems in terms of accuracy.

【Keywords】: Information systems; Data management systems; Information integration; Data cleaning; Extraction, transformation and loading; Theory of computation; Theory and algorithms for application domains; Database theory; Data integration

103. IDEBench: A Benchmark for Interactive Data Exploration.

【Paper Link】【Pages】:1555-1569

【Authors】: Philipp Eichmann ; Emanuel Zgraggen ; Carsten Binnig ; Tim Kraska

【Abstract】: In recent years, many query processing techniques have been developed to better support interactive data exploration (IDE) of large structured datasets. To evaluate and compare database engines in terms of how well they support such workloads, experimenters have mostly used self-designed evaluation procedures rather than established benchmarks. In this paper we argue that this is due to the fact that the workloads and metrics of popular analytical benchmarks such as TPC-H or TPC-DS were designed for traditional performance reporting scenarios, and do not capture distinctive IDE characteristics. Guided by the findings of several user studies we present a new benchmark called IDEBench, designed to evaluate database engines based on common IDE workflows and metrics that matter to the end-user. We demonstrate the applicability of IDEBench through a number of experiments with five different database engines, and present and discuss our findings.

【Keywords】: Information systems; Information systems applications; Decision support systems; Data analytics; Mathematics of computing; Probability and statistics; Statistical paradigms; Exploratory data analysis

104. Database Benchmarking for Supporting Real-Time Interactive Querying of Large Data.

【Paper Link】【Pages】:1571-1587

【Authors】: Leilani Battle ; Philipp Eichmann ; Marco Angelini ; Tiziana Catarci ; Giuseppe Santucci ; Yukun Zheng ; Carsten Binnig ; Jean-Daniel Fekete ; Dominik Moritz

【Abstract】: In this paper, we present a new benchmark to validate the suitability of database systems for interactive visualization workloads. While there exist proposals for evaluating database systems on interactive data exploration workloads, none rely on real user traces for database benchmarking. To this end, our long term goal is to collect user traces that represent workloads with different exploration characteristics. In this paper, we present an initial benchmark that focuses on "crossfilter"-style applications, which are a popular interaction type for data exploration and a particularly demanding scenario for testing database system performance. We make our benchmark materials, including input datasets, interaction sequences, corresponding SQL queries, and analysis code, freely available as a community resource, to foster further research in this area: https://osf.io/9xerb/?view_only=81de1a3f99d04529b6b173a3bd5b4d23.

【Keywords】:

105. Benchmarking Spreadsheet Systems.

【Paper Link】【Pages】:1589-1599

【Authors】: Sajjadur Rahman ; Kelly Mack ; Mangesh Bendre ; Ruilin Zhang ; Karrie Karahalios ; Aditya G. Parameswaran

【Abstract】: Spreadsheet systems are used for storing and analyzing data across domains by programmers and non-programmers alike.While spreadsheet systems have continued to support increasingly large datasets, they are prone to hanging and freezing while performing computations even on much smaller ones. We present a benchmarking study that evaluates and compares the performance of three popular systems, Microsoft Excel, LibreOffice Calc, and Google Sheets, on a range of canonical spreadsheet computation operations. We find that spreadsheet systems lack interactivity for several operations, on datasets well below their advertised scalability limits. We further evaluate whether spreadsheet systems adopt database optimization techniques such as indexing, intelligent data layout, and incremental and shared computation,to efficiently execute computation operations. We outline several ways future spreadsheet systems can be redesigned to offer interactive response times on large datasets.

【Keywords】: Information systems; Data management systems

Research 18: Main Memory Databases and Modern Hardware 5

106. Order-Preserving Key Compression for In-Memory Search Trees.

【Paper Link】【Pages】:1601-1615

【Authors】: Huanchen Zhang ; Xiaoxuan Liu ; David G. Andersen ; Michael Kaminsky ; Kimberly Keeton ; Andrew Pavlo

【Abstract】: We present the High-speed Order-Preserving Encoder (HOPE) for in-memory search trees. HOPE is a fast dictionary-based compressor that encodes arbitrary keys while preserving their order. HOPE's approach is to identify common key patterns at a fine granularity and exploit the entropy to achieve high compression rates with a small dictionary. we first develop a theoretical model to reason about order-preserving dictionary designs. We then select six representative compression schemes using this model and implement them in HOPE. These schemes make different trade-offs between compression rate and encoding speed. We evaluate HOPE on five data structures used in databases: SuRF, ART, HOT, B+tree, and Prefix B+tree. Our experiments show that using HOPE allows the search trees to achieve lower query latency (up to 40% lower) and better memory efficiency (up to 30% smaller) simultaneously for most string key workloads.

【Keywords】: Information systems; Data management systems; Data structures; Data layout; Data compression

107. A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics.

【Paper Link】【Pages】:1617-1632

【Authors】: Anil Shanbhag ; Samuel Madden ; Xiangyao Yu

【Abstract】: There has been significant amount of excitement and recent work on GPU-based database systems. Previous work has claimed that these systems can perform orders of magnitude better than CPU-based database systems on analytical workloads such as those found in decision support and business intelligence applications. A hardware expert would view these claims with suspicion. Given the general notion that database operators are memory-bandwidth bound, one would expect the maximum gain to be roughly equal to the ratio of the memory bandwidth of GPU to that of CPU. In this paper, we adopt a model-based approach to understand when and why the performance gains of running queries on GPUs vs on CPUs vary from the bandwidth ratio (which is roughly 16× on modern hardware). We propose Crystal, a library of parallel routines that can be combined together to run full SQL queries on a GPU with minimal materialization overhead. We implement individual query operators to show that while the speedups for selection, projection, and sorts are near the bandwidth ratio, joins achieve less speedup due to differences in hardware capabilities. Interestingly, we show on a popular analytical workload that full query performance gain from running on GPU exceeds the bandwidth ratio despite individual operators having speedup less than bandwidth ratio, as a result of limitations of vectorizing chained operators on CPUs, resulting in a 25× speedup for GPUs over CPUs on the benchmark.

【Keywords】: Computer systems organization; Architectures; Other architectures; Heterogeneous (hybrid) systems; Information systems; Data management systems; Database management system engines; Database query processing; Query operators

108. Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects.

【Paper Link】【Pages】:1633-1649

【Authors】: Clemens Lutz ; Sebastian Breß ; Steffen Zeuch ; Tilmann Rabl ; Volker Markl

【Abstract】: GPUs have long been discussed as accelerators for database query processing because of their high processing power and memory bandwidth. However, two main challenges limit the utility of GPUs for large-scale data processing: (1) the on-board memory capacity is too small to store large data sets, yet (2) the interconnect bandwidth to CPU main-memory is insufficient for ad hoc data transfers. As a result, GPU-based systems and algorithms run into a transfer bottleneck and do not scale to large data sets. In practice, CPUs process large-scale data faster than GPUs with current technology. In this paper, we investigate how a fast interconnect can resolve these scalability limitations using the example of NVLink 2.0. NVLink 2.0 is a new interconnect technology that links dedicated GPUs to a [email protected] The high bandwidth of NVLink 2.0 enables us to overcome the transfer bottleneck and to efficiently process large data sets stored in main-memory on GPUs. We perform an in-depth analysis of NVLink 2.0 and show how we can scale a no-partitioning hash join beyond the limits of GPU memory. Our evaluation shows speed-ups of up to 18x over PCI-e 3.0 and up to 7.3x over an optimized CPU implementation. Fast GPU interconnects thus enable GPUs to efficiently accelerate query processing.

【Keywords】: Information systems; Data management systems; Database management system engines; Main memory engines

109. Robust Performance of Main Memory Data Structures by Configuration.

【Paper Link】【Pages】:1651-1666

【Authors】: Tiemo Bang ; Ismail Oukid ; Norman May ; Ilia Petrov ; Carsten Binnig

【Abstract】: In this paper, we present a new approach for achieving robust performance of data structures making it easier to reuse the same design for different hardware generations but also for different workloads. To achieve robust performance, the main idea is to strictly separate the data structure design from the actual strategies to execute access operations and adjust the actual execution strategies by means of so-called configurations instead of hard-wiring the execution strategy into the data structure. In our evaluation we demonstrate the benefits of this configuration approach for individual data structures as well as complex OLTP workloads.

【Keywords】:

110. Black or White? How to Develop an AutoTuner for Memory-based Analytics.

【Paper Link】【Pages】:1667-1683

【Authors】: Mayuresh Kunjir ; Shivnath Babu

【Abstract】: There is a lot of interest today in building autonomous (or, self-driving) data processing systems. An emerging school of thought is to leverage AI-driven "black box" algorithms for this purpose. In this paper, we present a contrarian view. We study the problem of autotuning the memory allocation for applications running on modern distributed data processing systems. We show that an empirically-driven "white-box" algorithm, called RelM, that we have developed provides a close-to-optimal tuning at a fraction of the overheads compared to state-of-the-art AI-driven "black box" algorithms, namely, Bayesian Optimization (BO) and Deep Distributed Policy Gradient (DDPG). The main reason for RelM's superior performance is that the memory management in modern memory-based data analytics systems is an interplay of algorithms at multiple levels: (i) at the resource-management level across various containers allocated by resource managers like Kubernetes and YARN, (ii) at the container level among the OS, pods, and processes such as the Java Virtual Machine (JVM), (iii) at the application level for caching, aggregation, data shuffles, and application data structures, and (iv) at the JVM level across various pools such as the Young and Old Generation. RelM understands these interactions and uses them in building an analytical solution to autotune the memory management knobs. In another contribution, called Guided-BO (GBO), we use RelM's analytical models to speed up BO. Through an evaluation based on Apache Spark, we showcase that the RelM's recommendations are significantly better than what commonly-used Spark deployments provide, and are close to the ones obtained by brute-force exploration; while GBO provides optimality guarantees for a higher, but still significantly lower cost overhead compared to the state-of-the-art AI-driven policies.

【Keywords】: Information systems; Data management systems; Database administration; Autonomous database administration

Research 19: Machine Learning Systems and Applications 5

111. Vista: Optimized System for Declarative Feature Transfer from Deep CNNs at Scale.

【Paper Link】【Pages】:1685-1700

【Authors】: Supun Nakandala ; Arun Kumar

【Abstract】: Scalable systems for machine learning (ML) are largely siloed into dataflow systems for structured data and deep learning systems for unstructured data. This gap has left workloads that jointly analyze both forms of data with poor systems support, leading to both low system efficiency and grunt work for users. We bridge this gap for an important class of such workloads: feature transfer from deep convolutional neural networks (CNNs) for analyzing images along with structured data. Executing feature transfer on scalable dataflow and deep learning systems today faces two key systems issues: inefficiency due to redundant computations and crash-proneness due to mismanaged memory. We present Vista, a new data system that resolves these issues by elevating this workload to a declarative level on top of dataflow and deep learning systems. Vista automatically optimizes the configuration and execution of this workload to reduce both computational redundancy and the potential for workload crashes. Experiments on real datasets show that apart from making feature transfer easier, Vista avoids workload crashes and reduces runtimes by 58% to 92% compared to baselines.

112. Optimizing Machine Learning Workloads in Collaborative Environments.

【Paper Link】【Pages】:1701-1716

【Authors】: Behrouz Derakhshan ; Alireza Rezaei Mahdiraji ; Ziawasch Abedjan ; Tilmann Rabl ; Volker Markl

【Abstract】: Effective collaboration among data scientists results in high-quality and efficient machine learning (ML) workloads. In a collaborative environment, such as Kaggle or Google Colabratory, users typically re-execute or modify published scripts to recreate or improve the result. This introduces many redundant data processing and model training operations. Reusing the data generated by the redundant operations leads to the more efficient execution of future workloads. However, existing collaborative environments lack a data management component for storing and reusing the result of previously executed operations. In this paper, we present a system to optimize the execution of ML workloads in collaborative environments by reusing previously performed operations and their results. We utilize a so-called Experiment Graph (EG) to store the artifacts, i.e., raw and intermediate data or ML models, as vertices and operations of ML workloads as edges. In theory, the size of EG can become unnecessarily large, while the storage budget might be limited. At the same time, for some artifacts, the overall storage and retrieval cost might outweigh the recomputation cost. To address this issue, we propose two algorithms for materializing artifacts based on their likelihood of future reuse. Given the materialized artifacts inside EG, we devise a linear-time reuse algorithm to find the optimal execution plan for incoming ML workloads. Our reuse algorithm only incurs a negligible overhead and scales for the high number of incoming ML workloads in collaborative environments. Our experiments show that we improve the run-time by one order of magnitude for repeated execution of the workloads and 50% for the execution of modified workloads in collaborative environments.

【Keywords】: Computing methodologies; Machine learning; Information systems; Data management systems; Database management system engines; Information systems applications; Collaborative and social computing systems and tools

113. GOGGLES: Automatic Image Labeling with Affinity Coding.

【Paper Link】【Pages】:1717-1732

【Authors】: Nilaksh Das ; Sanya Chaba ; Renzhi Wu ; Sakshi Gandhi ; Duen Horng Chau ; Xu Chu

【Abstract】: Generating large labeled training data is becoming the biggest bottleneck in building and deploying supervised machine learning models. Recently, the data programming paradigm has been proposed to reduce the human cost in labeling training data. However, data programming relies on designing labeling functions which still requires significant domain expertise. Also, it is prohibitively difficult to write labeling functions for image datasets as it is hard to express domain knowledge using raw features for images (pixels). We propose affinity coding, a new domain-agnostic paradigm for automated training data labeling. The core premise of affinity coding is that the affinity scores of instance pairs belonging to the same class on average should be higher than those of pairs belonging to different classes, according to some affinity functions. We build the GOGGLES system that implements affinity coding for labeling image datasets by designing a novel set of reusable affinity functions for images, and propose a novel hierarchical generative model for class inference using a small development set. We compare GOGGLES with existing data programming systems on 5 image labeling tasks from diverse domains. GOGGLES achieves labeling accuracies ranging from a minimum of 71% to a maximum of 98% without requiring any extensive human annotation. In terms of end-to-end performance, GOGGLES outperforms the state-of-the-art data programming system Snuba by 21% and a state-of-the-art few-shot learning technique by 5%, and is only 7% away from the fully supervised upper bound.

【Keywords】: Computing methodologies; Artificial intelligence; Computer vision; Computer vision representations; Machine learning; Learning paradigms; Unsupervised learning; Cluster analysis; Learning settings; Mathematics of computing; Probability and statistics; Probabilistic inference problems

114. DeepSqueeze: Deep Semantic Compression for Tabular Data.

【Paper Link】【Pages】:1733-1746

【Authors】: Amir Ilkhechi ; Andrew Crotty ; Alex Galakatos ; Yicong Mao ; Grace Fan ; Xiran Shi ; Ugur Çetintemel

【Abstract】: With the rapid proliferation of large datasets, efficient data compression has become more important than ever. Columnar compression techniques (e.g., dictionary encoding, run-length encoding, delta encoding) have proved highly effective for tabular data, but they typically compress individual columns without considering potential relationships among columns, such as functional dependencies and correlations. Semantic compression techniques, on the other hand, are designed to leverage such relationships to store only a subset of the columns necessary to infer the others, but existing approaches cannot effectively identify complex relationships across more than a few columns at a time. We propose DeepSqueeze, a novel semantic compression framework that can efficiently capture these complex relationships within tabular data by using autoencoders to map tuples to a lower-dimensional representation. DeepSqueeze also supports guaranteed error bounds for lossy compression of numerical data and works in conjunction with common columnar compression formats. Our experimental evaluation uses real-world datasets to demonstrate that DeepSqueeze can achieve over a 4x size reduction compared to state-of-the-art alternatives.

115. TRACER: A Framework for Facilitating Accurate and Interpretable Analytics for High Stakes Applications.

【Paper Link】【Pages】:1747-1763

【Authors】: Kaiping Zheng ; Shaofeng Cai ; Horng Ruey Chua ; Wei Wang ; Kee Yuan Ngiam ; Beng Chin Ooi

【Abstract】: In high stakes applications such as healthcare and finance analytics, the interpretability of predictive models is required and necessary for domain practitioners to trust the predictions. Traditional machine learning models, e.g., logistic regression (LR), are easy to interpret in nature. However, many of these models aggregate time-series data without considering the temporal correlations and variations. Therefore, their performance cannot match up to recurrent neural network (RNN) based models, which are nonetheless difficult to interpret. In this paper, we propose a general framework TRACER to facilitate accurate and interpretable predictions, with a novel model TITV devised for healthcare analytics and other high stakes applications such as financial investment and risk management. Different from LR and other existing RNN-based models, TITV is designed to capture both the time-invariant and the time-variant feature importance using a feature-wise transformation subnetwork and a self-attention subnetwork, for the feature influence shared over the entire time series and the time-related importance respectively. Healthcare analytics is adopted as a driving use case, and we note that the proposed TRACER is also applicable to other domains, e.g., fintech. We evaluate the accuracy of TRACER extensively in two real-world hospital datasets, and our doctors/clinicians further validate the interpretability of TRACER in both the patient level and the feature level. Besides, TRACER is also validated in a critical financial application. The experimental results confirm that TRACER facilitates both accurate and interpretable analytics for high stakes applications.

【Keywords】: Applied computing; Life and medical sciences; Health informatics; Mathematics of computing; Probability and statistics; Statistical paradigms; Time series analysis

Research 20: Graph Data Management and Analysis 5

116. Application Driven Graph Partitioning.

【Paper Link】【Pages】:1765-1779

【Authors】: Wenfei Fan ; Ruochun Jin ; Muyang Liu ; Ping Lu ; Xiaojian Luo ; Ruiqi Xu ; Qiang Yin ; Wenyuan Yu ; Jingren Zhou

【Abstract】: Graph partitioning is crucial to parallel computations on large graphs. The choice of partitioning strategies has strong impact on not only the performance of graph algorithms, but also the design of the algorithms. For an algorithm of our interest, what partitioning strategy fits it the best and improves its parallel execution? Is it possible to develop graph algorithms with partition transparency, such that the algorithms work under different partitions without changes? This paper aims to answer these questions. We propose an application-driven hybrid partitioning strategy that, given a graph algorithm A, learns a cost model for A as polynomial regression. We develop partitioners that given the learned cost model, refine an edge-cut or vertex-cut partition to a hybrid partition and reduce the parallel cost of A. Moreover, we identify a general condition under which graph-centric algorithms are partition transparent. We show that a number of graph algorithms can be made partition transparent. Using real-life and synthetic graphs, we experimentally verify that our partitioning strategy improves the performance of a variety of graph computations, up to 22.5 times.

【Keywords】: Theory of computation; Design and analysis of algorithms; Graph algorithms analysis

117. Progressive Top-K Nearest Neighbors Search in Large Road Networks.

【Paper Link】【Pages】:1781-1795

【Authors】: Dian Ouyang ; Dong Wen ; Lu Qin ; Lijun Chang ; Ying Zhang ; Xuemin Lin

【Abstract】: Computing top-k nearest neighbors (kNN) is a fundamental problem in road networks. Existing solutions either need a complicated parameter configuration in index construction or incur high costs when scanning an unbounded number of vertices in query processing. In this paper, we propose a novel parameter-free index-based solution for the kNN query based on the concept of tree decomposition in large road networks. Based on our index structure, we propose an efficient and progressive algorithm that returns each result in a bounded delay. We also optimize the index structure, which improves the efficiency of both index construction and index maintenance in large road networks. We conduct extensive experiments to show the efficiency of our proposed algorithms and the effectiveness of our optimization techniques in real-world road networks from ten regions.

【Keywords】: Information systems; Information systems applications; Data mining; Nearest-neighbor search; Theory of computation; Design and analysis of algorithms; Graph algorithms analysis; Dynamic graph algorithms; Shortest paths; Streaming, sublinear and near linear time algorithms; Nearest neighbor algorithms

118. Memory-Aware Framework for Efficient Second-Order Random Walk on Large Graphs.

【Paper Link】【Pages】:1797-1812

【Authors】: Yingxia Shao ; Shiyue Huang ; Xupeng Miao ; Bin Cui ; Lei Chen

【Abstract】: Second-order random walk is an important technique for graph analysis. Many applications use it to capture higher-order patterns in the graph, thus improving the model accuracy. However, the memory explosion problem of this technique hinders it from analyzing large graphs. When processing a billion-edge graph like Twitter, existing solutions (e.g., alias method) of the second-order random walk may take up 1796TB memory. Such high memory overhead comes from the memory-unaware strategies for node sampling across the graph. In this paper, to clearly study the efficiency of various node sampling methods in the context of second-order random walk, we design a cost model, and then propose a new node sampling method following the acceptance-rejection paradigm to achieve a better balance between memory and time cost. Further, to guarantee the efficiency of the second-order random walk within arbitrary memory budgets, we propose a memory-aware framework on the basis of the cost model. The framework applies a cost-based optimizer to assign desirable node sampling method for each node in the graph within a memory budget while minimizing the time cost. Finally, we provide general programming interfaces for users to benefit from the memory-aware framework easily. The empirical studies demonstrate that our memory-aware framework is robust with respect to memory and is able to achieve considerable efficiency by reducing 90% of the memory cost.

【Keywords】: Theory of computation; Design and analysis of algorithms; Graph algorithms analysis; Parallel algorithms

119. Hub Labeling for Shortest Path Counting.

【Paper Link】【Pages】:1813-1828

【Authors】: Yikai Zhang ; Jeffrey Xu Yu

【Abstract】: The notion of shortest path is fundamental in graph analytics. While many works have devoted to devising efficient distance oracles to compute the shortest distance between any vertices s and t, we study the problem of efficiently counting the number of shortest paths between s and t in light of its applications in tasks such as betweenness-related analysis. Specifically, we propose a hub labeling scheme based on hub pushing and discuss several graph reduction techniques to reduce the index size. Furthermore, we prove several theoretical results on the performance of the scheme for some special graph classes. Our empirical study verifies the efficiency and effectiveness of the algorithms. In particular, a query evaluation takes only hundreds of microseconds in average for graphs with up to hundreds of millions of edges. We report our findings in this paper.

【Keywords】: Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms

120. CHASSIS: Conformity Meets Online Information Diffusion.

【Paper Link】【Pages】:1829-1840

【Authors】: Hui Li ; Hui Li ; Sourav S. Bhowmick

【Abstract】: Online information diffusion generates huge volumes of social activities (eg. tweets, retweets posts, comments, likes) among individuals. Existing information diffusion modeling techniques are oblivious to conformity of individuals during the diffusion process, a fundamental human trait according to social psychology theories. Intuitively, conformity captures the extent to which an individual complies with social norms or expectations. In this paper, we present a novel framework called chassis to characterize online information diffusion by bridging classical information diffusion model with conformity from social psychology. To this end, we first extend "Hawkes Process", a well-known statistical technique utilized to model information diffusion, to quantitatively capture two flavors of conformity, informational conformity and normative conformity, hidden in activity sequences. Next, we present a novel semi-parametric inference approach to learn the proposed model. Experimental study with real-world datasets demonstrates the superiority of chassis to state-of-the-art conformity-unaware information diffusion models.

【Keywords】: Information systems; Information systems applications; Collaborative and social computing systems and tools

Research 21: Spatial, Temporal, and Multimedia Data I 5

121. Architecture-Intact Oracle for Fastest Path and Time Queries on Dynamic Spatial Networks.

【Paper Link】【Pages】:1841-1856

【Authors】: Victor Junqiu Wei ; Raymond Chi-Wing Wong ; Cheng Long

【Abstract】: Given two vertices of interest (POIs) s and t on a spatial network, a distance (path) query returns the shortest network distance (shortest path) from s to t. This query has a variety of applications in practice and is a fundamental operation for many database and data mining algorithms. In this paper, we propose an efficient distance and path oracle on dynamic road networks using the randomization technique. Our oracle has a good performance in practice and remarkably, and at the same time, it has a favorable theoretical bound. Specifically, it has O(n log2 n) (resp. O(n log2n)) preprocessing time (resp. space) and O(log4n log log n) (resp. O(log4n log log n+l)) distance query time (resp. shortest path query time) as well as O(log3n) update time with high probability (w.h.p.), where n is the number of vertices in the spatial network and l is the number of edges on the shortest path. Our experiments show that the existing oracles suffer from a huge updating time that renders them impractical and our oracle enjoys a negligible updating time and meanwhile has comparable query time and indexing cost with the best existing oracle.

【Keywords】: Information systems; Data management systems; Database design and models; Graph-based database models; Network data models; Information systems applications; Spatial-temporal systems; Geographic information systems; Location based services

122. Data Series Progressive Similarity Search with Probabilistic Quality Guarantees.

【Paper Link】【Pages】:1857-1873

【Authors】: Anna Gogolou ; Theophanis Tsandilas ; Karima Echihabi ; Anastasia Bezerianos ; Themis Palpanas

【Abstract】: Existing systems dealing with the increasing volume of data series cannot guarantee interactive response times, even for fundamental tasks such as similarity search. Therefore, it is necessary to develop analytic approaches that support exploration and decision making by providing progressive results, before the final and exact ones have been computed. Prior works lack both efficiency and accuracy when applied to large-scale data series collections. We present and experimentally evaluate a new probabilistic learning-based method that provides quality guarantees for progressive Nearest Neighbor (NN) query answering. We provide both initial and progressive estimates of the final answer that are getting better during the similarity search, as well suitable stopping criteria for the progressive queries. Experiments with synthetic and diverse real datasets demonstrate that our prediction methods constitute the first practical solution to the problem, significantly outperforming competing approaches.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Information retrieval; Retrieval models and ranking; Top-k retrieval in databases; Information systems applications; Data mining; Nearest-neighbor search

123. A GPU-friendly Geometric Data Model and Algebra for Spatial Queries.

【Paper Link】【Pages】:1875-1885

【Authors】: Harish Doraiswamy ; Juliana Freire

【Abstract】: The availability of low cost sensors has led to an unprecedented growth in the volume of spatial data. Unfortunately, the time required to evaluate even simple spatial queries over large data sets greatly hampers our ability to interactively explore these data sets and extract actionable insights. While Graphics Processing Units~(GPUs) are increasingly being used to speed up spatial queries, existing solutions have two important drawbacks: they are often tightly coupled to the specific query types they target, making it hard to adapt them for other queries; and since their design is based on CPU-based approaches, it can be difficult to effectively utilize all the benefits provided by the GPU. As a first step towards making GPU spatial query processing mainstream, we propose a new model that represents spatial data as geometric objects and define an algebra consisting of GPU-friendly composable operators that operate over these objects. We demonstrate the expressiveness of the proposed algebra and present a proof-of-concept prototype that supports a subset of the operators, which shows that it is orders of magnitude faster than a CPU-based implementation and outperforms custom GPU-based approaches.

【Keywords】: Information systems; Data management systems; Database design and models; Information systems applications; Spatial-temporal systems

124. Debunking Four Long-Standing Misconceptions of Time-Series Distance Measures.

【Paper Link】【Pages】:1887-1905

【Authors】: John Paparrizos ; Chunwei Liu ; Aaron J. Elmore ; Michael J. Franklin

【Abstract】: Distance measures are core building blocks in time-series analysis and the subject of active research for decades. Unfortunately, the most detailed experimental study in this area is outdated (over a decade old) and, naturally, does not reflect recent progress. Importantly, this study (i) omitted multiple distance measures, including a classic measure in the time-series literature; (ii) considered only a single time-series normalization method; and (iii) reported only raw classification error rates without statistically validating the findings, resulting in or fueling four misconceptions in the time-series literature. Motivated by the aforementioned drawbacks and our curiosity to shed some light on these misconceptions, we comprehensively evaluate 71 time-series distance measures. Specifically, our study includes (i) 8 normalization methods; (ii) 52 lock-step measures; (iii) 4 sliding measures; (iv) 7 elastic measures; (v) 4 kernel functions; and (vi) 4 embedding measures. We extensively evaluate these measures across 128 time-series datasets using rigorous statistical analysis. Our findings debunk four long-standing misconceptions that significantly alter the landscape of what is known about existing distance measures. With the new foundations in place, we discuss open challenges and promising directions.

【Keywords】: Computing methodologies; Machine learning; Learning paradigms; Supervised learning; Supervised learning by classification; Unsupervised learning; Dimensionality reduction and manifold learning; Machine learning approaches; Kernel methods; Information systems; Data management systems; Database design and models; Data model extensions; Data streams; Temporal data; Information systems applications; Data mining; Data stream mining; Nearest-neighbor search; Mathematics of computing; Probability and statistics; Statistical paradigms; Dimensionality reduction; Exploratory data analysis; Time series analysis

125. MIRIS: Fast Object Track Queries in Video.

【Paper Link】【Pages】:1907-1921

【Authors】: Favyen Bastani ; Songtao He ; Arjun Balasingam ; Karthik Gopalakrishnan ; Mohammad Alizadeh ; Hari Balakrishnan ; Michael J. Cafarella ; Tim Kraska ; Sam Madden

【Abstract】: Video databases that enable queries with object-track predicates are useful in many applications. Such queries include selecting objects that move from one region of the camera frame to another (e.g., finding cars that turn right through a junction) and selecting objects with certain speeds (e.g., finding animals that stop to drink water from a lake). Processing such predicates efficiently is challenging because they involve the movement of an object over several video frames. We propose a novel query-driven tracking approach that integrates query processing with object tracking to efficiently process object track queries and address the computational complexity of object detection methods. By processing video at low framerates when possible, but increasing the framerate when needed to ensure high-accuracy on a query, our approach substantially speeds up query execution. We have implemented query-driven tracking in MIRIS, a video query processor, and compare MIRIS against four baselines on a diverse dataset consisting of five sources of video and nine distinct queries. We find that, at the same accuracy, MIRIS accelerates video query processing by 9x on average over the IOU tracker, an overlap-based tracking-by-detection method used in existing video database systems.

【Keywords】: Computing methodologies; Artificial intelligence; Computer vision; Computer vision problems; Tracking; Computer vision tasks; Visual content-based indexing and retrieval; Information systems; Data management systems; Database management system engines; Database query processing; Query optimization

Award Talks 2

126. ACM SIGMOD Jim Gray Dissertation Award W Talk.

【Paper Link】【Pages】:1923

【Authors】: Jose M. Faleiro

【Abstract】: The increasing democratization of server hardware with multi-core CPUs and large main memories has been one of the dominant hardware trends of the last decade. "Bare metal" servers with tens of CPU cores and over 100 gigabytes of main memory have been available for several years now. Recently, this large scale hardware has also been available via the cloud. Database systems, with their roots in uniprocessors and paucity of main memory, have unsurprisingly been found wanting on modern hardware. In addition to changes in hardware, database systems have had to contend with changing application requirements and deployment environments. Database systems have long provided applications with an interactive interface, in which an application can communicate with the database over several round-trips in the course of a single request. A large class of applications, however, does not require interactive interfaces, and is unwilling to pay the performance cost associated with overly flexible interfaces. Some of these applications have eschewed database systems altogether in favor of high-performance key-value stores. Finally, modern applications are increasingly deployed at ever increasing scales, often serving hundreds of thousands to millions of simultaneous clients. These large scale deployments are more prone to errors due to consistency issues in their underlying database systems. Ever since their inception, database systems have allowed applications to tradeoff consistency for performance, and often nudge applications towards weak consistency. When deployed at scale, weak consistency exposes latent consistency-related bugs, in the same way that failures are more likely to occur at scale. Nearly every widely deployed database system provides applications with weak consistency consistency by default, and its widespread use in practice significantly complicates application development, leading to latent Heisenbugs that are only exposed in production. This dissertation proposes and explores the use of deterministic execution to address these concerns. Database systems have traditionally been non-deterministic; given an input list of transactions, the final state of the database, which corresponds to some totally ordered execution of transactions, is dependent on non-deterministic factors such as thread scheduling decisions made by the operating system and failures. Deterministic execution, on the other hand, ensures that the database's final state is always determined by its input list of transactions; in other words, the input list of transactions is the same as the total order of transactions that determines the database's state. While non-deterministic database systems expend significant resources in determining valid total orders of transactions, we show that deterministic systems can exploit simple and low-cost up-front total ordering of transactions to execute and schedule transactions much more efficiently. We show that deterministic execution enables low-overhead, highly-parallel scheduling mechanisms, that can address the performance limitations of existing database systems on modern hardware. Deterministic database systems are designed based on the assumption that applications can submit their transactions in one-shot prepared transactions, instead of multiple round-trips. Finally, we attempt to understand the fundamental reason for the observed performance differences between various consistency levels in database systems, and based on this understanding, show that we can exploit deterministic execution to provide strong consistency at a cost that is competitive with that offered by weak consistency levels.

【Keywords】: Information systems; Data management systems; Database management system engines; Database transaction processing; DBMS engine architectures; Main memory engines

127. Effective Data Versioning for Collaborative Data Analytics.

【Paper Link】【Pages】:1925-1938

【Authors】: Silu Huang

【Abstract】: With the massive proliferation of datasets in a variety of sec-tors, data science teams in these sectors spend vast amounts of time collaboratively constructing, curating, and analyzing these datasets. Versions of datasets are routinely generated during this data science process, via various data processing operations like data transformation and cleaning, feature engineering and normalization, among others. However, no existing systems enable us to effectively store, track, and query these versioned datasets, leading to massive redundancy in versioned data storage and making true collaboration and sharing impossible. In my PhD thesis, we develop solutions for versioned data management for collaborative data analytics. In the first part of my dissertation, we extend a relational database to support versioning of structured data. Specifically, we build a system, OrpheusDB, on top of a relational database with a carefully designed data representation and an intelligent partitioning algorithm for fast version control operations. OrpheusDB inherits much of the same benefits of relational databases, while also compactly storing, keeping track of, and recreating versions on demand. However, OrpheusDB implicitly makes a few assumptions, namely that:(a) the SQL assumption: a SQL-like language is the best fit for querying data and versioning information;(b) the structural assumption: the data is in a relational for-mat with a regular structure;(c) the from-scratch assumption: users adopt OrpheusDB from the very beginning of their project and register each data version along with full meta-data in the system. In the second part of my dissertation, we remove each of these assumptions, one at a time. First, we remove the SQL assumption and propose a generalized query language for querying data along with versioning and provenance information. Second, we remove the structural assumption and develop solutions for compact storage and fast retrieval of arbitrary data representations [4]. Finally, we remove the "from-scratch" assumption, by developing techniques to infer lineage relationships among versions residing in an existing data repository.

【Keywords】: Information systems; Data management systems; Data structures; Data access methods; Data layout; Database design and models; Data model extensions; Database management system engines; Database query processing; Query languages

Research 22: Data Lakes, Web, and Knowledge Graph 5

【Paper Link】【Pages】:1939-1950

【Authors】: Fatemeh Nargesian ; Ken Q. Pu ; Erkang Zhu ; Bahar Ghadiri Bashardoost ; Renée J. Miller

【Abstract】: We consider the problem of creating an effective navigation structure over a data lake. We define an organization as a navigation graph that contains nodes representing sets of attributes within a data lake and edges indicating subset relationships among nodes. We propose the data lake organization problem as the problem of finding an organization that allows a user to most effectively navigate a data lake. We present a new probabilistic model of how users interact with an organization and propose an approximate algorithm for the data lake organization problem. We show the effectiveness of the algorithm on both a real data lake containing data from open data portals and on a benchmark that contains rich metadata emulating the observed characteristics of real data lakes. Through a formal user study, we show that navigation can help users find relevant tables that cannot be found by keyword search.

【Keywords】: Information systems; Data management systems; Information retrieval

【Paper Link】【Pages】:1951-1966

【Authors】: Yi Zhang ; Zachary G. Ives

【Abstract】: Many modern data science applications build on data lakes, schema-agnostic repositories of data files and data products that offer limited organization and management capabilities. There is a need to build data lake search capabilities into data science environments, so scientists and analysts can find tables, schemas, workflows, and datasets useful to their task at hand. We develop search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables. Our core methods also generalize to other settings where computational tasks involve execution of programs or scripts.

【Keywords】: Information systems; Data management systems; Information integration; Federated databases

130. Web Data Extraction using Hybrid Program Synthesis: A Combination of Top-down and Bottom-up Inference.

【Paper Link】【Pages】:1967-1978

【Authors】: Mohammad Raza ; Sumit Gulwani

【Abstract】: Automatic synthesis of web data extraction programs has been explored in a variety of settings, but in practice there remain various robustness and usability challenges. In this work we present a novel program synthesis approach which combines the benefits of deductive and enumerative synthesis strategies, yielding a semi-supervised technique with which concise programs expressible in standard languages can be synthesized from very few examples. We demonstrate improvement over existing techniques in terms of overall accuracy, number of examples required, and program complexity. Our method has been deployed as a web extraction feature in the mass market Microsoft Power BI product.

【Keywords】: Information systems; World Wide Web; Web data description languages; Markup languages; Web mining; Software and its engineering; Software creation and management; Software development techniques; Automatic programming

131. SPARQL Rewriting: Towards Desired Results.

【Paper Link】【Pages】:1979-1993

【Authors】: Xun Jian ; Yue Wang ; Xiayu Lei ; Libin Zheng ; Lei Chen

【Abstract】: Recent years witnessed the emergence of various applications on knowledge graphs, which are often represented as RDF graphs. However, due to the lack of data schema and the complexity of SPARQL language, there is usually a gap between the user's real desire and the actual meaning of a SPARQL query, especially when the query itself is complicated. In this paper, we try to narrow this gap by modifying a given query with a set of modifiers, so that its result approaches a user-provided example set. Specifically, we model this problem as two individual sub-problems, query-restricting, and query-relaxing, both of which are shown to be NP-hard. We further prove that unless P=NP, query-restricting has no polynomial-time approximation scheme (PTAS), and query-relaxing has no polynomial-time constant-factor approximation algorithm. Despite their hardness, we propose a (1-1/ε)-approximation method for query-restricting and 2 heuristics for query-relaxing. Extensive experiments have been conducted on real-world knowledge graphs to evaluate the effectiveness and efficiency of our proposed solutions.

【Keywords】: Information systems; Information retrieval; Information retrieval query processing; Query reformulation; Query suggestion

132. Realistic Re-evaluation of Knowledge Graph Completion Methods: An Experimental Study.

【Paper Link】【Pages】:1995-2010

【Authors】: Farahnaz Akrami ; Mohammed Samiul Saeef ; Qingheng Zhang ; Wei Hu ; Chengkai Li

【Abstract】: In the active research area of employing embedding models for knowledge graph completion, particularly for the task of link prediction, most prior studies used two benchmark datasets FB15k and WN18 in evaluating such models. Most triples in these and other datasets in such studies belong to reverse and duplicate relations which exhibit high data redundancy due to semantic duplication, correlation or data incompleteness. This is a case of excessive data leakage---a model is trained using features that otherwise would not be available when the model needs to be applied for real prediction. There are also Cartesian product relations for which every triple formed by the Cartesian product of applicable subjects and objects is a true fact. Link prediction on the aforementioned relations is easy and can be achieved with even better accuracy using straightforward rules instead of sophisticated embedding models. A more fundamental defect of these models is that the link prediction scenario, given such data, is non-existent in the real-world. This paper is the first systematic study with the main objective of assessing the true effectiveness of embedding models when the unrealistic triples are removed. Our experiment results show these models are much less accurate than what we used to perceive. Their poor accuracy renders link prediction a task without truly effective automated solution. Hence, we call for re-investigation of possible effective approaches.

【Keywords】: Computing methodologies; Artificial intelligence; Knowledge representation and reasoning; Machine learning; Machine learning approaches; Neural networks; Information systems; Data management systems; Database design and models; Graph-based database models

Research 23: OLAP, Data Warehouses, and Key-Value Stores 5

133. Bitvector-aware Query Optimization for Decision Support Queries.

【Paper Link】【Pages】:2011-2026

【Authors】: Bailu Ding ; Surajit Chaudhuri ; Vivek R. Narasayya

【Abstract】: Bitvector filtering is an important query processing technique that can significantly reduce the cost of execution, especially for complex decision support queries with multiple joins. Despite its wide application, however, its implication to query optimization is not well understood. In this work, we study how bitvector filters impact query optimization. We show that incorporating bitvector filters into query optimization straightforwardly can increase the plan space complexity by an exponential factor in the number of relations in the query. We analyze the plans with bitvector filters for star and snowflake queries in the plan space of right deep trees without cross products. Surprisingly, with some simplifying assumptions, we prove that, the plan of the minimal cost with bitvector filters can be found from a linear number of plans in the number of relations in the query. This greatly reduces the plan space complexity for such queries from exponential to linear. Motivated by our analysis, we propose an algorithm that accounts for the impact of bitvector filters in query optimization. Our algorithm optimizes the join order for an arbitrary decision support query by choosing from a linear number of candidate plans in the number of relations in the query. We implement our algorithm in a commercial database DBMS-X as a transformation rule. Our evaluation on both industry standard benchmarks and customer workload shows that, compared with DBMS-X, our technique reduces the total CPU execution time by 22%-64% for the workloads, with up to two orders of magnitude reduction in CPU execution time for individual queries.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization; Query planning

134. Efficient Join Synopsis Maintenance for Data Warehouse.

【Paper Link】【Pages】:2027-2042

【Authors】: Zhuoyue Zhao ; Feifei Li ; Yuxi Liu

【Abstract】: Various sources such as daily business operations and sensors from different IoT applications constantly generate a lot of data. They are often loaded into a data warehouse system to perform complex analysis over. It, however, can be extremely costly if the query involves joins, especially many-to-many joins over multiple large tables. A join synopsis, i.e., a small uniform random sample over the join result, often suffices as a representative alternative to the full join result for many applications such as histogram construction, model training and etc. Towards that end, we propose a novel algorithm SJoin that can maintain a join synopsis over a pre-specified general θ-join query in a dynamic database with continuous inflows of updates. Central to SJoin is maintaining a weighted join graph index, which assists to efficiently replace join results in the synopsis upon update. We conduct extensive experiments using TPC-DS and a simulated road sensor data over several complex join queries and they demonstrate the clear advantage of SJoin over the best available baseline.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Join algorithms; Information systems applications; Decision support systems; Data warehouses; Theory of computation; Design and analysis of algorithms; Streaming, sublinear and near linear time algorithms; Sketching and sampling

135. Adaptive HTAP through Elastic Resource Scheduling.

【Paper Link】【Pages】:2043-2054

【Authors】: Aunn Raza ; Periklis Chrysogelos ; Angelos-Christos G. Anadiotis ; Anastasia Ailamaki

【Abstract】: Modern Hybrid Transactional/Analytical Processing (HTAP) systems use an integrated data processing engine that performs analytics on fresh data, which are ingested from a transactional engine. HTAP systems typically consider data freshness at design time, and are optimized for a fixed range of freshness requirements, addressed at a performance cost for either OLTP or OLAP. The data freshness and the performance requirements of both engines, however, may vary with the workload. We approach HTAP as a scheduling problem, addressed at runtime through elastic resource management. We model an HTAP system as a set of three individual engines: an OLTP, an OLAP and a Resource and Data Exchange (RDE) engine. We devise a scheduling algorithm which traverses the HTAP design spectrum through elastic resource management, to meet the workload data freshness requirements. We propose an in-memory system design which is non-intrusive to the current state-of-art OLTP and OLAP engines, and we use it to evaluate the performance of our approach. Our evaluation shows that the performance benefit of our system for OLAP queries increases over time, reaching up to 50% compared to static schedules for 100 query sequences, while maintaining a small, and controlled, drop in the OLTP throughput.

【Keywords】: Information systems; Data management systems; Database management system engines; Database transaction processing; DBMS engine architectures; Main memory engines; Online analytical processing engines

136. SPRINTER: A Fast n-ary Join Query Processing Method for Complex OLAP Queries.

【Paper Link】【Pages】:2055-2070

【Authors】: Yoon-Min Nam ; Donghyoung Han ; Min-Soo Kim

【Abstract】: The concept of OLAP query processing is now being widely adopted in various applications. The number of complex queries containing the joins between non-unique keys (called FK-FK joins) increases in those applications. However, the existing in-memory OLAP systems tend not to handle such complex queries efficiently since they generate a large amount of intermediate results or incur a huge amount of probe cost. In this paper, we propose an effective query planning method for complex OLAP queries. It generates a query plan containing n-ary join operators based on a cost model. The plan does not generate intermediate results for processing FK-FK joins and significantly reduces the probe cost. We also propose an efficient processing method for n-ary join operators. We implement the prototype system SPRINTER by integrating our proposed methods into an open-source in-memory OLAP system. Through experiments using the TPC-DS benchmark, we have shown that SPRINTER outperforms the state-of-the-art OLAP systems for complex queries.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization; Query planning

137. Rosetta: A Robust Space-Time Optimized Range Filter for Key-Value Stores.

【Paper Link】【Pages】:2071-2086

【Authors】: Siqiang Luo ; Subarna Chatterjee ; Rafael Ketsetsidis ; Niv Dayan ; Wilson Qin ; Stratos Idreos

【Abstract】: We introduce Rosetta, a probabilistic range filter designed specifically for LSM-tree based key-value stores. The core intuition is that we can sacrifice filter probe time because it is not visible in end-to-end key-value store performance, which in turn allows us to significantly reduce the filter false positive rate for every level of the tree. Rosetta indexes all binary prefixes of a key using a hierarchically arranged set of Bloom filters. It then converts each range query into multiple probes, one for each non-overlapping binary prefix. Rosetta has the ability to track workload patterns and adopt a beneficial tuning for each individual LSM-tree run by adjusting the number of Bloom filters it uses and how memory is spread among them to optimize the FPR/CPU cost balance. We show how to integrate Rosetta in a full system, RocksDB, and we demonstrate that it brings as much as a 40x improvement compared to default RocksDB and 2-5x improvement compared to state-of-the-art range filters in a variety of workloads and across different levels of the memory hierarchy (memory, SSD, hard disk). We also show that, unlike state-of-the-art filters, Rosetta brings a net benefit in RocksDB's overall performance, i.e., it improves range queries without losing any performance for point queries.

【Keywords】: Information systems; Data management systems; Data structures; Database management system engines; Parallel and distributed DBMSs; Key-value stores

Research 24: Spatial, Temporal, and Multimedia Data II 4

138. RID: Deduplicating Snapshot Computations.

【Paper Link】【Pages】:2087-2101

【Authors】: Nikos Tsikoudis ; Liuba Shrira

【Abstract】: One can audit SQL applications by running SQL programs over sequences of persistent snapshots, but care is needed to avoid wasteful duplicate computation. This paper describes the design, implementation, and performance of RID, the first language-independent optimization framework that eliminates duplicate computations in SQL programs running over low-level snapshots by exploiting snapshot metadata efficiently.

【Keywords】: Information systems; Data management systems; Database design and models; Data model extensions; Temporal data; Database management system engines; Database query processing; Query optimization

139. Architecting a Query Compiler for Spatial Workloads.

【Paper Link】【Pages】:2103-2118

【Authors】: Ruby Y. Tahboub ; Tiark Rompf

【Abstract】: Modern location-based applications rely extensively on the efficient processing of spatial data and queries. Spatial query engines are commonly engineered as an extension to a relational database or a cluster-computing framework. Large parts of the spatial processing runtime is spent on evaluating spatial predicates and traversing spatial indexing structures. Typical high-level implementations of these spatial structures incur significant interpretive overhead, which increases latency and lowers throughput. A promising idea to improve the performance of spatial workloads is to leverage native code generation techniques that have become popular in relational query engines. However, architecting a spatial query compiler is challenging since spatial processing has fundamentally different execution characteristics from relational workloads in terms of data dimensionality, indexing structures, and predicate evaluation. In this paper, we discuss the underlying reasons why standard query compilation techniques are not fully effective when applied to spatial workloads, and we demonstrate how a particular style of query compilation based on techniques borrowed from partial evaluation and generative programming manages to avoid most of these difficulties by extending the scope of custom code generation into the data structures layer. We extend the LB2 main-memory query compiler, a relational engine developed in this style, with spatial data types, predicates, indexing structures, and operators. We show that the spatial extension matches the performance of specialized library code and outperforms relational and map-reduce extensions.

【Keywords】: Information systems; Data management systems; Database management system engines

140. LISA: A Learned Index Structure for Spatial Data.

【Paper Link】【Pages】:2119-2133

【Authors】: Pengfei Li ; Hua Lu ; Qian Zheng ; Long Yang ; Gang Pan

【Abstract】: In spatial query processing, the popular index R-tree may incur large storage consumption and high IO cost. Inspired by the recent learned index [17] that replaces B-tree with machine learning models, we study an analogy problem for spatial data. We propose a novel Learned Index structure for Spatial dAta (LISA for short). Its core idea is to use machine learning models, through several steps, to generate searchable data layout in disk pages for an arbitrary spatial dataset. In particular, LISA consists of a mapping function that maps spatial keys (points) into 1-dimensional mapped values, a learned shard prediction function that partitions the mapped space into shards, and a series of local models that organize shards into pages. Based on LISA, a range query algorithm is designed, followed by a lattice regression model that enables us to convert a KNN query to range queries. Algorithms are also designed for LISA to handle data updates. Extensive experiments demonstrate that LISA clearly outperforms R-tree and other alternatives in terms of storage consumption and IO cost for queries. Moreover, LISA can handle data insertions and deletions efficiently.

【Keywords】: Computing methodologies; Machine learning; Learning paradigms; Supervised learning; Supervised learning by regression; Theory of computation; Theory and algorithms for application domains; Database theory; Data structures and algorithms for data management

141. Effective Travel Time Estimation: When Historical Trajectories over Road Networks Matter.

【Paper Link】【Pages】:2135-2149

【Authors】: Haitao Yuan ; Guoliang Li ; Zhifeng Bao ; Ling Feng

【Abstract】: In this paper, we study the problem of origin-destination (OD) travel time estimation where the OD input consists of an OD pair and a departure time. We propose a novel neural network based prediction model that fully exploits an important fact neglected by the literature -- for a past OD trip its travel time is usually affiliated with the trajectory it travels along, whereas it does not exist during prediction. At the training phase, our goal is to design novel representations for the OD input and its affiliated trajectory, such that they are close to each other in the latent space. First, we match the OD pairs and their affiliated (historical) trajectories to road networks, and utilize road segment embeddings to represent their spatial properties. Later, we match the timestamps associated with trajectories to time slots and utilize time slot embeddings to represent the temporal properties. Next, we build a temporal graph to capture the weekly and daily periodicity of time slot embeddings. Last, we design an effective encoding to represent the spatial and temporal properties of trajectories. To bind each OD input to its affiliated trajectory, we also encode the OD input into a hidden representation, and make the hidden representation close to the spatio-temporal representation of the trajectory. At the prediction phase, we only use the OD input, get the hidden representation of the OD input, and use it to generate the travel time. Extensive experiments on real datasets show that our method achieves high effectiveness and outperforms existing methods.

【Keywords】: Information systems; Information systems applications; Spatial-temporal systems

142. The Solution Distribution of Influence Maximization: A High-level Experimental Study on Three Algorithmic Approaches.

【Paper Link】【Pages】:2151-2166

【Authors】: Naoto Ohsaka

【Abstract】: Influence maximization is among the most fundamental algorithmic problems in social influence analysis. Over the last decade, a great effort has been devoted to developing efficient algorithms for influence maximization, so that identifying the "best" algorithm has become a demanding task. In SIGMOD'17, Arora, Galhotra, and Ranu reported benchmark results on eleven existing algorithms and demonstrated that there is no single state-of-the-art offering the best trade-off between computational efficiency and solution quality. In this paper, we report a high-level experimental study on three well-established algorithmic approaches for influence maximization, referred to as Oneshot, Snapshot, and Reverse Influence Sampling (RIS). Different from Arora et al., our experimental methodology is so designed that we examine the distribution of random solutions, characterize the relation between the sample number and the actual solution quality, and avoid implementation dependencies. Our main findings are as follows: 1. For a sufficiently large sample number, we obtain a unique solution regardless of algorithms. 2. The average solution quality of Oneshot, Snapshot, and RIS improves at the same rate up to scaling of sample number. 3. Oneshot requires more samples than Snapshot, and Snapshot requires fewer but larger samples than RIS. We discuss the time efficiency when conditioning Oneshot, Snapshot, and RIS to be of identical accuracy. Our conclusion is that Oneshot is suitable only if the size of available memory is limited, and RIS is more efficient than Snapshot for large networks; Snapshot is preferable for small, low-probability networks.

【Keywords】: Information systems; Data management systems; Database administration; Database performance evaluation; World Wide Web; Web applications; Social networks; Theory of computation; Design and analysis of algorithms; Graph algorithms analysis

143. Influence Maximization Revisited: Efficient Reverse Reachable Set Generation with Bound Tightened.

【Paper Link】【Pages】:2167-2181

【Authors】: Qintian Guo ; Sibo Wang ; Zhewei Wei ; Ming Chen

【Abstract】: Given a social network G with n nodes and m edges, a positive integer k, and a cascade model C, the influence maximization (IM) problem asks for k nodes in G such that the expected number of nodes influenced by the k nodes under cascade model C is maximized. The state-of-the-art approximate solutions run in O(k(n+m)log(n)/ε2) expected time while returning a (1-1/e -ε) approximate solution with at least 1-1/n probability. A key phase of these IM algorithms is the random reverse reachable (RR) set generation, and this phase significantly affects the efficiency and scalability of the state-of-the-art IM algorithms. In this paper, we present a study on this key phase and propose an efficient random RR set generation algorithm under IC model. With the new algorithm, we show that the expected running time of existing IM algorithms under IC model can be improved to O(k· n log(n)/ε2), when for any node v, the total weight of its incoming edges is no larger than a constant. Moreover, existing approximate IM algorithms suffer from scalability issues in high influence networks where the size of random RR sets is usually quite large. We tackle this challenging issue by reducing the average size of random RR sets without sacrificing the approximation guarantee. The proposed solution is orders of magnitude faster than states of the art as shown in our experiment.

【Keywords】: Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms

144. Truss-based Community Search over Large Directed Graphs.

【Paper Link】【Pages】:2183-2197

【Authors】: Qing Liu ; Minjun Zhao ; Xin Huang ; Jianliang Xu ; Yunjun Gao

【Abstract】: Community search enables personalized community discovery and has wide applications in large real-world graphs. While community search has been extensively studied for undirected graphs, the problem for directed graphs has received attention only recently. However, existing studies suffer from several drawbacks, e.g., the vertices with varied in-degrees and out-degrees cannot be included in a community at the same time. To address the limitations, in this paper, we systematically study the problem of community search over large directed graphs. We start by presenting a novel community model, called D-truss, based on two distinct types of directed triangles, i.e., flow triangle and cycle triangle. The D-truss model brings nice structural and computational properties and has many advantages in comparison with the existing models. With this new model, we then formulate the D-truss community search problem, which is proved to be NP-hard. In view of its hardness, we propose two efficient 2-approximation algorithms, named Global and Local, that run in polynomial time yet with quality guarantee. To further improve the efficiency of the algorithms, we devise an indexing method based on D-truss decomposition. Consequently, the D-truss community search can be solved upon the D-truss index without time-consuming accesses to the original graph. Experimental studies on real-world graphs with ground-truth communities validate the quality of the solutions we obtain and the efficiency of the proposed algorithms.

【Keywords】: Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms; Theory of computation; Design and analysis of algorithms; Graph algorithms analysis

【Paper Link】【Pages】:2199-2209

【Authors】: Junghoon Kim ; Tao Guo ; Kaiyu Feng ; Gao Cong ; Arijit Khan ; Farhana Murtaza Choudhury

【Abstract】: Searching for a community based on query nodes in a graph is a fundamental problem and has been extensively investigated. Most of the existing approaches focus on finding a community in a social network, and very few studies consider location-based social networks where users can check in locations. In this paper we propose the GeoSocial Community Search problem (GCS) which aims to find a social community and a cluster of spatial locations that are densely connected in a location-based social network simultaneously. The GCS can be useful for marketing and user/location recommendation. To the best of our knowledge, this is the first work to find a social community and a cluster of spatial locations that are densely connected from location-based social networks. We prove that the problem is NP-hard, and is not in APX, unless P = NP. To solve this problem, we propose three algorithms: core-based basic algorithm, top-down greedy removing algorithm, and an expansion algorithm. Finally, we report extensive experimental studies that offer insights into the efficiency and effectiveness of the proposed solutions.

【Keywords】: Information systems; Information systems applications; Data mining; Clustering

【Paper Link】【Pages】:2211-2226

【Authors】: Qingyuan Linghu ; Fan Zhang ; Xuemin Lin ; Wenjie Zhang ; Ying Zhang

【Abstract】: The stability of a social network has been widely studied as an important indicator for both the network holders and the participants. Existing works on reinforcing networks focus on a local view, e.g., the anchored k-core problem aims to enlarge the size of the k-core with a fixed input k. Nevertheless, it is more promising to reinforce a social network in a global manner: considering the engagement of every user (vertex) in the network. Since the coreness of a user has been validated as the "best practice" for capturing user engagement, we propose and study the anchored coreness problem in this paper: anchoring a small number of vertices to maximize the coreness gain (the total increment of coreness) of all the vertices in the network. We prove the problem is NP-hard and show it is more challenging than the existing local-view problems. An efficient heuristic algorithm is proposed with novel techniques on pruning search space and reusing the intermediate results. Extensive experiments on real-life data demonstrate that our model is effective for reinforcing social networks and our algorithm is efficient.

【Keywords】: Human-centered computing; Collaborative and social computing; Collaborative and social computing theory, concepts and paradigms; Social recommendation; Information systems; Data management systems; Database design and models; Graph-based database models; Network data models; Information systems applications; Data mining; World Wide Web; Web applications; Social networks; Web mining; Mathematics of computing; Discrete mathematics; Graph theory; Approximation algorithms; Graph algorithms; Trees; Networks; Network types; Overlay and other logical network structures; Social media networks; Theory of computation; Design and analysis of algorithms; Graph algorithms analysis

Industry 4: Advanced Functionality 5

147. Confidentiality Support over Financial Grade Consortium Blockchain.

【Paper Link】【Pages】:2227-2240

【Authors】: Ying Yan ; Changzheng Wei ; Xuepeng Guo ; Xuming Lu ; Xiaofu Zheng ; Qi Liu ; Chenhui Zhou ; Xuyang Song ; Boran Zhao ; Hui Zhang ; Guofei Jiang

【Abstract】: Confidentiality is an indispensable requirement in financial applications of blockchain technology, and supporting it along with high performance and friendly programmability is technically challenging. In this paper, we present a system design called CONFIDE to support on-chain confidentiality by leveraging Trust Execution Environment (TEE). CONFIDE's secure data transmission protocol and data encryption protocol, together with a highly efficient virtual machine run in TEE, guarantee the confidentiality in the life cycle of a transaction from end to end. CONFIDE proposes a secure data model along with an application-driven secure protocol to guarantee data confidentiality and integrity. Its smart contract language extension offers users the flexibility to define complex confidentiality models. CONFIDE is implemented as a plugin module to Antfin Blockchain's proprietary platform, and can be plugged into other blockchain platforms as well with its universal interface design. Nowadays, CONFIDE is supporting millions of commercial transactions daily on consortium blockchain running financial applications including supply chain finance, ABS, commodity provenance, and cold-chain logistics.

【Keywords】: Information systems; Data management systems; Database management system engines; Distributed database transactions; Security and privacy; Security services; Access control

148. PASE: PostgreSQL Ultra-High-Dimensional Approximate Nearest Neighbor Search Extension.

【Paper Link】【Pages】:2241-2253

【Authors】: Wen Yang ; Tao Li ; Gai Fang ; Hong Wei

【Abstract】: Similarity search has been widely used in various fields, particularly in the Alibaba ecosystem. The open-source solutions to a similarity search of vectors can only support a query with a single vector, whereas real-life scenarios generally require a processing of compound queries. Moreover, existing open-source implementations only provide runtime libraries, which have difficulty meeting the requirements of industrial applications. To address these issues, we designed a novel scheme for extending the index-type of PostgreSQL (PG), which enables a similar vector search and achieves a high-performance level and strong reliability of PG. Two representative types of nearest neighbor search (NNS) algorithms are presented herein. These algorithms achieve a high performance, and afford advantages such as the support of composite queries and seamless integration of existing business data. The other NNS algorithms can be easily implemented under the proposed framework. Experiments were conducted on large datasets to illustrate the efficiency of the proposed retrieval mechanism.

【Keywords】:

149. Making Search Engines Faster by Lowering the Cost of Querying Business Rules Through FPGAs.

【Paper Link】【Pages】:2255-2270

【Authors】: Fabio Maschi ; Muhsen Owaida ; Gustavo Alonso ; Matteo Casalino ; Anthony Hock-Koon

【Abstract】: Business Rule Management Systems (BRMSs) are widely used in industry for a variety of tasks. Their main advantage is to codify in a succinct and queryable manner vast amounts of constantly evolving logic. In BRMSs, rules are typically captured as facts (tuples) over a collection of criteria, and checking them involves querying the collection of rules to find the best match. In this paper, we focus on a real-world use case from the airline industry: determining the minimum connection time (MCT) between flights. The MCT module is part of the flight search engine, and captures the ever changing constraints at each airport that determine the time to allocate between an arriving and a departing flight for a connection to be feasible. We explore how to use hardware acceleration to (i) improve the performance of the MCT module (lower latency, higher throughput); and (ii) reduce the amount of computing resources needed. A key aspect of the solution is the transformation of a collection of rules into a Non-deterministic Finite state Automaton efficiently implemented on FPGA. Experiments performed on-premises and in the cloud show several orders of magnitude improvement over the existing solution, and the potential to reduce by 40% the number of machines needed for the flight search engine.

【Keywords】: Applied computing; Enterprise computing; Business rules; Hardware; Integrated circuits; Reconfigurable logic and FPGAs; Hardware accelerators; Information systems; Information retrieval; Search engine architectures and scalability

150. Spur: Mitigating Slow Instances in Large-Scale Streaming Pipelines.

【Paper Link】【Pages】:2271-2285

【Authors】: Ke Wang ; Avrilia Floratou ; Ashvin Agrawal ; Daniel Musgrave

【Abstract】: Bing's monetization pipeline is one of the largest and most critical streaming workloads deployed in Microsoft's internal data lake. The pipeline runs 24/7 at a scale of 3500 YARN containers and is required to meet a Service Level Objective (SLO) of low tail latency. In this paper, we highlight some of the unique challenges imposed by this large scale of operation: other concurrent workloads sharing the cluster may cause random performance deterioration; unavailability of external dependencies may cause temporary stalls in the pipeline; scarcity in the underlying resource manager may cause arbitrarily long delays or rejection of container allocation requests. Weathering these challenges requires specially tailored dynamic control policies that react to these issues as and when they arise. We focus on the problem of reducing the latency in the tail, i.e., 99th percentile (p99), by detecting and mitigating slow instances through speculative replication. We show that widely used approaches do not satisfactorily solve this issue at our scale. A conservative approach is hesitant to acquire additional resources, reacts too slowly to the changes in the environment and therefore achieves little improvement in p99 latency. On the other hand, an aggressive approach overwhelms the underlying resource manager with unnecessary resource requests and paradoxically worsens the p99 latency. Our proposed approach, Spur, is designed for this challenging environment. It combines aggressive detection of slow instances with smart pruning of false positives to achieve a far better trade-off between these conflicting objectives. Using only 0.5% additional resources (similar to the conservative approach), we demonstrate a 10% -38% improvement in the tail latency compared to both conservative and aggressive approaches.

【Keywords】: Computer systems organization; Dependable and fault-tolerant systems and networks; Availability; Real-time systems; Real-time system architecture; Information systems; Data management systems; Database management system engines; Stream management

151. Entity Matching in the Wild: A Consistent and Versatile Framework to Unify Data in Industrial Applications.

【Paper Link】【Pages】:2287-2301

【Authors】: Yan Yan ; Stephen Meyles ; Aria Haghighi ; Dan Suciu

【Abstract】: Entity matching -- the task of clustering duplicated database records to underlying entities -- has become an increasingly critical component in modern data integration management. Amperity provides a platform for businesses to manage customer data that utilizes a machine-learning approach to entity matching, resolving billions of customer records on a daily basis. We face several challenges in deploying entity matching to industrial applications at scale, and they are less prominent in the literature. These challenges include: (1) Providing not just a single entity clustering, but supporting clusterings at multiple confidence levels to enable downstream applications with varying precision/recall trade-off needs. (2) Many customer record attributes may be systematically missing from different sources of data, creating many pairs of records in a cluster that appear to not match due to incomplete, rather than conflicting information. Allowing these records to connect transitively without introducing conflicts is invaluable to businesses because they can acquire a more comprehensive profile of their customers without incorrect entity merges. (3) How to cluster records over time and assign persistent cluster IDs that can be used for downstream use cases such as A/B tests or predictive model training; this is made more challenging by the fact that we receive new customer data every day and clusters naturally evolving over time still require persistent IDs that refer to the same entity. In this work, we describe Amperity's entity matching framework, Fusion, and how its design provides solutions to these challenges. In particular, we describe our pairwise matching model based on ordinal regression that permits a well-defined way to produce entity clusterings at different confidence levels, a novel clustering algorithm that separates conflicting record pairs in clusters while allowing for pairs that may appear dissimilar due to missing data, and a persistent ID generation algorithm which balances stability of the identifier with ever-evolving entities.

【Keywords】: Computing methodologies; Machine learning; Learning paradigms; Supervised learning; Supervised learning by classification; Information systems; Data management systems; Information integration; Data cleaning; Deduplication; Entity resolution; Information systems applications; Data mining; Clustering

Research 26: Usability and Natural Language User Interfaces 5

152. QueryVis: Logic-based Diagrams help Users Understand Complicated SQL Queries Faster.

【Paper Link】【Pages】:2303-2318

【Authors】: Aristotelis Leventidis ; Jiahui Zhang ; Cody Dunne ; Wolfgang Gatterbauer ; H. V. Jagadish ; Mirek Riedewald

【Abstract】: Understanding the meaning of existing SQL queries is critical for code maintenance and reuse. Yet SQL can be hard to read, even for expert users or the original creator of a query. We conjecture that it is possible to capture the logical intent of queries in automatically-generated visual diagrams that can help users understand the meaning of queries faster and more accurately than SQL text alone. We present initial steps in that direction with visual diagrams that are based on the first-order logic foundation of SQL and can capture the meaning of deeply nested queries. Our diagrams build upon a rich history of diagrammatic reasoning systems in logic and were designed using a large body of human-computer interaction best practices: they are minimal in that no visual element is superfluous; they are unambiguous in that no two queries with different semantics map to the same visualization; and they extend previously existing visual representations of relational schemata and conjunctive queries in a natural way. An experimental evaluation involving 42 users on Amazon Mechanical Turk shows that with only a 2--3 minute static tutorial, participants could interpret queries meaningfully faster with our diagrams than when reading SQL alone. Moreover, we have evidence that our visual diagrams result in participants making fewer errors than with SQL. We believe that more regular exposure to diagrammatic representations of SQL can give rise to a pattern-based and thus more intuitive use and re-use of SQL. A full version of this paper with all appendices and supplemental material for the experimental study (stimuli, raw data, and analysis code) are available at https://osf.io/btszh.

【Keywords】: General and reference; Cross-computing tools and techniques; Empirical studies; Human-centered computing; Human computer interaction (HCI); Empirical studies in HCI; HCI design and evaluation methods; Usability testing; User studies; Interaction paradigms; Graphical user interfaces; Visualization; Empirical studies in visualization; Visualization application domains; Information visualization; Visualization design and evaluation methods; Visualization systems and tools; Visualization toolkits; Visualization techniques; Graph drawings; Visualization theory, concepts and paradigms; Information systems; Data management systems; Database administration; Database utilities and tools; Query languages; Relational database query languages; Theory of computation; Logic; Abstraction; Semantics and reasoning; Program reasoning; Program analysis; Theory and algorithms for application domains; Database theory; Database query languages (principles); Logic and databases

153. Duoquest: A Dual-Specification System for Expressive SQL Queries.

【Paper Link】【Pages】:2319-2329

【Authors】: Christopher Baik ; Zhongjun Jin ; Michael J. Cafarella ; H. V. Jagadish

【Abstract】: Querying a relational database is difficult because it requires users to be familiar with both the SQL language and the schema. However, many users possess enough domain expertise to describe their desired queries by alternative means. For such users, two major alternatives to writing SQL are natural language interfaces (NLIs) and programming-by-example (PBE). Both of these alternatives face certain pitfalls: natural language queries (NLQs) are often ambiguous, even for human interpreters, while current PBE approaches limit functionality to be tractable. Consequently, we propose dual-specification query synthesis, which consumes both a NLQ and an optional PBE-like table sketch query that enables users to express varied levels of domain knowledge. We introduce the novel dual-specification Duoquest system, which leverages guided partial query enumeration to efficiently explore the space of possible queries. We present results from user studies in which Duoquest demonstrates a 62.5% absolute increase in query construction accuracy over a state-of-the-art NLI and comparable accuracy to a PBE system on a limited workload supported by the PBE system. In a simulation study on the Spider benchmark, Duoquest demonstrates a >2x increase in top-1 accuracy over both NLI and PBE.

【Keywords】: Human-centered computing; Human computer interaction (HCI); Interaction paradigms; Natural language interfaces; Information systems; Data management systems; Query languages; Relational database query languages; Structured Query Language; Software and its engineering; Software notations and tools; Context specific languages; Programming by example

154. SQLCheck: Automated Detection and Diagnosis of SQL Anti-Patterns.

【Paper Link】【Pages】:2331-2345

【Authors】: Prashanth Dintyala ; Arpit Narechania ; Joy Arulraj

【Abstract】: The emergence of database-as-a-service platforms has made deploying database applications easier than before. Now, developers can quickly create scalable applications. However, designing performant, maintainable, and accurate applications is challenging. Developers may unknowingly introduce anti-patterns in the application's SQL statements. These anti-patterns are design decisions that are intended to solve a problem but often lead to other problems by violating fundamental design principles. In this paper, we present SQLCheck, a holistic toolchain for automatically finding and fixing anti-patterns in database applications. We introduce techniques for automatically (1) detecting anti-patterns with high precision and recall, (2) ranking the anti-patterns based on their impact on performance, maintainability, and accuracy of applications, and (3) suggesting alternative queries and changes to the database design to fix these anti-patterns. We demonstrate the prevalence of these anti-patterns in a large collection of queries and databases collected from open-source repositories. We introduce an anti-pattern detection algorithm that augments query analysis with data analysis. We present a ranking model for characterizing the impact of frequently occurring anti-patterns. We discuss how SQLCheck suggests fixes for high-impact anti-patterns using rule-based query refactoring techniques. Our experiments demonstrate that SQLCheck enables developers to create more performant, maintainable, and accurate applications.

【Keywords】: Information systems; Data management systems; Database administration; Database utilities and tools

155. DBPal: A Fully Pluggable NL2SQL Training Pipeline.

【Paper Link】【Pages】:2347-2361

【Authors】: Nathaniel Weir ; Prasetya Utama ; Alex Galakatos ; Andrew Crotty ; Amir Ilkhechi ; Shekar Ramaswamy ; Rohin Bhushan ; Nadja Geisler ; Benjamin Hättasch ; Steffen Eger ; Ugur Çetintemel ; Carsten Binnig

【Abstract】: Natural language is a promising alternative interface to DBMSs because it enables non-technical users to formulate complex questions in a more concise manner than SQL. Recently, deep learning has gained traction for translating natural language to SQL, since similar ideas have been successful in the related domain of machine translation. However, the core problem with existing deep learning approaches is that they require an enormous amount of training data in order to provide accurate translations. This training data is extremely expensive to curate, since it generally requires humans to manually annotate natural language examples with the corresponding SQL queries (or vice versa). Based on these observations, we propose DBPal, a new approach that augments existing deep learning techniques in order to improve the performance of models for natural language to SQL translation. More specifically, we present a novel training pipeline that automatically generates synthetic training data in order to (1) improve overall translation accuracy, (2) increase robustness to linguistic variation, and (3) specialize the model for the target database. As we show, our DBPal training pipeline is able to improve both the accuracy and linguistic robustness of state-of-the-art natural language to SQL translation models.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing; Information systems; Data management systems

156. SpeakQL: Towards Speech-driven Multimodal Querying of Structured Data.

【Paper Link】【Pages】:2363-2374

【Authors】: Vraj Shah ; Side Li ; Arun Kumar ; Lawrence Saul

【Abstract】: Speech-driven querying is becoming popular in new device environments such as smartphones, tablets, and even conversational assistants. However, such querying is largely restricted to natural language. Typed SQL remains the gold standard for sophisticated structured querying although it is painful in many environments, which restricts when and how users consume their data. In this work, we propose to bridge this gap by designing a speech-driven querying system and interface for structured data we call SpeakQL. We support a practically useful subset of regular SQL and allow users to query in any domain with novel touch/speech based human-in-the-loop correction mechanisms. Automatic speech recognition (ASR) introduces myriad forms of errors in transcriptions, presenting us with a technical challenge. We exploit our observations of SQL's properties, its grammar, and the queried database to build a modular architecture. We present the first dataset of spoken SQL queries and a generic approach to generate them for any arbitrary schema. Our experiments show that SpeakQL can automatically correct a large fraction of errors in ASR transcriptions. User studies show that SpeakQL can help users specify SQL queries significantly faster with a speedup of average 2.7x and up to 6.7x compared to typing on a tablet device. SpeakQL also reduces the user effort in specifying queries by a factor of average 10x and up to 60x compared to raw typing effort.

【Keywords】: Information systems; Data management systems; Query languages; Relational database query languages; Structured Query Language

Research 27: Distributed and Parallel Processing 5

157. Near-Optimal Distributed Band-Joins through Recursive Partitioning.

【Paper Link】【Pages】:2375-2390

【Authors】: Rundong Li ; Wolfgang Gatterbauer ; Mirek Riedewald

【Abstract】: We consider running-time optimization for band-joins in a distributed system, e.g., the cloud. To balance load across worker machines, input has to be partitioned, which causes duplication. We explore how to resolve this tension between maximum load per worker and input duplication for band-joins between two relations. Previous work suffered from high optimization cost or considered partitionings that were too restricted (resulting in suboptimal join performance). Our main insight is that recursive partitioning of the join-attribute space with the appropriate split scoring measure can achieve both low optimization cost and low join cost. It is the first approach that is not only effective for one-dimensional band-joins but also for joins on multiple attributes. Experiments indicate that our method is able to find partitionings that are within 10% of the lower bound for both maximum load per worker and input duplication for a broad range of settings, significantly improving over previous work.

【Keywords】: Information systems; Data management systems; Database management system engines; Parallel and distributed DBMSs; MapReduce-based systems; Relational parallel and distributed DBMSs

158. ChronoCache: Predictive and Adaptive Mid-Tier Query Result Caching.

【Paper Link】【Pages】:2391-2406

【Authors】: Brad Glasbergen ; Kyle Langendoen ; Michael Abebe ; Khuzaima Daudjee

【Abstract】: The performance of data-driven, web-scale client applications is sensitive to access latency. To address this concern, enterprises strive to cache data on edge nodes that are closer to users, thereby avoiding expensive round-trips to remote data centers. However, these geo-distributed approaches are limited to caching static data. In this paper we present ChronoCache, a mid-tier caching system that exploits the presence of geo-distributed edge nodes to cache database query results closer to users. ChronoCache transparently learns and leverages client application access patterns to predictively combine query requests and cache their results ahead of time, thereby reducing costly round-trips to the remote database. We show that ChronoCache reduces query response times by up to 2/3 over prior approaches on multiple representative benchmark workloads. representative benchmark workloads.

【Keywords】: Information systems; Data management systems; Middleware for databases

159. Cheetah: Accelerating Database Queries with Switch Pruning.

【Paper Link】【Pages】:2407-2422

【Authors】: Muhammad Tirmazi ; Ran Ben Basat ; Jiaqi Gao ; Minlan Yu

【Abstract】: Modern database systems are growing increasingly distributed and struggle to reduce query completion time with a large volume of data. In this paper, we leverage programmable switches in the network to partially offload query computation to the switch. While switches provide high performance, they have resource and programming constraints that make implementing diverse queries difficult. To fit in these constraints, we introduce the concept of data pruning -- filtering out entries that are guaranteed not to affect output. The database system then runs the same query but on the pruned data, which significantly reduces processing time. We propose pruning algorithms for a variety of queries. We implement our system, Cheetah, on a Barefoot Tofino switch and Spark. Our evaluation on multiple workloads shows 40 - 200% improvement in the query completion time compared to Spark.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization; Networks; Network algorithms; Data path algorithms; Network services; In-network processing

160. External Merge Sort for Top-K Queries: Eager input filtering guided by histograms.

【Paper Link】【Pages】:2423-2437

【Authors】: Yannis Chronis ; Thanh Do ; Goetz Graefe ; Keith Peters

【Abstract】: Business intelligence and web log analysis workloads often use queries with top-k clauses to produce the most relevant results. Values ofk range from small to rather large and sometimes the requested output exceeds the capacity of the available main memory. When the requested output fits in the available memory existing top-k algorithms are efficient, as they can eliminate almost all but the topk results before sorting them. When the requested output exceeds the main memory capacity, existing algorithms externally sort the entire input, which can be very expensive. Furthermore, the drastic difference in execution cost when the memory capacity is exceeded results in an unpleasant user experience. Every day, tens of thousands of production top-k queries executed on F1 Query resort to an external sort of the input. To address these challenges, we introduce a new top-k algorithm that is able to eliminate parts of the input before sorting or writing them to secondary storage, regardless of whether the requested output fits in the available memory. To achieve this, at execution time our algorithm creates a concise model of the input using histograms. The proposed algorithm is implemented as part of F1 Query and is used in production, where significantly accelerates top-k queries with outputs larger than the available memory. We evaluate our algorithm against existing top-k algorithms and show that it reduces I/O traffic and can be up to 11 times faster.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query operators

161. Automating Incremental and Asynchronous Evaluation for Recursive Aggregate Data Processing.

【Paper Link】【Pages】:2439-2454

【Authors】: Qiange Wang ; Yanfeng Zhang ; Hao Wang ; Liang Geng ; Rubao Lee ; Xiaodong Zhang ; Ge Yu

【Abstract】: In database and large-scale data analytics, recursive aggregate processing plays an important role, which is generally implemented under a framework of incremental computing and executed synchronously and/or asynchronously. We identify three barriers in existing recursive aggregate data processing. First, the processing scope is largely limited to monotonic programs. Second, checking on conditions for monotonicity and correctness for async processing is sophisticated and manually done. Third, execution engines may be suboptimal due to separation of sync and async execution. In this paper, we lay an analytical foundation for conditions to check if a recursive aggregate program that is monotonic or even non-monotonic can be executed incrementally and asynchronously with its correct result. We design and implement a condition verification tool that can automatically check if a given program satisfies the conditions. We further propose a unified sync-async engine to execute these programs for high performance. To integrate all these effective methods together, we have developed a distributed Datalog system, called PowerLog. Our evaluation shows that PowerLog can outperform three representative Datalog systems on both monotonic and non-monotonic recursive programs.

【Keywords】: Computing methodologies; Distributed computing methodologies; Information systems; Data management systems

Research 28: Stream Processing 5

162. Prompt: Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing Systems.

【Paper Link】【Pages】:2455-2469

【Authors】: Ahmed S. Abdelhamid ; Ahmed R. Mahmood ; Anas Daghistani ; Walid G. Aref

【Abstract】: Advances in real-world applications require high-throughput processing over large data streams. Micro-batching has been proposed to support the needs of these applications. In micro-batching, the processing and batching of the data are interleaved, where the incoming data tuples are first buffered as data blocks, and then are processed collectively using parallel function constructs (e.g., Map-Reduce). The size of a micro-batch is set to guarantee a certain response-time latency that is to conform to the application's service-level agreement. In contrast to tuple-at-a-time data stream processing, micro-batching has the potential to sustain higher data rates. However, existing micro-batch stream processing systems use basic data-partitioning techniques that do not account for data skew and variable data rates. Load-awareness is necessary to maintain performance and to enhance resource utilization. A new data partitioning scheme termed Prompt is presented that leverages the characteristics of the micro-batch processing model. In the batching phase, a frequency-aware buffering mechanism is introduced that progressively maintains run-time statistics, and provides online key-based sorting as data tuples arrive. Because achieving optimal data partitioning is NP-Hard in this context, a workload-aware greedy algorithm is introduced that partitions the buffered data tuples efficiently for the Map stage. In the processing phase, a load-aware distribution mechanism is presented that balances the size of the input to the Reduce stage without incurring inter-task communication overhead. Moreover, Prompt elastically adapts resource consumption according to workload changes. Experimental results using real and synthetic data sets demonstrate that Prompt is robust against fluctuations in data distribution and arrival rates. Furthermore, Prompt achieves up to 200% improvement in system throughput over state-of-the-art techniques without degradation in latency.

【Keywords】: Information systems; Data management systems; Data structures; Data layout; Record and block layout; Database design and models; Data model extensions; Data streams; Database management system engines; Database query processing; Query optimization; Main memory engines; Parallel and distributed DBMSs; MapReduce-based systems; Stream management

163. Rhino: Efficient Management of Very Large Distributed State for Stream Processing Engines.

【Paper Link】【Pages】:2471-2486

【Authors】: Bonaventura Del Monte ; Steffen Zeuch ; Tilmann Rabl ; Volker Markl

【Abstract】: Scale-out stream processing engines (SPEs) are powering large big data applications on high velocity data streams. Industrial setups require SPEs to sustain outages, varying data rates, and low-latency processing. SPEs need to transparently reconfigure stateful queries during runtime. However, state-of-the-art SPEs are not ready yet to handle on-the-fly reconfigurations of queries with terabytes of state due to three problems. These are network overhead for state migration, consistency, and overhead on data processing. In this paper, we propose Rhino, a library for efficient reconfigurations of running queries in the presence of very large distributed state. Rhino provides a handover protocol and a state migration protocol to consistently and efficiently migrate stream processing among servers. Overall, our evaluation shows that Rhino scales with state sizes of up to TBs, reconfigures a running query 15 times faster than the state-of-the-art, and reduces latency by three orders of magnitude upon a reconfiguration.

【Keywords】: Information systems; Data management systems; Database management system engines; Parallel and distributed DBMSs; MapReduce-based systems; Stream management

164. Grizzly: Efficient Stream Processing Through Adaptive Query Compilation.

【Paper Link】【Pages】:2487-2503

【Authors】: Philipp M. Grulich ; Sebastian Breß ; Steffen Zeuch ; Jonas Traub ; Janis von Bleichert ; Zongxiong Chen ; Tilmann Rabl ; Volker Markl

【Abstract】: Stream Processing Engines (SPEs) execute long-running queries on unbounded data streams. They follow an interpretation-based processing model and do not perform runtime optimizations. This limits the utilization of modern hardware and neglects changing data characteristics at runtime. In this paper, we present Grizzly, a novel adaptive query compilation-based SPE, to enable highly efficient query execution. We extend query compilation and task-based parallelization for the unique requirements of stream processing and apply adaptive compilation to enable runtime re-optimizations. The combination of light-weight statistic gathering with just-in-time compilation enables Grizzly to adjust to changing data-characteristics dynamically at runtime. Our experiments show that Grizzly outperforms state-of-the-art SPEs by up to an order of magnitude in throughput.

【Keywords】: Information systems; Data management systems; Database design and models; Data model extensions; Data streams; Database management system engines; DBMS engine architectures; Stream management

165. LightSaber: Efficient Window Aggregation on Multi-core Processors.

【Paper Link】【Pages】:2505-2521

【Authors】: Georgios Theodorakis ; Alexandros Koliousis ; Peter R. Pietzuch ; Holger Pirk

【Abstract】: Window aggregation queries are a core part of streaming applications. To support window aggregation efficiently, stream processing engines face a trade-off between exploiting parallelism (at the instruction/multi-core levels) and incremental computation (across overlapping windows and queries). Existing engines implement ad-hoc aggregation and parallelization strategies. As a result, they only achieve high performance for specific queries depending on the window definition and the type of aggregation function. We describe a general model for the design space of window aggregation strategies. Based on this, we introduce LightSaber, a new stream processing engine that balances parallelism and incremental processing when executing window aggregation queries on multi-core CPUs. Its design generalizes existing approaches: (i) for parallel processing, LightSaber constructs a parallel aggregation tree (PAT) that exploits the parallelism of modern processors. The PAT divides window aggregation into intermediate steps that enable the efficient use of both instruction-level (i.e., SIMD) and task-level (i.e., multi-core) parallelism; and (ii) to generate efficient incremental code from the PAT, LightSaber uses a generalized aggregation graph (GAG), which encodes the low-level data dependencies required to produce aggregates over the stream. A GAG thus generalizes state-of-the-art approaches for incremental window aggregation and supports work-sharing between overlapping windows. LightSaber achieves up to an order of magnitude higher throughput compared to existing systems-on a 16-core server, it processes 470 million records/s with 132 ?s average latency.

【Keywords】: Information systems; Data management systems; Database design and models; Data model extensions; Data streams

166. Parallel Index-based Stream Join on a Multicore CPU.

【Paper Link】【Pages】:2523-2537

【Authors】: Amirhesam Shahvarani ; Hans-Arno Jacobsen

【Abstract】: Indexing sliding window content to enhance the performance of streaming queries can be greatly improved by utilizing the computational capabilities of a multicore processor. Conventional indexing data structures optimized for frequent search queries on a prestored dataset do not meet the demands of indexing highly dynamic data as in streaming environments. In this paper, we introduce an index data structure, called the partitioned in-memory merge tree, to address the challenges that arise when indexing highly dynamic data, which are common in streaming settings. Utilizing the specific pattern of streaming data and the distribution of queries, we propose a low-cost and effective concurrency control mechanism to meet the demands of high-rate update queries. To complement the index, we design an algorithm to realize a parallel index-based stream join that exploits the computational power of multicore processors. Our experiments using an octa-core processor show that our parallel stream join achieves up to 5.5 times higher throughput than a single-threaded approach.

【Keywords】: Theory of computation; Design and analysis of algorithms; Data structures design and analysis; Parallel algorithms

Research 29: Data Mining and Similarity Search 5

167. Improving Approximate Nearest Neighbor Search through Learned Adaptive Early Termination.

【Paper Link】【Pages】:2539-2554

【Authors】: Conglong Li ; Minjia Zhang ; David G. Andersen ; Yuxiong He

【Abstract】: In applications ranging from image search to recommendation systems, the problem of identifying a set of "similar" real-valued vectors to a query vector plays a critical role. However, retrieving these vectors and computing the corresponding similarity scores from a large database is computationally challenging. Approximate nearest neighbor (ANN) search relaxes the guarantee of exactness for efficiency by vector compression and/or by only searching a subset of database vectors for each query. Searching a larger subset increases both accuracy and latency. State-of-the-art ANN approaches use fixed configurations that apply the same termination condition (the size of subset to search) for all queries, which leads to undesirably high latency when trying to achieve the last few percents of accuracy. We find that due to the index structures and the vector distributions, the number of database vectors that must be searched to find the ground-truth nearest neighbor varies widely among queries. Critically, we further identify that the intermediate search result after a certain amount of search is an important runtime feature that indicates how much more search should be performed. To achieve a better tradeoff between latency and accuracy, we propose a novel approach that adaptively determines search termination conditions for individual queries. To do so, we build and train gradient boosting decision tree models to learn and predict when to stop searching for a certain query. These models enable us to achieve the same accuracy with less total amount of search compared to the fixed configurations. We apply the learned adaptive early termination to state-of-the-art ANN approaches, and evaluate the end-to-end performance on three million to billion-scale datasets. Compared with fixed configurations, our approach consistently improves the average end-to-end latency by up to 7.1 times faster under the same high accuracy targets. Our approach is open source at github.com/efficient/faiss-learned-termination.

【Keywords】: Information systems; Information retrieval; Search engine architectures and scalability; Search engine indexing

168. Theoretically-Efficient and Practical Parallel DBSCAN.

【Paper Link】【Pages】:2555-2571

【Authors】: Yiqiu Wang ; Yan Gu ; Julian Shun

【Abstract】: The DBSCAN method for spatial clustering has received significant attention due to its applicability in a variety of data analysis tasks. There are fast sequential algorithms for DBSCAN in Euclidean space that take O(nłog n) work for two dimensions, sub-quadratic work for three or more dimensions, and can be computed approximately in linear work for any constant number of dimensions. However, existing parallel DBSCAN algorithms require quadratic work in the worst case. This paper bridges the gap between theory and practice of parallel DBSCAN by presenting new parallel algorithms for Euclidean exact DBSCAN and approximate DBSCAN that match the work bounds of their sequential counterparts, and are highly parallel (polylogarithmic depth). We present implementations of our algorithms along with optimizations that improve their practical performance. We perform a comprehensive experimental evaluation of our algorithms on a variety of datasets and parameter settings. Our experiments on a 36-core machine with two-way hyper-threading show that our implementations outperform existing parallel implementations by up to several orders of magnitude, and achieve speedups of up to 33x over the best sequential algorithms.

【Keywords】: Computing methodologies; Parallel computing methodologies; Parallel algorithms; Shared memory algorithms; Information systems; Information systems applications; Data mining; Clustering; Theory of computation; Theory and algorithms for application domains; Machine learning theory; Unsupervised learning and clustering

169. A Relational Matrix Algebra and its Implementation in a Column Store.

【Paper Link】【Pages】:2573-2587

【Authors】: Oksana Dolmatova ; Nikolaus Augsten ; Michael H. Böhlen

【Abstract】: Analytical queries often require a mixture of relational and linear algebra operations applied to the same data. This poses a challenge to analytic systems that must bridge the gap between relations and matrices. Previous work has mainly strived to fix the problem at the implementation level. This paper proposes a principled solution at the logical level. We introduce the relational matrix algebra (RMA), which seamlessly integrates linear algebra operations into the relational model and eliminates the dichotomy between matrices and relations. RMA is closed: All our relational matrix operations are performed on relations and result in relations; no additional data structure is required. Our implementation in MonetDB shows the feasibility of our approach, and empirical evaluations suggest that in-database analytics performs well for mixed workloads.

【Keywords】: Information systems; Data management systems; Database design and models; Relational database model; Database management system engines; Online analytical processing engines; Mathematics of computing; Mathematical analysis; Numerical analysis; Computations on matrices; Mathematical software; Statistical software

170. Locality-Sensitive Hashing Scheme based on Longest Circular Co-Substring.

【Paper Link】【Pages】:2589-2599

【Authors】: Yifan Lei ; Qiang Huang ; Mohan S. Kankanhalli ; Anthony K. H. Tung

【Abstract】: Locality-Sensitive Hashing (LSH) is one of the most popular methods for c-Approximate Nearest Neighbor Search (c-ANNS) in high-dimensional spaces. In this paper, we propose a novel LSH scheme based on the Longest Circular Co-Substring (LCCS) search framework (LCCS-LSH) with a theoretical guarantee. We introduce a novel concept of LCCS and a new data structure named Circular Shift Array (CSA) for k-LCCS search. The insight of LCCS search framework is that close data objects will have a longer LCCS than the far-apart ones with high probability. LCCS-LSH is LSH-family-independent, and it supports c-ANNS with different kinds of distance metrics. We also introduce a multi-probe version of LCCS-LSH and conduct extensive experiments over five real-life datasets. The experimental results demonstrate that LCCS-LSH outperforms state-of-the-art LSH schemes.

【Keywords】: Information systems; Information systems applications; Data mining; Nearest-neighbor search; Theory of computation; Design and analysis of algorithms; Streaming, sublinear and near linear time algorithms; Bloom filters and hashing; Nearest neighbor algorithms

171. Continuously Adaptive Similarity Search.

【Paper Link】【Pages】:2601-2616

【Authors】: Huayi Zhang ; Lei Cao ; Yizhou Yan ; Samuel Madden ; Elke A. Rundensteiner

【Abstract】: Similarity search is the basis for many data analytics techniques, including k-nearest neighbor classification and outlier detection. Similarity search over large data sets relies on i) a distance metric learned from input examples and ii) an index to speed up search based on the learned distance metric. In interactive systems, input to guide the learning of the distance metric may be provided over time. As this new input changes the learned distance metric, a naive approach would adopt the costly process of re-indexing all items after each metric change. In this paper, we propose the first solution, called OASIS, to instantaneously adapt the index to conform to a changing distance metric without this prohibitive re-indexing process. To achieve this, we prove that locality-sensitive hashing (LSH) provides an invariance property, meaning that an LSH index built on the original distance metric is equally effective at supporting similarity search using an updated distance metric as long as the transform matrix learned for the new distance metric satisfies certain properties. This observation allows OASIS to avoid recomputing the index from scratch in most cases. Further, for the rare cases when an adaption of the LSH index is shown to be necessary, we design an efficient incremental LSH update strategy that re-hashes only a small subset of the items in the index. In addition, we develop an efficient distance metric learning strategy that incrementally learns the new metric as inputs are received. Our experimental study using real world public datasets confirms the effectiveness of OASIS at improving the accuracy of various similarity search-based data analytics tasks by instantaneously adapting the distance metric and its associated index in tandem, while achieving an up to 3 orders of magnitude speedup over the state-of-art techniques.

【Keywords】: Information systems; Data management systems

Tutorials 8

172. Automating Exploratory Data Analysis via Machine Learning: An Overview.

【Paper Link】【Pages】:2617-2622

【Authors】: Tova Milo ; Amit Somech

【Abstract】: Exploratory Data Analysis (EDA) is an important initial step for any knowledge discovery process, in which data scientists interactively explore unfamiliar datasets by issuing a sequence of analysis operations (e.g. filter, aggregation, and visualization). Since EDA is long known as a difficult task, requiring profound analytical skills, experience, and domain knowledge, a plethora of systems have been devised over the last decade in order to facilitate EDA.

【Keywords】: Mathematics of computing; Probability and statistics; Statistical paradigms; Exploratory data analysis

173. Crowdsourcing Practice for Efficient Data Labeling: Aggregation, Incremental Relabeling, and Pricing.

【Paper Link】【Pages】:2623-2627

【Authors】: Alexey Drutsa ; Valentina Fedorova ; Dmitry Ustalov ; Olga Megorskaya ; Evfrosiniya Zerminova ; Daria Baidakova

【Abstract】: In this tutorial, we present a portion of unique industry experience in efficient data labeling via crowdsourcing shared by both leading researchers and engineers from Yandex. We will make an introduction to data labeling via public crowdsourcing marketplaces and will present the key components of efficient label collection. This will be followed by a practice session, where participants will choose one of the real label collection tasks, experiment with selecting settings for the labeling process, and launch their label collection project on one of the largest crowdsourcing marketplaces. The projects will be run on real crowds within the tutorial session. While the crowd performers are annotating the project set up by the attendees, we will present the major theoretical results in efficient aggregation, incremental relabeling, and dynamic pricing. We will also discuss their strengths and weaknesses as well as applicability to real-world tasks, summarizing our five year-long research and industrial expertise in crowdsourcing. Finally, participants will receive a feedback about their projects and practical advice on how to make them more efficient. We invite beginners, advanced specialists, and researchers to learn how to collect high quality labeled data and do it efficiently.

【Keywords】: General and reference; Cross-computing tools and techniques; Reliability; Information systems; World Wide Web; Web applications; Crowdsourcing; Social and professional topics; Professional topics; Computing industry; Sustainability

174. State of the Art and Open Challenges in Natural Language Interfaces to Data.

【Paper Link】【Pages】:2629-2636

【Authors】: Fatma Ozcan ; Abdul Quamar ; Jaydeep Sen ; Chuan Lei ; Vasilis Efthymiou

【Abstract】: Recent advances in natural language understanding and processing resulted in renewed interest in natural language based interfaces to data, which provide an easy mechanism for non-technical users to access and query the data. While early systems only allowed simple selection queries over a single table, some recent work supports complex BI queries, with many joins and aggregation, and even nested queries. There are various approaches in the literature for interpreting user's natural language query. Rule-based systems try to identify the entities in the query, and understand the intended relationships between those entities. Recent years have seen the emergence and popularity of neural network based approaches which try to interpret the query holistically, by learning the patterns. In this tutorial, we will review these natural language interface solutions in terms of their interpretation approach, as well as the complexity of the queries they can generate. We will also discuss open research challenges.

【Keywords】: Human-centered computing; Human computer interaction (HCI); Interaction paradigms; Natural language interfaces

175. SIGMOD 2020 Tutorial on Fairness and Bias in Peer Review and Other Sociotechnical Intelligent Systems.

【Paper Link】【Pages】:2637-2640

【Authors】: Nihar B. Shah ; Zachary Lipton

【Abstract】: Questions of fairness and bias abound in all socially-consequential decisions pertaining to collection and management of data. Whether designing protocols for peer review of research papers, setting hiring policies, or framing research question in genetics, any data-management decision with the potential to allocate benefits or confer harms raises concerns about who gains or loses that may fail to surface in naively-chosen performance measures. Data science interacts with these questions in two fundamentally different ways: (i) as the technology driving the very systems responsible for certain social impacts, posing new questions about what it means for such systems to accord with ethical norms and the law; and (ii) as a set of powerful tools for analyzing existing data management systems, e.g., for auditing existing systems for various biases. This tutorial will tackle both angles on the interaction between technology and society vis-a-vis concerns over fairness and bias, particularly focusing on the collection and management of data. Our presentation will cover a wide range of disciplinary perspectives with the first part focusing on the social impacts of technology and the formulations of fairness and bias defined via protected characteristics and the second part taking a deep into peer review and distributed human evaluations, to explore other forms of bias, such as that due to subjectivity, miscalibration, and dishonest behavior.

【Keywords】: Computing methodologies; Artificial intelligence; Machine learning; Human-centered computing; Collaborative and social computing; Information systems

176. Le Taureau: Deconstructing the Serverless Landscape & A Look Forward.

【Paper Link】【Pages】:2641-2650

【Authors】: Anurag Khandelwal ; Arun Kejariwal ; Karthikeyan Ramasamy

【Abstract】: Akin to the natural evolution of programming in assembly language to high-level languages, serverless computing represents the next frontier in the evolution of cloud computing: bare metal -> virtual machines -> containers -> serverless. The genesis of serverless computing can be traced back to the fundamental need of enabling a programmer to singularly focus on writing application code in a high-level language and isolating all facets of system management (for example, but not limited to, instance selection, scaling, deployment, logging, monitoring, fault tolerance and so on). This is particularly critical in light of today's, increasingly tightening, time-to-market constraints. Currently, serverless computing is supported by leading public cloud vendors, such as AWS Lambda, Google Cloud Functions, Azure Cloud Functions and others. While this is an important step in the right direction, there are many challenges going forward. For instance, but not limited to, how to enable support for dynamic optimization, how to extend support for stateful computation, how to efficiently bin-pack applications, how to support hardware heterogeneity (this will be key especially in light of the emergence of hardware accelerators for deep learning workloads). Inspired by Picasso's Le Taureau, in the tutorial proposed herein, we shall deconstruct evolution of serverless --- the overarching intent being to facilitate better understanding of the serverless landscape. This, we hope, would help push the innovation frontier on both fronts, the paradigm itself and the applications built atop of it.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Cloud computing; Computing methodologies; Parallel computing methodologies; Software and its engineering; Software notations and tools; General programming languages

177. Beyond Analytics: The Evolution of Stream Processing Systems.

【Paper Link】【Pages】:2651-2658

【Authors】: Paris Carbone ; Marios Fragkoulis ; Vasiliki Kalavri ; Asterios Katsifodimos

【Abstract】: Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. The goal of this tutorial is threefold. First, we aim to review and highlight noteworthy past research findings, which were largely ignored until very recently. Second, we intend to underline the differences between early ('00-'10) and modern ('11-'18) streaming systems, and how those systems have evolved through the years. Most importantly, we wish to turn the attention of the database community to recent trends: streaming systems are no longer used only for classic stream processing workloads, namely window aggregates and joins. Instead, modern streaming systems are being increasingly used to deploy general event-driven applications in a scalable fashion, challenging the design decisions, architecture and intended use of existing stream processing systems.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Cloud computing; Computing methodologies; Distributed computing methodologies; Parallel computing methodologies; Information systems; Data management systems; Database design and models; Data model extensions; Data streams; Database management system engines; Distributed database transactions; Distributed database recovery; Parallel and distributed DBMSs; Stream management; Theory of computation; Design and analysis of algorithms; Distributed algorithms; MapReduce algorithms; Models of computation; Concurrency; Distributed computing models

178. Optimal Join Algorithms Meet Top-k.

【Paper Link】【Pages】:2659-2665

【Authors】: Nikolaos Tziavelis ; Wolfgang Gatterbauer ; Mirek Riedewald

【Abstract】: Top-k queries have been studied intensively in the database community and they are an important means to reduce query cost when only the "best" or "most interesting" results are needed instead of the full output. While some optimality results exist, e.g., the famous Threshold Algorithm, they hold only in a fairly limited model of computation that does not account for the cost incurred by large intermediate results and hence is not aligned with typical database-optimizer cost models. On the other hand, the idea of avoiding large intermediate results is arguably the main goal of recent work on optimal join algorithms, which uses the standard RAM model of computation to determine algorithm complexity. This research has created a lot of excitement due to its promise of reducing the time complexity of join queries with cycles, but it has mostly focused on full-output computation. We argue that the two areas can and should be studied from a unified point of view in order to achieve optimality in the common model of computation for a very general class of top-k-style join queries. This tutorial has two main objectives. First, we will explore and contrast the main assumptions, concepts, and algorithmic achievements of the two research areas. Second, we will cover recent, as well as some older, approaches that emerged at the intersection to support efficient ranked enumeration of join-query results. These are related to classic work on k-shortest path algorithms and more general optimization problems, some of which dates back to the 1950s. We demonstrate that this line of research warrants renewed attention in the challenging context of ranked enumeration for general join queries.

【Keywords】: General and reference; Document types; Surveys and overviews; Information systems; Information retrieval; Retrieval models and ranking; Top-k retrieval in databases; Mathematics of computing; Discrete mathematics; Combinatorics; Enumeration; Graph theory; Hypergraphs; Theory of computation; Design and analysis of algorithms; Algorithm design techniques; Dynamic programming; Graph algorithms analysis; Shortest paths; Theory and algorithms for application domains; Database theory; Data structures and algorithms for data management; Database query processing and optimization (theory)

179. Key-Value Storage Engines.

【Paper Link】【Pages】:2667-2672

【Authors】: Stratos Idreos ; Mark Callaghan

【Abstract】: Key-value stores are everywhere. They power a diverse set of data-driven applications across both industry and science. Key-value stores are used as stand-alone NoSQL systems but they are also used as a part of more complex pipelines and systems such as machine learning and relational systems. In this tutorial, we survey the state-of-the-art approaches on how the core storage engine of a key-value store system is designed. We focus on several critical components of the engine, starting with the core data structures to lay out data across the memory hierarchy. We also discuss design issues related to caching, timestamps, concurrency control, updates, shifting workloads, as well as mixed workloads with both analytical and transactional characteristics. We cover designs that are read-optimized, write-optimized as well as hybrids. We draw examples from several state-of-the-art systems but we also put everything together in a general framework which allows us to model storage engine designs under a single unified model and reason about the expected behavior of diverse designs. In addition, we show that given the vast number of possible storage engine designs and their complexity, there is a need to be able to describe and communicate design decisions at a high level descriptive language and we present a first version of such a language. We then use that framework to present several open challenges in the field, especially in terms of supporting increasingly more diverse and dynamic applications in the era of data science and AI, including neural networks, graphs, and data versioning.

【Keywords】: Information systems; Data management systems; Data structures; Data layout; Database design and models; Physical data models; Database management system engines; DBMS engine architectures

Demonstrations 36

180. RASQL: A Powerful Language and its System for Big Data Applications.

【Paper Link】【Pages】:2673-2676

【Authors】: Jin Wang ; Guorui Xiao ; Jiaqi Gu ; Jiacheng Wu ; Carlo Zaniolo

【Abstract】: There is a growing interest in supporting advanced Big Data applications on distributed data processing platforms. Most of these systems support SQL or its dialect as the query interface due to its portability and declarative nature. However, current SQL standard cannot effectively express advanced analytical queries due to its limitation in supporting recursive queries. In this demonstration, we show that this problem can be resolved via a simple SQL extension that delivers greater expressive power by allowing aggregates in recursion. To this end, we propose the Recursive-aggregate-SQL (RASQL) language and its system on top of Apache Spark to express and execute complex queries and declarative algorithms in many applications, such as graph search and machine learning. With a variety of examples, we will (i) show how complicated analytic queries can be expressed with RASQL; (ii) illustrate formal semantics of the powerful new constructs; and (iii) present a user-friendly interface to interact with the RASQL system and monitor the query results.

【Keywords】: Information systems; Data management systems

181. PL/SQL Without the PL.

【Paper Link】【Pages】:2677-2680

【Authors】: Denis Hirn ; Torsten Grust

【Abstract】: We demonstrate a source-to-source compilation technique that can translate user-defined PL/SQL functions into plain SQL queries. These PL/SQL functions may feature arbitrarily complex control flow-iteration, in particular. From this imperative-style input code we derive equivalent recursive common table expressions, ready for execution on any relational SQL:1999 back-end. Principally due to the absence of PL/SQL?SQL context switches, the output plain SQL code comes with substantial runtime savings. The demonstration embeds the compiler into an interactive setup that admits function editing while live re-compilation and visualization of intermediate code forms happens in the background.

【Keywords】: Information systems; Data management systems; Query languages; Relational database query languages; Structured Query Language; Software and its engineering; Software notations and tools; Compilers; General programming languages; Language features; Recursion

182. Analysis of Database Search Systems with THOR.

【Paper Link】【Pages】:2681-2684

【Authors】: Theofilos Belmpas ; Orest Gkini ; Georgia Koutrika

【Abstract】: Numerous search systems have been implemented that allow users to pose unstructured queries over databases without the need to use a query language, such as SQL. Unfortunately, the landscape of efforts is fragmented with no clear sight of which system is best, and what open challenges we should pursue in our research. To help towards this direction, we present THOR that makes 4 important contributions: a query benchmark, a framework for comparing different systems, several search system implementations, and a highly interactive tool for comparing different search systems.

【Keywords】: Information systems; Information retrieval; Evaluation of retrieval results; Users and interactive retrieval; Search interfaces

183. BOOMER: A Tool for Blending Visual P-Homomorphic Queries on Large Networks.

【Paper Link】【Pages】:2685-2688

【Authors】: Yinglong Song ; Huey-Eng Chua ; Sourav S. Bhowmick ; Byron Choi ; Shuigeng Zhou

【Abstract】: The paradigm of interleaving (i.e. blending) visual subgraph query formulation and processing by exploiting the latency offered by the GUI brings in several potential benefits such as superior system response time (SRT) and opportunities to enhance usability of graph databases. Recent efforts at implementing this paradigm are focused on subgraph isomorphism-based queries, which are often restrictive in many real-world graph applications. In this demonstration, we present a novel system called BOOMER to realize this paradigm on more generic but complex bounded 1-1 p-homomorphic(BPH) queries on large networks. Intuitively, a BPH query maps an edge of the query to bounded paths in the data graph. We demonstrate various innovative features of BOOMER, its flexibility, and its promising performance.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing

184. AURORA: Data-driven Construction of Visual Graph Query Interfaces for Graph Databases.

【Paper Link】【Pages】:2689-2692

【Authors】: Sourav S. Bhowmick ; Kai Huang ; Huey-Eng Chua ; Zifeng Yuan ; Byron Choi ; Shuigeng Zhou

【Abstract】: Several commercial and academic frameworks for querying a large collection of small- or medium-sized data graphs (eg. chemical compounds) provide visual graph query interfaces (a.k.a GUI) to facilitate non-programmers to query these sources. However, construction of these visual interfaces is not data-driven. That is, it does not exploit the underlying data graphs to automatically generate the contents of various panels in a GUI. Such data-driven construction has several benefits such as facilitating efficient subgraph query formulation and portability of the interface across different application domains and sources. In this demonstration, we present a novel data-driven visual subgraph query interface construction engine called AURORA. Specifically, given a graph repository D containing a collection of small- or medium-sized data graphs, it automatically generates the GUI for D by populating various components of the interface. We demonstrate various innovative features of AURORA.

【Keywords】: Information systems; Data management systems

185. vChain: A Blockchain System Ensuring Query Integrity.

【Paper Link】【Pages】:2693-2696

【Authors】: Haixin Wang ; Cheng Xu ; Ce Zhang ; Jianliang Xu

【Abstract】: This demonstration presents vChain, a blockchain system that ensures query integrity. With the proliferation of blockchain applications and services, there has been an increasing demand for querying the data stored in a blockchain database. However, existing solutions either are at the risk of losing query integrity, or require users to maintain a full copy of the blockchain database. In comparison, by employing a novel verifiable query processing framework, vChain enables a lightweight user to authenticate the query results returned from a potentially untrusted service provider. We demonstrate its verifiable query operations, usability, and performance with visualization for better insights. We also showcase how users can detect falsified results in the case that the service provider is compromised.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Security and privacy; Systems security; Distributed systems security

186. AUDITOR: A System Designed for Automatic Discovery of Complex Integrity Constraints in Relational Databases.

【Paper Link】【Pages】:2697-2700

【Authors】: Wentao Hu ; Dongxiang Zhang ; Dawei Jiang ; Sai Wu ; Ke Chen ; Kian-Lee Tan ; Gang Chen

【Abstract】: In this demonstration, we present a new definition of integrity constraint that is more powerful for anomalous data discovery. In our definition, a constraint is functioned on both categorical and numerical attributes in relational tables, as well as their derivative attributes, leading to a huge search space. Furthermore, we are the first to take into account attribute value distribution as part of a constraint. Based on the proposed integrity constraint, we build AUDITOR on top of relational tables from the industry of healthcare auditing and demonstrate its effectiveness and ease-of-use for domain experts to discover anomalous data.

【Keywords】: Information systems; Data management systems; Information integration; Data cleaning

187. SHARQL: Shape Analysis of Recursive SPARQL Queries.

【Paper Link】【Pages】:2701-2704

【Authors】: Angela Bonifati ; Wim Martens ; Thomas Timm

【Abstract】: We showcase SHARQL, a system that allows to navigate SPARQL query logs, can inspect complex queries by visualizing their shape, and can serve as a back-end to flexibly produce statistics about the logs. Even though SPARQL query logs are increasingly available and have become public recently, their navigation and analysis is hampered by the lack of appropriate tools. SPARQL queries are sometimes hard to understand and their inherent properties, such as their shape, their hypertree properties, and their property paths are even more difficult to be identified and properly rendered. In SHARQL, we show how the analysis and exploration of several hundred million queries is possible. We offer edge rendering which works with complex hyperedges, regular edges, and property paths of SPARQL queries. The underlying database stores more than one hundred attributes per query and is therefore extremely flexible for exploring the query logs and as a back-end to compute and display analytical properties of the entire logs or parts thereof.

【Keywords】: Information systems; Data management systems; Query languages

188. High Performance Distributed OLAP on Property Graphs with Grasper.

【Paper Link】【Pages】:2705-2708

【Authors】: Hongzhi Chen ; Bowen Wu ; Shiyuan Deng ; Chenghuan Huang ; Changji Li ; Yichao Li ; James Cheng

【Abstract】: Achieving high performance OLAP over large graphs is a challenging problem and has received great attention recently because of its broad spectrum of applications. Existing systems have various performance bottlenecks due to limitations such as low parallelism and high network overheads. This Demo presents Grasper, an RDMA-enabled distributed graph OLAP system, which adopts a series of new system designs to overcome the challenges of OLAP on graphs. The take-aways for Demo attendees are: (1)~a good understanding of the challenges of processing graph OLAP queries; (2)~useful insights about where Grasper's good performance comes from; (3)~inspirations about how to design an efficient graph OLAP system by comparing Grasper with existing systems.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Cloud computing

189. ProcAnalyzer: Effective Code Analyzer for Tuning Imperative Programs in SAP HANA.

【Paper Link】【Pages】:2709-2712

【Authors】: Kisung Park ; Taeyoung Jeong ; Chanho Jeong ; Jaeha Lee ; Dong-Hun Lee ; Young-Koo Lee

【Abstract】: Troubleshooting imperative programs at runtime is very challenging because the final optimized plan is quite different from the original design time model. In this demonstration, we present ProcAnalyzer, an expressive and intuitive tool for troubleshooting issues related to performance, code quality, and security. We propose end-to-end graph (E2EGraph) that provides a holistic view of design time, compile time, and runtime behavior so that end users and engine developers easily find the correlations between design time and runtime. ProcAnalyzer provides suggestions and visualization to find problematic statements through the E2EGraph.

【Keywords】: Information systems; Data management systems; Database administration; Database utilities and tools

190. LATTE: Visual Construction of Smart Contracts.

【Paper Link】【Pages】:2713-2716

【Authors】: Sean Tan ; Sourav S. Bhowmick ; Huey-Eng Chua ; Xiaokui Xiao

【Abstract】: Smart contracts enable developers to run instructions on blockchains (eg. Ethereum) and have broad range of real-world applications. Solidity is the most popular high-level smart contract programming language on Ethereum. Coding in such language, however, demands a user to be proficient in contract programming and debugging to construct smart contracts correctly. In practice, such expectation makes it harder for non-programmers to take advantage of smart contracts. In this demonstration, we present a novel visual smart contract construction system on Ethereum called latte to make smart contract development accessible to non-programmers. Specifically, it allows a user to construct a contract without writing Solidity code by manipulating visual objects in a direct manipulation-based interface. Furthermore, latte interactively guides users and makes them aware of the cost (in units of Gas) of visual actions undertaken by them during contract construction.

【Keywords】: Security and privacy; Human and societal aspects of security and privacy; Usability in security and privacy

191. PROUD: PaRallel OUtlier Detection for Streams.

【Paper Link】【Pages】:2717-2720

【Authors】: Theodoros Toliopoulos ; Christos Bellas ; Anastasios Gounaris ; Apostolos Papadopoulos

【Abstract】: We introduce PROUD, standing for PaRallel OUtlier Detection for streams, which is an extensible engine for continuous multi-parameter parallel distance-based outlier (or anomaly) detection tailored to big data streams. PROUD is built on top of Flink. It defines a simple API for data ingestion. It supports a variety of parallel techniques, including novel ones, for continuous outlier detection that can be easily configured. In addition, it graphically reports metrics of interest and stores main results into a permanent store to enable future analysis. It can be easily extended to support additional techniques. Finally, it is publicly provided in open-source.

【Keywords】: Computing methodologies; Machine learning; Learning paradigms; Unsupervised learning; Anomaly detection; Information systems; Information systems applications; Data mining; Data stream mining; Software and its engineering; Software organization and properties; Software system structures; Software system models; Massively parallel systems

192. MithraCoverage: A System for Investigating Population Bias for Intersectional Fairness.

【Paper Link】【Pages】:2721-2724

【Authors】: Zhongjun Jin ; Mengjing Xu ; Chenkai Sun ; Abolfazl Asudeh ; H. V. Jagadish

【Abstract】: Data-driven technologies are only as good as the data they work with. On the other hand, data scientists have often limited control on how the data is collected. Failing to contain adequate number of instances from minority (sub)groups, known as population bias, is a major reason for model unfairness and disparate performance across different groups. We demonstrate MithraCoverage, a system for investigating population bias over the intersection of multiple attributes. We use the concept of coverage for identifying intersectional subgroups with inadequate representation in the dataset. MithraCoverage is a web application with an interactive visual interface that allows data scientists to explore the dataset and identify subgroups with poor coverage.

【Keywords】: Information systems; Data management systems; Information integration; Data cleaning; Information systems applications; Data mining; Mathematics of computing; Probability and statistics; Statistical paradigms; Exploratory data analysis; Theory of computation; Theory and algorithms for application domains; Database theory; Incomplete, inconsistent, and uncertain databases

193. MC3: A System for Minimization of Classifier Construction Cost.

【Paper Link】【Pages】:2725-2728

【Authors】: Shay Gershtein ; Tova Milo ; Gefen Morami ; Slava Novgorodov

【Abstract】: Search mechanisms over massive sets of items are the cornerstone of many modern applications, particularly in e-commerce websites. Consumers express in search queries a set of properties, and expect the system to retrieve qualifying items. A common difficulty, however, is that the information on whether or not an item satisfies the search criteria is sometimes not explicitly recorded in the repository. Instead, it may be considered as general knowledge or "hidden" in a picture/description, thereby leading to incomplete search results. To overcome these problems companies invest in building dedicated classifiers that determine whether an item satisfies the given search criteria. However, building classifiers typically incurs non-trivial costs due to the required volumes of high-quality labeled training data. In this demo, we introduce MC3, a real-time system that helps data analysts decide which classifiers to construct to minimize the costs of answering a set of search queries. MC3 is interactive and facilitates real-time analysis, by providing detailed classifiers impact information. We demonstrate the effectiveness of MC3 on real-world data and scenarios taken from a large e-commerce system, by interacting with the SIGMOD'20 audience members who act as analysts.

【Keywords】: Information systems; World Wide Web; Web applications; Electronic commerce

194. Sentinel: Understanding Data Systems.

【Paper Link】【Pages】:2729-2732

【Authors】: Brad Glasbergen ; Michael Abebe ; Khuzaima Daudjee ; Daniel Vogel ; Jian Zhao

【Abstract】: The complexity of modern data systems and applications greatly increases the challenge in understanding system behaviour and diagnosing performance problems. When these problems arise, system administrators are left with the difficult task of remedying them by relying on large debug log files, vast numbers of metrics, and system-specific tooling. We demonstrate the Sentinel system, which enables administrators to analyze systems and applications by building models of system execution and comparing them to derive key differences in behaviour. The resulting analyses are then presented as system reports to administrators and developers in an intuitive fashion. Users of Sentinel can locate, identify and take steps to resolve the reported performance issues. As Sentinel's models are constructed online by intercepting debug logging library calls, Sentinel's functionality incurs little overhead and works with all systems that use standard debug logging libraries.

【Keywords】: Information systems; Data management systems; Database administration; Database utilities and tools

195. BugDoc: A System for Debugging Computational Pipelines.

【Paper Link】【Pages】:2733-2736

【Authors】: Raoni Lourenço ; Juliana Freire ; Dennis E. Shasha

【Abstract】: Data analysis for scientific experiments and enterprises, large-scale simulations, and machine learning tasks all entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous outputs, the pipeline may fail to execute or produce incorrect results. Inferring the root cause(s) of such failures is challenging, usually requiring time and much human thought, while still being error-prone. We recently proposed a new approach that makes provenance to automatically and iteratively infer root causes and derive succinct explanations of failures; such an approach was implemented in our prototype, BugDoc. In this demonstration, we will illustrate BugDoc's capabilities to debug pipelines using few configuration instances.

【Keywords】: Information systems; Data management systems; Database design and models; Data model extensions; Data provenance

196. TQVS: Temporal Queries over Video Streams in Action.

【Paper Link】【Pages】:2737-2740

【Authors】: Yueting Chen ; Xiaohui Yu ; Nick Koudas

【Abstract】: We present TQVS, a system capable of conducting efficient evaluation of declarative temporal queries over real-time video streams. Users may issue queries to identify video clips in which the same two cars and the same three persons appear jointly in the frames for say 30 seconds. In real-world videos, some of the objects may disappear in frames due to reasons such as occlusion, which introduces challenges to query evaluation. Our system, aiming to address such challenges, consists of two main components: the Object Detection and Tracking (ODT) module and the Query Evaluation module. The ODT module utilizes state-of-art Object Detection and Tracking algorithms to produce a list of identified objects for each frame. Based on these results, we maintain select object combinations through the current window during query evaluation. Those object combinations contain sufficient information to evaluate queries correctly. Since the number of possible combinations could be very large, we introduce a novel technique to structure the possible combinations and facilitate query evaluation. We demonstrate that our approach offers significant performance benefits compared to alternate approaches and constitutes a fundamental building block of the TQVS system.

【Keywords】: Information systems; Data management systems

197. ExTuNe: Explaining Tuple Non-conformance.

【Paper Link】【Pages】:2741-2744

【Authors】: Anna Fariha ; Ashish Tiwari ; Arjun Radhakrishna ; Sumit Gulwani

【Abstract】: In data-driven systems, we often encounter tuples on which the predictions of a machine-learned model are untrustworthy. A key cause of such untrustworthiness is non-conformance of a new tuple with respect to the training dataset. To check conformance, we introduce a novel concept of data invariant, which captures a set of implicit constraints that all tuples of a dataset satisfy: a test tuple is non-conforming if it violates the data invariants. Data invariants model complex relationships among multiple attributes; but do not provide interpretable explanations of non-conformance. We present ExTuNe, a system for Explaining causes of Tuple Non-conformance. Based on the principles of causality, ExTuNe assigns responsibility to the attributes for causing non-conformance. The key idea is to observe change in invariant violation under intervention on attribute-values. Through a simple interface, ExTuNe produces a ranked list of the test tuples based on their degree of non-conformance and visualizes tuple-level attribute responsibility for non-conformance through heat maps. ExTuNe further visualizes attribute responsibility, aggregated over the test tuples. We demonstrate how ExTuNe can detect and explain tuple non-conformance and assist the users to make careful decisions towards achieving trusted machine learning.

【Keywords】: Computing methodologies; Artificial intelligence; Knowledge representation and reasoning; Causal reasoning and diagnostics; Machine learning; Learning paradigms; Unsupervised learning; Anomaly detection; Machine learning approaches; Factorization methods; Principal component analysis; Human-centered computing; Visualization; Visualization application domains; Information visualization; Visualization techniques; Heat maps; Security and privacy; Formal methods and theory of security; Trust frameworks; Theory of computation; Semantics and reasoning; Program reasoning; Assertions; Invariants

198. Interactively Discovering and Ranking Desired Tuples without Writing SQL Queries.

【Paper Link】【Pages】:2745-2748

【Authors】: Xuedi Qin ; Chengliang Chai ; Yuyu Luo ; Nan Tang ; Guoliang Li

【Abstract】: The very first step of many data analytics is to find and (possibly) rank desired tuples, typically through writing SQL queries - this is feasible only for data experts who can write SQL queries and know the data very well. Unfortunately, in practice, the queries might be complicated (for example, "find and rank good off-road cars based on a combination of Price, Make, Model, Age, Mileage, and so on" is complicated because it contains many if-then-else, and, or and not logic) such that even data experts cannot precisely specify SQL queries; and the data might be unknown, which is common in data discovery that one tries to discover desired data from a data lake. Naturally, a system that can help users to discover and rank desired tuples without writing SQL queries is needed. We propose to demonstrate such as a system, namely DExPlorer. To use DExPlorer for data exploration, the user only needs to interactively perform two simple operations over a set of system provided tuples: (1) annotate which tuples are desired (i.e., true labels) or not (i.e., false labels), and (2) annotate whether a tuple is more preferred than another one (i.e., partial orders or ranked lists). We will show that DExPlorer can find user's desired tuples and rank them in a few interactions, even for complicated queries.

【Keywords】: Human-centered computing; Human computer interaction (HCI); Interactive systems and tools; Information systems; Data management systems; Information integration; Mediators and data integration

199. Synner: Generating Realistic Synthetic Data.

【Paper Link】【Pages】:2749-2752

【Authors】: Miro Mannino ; Azza Abouzied

【Abstract】: Synner allows users to generate realistic-looking data. With Synner users can visually and declaratively specify properties of the dataset they wish to generate. Such properties include the domain, and statistical distribution of each field, and relationships between fields. User can also sketch custom distributions and relationships. Synner provides instant feedback on every user interaction by visualizing a preview of the generated data. It also suggests generation specifications from a few user-provided examples of data to generate, column labels and other user interactions. In this demonstration, we showcase Synner and summarize results from our evaluation of Synner's effectiveness at generating realistic-looking data.

【Keywords】: Human-centered computing; Human computer interaction (HCI); Interactive systems and tools

200. InCognitoMatch: Cognitive-aware Matching via Crowdsourcing.

【Paper Link】【Pages】:2753-2756

【Authors】: Roee Shraga ; Coral Scharf ; Rakefet Ackerman ; Avigdor Gal

【Abstract】: We present InCognitoMatch, the first cognitive-aware crowdsourcing application for matching tasks. InCognitoMatch provides a handy tool to validate, annotate, and correct correspondences using the crowd whilst accounting for human matching biases. In addition, InCognitoMatch enables system administrators to control context information visible for workers and analyze their performance accordingly. For crowd workers, InCognitoMatch is an easy-to-use application that may be accessed from multiple crowdsourcing platforms. In addition, workers completing a task are offered suggestions for followup sessions according to their performance in the current session. For this demo, the audience will be able to experience InCognitoMatch thorough three use-cases, interacting with system as workers and as administrators.

【Keywords】: Information systems; Data management systems; Information integration; Mediators and data integration; World Wide Web; Web applications; Crowdsourcing

201. CoClean: Collaborative Data Cleaning.

【Paper Link】【Pages】:2757-2760

【Authors】: Mashaal Musleh ; Mourad Ouzzani ; Nan Tang ; AnHai Doan

【Abstract】: High quality data is crucial for many applications but real-life data is often dirty. Unfortunately, automated solutions are often not trustable and are thus seldom employed in practice. In real-world scenarios, it is often necessary to resort to manual cleaning for obtaining pristine data. Existing human-in-the-loop solutions, such as Trifacta and OpenRefine, typically involve a single user. This is often error-prone, limited to a single-person expertise, and cannot scale with the ever growing volume, variety and veracity of data.

【Keywords】: Information systems; Data management systems; Information integration; Data cleaning

202. STAR: A Distributed Stream Warehouse System for Spatial Data.

【Paper Link】【Pages】:2761-2764

【Authors】: Zhida Chen ; Gao Cong ; Walid G. Aref

【Abstract】: The proliferation of mobile phones and location-based services gives rise to an explosive growth of spatial data. This spatial data contains valuable information, and calls for data stream warehouse systems that can provide real-time analytical results with the latest integrated spatial data. In this demonstration, we present the STAR (Spatial Data Stream Warehouse) system. STAR is a distributed in-memory spatial data stream warehouse system that provides low-latency and up-to-date analytical results over a fast spatial data stream. STAR supports a rich set of aggregate queries for spatial data analytics, e.g., contrasting the frequencies of spatial objects that appear in different spatial regions, or showing the most frequently mentioned topics being tweeted in different cities. STAR processes aggregate queries by maintaining distributed materialized views. Additionally, STAR supports dynamic load adjustment that makes STAR scalable and adaptive. We demonstrate STAR on top of Amazon EC2 clusters using real data sets.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization; Online analytical processing engines; Stream management

203. T-REx: Table Repair Explanations.

【Paper Link】【Pages】:2765-2768

【Authors】: Daniel Deutch ; Nave Frost ; Amir Gilad ; Oren Sheffer

【Abstract】: Data repair is a common and crucial step in many frameworks today, as applications may use data from different sources and of different levels of credibility. Thus, this step has been the focus of many works, proposing diverse approaches. To assist users in understanding the output of such data repair algorithms, we propose T-REx, a system for providing data repair explanations through Shapley values. The system is generic and not specific to a given repair algorithm or approach: it treats the algorithm as a black box. Given a specific table cell selected by the user, T-REx employs Shapley values to explain the significance of each constraint and each table cell in the repair of the cell of interest. T-REx then ranks the constraints and table cells according to their importance in the repair of this cell. This explanation allows users to understand the repair process, as well as to act based on this knowledge, to modify the most influencing constraints or the original database.

【Keywords】: Information systems; Data management systems; Information integration; Data cleaning

204. SVQ++: Querying for Object Interactions in Video Streams.

【Paper Link】【Pages】:2769-2772

【Authors】: Daren Chao ; Nick Koudas ; Ioannis Xarchakos

【Abstract】: Deep neural nets enabled sophisticated information extraction out of images, including video frames. Recently, there has been interest in techniques and algorithms to enable interactive declarative query processing of objects appearing on video frames and their associated interactions on the video feed. SVQ++ is a system for declarative querying on real-time video streams involving objects and their interactions. The system utilizes a sequence of inexpensive and less accurate models (filters), called Progressive Filters (PF), to detect the presence of the query specified objects on frames, and a filtering approach, called Interaction Sheave (IS), to effectively prune frames that are not likely to contain interactions. We demonstrate that this system can efficiently identify frames in a streaming video in which an object is interacting with another in a specific way, increasing the frame processing rate dramatically and speed up query processing by at least two orders of magnitude depending on the query.

【Keywords】: Information systems; Data management systems; Database management system engines

205. F-IVM: Learning over Fast-Evolving Relational Data.

【Paper Link】【Pages】:2773-2776

【Authors】: Milos Nikolic ; Haozhe Zhang ; Ahmet Kara ; Dan Olteanu

【Abstract】: F-IVM is a system for real-time analytics such as machine learning applications over training datasets defined by queries over fast-evolving relational databases. We will demonstrate F-IVM for three such applications: model selection, Chow-Liu trees, and ridge linear regression.

206. CoMing: A Real-time Co-Movement Mining System for Streaming Trajectories.

【Paper Link】【Pages】:2777-2780

【Authors】: Ziquan Fang ; Yunjun Gao ; Lu Pan ; Lu Chen ; Xiaoye Miao ; Christian S. Jensen

【Abstract】: The aim of real-time co-movement pattern mining for streaming trajectories is to discover co-moving objects that satisfy specific spatio-temporal constraints in real time. This functionality serves a range of real-world applications, such as traffic monitoring and management. However, little work targets the visualization and interaction with such co-movement detection on streaming trajectories. To this end, we develop CoMing, a real-time co-movement pattern mining system, to handle streaming trajectories. CoMing leverages ICPE, a real-time distributed co-movement pattern detection framework, and thus, it has its capacity of good performance. This demonstration offers hands-on experience with CoMing's visual and user-friendly interface. Moreover, several applications in the traffic domain, including object monitoring and traffic statistics visualization, are also provided to users.

【Keywords】: Information systems; Information systems applications; Data mining; Data stream mining

207. Unified Spatial Analytics from Heterogeneous Sources with Amazon Redshift.

【Paper Link】【Pages】:2781-2784

【Authors】: Nemanja Boric ; Hinnerk Gildhoff ; Menelaos Karavelas ; Ippokratis Pandis ; Ioanna Tsalouchidou

【Abstract】: Enterprise companies use spatial data for decision optimization and gain new insights regarding the locality of their business and services. Industries rely on efficiently combining spatial and business data from different sources, such as data warehouses, geospatial information systems, transactional systems, and data lakes, where spatial data can be found in structured or unstructured form. In this demonstration we present the spatial functionality of Amazon Redshift and its integration with other Amazon services, such as Amazon Aurora PostgreSQL and Amazon S3. We focus on the design and functionality of the feature, including the extensions in Redshift's state-of-the-art optimizer to push spatial processing close to where the data is stored.

【Keywords】: Information systems; Data management systems; Database management system engines; Parallel and distributed DBMSs; Relational parallel and distributed DBMSs; Information integration; Data warehouses; Information systems applications; Decision support systems; Data analytics

208. Big Data Series Analytics Using TARDIS and its Exploitation in Geospatial Applications.

【Paper Link】【Pages】:2785-2788

【Authors】: Liang Zhang ; Noura Alghamdi ; Mohamed Y. Eltabakh ; Elke A. Rundensteiner

【Abstract】: The massive amounts of data series data continuously generated and collected by applications require new indices to speed up data series similarity queries on which various data mining techniques rely. However, the state-of-the-art iSAX-based indexing techniques do not scale well due to the binary fanout that leads to a highly deep index tree and suffer from accuracy degradation due to the character-level cardinality that leads to poor maintenance of the proximity. To address this problem, we recently proposed TARDIS to supports indexing and querying billion-scale data series datasets. It introduces a new iSAX-T signatures to reduce the cardinality conversion cost and corresponding sigTree to construct a compact index structure to preserve better similarity. The framework consists of one centralized index and local distributed indices to efficiently re-partition and index dimensional datasets. Besides, effective query strategies based on sigTree structure are proposed to greatly improve the accuracy. In this demonstration, we present GENET, a new interactive exploration demonstration that allows users to support Big Data Series Approximate Retrieval and Recursive Interactive Clustering in large-scale geospatial datasets using TARDIS index techniques.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization

209. CDFShop: Exploring and Optimizing Learned Index Structures.

【Paper Link】【Pages】:2789-2792

【Authors】: Ryan Marcus ; Emily Zhang ; Tim Kraska

【Abstract】: Indexes are a critical component of data management applications. While tree-like structures (e.g., B-Trees) have been employed to great success, recent work suggests that index structures powered by machine learning models (learned index structures) can achieve low lookup times with a reduced memory footprint. This demonstration showcases CDFShop, a tool to explore and optimize recursive model indexes (RMIs), a type of learned index structure. This demonstration allows audience members to (1) gain an intuition about various tuning parameters of RMIs and why learned index structures can greatly accelerate search, and (2) understand how automatic optimization techniques can be used to explore space/time tradeoffs within the space of RMIs.

【Keywords】: Information systems; Data management systems

210. TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines.

【Paper Link】【Pages】:2793-2796

【Authors】: Emily Caveness ; Paul Suganthan G. C. ; Zhuo Peng ; Neoklis Polyzotis ; Sudip Roy ; Martin Zinkevich

【Abstract】: Machine Learning (ML) research has primarily focused on improving the accuracy and efficiency of the training algorithms while paying much less attention to the equally important problem of understanding, validating, and monitoring the data fed to ML. Irrespective of the ML algorithms used, data errors can adversely affect the quality of the generated model. This indicates that we need to adopt a data-centric approach to ML that treats data as a first-class citizen, on par with algorithms and infrastructure which are the typical building blocks of ML pipelines. In this demonstration we showcase TensorFlow Data Validation (TFDV), a scalable data analysis and validation system for ML that we have developed at Google and recently open-sourced. This system is deployed in production as an integral part of TFX - an end-to-end machine learning platform at Google. It is used by hundreds of product teams at Google and has received significant attention from the open-source community as well.

【Keywords】: Information systems; Data management systems

211. Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental Computing.

【Paper Link】【Pages】:2797-2800

【Authors】: Zuozhi Wang ; Kai Zeng ; Botong Huang ; Wei Chen ; Xiaozong Cui ; Bo Wang ; Ji Liu ; Liya Fan ; Dachuan Qu ; Zhenyu Hou ; Tao Guan ; Chen Li ; Jingren Zhou

【Abstract】: As the primary approach to deriving decision-support insights, automated recurring routine analytic jobs account for a major part of cluster resource usages in modern enterprise data warehouses. These recurring routine jobs usually have stringent schedule and deadline determined by external business logic, and thus cause dreadful resource skew and severe resource over-provision in the cluster. In this paper, we present Grosbeak, a novel data warehouse that supports resource-aware incremental computing to process recurring routine jobs, smooths the resource skew, and optimizes the resource usage. Unlike batch processing in traditional data warehouses, Grosbeak leverages the fact that data is continuously ingested. It breaks an analysis job into small batches that incrementally process the progressively available data, and schedules these small-batch jobs intelligently when the cluster has free resources. In this demonstration, we showcase Grosbeak using real-world analysis pipelines. Users can interact with the data warehouse by registering recurring queries and observing the incremental scheduling behavior and smoothed resource usage pattern.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization; Query planning; Database views; Online analytical processing engines; Parallel and distributed DBMSs; Relational parallel and distributed DBMSs; Information integration; Data warehouses; Theory of computation; Theory and algorithms for application domains; Database theory; Database query processing and optimization (theory)

212. Demonstration of BitGourmet: Data Analysis via Deterministic Approximation.

【Paper Link】【Pages】:2801-2804

【Authors】: Saehan Jo ; Immanuel Trummer

【Abstract】: We demonstrate BitGourmet, a novel data analysis system that supports deterministic approximate query processing (DAQ). The system executes aggregation queries and produces deterministic bounds that are guaranteed to contain the true value. The system allows users to set a precision constraint on query results. Given a user-defined target precision, we operate on a carefully selected data subset to satisfy the precision constraint. More precisely, we divide each column vertically, bit-by-bit. Our specialized query processing engine evaluates queries on subsets of these bit vectors. This involves a scenario-specific query optimizer which relies on quality and cost models to decide the optimal bit selection and execution plan. In our demonstration, we show that DAQ realizes an interesting trade-off between result quality and execution time, making data analysis more interactive. We also offer manual control over the query plan, i.e., the bit selection and the execution plan, so that users can gain more insights into our system and DAQ in general.

【Keywords】: Information systems; Data management systems; Data structures; Data layout; Database management system engines; Database query processing; Query optimization; Online analytical processing engines

213. Bring Your Own Data to X-PLAIN.

【Paper Link】【Pages】:2805-2808

【Authors】: Eliana Pastor ; Elena Baralis

【Abstract】: Exploring and understanding the motivations behind black-box model predictions is becoming essential in many different applications. X-PLAIN is an interactive tool that allows human-in-the-loop inspection of the reasons behind model predictions. Its support for the local analysis of individual predictions enables users to inspect the local behavior of different classifiers and compare the knowledge different classifiers are exploiting for their prediction. The interactive exploration of prediction explanation provides actionable insights for both trusting and validating model predictions and, in case of unexpected behaviors, for debugging and improving the model itself.

【Keywords】: Computing methodologies; Machine learning; Human-centered computing; Human computer interaction (HCI); Information systems; Information systems applications; Data mining

214. Physical Visualization Design.

【Paper Link】【Pages】:2809-2812

【Authors】: Lana Ramjit ; Zhaoning Kong ; Ravi Netravali ; Eugene Wu

【Abstract】: We demonstrate PVD, a system that visualization designers can use to co-design the interface and system architecture of scalable and expressive visualization.

【Keywords】: Human-centered computing; Visualization; Visualization application domains; Visual analytics; Visualization design and evaluation methods; Visualization systems and tools; Information systems; Data management systems; Database management system engines; Database query processing; Query optimization

215. Demonstration of Chestnut: An In-memory Data Layout Designer for Database Applications.

【Paper Link】【Pages】:2813-2816

【Authors】: Mingwei Samuel ; Cong Yan ; Alvin Cheung

【Abstract】: This demonstration showcases Chestnut, a data layout generator for in-memory object-oriented database applications. Given an application and a memory budget, Chestnut generates a customized in-memory data layout and the corresponding query plans that are specialized for the application queries. Our demo will let users design and improve simple web applications using Chestnut. Users can view the Chestnut-generated data layouts using a custom visualization system, which will allow users to see how the application parameters affect Chestnut's design. Finally, users will be able to run queries generated by the application via the customized query plans generated by Chestnut or traditional relational query engines, and can compare the results and observe the speedup achieved by the Chestnut-generated query plans.

【Keywords】: Information systems; Data management systems; Data structures; Data layout; Record and block layout; Database design and models; Data model extensions; Semi-structured data; Middleware for databases; Application servers

Student Abstracts 19

216. Breaking Down Memory Walls in LSM-based Storage Systems.

【Paper Link】【Pages】:2817-2819

【Authors】: Chen Luo

【Abstract】: The log-structured merge-tree (LSM-tree) [16, 18] is widely used in modern NoSQL systems. Different from traditional update-in-place structures, an LSM-tree first buffers all writes in memory, which are subsequently flushed to disk when memory is full. The on-disk components are usually organized into levels of exponentially increasing sizes, where a smaller level is merged into the adjacent larger level when it fills up. To bound the temporary disk space occupied by merges, modern LSM-tree implementations often range partition a disk component into many fixed-size SSTables

【Keywords】: Information systems; Data management systems; Database management system engines; Record and buffer management

217. Context-Free Path Querying via Matrix Equations.

【Paper Link】【Pages】:2821-2823

【Authors】: Yuliya Susanina

【Abstract】: Context-free path querying (CFPQ) widely used for graph-structured data analysis in different areas. It is crucial to develop highly efficient algorithms for CFPQ since the size of the input data is typically large. We show how to reduce GFPQ evaluation to solving systems of matrix equations over R --- a problem for which there exist high-performance solutions. Also, we demonstrate the applicability of our approach to real-world data analysis.

【Keywords】: Information systems; Data management systems; Query languages; Query languages for non-relational engines; Mathematics of computing; Mathematical analysis; Numerical analysis; Computations on matrices; Theory of computation; Formal languages and automata theory

218. Simulation-based Approximate Graph Pattern Matching.

【Paper Link】【Pages】:2825-2827

【Authors】: Xiaoshuang Chen

【Abstract】: Graph pattern matching is a fundamental problem in analyzing attributed graphs, that is to search the matches of a given query graph in a large data graph. However, existing algorithms either encounter with the performance issues or cannot capture reasonable matches. In this paper, we propose a simulation-based approximate pattern matching algorithm that is not only efficient to compute, but also able to capture those reasonable matches (missed by existing algorithms)

【Keywords】: Theory of computation; Design and analysis of algorithms; Data structures design and analysis; Pattern matching

219. High-Dimensional Vector Similarity Search: From Time Series to Deep Network Embeddings.

【Paper Link】【Pages】:2829-2832

【Authors】: Karima Echihabi

【Abstract】: Similarity search is an important and challenging problem that is typically modeled as nearest neighbor search in high dimensional space, where objects are represented as high dimensional vectors and their (dis)similarity is evaluated using a distance measure such as the Euclidean distance.

【Keywords】: Information systems; Information systems applications; Data mining; Nearest-neighbor search; Theory of computation; Theory and algorithms for application domains; Machine learning theory; Unsupervised learning and clustering

220. Rethinking Message Brokers on RDMA and NVM.

【Paper Link】【Pages】:2833-2835

【Authors】: Hendrik Makait

【Abstract】: Over the last years, message brokers have become an important part of enterprise systems. As microservice architectures gain popularity and the need to analyze data produced by these services grows, companies increasingly rely on message brokers to orchestrate the flow of events between different applications as well as between data-producing services and streaming engines that analyze the data in real-time.

【Keywords】: Information systems; Data management systems; Middleware for databases; Message queues; Software and its engineering; Software organization and properties; Contextual software domains; Software infrastructure; Middleware; Message oriented middleware

221. Monte Carlo Tree Search for Generating Interactive Data Analysis Interfaces.

【Paper Link】【Pages】:2837-2839

【Authors】: Yiru Chen

【Abstract】: Interactive tools like user interfaces help democratize data access for end-users by hiding underlying programming details and exposing the necessary widget interface to users. Since customized interfaces are costly to build, automated interface generation is desirable. SQL is the dominant way to analyze data and there already exists logs to analyze data. Previous work proposed a syntactic approach to analyze structural changes in SQL query logs and automatically generates a set of widgets to express the changes. However, they do not consider layout usability and the sequential order of queries in the log. We propose to adopt Monte Carlo Tree Search(MCTS) to search for the optimal interface that accounts for hierarchical layout as well as the usability in terms of how easy to express the query log.

【Keywords】: Human-centered computing; Human computer interaction (HCI); Interactive systems and tools; Interaction design; Interaction design process and methods; User interface design; Information systems; Data management systems; Database management system engines; Query languages; Relational database query languages

222. Continuous Prefetch for Interactive Data Applications.

【Paper Link】【Pages】:2841-2843

【Authors】: Haneen Mohammed

【Abstract】: Interactive data visualization and exploration (DVE) applications, such as the one in Figure 1, have rapidly grown in popularity with use cases in numerous sectors [2, 4, 9, 11, 15]. Like typical web services, DVE applications may be run on heterogeneous client devices and networks, with users expecting fast response times under 100 ms [12]. However, the resource demands of DVE applications are magnified and highly unpredictable, making it difficult to achieve such interactivity.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Client-server architectures; Human-centered computing; Human computer interaction (HCI); Interaction paradigms; Web-based interaction; Visualization; Visualization application domains; Information systems; Data management systems; Database design and models; Data model extensions; Data streams; Theory of computation; Design and analysis of algorithms; Approximation algorithms analysis; Scheduling algorithms

223. Re-evaluating the Performance Trade-offs for Hash-Based Multi-Join Queries.

【Paper Link】【Pages】:2845-2847

【Authors】: Shiva Jahangiri

【Abstract】: No abstract available.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Theory of computation; Theory and algorithms for application domains; Database theory; Database query processing and optimization (theory)

224. Interactive View Recommendation.

【Paper Link】【Pages】:2849-2851

【Authors】: Xiaozhong Zhang

【Abstract】: Existing view recommendation approaches proposed a variety of utility functions in selecting useful views. Even though each utility function might be suitable for specific scenarios, identifying the most appropriate ones, along with their tunable parameters, which represent the user's intention during an exploration, is a challenge for both expert and non-expert users. This paper presents an attempt towards interactive view recommendation that automatically discovers the utility function composition during an exploration that best matches the user's intentions and exploration task.

【Keywords】: Human-centered computing; Visualization; Information systems; World Wide Web; Web searching and information discovery; Personalization; Mathematics of computing; Probability and statistics; Statistical paradigms; Exploratory data analysis

225. From Worst-Case to Average-Case Analysis: Accurate Latency Predictions for Key-Value Storage Engines.

【Paper Link】【Pages】:2853-2855

【Authors】: Meena Jagadeesan ; Garrett Tanzer

【Abstract】: Selecting the optimal storage engine and tuning for an application requires comparing the latency of diverse workloads executed on different data structures. In this work, we start to develop an average-case analysis of the performance of storage engines that can achieve significantly more accurate predictions than existing worst-case models. We propose a distribution-aware framework to predict the latency of diverse workloads executed on a vast number of data structures. As a case study, we use our framework to produce cost models for a diverse family of key-value storage engine tunings, and verify our models on RocksDB and WiredTiger.

【Keywords】: Information systems; Data management systems; Data structures; Theory of computation; Theory and algorithms for application domains; Database theory; Data structures and algorithms for data management

226. Towards the Scheduling of Vertex-constrained Multi Subgraph Matching Query.

【Paper Link】【Pages】:2857-2859

【Authors】: Kongzhang Hao ; Longbin Lai

【Abstract】: Subgraph matching is one of the most fundamental problems in graph database, which is associated with a wide spectrum of applications. Researchers have primarily devoted their efforts to improving performance for individual query, while we often need to compute multiple queries all at once in practice. In this paper, we study the problem of vertex-constrained multi subgraph matching query (vMSQ), where we propose a novel scheduling algorithm for processing multiple queries in parallel, while taking into considerations of load balance and maximum possible sharing of computation.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization; Query planning

227. Serverless Query Processing on a Budget.

【Paper Link】【Pages】:2861-2863

【Authors】: William W. Ma

【Abstract】: Relational query processing is an ideal candidate for serverless computation with its stateless, idempotent, and short-lived properties. However, current serverless offerings for query processing neither provide millisecond-based pricing nor allow users to optimize the cost of their queries. To have tradeoffs between the cost and performance of their queries, users are limited to using traditional serverful approaches, which we demonstrate to be 50% slower at approximately the same cost as serverless approaches. We propose a model that will allow service providers to dynamically provision clusters to achieve their users' desired time-cost tradeoffs.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query planning; Parallel and distributed DBMSs; MapReduce-based systems

228. Workload-Aware Column Imprints.

【Paper Link】【Pages】:2865-2867

【Authors】: Noah Slavitch

【Abstract】: In-memory columnar databases use indexes to accelerate highly selective queries. The additional storage requirement of indexes becomes prohibitive when kept in memory. For example, an inverted index requires as much space as the column itself. Column Imprints (CI) have been proposed as a space-efficient structure that supports range queries. We examine the limitations of CI and we suggest three enhancements for in-memory databases. We propose a workload-aware approach which considers recent data access patterns when constructing CI. We optimize the histogram towards reducing false positives and cache misses for highly selective queries. We propose efficient algorithms to construct our data structures. Preliminary experiments confirm that: 1) our workload-aware imprints reduce the cache lines scanned anywhere from 30% to 50% when compared to the original CI, and 2) have significantly smaller storage requirements.

【Keywords】: Information systems; Data management systems; Data structures; Data access methods; Database management system engines; Database query processing; Main memory engines; Theory of computation; Theory and algorithms for application domains; Database theory; Database query processing and optimization (theory)

229. Towards Scalable UDTFs in Noria.

【Paper Link】【Pages】:2869-2871

【Authors】: Justus Adam

【Abstract】: User Defined Functions (UDF) are an important and powerful extension point for database queries. Systems using incremental materialized views largely do not support UDFs because they cannot easily be incrementalized. In this work we design single-tuple UDF and User Defined Aggregates (UDA) interfaces for Noria, a state-of-the art dataflow system with incremental materialized views. We also add limited support for User Defined Table Functions (UDTF), by compiling them to query fragments. We show our UDTFs scale by implementing a motivational example used Friedman et al.

【Keywords】: Information systems; Data management systems; Database management system engines; Parallel and distributed DBMSs; Relational parallel and distributed DBMSs; Query languages; Relational database query languages; Software and its engineering; Software notations and tools; General programming languages; Language types; Imperative languages; Parallel programming languages

230. Column Partition and Permutation for Run Length Encoding in Columnar Databases.

【Paper Link】【Pages】:2873-2874

【Authors】: Jia Shi

【Abstract】: Effective compression is essential when databases are used in Big Data applications. For in-memory columnar databases, compression can help to load the columns faster and speed up query evaluation. In this paper, we consider compressing columns using the Run Length Encoding (RLE). This algorithm encodes each region with identical value using a single run. The question we study in this paper is 'how to rearrange table columns for better compression?' We observe that not every column of a table benefits from column compression in an ideal column arrangement. Because finding the optimal column arrangement is NP-hard, we propose an incremental heuristic that identifies the set of columns to be compressed and the order of rows that offer a better compression ratio. Our preliminary experiments confirm that our algorithm improves the compression rate by up to 25% on test data, compared with compressing all columns of a table.

【Keywords】: Information systems; Data management systems; Data structures; Data layout; Data compression

231. Supporting Database Constraints in Synthetic Data Generation based on Generative Adversarial Networks.

【Paper Link】【Pages】:2875-2877

【Authors】: Wanxin Li

【Abstract】: With unprecedented development in machine learning algorithms, it is crucial to have available large amount of data to verify the correctness and efficiency of these algorithms. Due to privacy concerns, we may not always have enough real data to use. In our research, we focus on data synthesization for relational databases where the database constraints of the original data must be imposed to the generated data. To the best of our knowledge, no study has been conducted on supporting database constraints in synthetic data generation. We offer solutions by designing extensions to Tabular Generative Adversarial Network algorithm. We implemented a prototype for our approach, and compared the performance of different extensions by experiments. Related work on synthetic data generation includes classical statistical methods and neural network approaches. Synthetic Data Vault is developed using classical statistical methods. It uses Kolmogorov-Smirnov test to select the best statistical distribution to describe columnar data. TableGAN and Tabular GAN use neural networks to minimize cross entropy or Kullback-Leibler divergence on marginal distributions. The main challenges to our research problem are: Classical statistical distributions cannot describe complex and mixed distributions in relational databases. Database constraints are non-differentiable. Neural networks require loss functions to be differentiable.

【Keywords】: Computing methodologies; Machine learning; Learning paradigms; Supervised learning; Learning settings; Batch learning; Machine learning approaches; Learning in probabilistic graphical models; Maximum entropy modeling; Learning linear models; Perceptron algorithm; Mathematics of computing; Probability and statistics; Distribution functions; Probabilistic inference problems; Max marginal computation

232. An Evaluation of Methods of Compressing Doubles.

【Paper Link】【Pages】:2879-2881

【Authors】: Jacob Spiegel

【Abstract】: Data compression is a problem with far-reaching implications across science and industry. In the era of big data, methods for efficient compression are crucial to achieve compact data representation, low-latency data transfers, and high- throughput during query execution. Due to the explosion of Internet-of-Things applications, a large portion of this data is in the form of double-precision floating-point numbers. Despite the plethora of methods for compression, a comprehensive evaluation across real-world data and applications is still missing. In this paper, we perform such a comparison of methods and evaluate their performance in terms of compression ratio and throughput achieved across two dataset repositories of time series and featurized machine-learning problems, as well as on a dataset of machine logs.

【Keywords】: Information systems; Data management systems; Data structures; Data layout; Data compression; Information storage systems; Record storage systems; Relational storage; Compression strategies; Theory of computation; Design and analysis of algorithms; Data structures design and analysis; Data compression

233. MemFlow: Memory-Aware Distributed Deep Learning.

【Paper Link】【Pages】:2883-2885

【Authors】: Neil Band

【Abstract】: As the number of layers and the amount of training data increases, the trend is to train deep neural networks in parallel across devices. In such scenarios, neural network training is increasingly bottlenecked by high memory requirements posed by intermediate results, or feature maps, that are produced during the forward pass and consumed during the backward pass. We recognize that the best-performing device parallelization configurations should consider memory usage in addition to the canonical metric of computation time. Towards this we introduce MemFlow, an optimization framework for distributed deep learning that performs joint optimization over memory usage and computation time when searching for a parallelization strategy. MemFlow consists of: (i) a task graph with memory usage estimates; (ii) a memory-aware execution simulator; and (iii) a Markov Chain Monte Carlo search algorithm that considers various degrees of recomputation i.e., discarding feature maps during the forward pass and recomputing them during the backward pass. Our experiments demonstrate that under memory constraints, MemFlow can readily locate valid and superior parallelization strategies unattainable with previous frameworks.

【Keywords】: Computer systems organization; Architectures; Parallel architectures; Computing methodologies; Machine learning; Networks; Network performance evaluation; Network simulations

234. JSON Schema Matching: Empirical Observations.

【Paper Link】【Pages】:2887-2889

【Authors】: Kunal Waghray

【Abstract】: Database schema specifies the desired logical organization of the data it stores. A major challenge in database integration is that of schema matching [10, 12], which seeks to determine schema elements in different databases that correspond to the same real world entity. For example, a simple matcher may determine that the attribute ID in one schema is semantically equivalent to Identification in another.

【Keywords】: Information systems; Data management systems; Information integration; World Wide Web; Web mining; Data extraction and integration