ACM SIGMOD Conference 2019:Amsterdam, The Netherlands

Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019. ACM 【DBLP Link】

Paper Num: 179 || Session Num: 28

Award Talks 2
Demonstrations 40
Industry 1: Data Applications 6
Industry 2: Storage & Indexing 4
Industry 3: Data Platforms 6
Panel 1
Research 10: Graphs 1 6
Research 11: Systems & Machine Learning 4
Research 12: Indexing 4
Research 13: Fairness, Uncertainty 4
Research 14: Graphs 2 4
Research 15: Graphs 3 6
Research 16: Machine Learning 6
Research 17: Scalability 6
Research 1: Query Processing & Optimization 1 -- sponsored by Tableau 6
Research 2: Privacy/Blockchain 6
Research 3: Information Extraction 6
Research 4: Distributed Data Management 4
Research 5: Provenance 4
Research 6: Streams 4
Research 7: Modern Hardware 6
Research 8: Data Integration/Cleaning 6
Research 9: Query Processing & Optimization 2 6
SIGMOD Keynote 1 1
SIGMOD Keynote 2 1
Student Abstracts 16
Tutorials 7
Workshop Summaries 7

SIGMOD Keynote 1 1

1. Responsible Data Science.

【Paper Link】【Pages】:1

【Authors】: Lise Getoor

【Abstract】: Data science is an emerging discipline that offers both promise and peril. Responsible data science refers to efforts that address both the technical and societal issues in emerging data-driven technologies. How can machine learning and database systems reason effectively about complex dependencies and uncertainty? Furthermore, how do we understand the ethical and societal issues involved in data-driven decision-making? There is a pressing need to integrate algorithmic and statistical principles, social science theories, and basic humanist concepts so that we can think critically and constructively about the socio-technical systems we are building. In this talk, I will overview this emerging area, with an emphasis on relational learning.

【Keywords】: General and reference; Document types; Surveys and overviews

Research 1: Query Processing & Optimization 1 -- sponsored by Tableau 6

2. Exact Cardinality Query Optimization with Bounded Execution Cost.

【Paper Link】【Pages】:2-17

【Authors】: Immanuel Trummer

【Abstract】: Query optimizers often produce sub-optimal query plans due to inaccurate cardinality estimates. The goal of exact cardinality query optimization (ECQO) is to produce optimal plans based on guaranteed exact cardinality values, obtained via probe executions. We propose a novel algorithm for ECQO. It improves over prior work by limiting the overheads of probe executions. Instead of fully generating each relation in the optimizer's plan space, it calculates cardinality bounds based on partially generated relations. Thereby, it is often able to prune relations early if they are too large to participate in any optimal plan. Our algorithm exploits dependencies between relations to propagate exclusions, it selects probing targets for maximum information gain, and it caps probing overheads by conservative bounds on execution cost. Those bounds are iteratively increased once lower bounds on optimal query execution cost have been established. The algorithm is non-intrusive and can be used on top of any database management system supporting SQL limit clauses. We formally prove that probing costs are bounded as a function of the optimal execution cost and evaluate our algorithm experimentally. Our algorithm is in average six times and up to 69 times faster than a state-of-the-art baseline for ECQO on the recently proposed join order benchmark.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization; Query planning

3. Pessimistic Cardinality Estimation: Tighter Upper Bounds for Intermediate Join Cardinalities.

【Paper Link】【Pages】:18-35

【Authors】: Walter Cai ; Magdalena Balazinska ; Dan Suciu

【Abstract】: In this work we introduce a novel approach to the problem of cardinality estimation over multijoin queries. Our approach leveraging randomized hashing and data sketching to tighten these bounds beyond the current state of the art. We demonstrate that the bounds can be injected directly into the cost based query optimizer framework enabling it to avoid expensive physical join plans. We outline our base data structures and methodology, and how these bounds may be introduced to the optimizer's parameterized cost function as a new statistic for physical join plan selection. We demonstrate a complex tradeoff space between the tightness of our bounds and the size and complexity of our data structures. This space is not always monotonic as one might expect. In order combat this non-monotonicity, we introduce a partition budgeting scheme that guarantees monotonic behavior. We evaluate our methods on GooglePlus community graphs~\citegoogleplus, and the Join Order Benchmark (JOB)~\citeLeis:2015:GQO:2850583.2850594. In the presence of foreign key indexes, we demonstrate a $1.7\times$ improvement in aggregate (time summed over all queries in benchmark) physical query plan runtime compared to plans chosen by Postgres using the default cardinality estimation methods. When foreign key indexes are absent, this advantage improves to over $10\times$.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization; Theory of computation; Design and analysis of algorithms; Streaming, sublinear and near linear time algorithms; Bloom filters and hashing; Sketching and sampling; Theory and algorithms for application domains; Database theory; Data modeling; Database query processing and optimization (theory)

4. Efficiently Searching In-Memory Sorted Arrays: Revenge of the Interpolation Search?

【Paper Link】【Pages】:36-53

【Authors】: Peter Van Sandt ; Yannis Chronis ; Jignesh M. Patel

【Abstract】: In this paper, we focus on the problem of searching sorted, in-memory datasets. This is a key data operation, and Binary Search is the de facto algorithm that is used in practice. We consider an alternative, namely Interpolation Search, which can take advantage of hardware trends by using complex calculations to save memory accesses. Historically, Interpolation Search was found to underperform compared to other search algorithms in this setting, despite its superior asymptotic complexity. Also, Interpolation Search is known to perform poorly on non-uniform data. To address these issues, we introduce SIP (Slope reuse Interpolation), an optimized implementation of Interpolation Search, and TIP (Three point Interpolation), a new search algorithm that uses linear fractions to interpolate on non-uniform distributions. We evaluate these two algorithms against a similarly optimized Binary Search method using a variety of real and synthetic datasets. We show that SIP is up to 4 times faster on uniformly distributed data and TIP is 2-3 times faster on non-uniformly distributed data in some cases. We also design a meta-algorithm to switch between these different methods to automate picking the higher performing search algorithm, which depends on factors like data distribution.

【Keywords】: Information systems; Data management systems; Data structures; Data access methods; Point lookups; Database management system engines; Main memory engines

5. Iterative Query Processing based on Unified Optimization Techniques.

【Paper Link】【Pages】:54-68

【Authors】: Kisung Park ; Hojin Seo ; Mostofa Kamal Rasel ; Young-Koo Lee ; Chanho Jeong ; Sung Yeol Lee ; Chungmin Lee ; Dong-Hun Lee

【Abstract】: Hybrid transactional and analytical processing (HTAP) systems like SAP HANA make it much simpler to manage both operational load and analytical queries without ETL, separate data warehouses, et al. To represent both transactional and analytical business logic in a single database system, stored procedures are often used to express analytical queries using control flow logic and DMLs. Optimizing these complex procedures requires a fair knowledge of imperative programming languages as well as the declarative query language. Therefore, unified optimization techniques considering both program and query optimization techniques are essential for achieving optimal query performance. In this paper, we propose a novel unified optimization technique for efficient iterative query processing. We present a notion of query motion that allows the movement of SQL queries in and out of a loop. Additionally, we exploit a new cost model that measures the quality of the execution plan with consideration for queries and loop iterations. We describe our experimental evaluation that demonstrates the benefit of our technique using both a standard decision support benchmark and real-world workloads. An extensive evaluation shows that our unified optimization technique enumerates plans that achieve performance improvements of up to an order of magnitude faster than plans generated by the existing loop-invariant code motion technique.

【Keywords】: Theory of computation; Theory and algorithms for application domains; Database theory; Database query processing and optimization (theory)

6. Approximate Distinct Counts for Billions of Datasets.

【Paper Link】【Pages】:69-86

【Authors】: Daniel Ting

【Abstract】: Cardinality estimation plays an important role in processing big data. We consider the challenging problem of computing millions or more distinct count aggregations in a single pass and allowing these aggregations to be further combined into coarser aggregations. These arise naturally in many applications including networking, databases, and real-time business reporting. We demonstrate existing approaches to solve this problem are inherently flawed, exhibiting bias that can be arbitrarily large, and propose new methods for solving this problem that have theoretical guarantees of correctness and tight, practical error estimates. This is achieved by carefully combining CountMin and HyperLogLog sketches and a theoretical analysis using statistical estimation techniques. These methods also advance cardinality estimation for individual multisets, as they provide a provably consistent estimator and tight confidence intervals that have exactly the correct asymptotic coverage.

【Keywords】: Information systems; Data management systems; Mathematics of computing; Probability and statistics; Probabilistic algorithms

7. Cache-oblivious High-performance Similarity Join.

【Paper Link】【Pages】:87-104

【Authors】: Martin Perdacher ; Claudia Plant ; Christian Böhm

【Abstract】: A similarity join combines vectors based on a distance condition. Typically, such algorithms apply a filter step (by indexing or sorting) and then refine pairs of candidate vectors. In this paper, we propose to refine the pairs in an order defined by a space-filling curve which dramatically improves data locality. Modern multi-core microprocessors are supported by a deep memory hierarchy including RAM, various levels of cache, and registers. The space-filling curve makes our proposed algorithm cache-oblivious to fully exploit the memory hierarchy and to reach the possible peak performance of a multi-core processor. Our novel space-filling curve called Fast General Form (FGF) Hilbert solves a number of limitations of well-known approaches: it is non-recursive, it is not restricted to traverse squares, and it has a constant time and space complexity. As we demonstrate the easy transformation from conventional into cache-oblivious loops we believe that many algorithms for complex joins and other database operators could be transformed systematically into cache-oblivious SIMD and MIMD parallel algorithms.

【Keywords】: Computing methodologies; Parallel computing methodologies; Parallel algorithms; Shared memory algorithms; Information systems; Data management systems; Database management system engines; Database query processing; Join algorithms; Information systems applications; Data mining; Clustering; Nearest-neighbor search

Research 2: Privacy/Blockchain 6

8. Blurring the Lines between Blockchains and Database Systems: the Case of Hyperledger Fabric.

【Paper Link】【Pages】:105-122

【Authors】: Ankur Sharma ; Felix Martin Schuhknecht ; Divya Agrawal ; Jens Dittrich

【Abstract】: Within the last few years, a countless number of blockchain systems have emerged on the market, each one claiming to revolutionize the way of distributed transaction processing in one way or the other. Many blockchain features, such as byzantine fault tolerance, are indeed valuable additions in modern environments. However, despite all the hype around the technology, many of the challenges that blockchain systems have to face are fundamental transaction management problems. These are largely shared with traditional database systems, which have been around for decades already. These similarities become especially visible for systems, that blur the lines between blockchain systems and classical database systems. A great example of this is Hyperledger Fabric, an open-source permissioned blockchain system under development by IBM. By implementing parallel transaction processing, Fabric's workflow is highly motivated by optimistic concurrency control mechanisms in classical database systems. This raises two questions: (1)~Which conceptual similarities and differences do actually exist between a system such as Fabric and a classical distributed database system? (2)~Is it possible to improve on the performance of Fabric by transitioning technology from the database world to blockchains and thus blurring the lines between these two types of systems even further? To tackle these questions, we first explore Fabric from the perspective of database research, where we observe weaknesses in the transaction pipeline. We then solve these issues by transitioning well-understood database concepts to Fabric, namely transaction reordering as well as early transaction abort. Our experimental evaluation under the Smallbank benchmark as well as under a custom workload shows that our improved version Fabric++ significantly increases the throughput of successful transactions over the vanilla version by up to a factor of 12x, while decreasing the average latency to almost half.

【Keywords】: Information systems; Data management systems; Database management system engines; Distributed database transactions

9. Towards Scaling Blockchain Systems via Sharding.

【Paper Link】【Pages】:123-140

【Authors】: Hung Dang ; Tien Tuan Anh Dinh ; Dumitrel Loghin ; Ee-Chien Chang ; Qian Lin ; Beng Chin Ooi

【Abstract】: Existing blockchain systems scale poorly because of their distributed consensus protocols. Current attempts at improving blockchain scalability are limited to cryptocurrency. Scaling blockchain systems under general workloads (i.e., non-cryptocurrency applications) remains an open question. This work takes a principled approach to apply sharding to blockchain systems in order to improve their transaction throughput at scale. This is challenging, however, due to the fundamental difference in failure models between databases and blockchain. To achieve our goal, we first enhance the performance of Byzantine consensus protocols, improving individual shards' throughput. Next, we design an efficient shard formation protocol that securely assigns nodes into shards. We rely on trusted hardware, namely Intel SGX, to achieve high performance for both consensus and shard formation protocol. Third, we design a general distributed transaction protocol that ensures safety and liveness even when transaction coordinators are malicious. Finally, we conduct an extensive evaluation of our design both on a local cluster and on Google Cloud Platform. The results show that our consensus and shard formation protocols outperform state-of-the-art solutions at scale. More importantly, our sharded blockchain reaches a high throughput that can handle Visa-level workloads, and is the largest ever reported in a realistic environment.

【Keywords】: Hardware; Robustness; Fault tolerance; Information systems; Data management systems; Database management system engines; Distributed database transactions; Parallel and distributed DBMSs; Security and privacy; Security in hardware; Hardware security implementation; Hardware-based security protocols

10. vChain: Enabling Verifiable Boolean Range Queries over Blockchain Databases.

【Paper Link】【Pages】:141-158

【Authors】: Cheng Xu ; Ce Zhang ; Jianliang Xu

【Abstract】: Blockchains have recently been under the spotlight due to the boom of cryptocurrencies and decentralized applications. There is an increasing demand for querying the data stored in a blockchain database. To ensure query integrity, the user can maintain the entire blockchain database and query the data locally. However, this approach is not economic, if not infeasible, because of the blockchain's huge data size and considerable maintenance costs. In this paper, we take the first step toward investigating the problem of verifiable query processing over blockchain databases. We propose a novel framework, called vChain, that alleviates the storage and computing costs of the user and employs verifiable queries to guarantee the results' integrity. To support verifiable Boolean range queries, we propose an accumulator-based authenticated data structure that enables dynamic aggregation over arbitrary query attributes. Two new indexes are further developed to aggregate intra-block and inter-block data records for efficient query verification. We also propose an inverted prefix tree structure to accelerate the processing of a large number of subscription queries simultaneously. Security analysis and empirical study validate the robustness and practicality of the proposed techniques.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Security and privacy; Systems security; Distributed systems security

11. Answering Multi-Dimensional Analytical Queries under Local Differential Privacy.

【Paper Link】【Pages】:159-176

【Authors】: Tianhao Wang ; Bolin Ding ; Jingren Zhou ; Cheng Hong ; Zhicong Huang ; Ninghui Li ; Somesh Jha

【Abstract】: Multi-dimensional analytical (MDA) queries are often issued against a fact table with predicates on (categorical or ordinal) dimensions and aggregations on one or more measures. In this paper, we study the problem of answering MDA queries under local differential privacy (LDP). In the absence of a trusted agent, sensitive dimensions are encoded in a privacy-preserving (LDP) way locally before being sent to the data collector. The data collector estimates the answers to MDA queries, based on the encoded dimensions. We propose several LDP encoders and estimation algorithms, to handle a large class of MDA queries with different types of predicates and aggregation functions. Our techniques are able to answer these queries with tight error bounds and scale well in high-dimensional settings (i.e., error is polylogarithmic in dimension sizes). We conduct experiments on real and synthetic data to verify our theoretical results, and compare our solution with marginal-estimation based solutions.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Mathematics of computing; Probability and statistics; Probabilistic algorithms; Security and privacy; Human and societal aspects of security and privacy; Privacy protections; Security services; Privacy-preserving protocols; Theory of computation; Design and analysis of algorithms; Theory and algorithms for application domains; Database theory; Data structures and algorithms for data management; Theory of database privacy and security

12. APEx: Accuracy-Aware Differentially Private Data Exploration.

【Paper Link】【Pages】:177-194

【Authors】: Chang Ge ; Xi He ; Ihab F. Ilyas ; Ashwin Machanavajjhala

【Abstract】: Organizations are increasingly interested in allowing external data scientists to explore their sensitive datasets. Due to the popularity of differential privacy, data owners want the data exploration to ensure provable privacy guarantees. However, current systems for answering queries with differential privacy place an inordinate burden on the data analysts to understand differential privacy, manage their privacy budget, and even implement new algorithms for noisy query answering. Moreover, current systems do not provide any guarantees to the data analyst on the quality they care about, namely accuracy of query answers. We present APEx, a novel system that allows data analysts to pose adaptively chosen sequences of queries along with required accuracy bounds. By translating queries and accuracy bounds into differentially private algorithms with the least privacy loss, APEx returns query answers to the data analyst that meet the accuracy bounds, and proves to the data owner that the entire data exploration process is differentially private. Our comprehensive experimental study on real datasets demonstrates that APEx can answer a variety of queries accurately with moderate to small privacy loss, and can support data exploration for entity resolution with high accuracy under reasonable privacy settings.

【Keywords】: Information systems; Data management systems; Information integration; Entity resolution; Security and privacy; Database and storage security; Data anonymization and sanitization

13. Active Sparse Mobile Crowd Sensing Based on Matrix Completion.

【Paper Link】【Pages】:195-210

【Authors】: Kun Xie ; Xiaocan Li ; Xin Wang ; Gaogang Xie ; Jigang Wen ; Dafang Zhang

【Abstract】: A major factor that prevents the large scale deployment of Mobile Crowd Sensing (MCS) is its sensing and communication cost. Given the spatio-temporal correlation among the environment monitoring data, matrix completion (MC) can be exploited to only monitor a small part of locations and time, and infer the remaining data. Rather than only taking random measurements following the basic MC theory, to further reduce the cost of MCS while ensuring the quality of missing data inference, we propose an Active Sparse MCS (AS-MCS) scheme which includes a bipartite-graph-based sensing scheduling scheme to actively determine the sampling positions in each upcoming time slot, and a bipartite-graph-based matrix completion algorithm to robustly and accurately recover the un-sampled data in the presence of sensing and communications errors. We also incorporate the sensing cost into the bipartite-graph to facilitate low cost sample selection and consider the incentives for MCS. We have conducted extensive performance studies using the data sets from the monitoring of PM 2.5 air condition and road traffic speed, respectively. Our results demonstrate that our AS-MCS scheme can recover the missing data at very high accuracy with the sampling ratio only around $11%$, while the peer matrix completion algorithms with similar recovery performance requires up to 4-9 times the number of samples of ours for both the data sets.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Factorization methods; Non-negative matrix factorization

Research 3: Information Extraction 6

14. Autocompletion for Prefix-Abbreviated Input.

【Paper Link】【Pages】:211-228

【Authors】: Sheng Hu ; Chuan Xiao ; Jianbin Qin ; Yoshiharu Ishikawa ; Qiang Ma

【Abstract】: Query autocompletion (QAC) is an important interactive feature that assists users in formulating queries and saving keystrokes. Due to the convenience it brings to users, QAC has been adopted in many applications, including Web search engines, integrated development environments (IDEs), and mobile devices. For existing QAC methods, users have to manually type delimiters to separate keywords in their inputs. In this paper, we propose a novel QAC paradigm through which users may abbreviate keywords by prefixes and do not have to explicitly separate them. Such paradigm is useful for applications where it is inconvenient to specify delimiters, such as desktop search, text editors, and input method editors. E.g., in an IDE, users may input getnev and we suggest GetNextValue. We show that the query processing method for traditional QAC, which utilizes a trie index, is inefficient under the new problem setting. A novel indexing and query processing scheme is hence proposed to efficiently complete queries. To suggest meaningful results, we devise a ranking method based on a Gaussian mixture model, taking into consideration the way in which users abbreviate keywords, as opposed to the traditional ranking method that merely considers popularity. Efficient top-k query processing techniques are developed on top of the new index structure. Experiments demonstrate the effectiveness of the new QAC paradigm and the efficiency of the proposed query processing method.

【Keywords】: Information systems; Information retrieval; Information retrieval query processing; Query suggestion

15. Progressive Deep Web Crawling Through Keyword Queries For Data Enrichment.

【Paper Link】【Pages】:229-246

【Authors】: Pei Wang ; Ryan Shea ; Jiannan Wang ; Eugene Wu

【Abstract】: Data enrichment is the act of extending a local database with new attributes from external data sources. In this paper, we study a novel problem-how to progressively crawl the deep web (i.e., a hidden database) through a keyword-search API to enrich a local database in an e ective way. This is chal- lenging because these interfaces often limit the data access by enforcing the top-k constraint or limiting the number of queries that can be issued within a time window. In response, we propose SmartCrawl, a new framework to collect re- sults e ectively. Given a query budget b, SmartCrawl rst constructs a query pool based on the local database, and then iteratively issues a set of most bene cial queries to the hidden database such that the union of the query results can cover the maximum number of local records. The key technical challenge is how to estimate query bene t, i.e., the number of local records that can be covered by a given query. A simple approach is to estimate it as the query frequency in the local database. We nd that this is ine ective due to i) the impact of |ΔD|, where |ΔD| represents the number of local records that cannot be found in the hidden database, and ii) the top-k constraint enforced by the hidden database. We study how to mitigate the negative impacts of the two factors and propose e ective optimization techniques to improve performance. The experimental results show that on both simulated and real-world hidden databases, SmartCrawl signi cantly increases coverage over the local database as compared to the baselines.

【Keywords】: Information systems; Data management systems; Information integration; Extraction, transformation and loading

16. Visual Segmentation for Information Extraction from Heterogeneous Visually Rich Documents.

【Paper Link】【Pages】:247-262

【Authors】: Ritesh Sarkhel ; Arnab Nandi

【Abstract】: Physical and digital documents often contain visually rich information. With such information, there is no strict ordering or positioning in the document where the data values must appear. Along with textual cues, these documents often also rely on salient visual features to define distinct semantic boundaries and augment the information they disseminate. When performing information extraction (IE), traditional techniques fall short, as they use a text-only representation and do not consider the visual cues inherent to the layout of these documents. We propose VS2, a generalized approach for information extraction from heterogeneous visually rich documents. There are two major contributions of this work. First, we propose a robust segmentation algorithm that decomposes a visually rich document into a bag of visually isolated but semantically coherent areas, called logical blocks. Document type agnostic low-level visual and semantic features are used in this process. Our second contribution is a distantly supervised search-and-select method for identifying the named entities within these documents by utilizing the context boundaries defined by these logical blocks. Experimental results on three heterogeneous datasets suggest that the proposed approach significantly outperforms its text-only counterparts on all datasets. Comparing it against the state-of-the-art methods also reveal that VS2 performs comparably or better on all datasets.

【Keywords】: Information systems; Data management systems; Information integration; Entity resolution; Information retrieval; Document representation; Content analysis and feature selection; Document structure; Retrieval tasks and goals; Information extraction

17. RRR: Rank-Regret Representative.

【Paper Link】【Pages】:263-280

【Authors】: Abolfazl Asudeh ; Azade Nazi ; Nan Zhang ; Gautam Das ; H. V. Jagadish

【Abstract】: Selecting the best items in a dataset is a common task in data exploration. However, the concept of "best'' lies in the eyes of the beholder: different users may consider different attributes more important, and hence arrive at different rankings. Nevertheless, one can remove "dominated'' items and create a "representative'' subset of the data, comprising the "best items'' in it. A Pareto-optimal representative is guaranteed to contain the best item of each possible ranking, but it can be a large portion of data. A much smaller representative can be found if we relax the requirement to include the best item for each user, and instead just limit the users' "regret''. Existing work defines regret as the loss in score by limiting consideration to the representative instead of the full data set, for any chosen ranking function. However, the score is often not a meaningful number and users may not understand its absolute value. Sometimes small ranges in score can include large fractions of the data set. In contrast, users do understand the notion of rank ordering. Therefore, we consider the position of the items in the ranked list for defining the regret and propose the \em rank-regret representative as the minimal subset of the data containing at least one of the top-k of any possible ranking function. This problem is NP-complete. We use a geometric interpretation of items to bound their ranks on ranges of functions and to utilize combinatorial geometry notions for developing effective and efficient approximation algorithms for the problem. Experiments on real datasets demonstrate that we can efficiently find small subsets with small rank-regrets.

【Keywords】: Information systems; Information retrieval; Retrieval models and ranking; Top-k retrieval in databases; Theory of computation; Randomness, geometry and discrete structures; Computational geometry

18. Strongly Truthful Interactive Regret Minimization.

【Paper Link】【Pages】:281-298

【Authors】: Min Xie ; Raymond Chi-Wing Wong ; Ashwin Lall

【Abstract】: When faced with a database containing millions of tuples, an end user might be only interested in finding his/her (close to) favorite tuple in the database. Recently, a regret minimization query was proposed to obtain a small subset from the database that fits the user's needs, which are expressed through an unknown utility function. Specifically, it minimizes the "regret'' level of a user, which we quantify as the regret ratio if s/he gets the best tuple in the selected subset but not the best tuple among all tuples in the database. We study how to enhance the regret minimization query with user interactions : when presented with a small number of tuples (which can be artificial tuples or true tuples inside the database), a user is asked to indicate the tuple s/he favors the most among them. In particular, we are also interested in the special case of determining the favorite tuple for a user in the entire database with a small amount of interaction, measured by the number of questions we ask the user. Different from the previous work which displays artificial tuples to users, we achieve a stronger result in this paper by always displaying true tuples in the database. Specifically, we present a generic framework for interactive regret minimization, under which we propose algorithms that ask an asymptotically optimal number of questions in 2-dimensional spaces and algorithms with provable performance guarantees in d-dimensional spaces ($d \geq 2$) where each dimension corresponds to a description of a tuple. Experiments on real and synthetic datasets showed that our algorithms outperform the existing one by locating the favorite tuple and guaranteeing a small regret ratiowith much fewer questions.

【Keywords】: Information systems; Information systems applications; Decision support systems; Data analytics

19. Verifying Text Summaries of Relational Data Sets.

【Paper Link】【Pages】:299-316

【Authors】: Saehan Jo ; Immanuel Trummer ; Weicheng Yu ; Xuezhi Wang ; Cong Yu ; Daniel Liu ; Niyati Mehta

【Abstract】: We present a novel natural language query interface, the AggChecker, aimed at text summaries of relational data sets. The tool focuses on natural language claims that translate into an SQL query and a claimed query result. Similar in spirit to a spell checker, the AggChecker marks up text passages that seem to be inconsistent with the actual data. At the heart of the system is a probabilistic model that reasons about the input document in a holistic fashion. Based on claim keywords and the document structure, it maps each text claim to a probability distribution over associated query translations. By efficiently executing tens to hundreds of thousands of candidate translations for a typical input document, the system maps text claims to correctness probabilities. This process becomes practical via a specialized processing backend, avoiding redundant work via query merging and result caching. Verification is an interactive process in which users are shown tentative results, enabling them to take corrective actions if necessary. We tested our system on 53 publicly available articles containing 392 claims. Our tool revealed erroneous claims in roughly a third of test cases. Also, AggChecker compares favorably against several automated and semi-automated fact checking baselines.

【Keywords】: Applied computing; Document management and text processing; Human-centered computing; Human computer interaction (HCI); Interaction paradigms; Natural language interfaces; Information systems; Data management systems; Information integration; Query languages

Industry 1: Data Applications 6

20. QuickInsights: Quick and Automatic Discovery of Insights from Multi-Dimensional Data.

【Paper Link】【Pages】:317-332

【Authors】: Rui Ding ; Shi Han ; Yong Xu ; Haidong Zhang ; Dongmei Zhang

【Abstract】: Discovering interesting data patterns is a common and important analytical need in data, with increasing user demand for automated discovery abilities. However, automatically discovering interesting patterns from multi-dimensional data remains challenging. Existing techniques focus on mining individual types of patterns. There is a lack of unified formulation for different pattern types, as well as general mining frameworks to derive them effectively and efficiently. We present a novel technique QuickInsights, which quickly and automatically discovers interesting patterns from multi-dimensional data. QuickInsights proposes a unified formulation of interesting patterns, called insights, and designs a systematic mining framework to discover high-quality insights efficiently. We demonstrate the effectiveness and efficiency of QuickInsights through our evaluation on 447 real datasets as well as user studies on both expert users and non-expert users. QuickInsights is released in Microsoft Power BI.

【Keywords】: Computing methodologies; Artificial intelligence; Information systems; Information systems applications; Data mining

21. ExplainIt! - A Declarative Root-cause Analysis Engine for Time Series Data.

【Paper Link】【Pages】:333-348

【Authors】: Vimalkumar Jeyakumar ; Omid Madani ; Ali Parandeh ; Ashutosh Kulshreshtha ; Weifei Zeng ; Navindra Yadav

【Abstract】: We present \sys, a declarative, unsupervised root-cause analysis engine that uses time series monitoring data from large complex systems such as data centres. \sys empowers operators to succinctly specify a large number of causal hypotheses to search for causes of interesting events. \sys then ranks these hypotheses, reducing the number of causal dependencies from hundreds of thousands to a handful for human understanding. We show how a declarative language, such as SQL, can be effective in declaratively enumerating hypotheses that probe the structure of an unknown probabilistic graphical causal model of the underlying system. Our thesis is that databases are in a unique position to enable users to rapidly explore the possible causal mechanisms in data collected from diverse sources. We empirically demonstrate how \sys had helped us resolve over 30~performance issues in a commercial product since late 2014, of which we discuss a few cases in detail.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Cloud computing; Computing methodologies; Artificial intelligence; Knowledge representation and reasoning; Causal reasoning and diagnostics; Probabilistic reasoning; Temporal reasoning; Machine learning; Machine learning approaches; Learning in probabilistic graphical models; Information systems; Data management systems; Mathematics of computing; Probability and statistics; Multivariate statistics; Probabilistic inference problems; Computing most probable explanation; Probabilistic reasoning algorithms; Networks; Network services

22. Automatically Generating Interesting Facts from Wikipedia Tables.

【Paper Link】【Pages】:349-361

【Authors】: Flip Korn ; Xuezhi Wang ; You Wu ; Cong Yu

【Abstract】: Modern search engines provide contextual information surrounding query entities beyond ten blue links in the form of information cards. Among the various attributes displayed about entities there has been recent interest in providing fun facts. Obtaining such trivia at a large scale is, however, non-trivial: hiring professional content creators is expensive and extracting statements from the Web is prone to uninteresting, out-of-context and/or unreliable facts.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing; Natural language generation; Information systems; World Wide Web; Web searching and information discovery

23. Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale.

【Paper Link】【Pages】:362-375

【Authors】: Stephen H. Bach ; Daniel Rodriguez ; Yintao Liu ; Chong Luo ; Haidong Shao ; Cassandra Xia ; Souvik Sen ; Alexander Ratner ; Braden Hancock ; Houman Alborzi ; Rahul Kuchhal ; Christopher Ré ; Rob Malkin

【Abstract】: Labeling training data is one of the most costly bottlenecks in developing machine learning-based applications. We present a first-of-its-kind study showing how existing knowledge resources from across an organization can be used as weak supervision in order to bring development time and cost down by an order of magnitude, and introduce Snorkel DryBell, a new weak supervision management system for this setting. Snorkel DryBell builds on the Snorkel framework, extending it in three critical aspects: flexible, template-based ingestion of diverse organizational knowledge, cross-feature production serving, and scalable, sampling-free execution. On three classification tasks at Google, we find that Snorkel DryBell creates classifiers of comparable quality to ones trained with tens of thousands of hand-labeled examples, converts non-servable organizational resources to servable models for an average 52% performance improvement, and executes over millions of data points in tens of minutes.

【Keywords】: Computing methodologies; Machine learning; Learning settings; Semi-supervised learning settings

24. PS2: Parameter Server on Spark.

【Paper Link】【Pages】:376-388

【Authors】: Zhipeng Zhang ; Bin Cui ; Yingxia Shao ; Lele Yu ; Jiawei Jiang ; Xupeng Miao

【Abstract】: Most of the data is extracted and processed by Spark in Tencent Machine Learning Platform. However, seldom of them use Spark MLlib, an official machine learning (ML) library on top of Spark due to its inefficiency. In contrast, systems like parameter servers, XGBoost and TensorFlow are more used, which incur expensive cost of transferring data in and out of Spark ecosystem. In this paper, we identify the causes of inefficiency in Spark MLlib and solve the problem by building parameter servers on top of Spark. We propose PS2, a parameter server architecture that integrates Spark without hacking the core of Spark. With PS2, we leverage the power of Spark for data processing and ML training, and parameter servers for maintaining ML models. By carefully analyzing Tencent ML workloads, we figure out a widely existing computation pattern for ML models---element-wise operations among multiple high dimensional vectors. Based on this observation, we propose a new data abstraction, called Dimension Co-located Vector (DCV) for efficient model management in PS2. A DCV is a distributed vector that considers locality in parameter servers and enables efficient computation with multiple co-located distributed vectors. For ease-of-use, we also design a wide variety of advanced operators for operating DCVs. Finally, we carefully implement the PS2 system and evaluate it against existing systems on both public and Tencent workloads. Empirical results demonstrate that PS2 can outperform Spark MLlib by up to 55.6X and specialized ML systems like Petuum by up to 3.7X.

【Keywords】: Computer systems organization; Architectures; Distributed architectures

25. Entity Matching Meets Data Science: A Progress Report from the Magellan Project.

【Paper Link】【Pages】:389-403

【Authors】: Yash Govind ; Pradap Konda ; Paul Suganthan G. C. ; Philip Martinkus ; Palaniappan Nagarajan ; Han Li ; Aravind Soundararajan ; Sidharth Mudgal ; Jeffrey R. Ballard ; Haojun Zhang ; Adel Ardalan ; Sanjib Das ; Derek Paulsen ; Amanpreet Singh Saini ; Erik Paulson ; Youngchoon Park ; Marshall Carter ; Mingju Sun ; Glenn M. Fung ; AnHai Doan

【Abstract】: Entity matching (EM) finds data instances that refer to the same real-world entity. In 2015, we started the Magellan project at UW-Madison, joint with industrial partners, to build EM systems. Most current EM systems are stand-alone monoliths. In contrast, Magellan borrows ideas from the field of data science (DS), to build a new kind of EM systems, which is an ecosystem of interoperable tools. \em This paper provides a progress report on the past 3.5 years of Magellan, focusing on the system aspects and on how ideas from the field of data science have been adapted to the EM context. We argue why EM can be viewed as a special class of DS problems, and thus can benefit from system building ideas in DS. We discuss how these ideas have been adapted to build \pymatcher\ and \cloudmatcher, EM tools for power users and lay users. These tools have been successfully used in 21 EM tasks at 12 companies and domain science groups, and have been pushed into production for many customers. We report on the lessons learned, and outline a new envisioned Magellan ecosystem, which consists of not just on-premise Python tools, but also interoperable microservices deployed, executed, and scaled out on the cloud, using tools such as Dockers and Kubernetes.

【Keywords】: Information systems; Data management systems; Information integration; Deduplication; Entity resolution

SIGMOD Keynote 2 1

26. State of Public and Private Blockchains: Myths and Reality.

【Paper Link】【Pages】:404-411

【Authors】: C. Mohan

【Abstract】: It has been a decade since the concept of blockchain was invented as the underlying core data structure of the permissionless or public Bitcoin cryptocurrency network. Since then, several cryptocurrencies, tokens and ICOs have emerged. After much speculation and hype, significant number of them have become problematic or worthless! The public blockchain system Ethereum emerged by generalizing the use of blockchains to manage any kind of asset, be it physical or purely digital, with the introduction of Smart Contracts. Over the years, numerous myths have developed with respect to the purported utility and the need for public blockchains. The adoption and further adaptation of blockchains and smart contracts for use in the permissioned or private environments is what I consider to be useful and of practical consequence. Hence, the technical aspects of only private blockchain systems will be the focus of my SIGMOD 2019 keynote. Along the way, I will bust many myths associated with public blockchains. I will also compare traditional database technologies with blockchain systems' features and identify desirable future research topics.

【Keywords】: Applied computing; Enterprise computing; Enterprise information systems; Enterprise applications; Computer systems organization; Architectures; Distributed architectures; Peer-to-peer architectures; Information systems; Data management systems; Database management system engines; DBMS engine architectures; Distributed database transactions; Middleware for databases; Middleware business process managers; Information systems applications; Collaborative and social computing systems and tools; Open source software; World Wide Web; Web applications; Electronic commerce; Digital cash; Secure online transactions; Security and privacy; Security in hardware; Tamper-proof and tamper-resistant designs; Security services; Pseudonymity, anonymity and untraceability; Systems security; Operating systems security; Trusted computing; Social and professional topics; Computing / technology policy; Privacy policies; Software and its engineering; Software organization and properties; Software system structures; Distributed systems organizing principles; Theory of computation; Theory and algorithms for application domains; Database theory; Data provenance

Panel 1

27. The Responsibility Challenge for Data.

【Paper Link】【Pages】:412-414

【Authors】: H. V. Jagadish ; Francesco Bonchi ; Tina Eliassi-Rad ; Lise Getoor ; Krishna P. Gummadi ; Julia Stoyanovich

【Abstract】: As data science and artificial intelligence become ubiquitous, they have an increasing impact on society. While many of these impacts are beneficial, others may not be. So understanding and managing these impacts is required of every responsible data scientist. Nevertheless, most human decision-makers use algorithms for efficiency purposes and not to make a better (i.e., fairer) decisions. Even the task of risk assessment in the criminal justice system enables efficiency instead of (and often at the expense of) fairness. So we need to frame the problem with fairness, and other societal impacts, as primary objectives. In this context, most attention has been paid to the machine learning of a model for a task, such as recognition, prediction, or classification. However, issues arise in all parts of the data eco-system, from data acquisition to data presentation. For example, the majority of the population is not white and male, yet this demographic is over-represented in the training data. It is challenging for a data scientist to satisfactorily discharge this broad responsibility.

【Keywords】: Information systems; Data management systems; Social and professional topics; Computing / technology policy

Research 4: Distributed Data Management 4

28. An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning.

【Paper Link】【Pages】:415-432

【Authors】: Ji Zhang ; Yu Liu ; Ke Zhou ; Guoliang Li ; Zhili Xiao ; Bin Cheng ; Jiashu Xing ; Yangtao Wang ; Tianheng Cheng ; Li Liu ; Minwei Ran ; Zekang Li

【Abstract】: Configuration tuning is vital to optimize the performance of database management system (DBMS). It becomes more tedious and urgent for cloud databases (CDB) due to the diverse database instances and query workloads, which make the database administrator (DBA) incompetent. Although there are some studies on automatic DBMS configuration tuning, they have several limitations. Firstly, they adopt a pipelined learning model but cannot optimize the overall performance in an end-to-end manner. Secondly, they rely on large-scale high-quality training samples which are hard to obtain. Thirdly, there are a large number of knobs that are in continuous space and have unseen dependencies, and they cannot recommend reasonable configurations in such high-dimensional continuous space. Lastly, in cloud environment, they can hardly cope with the changes of hardware configurations and workloads, and have poor adaptability. To address these challenges, we design an end-to-end automatic CDB tuning system, CDBTune, using deep reinforcement learning (RL). CDBTune utilizes the deep deterministic policy gradient method to find the optimal configurations in high-dimensional continuous space. CDBTune adopts a try-and-error strategy to learn knob settings with a limited number of samples to accomplish the initial training, which alleviates the difficulty of collecting massive high-quality samples. CDBTune adopts the reward-feedback mechanism in RL instead of traditional regression, which enables end-to-end learning and accelerates the convergence speed of our model and improves efficiency of online tuning. We conducted extensive experiments under 6 different workloads on real cloud databases to demonstrate the superiority of CDBTune. Experimental results showed that CDBTune had a good adaptability and significantly outperformed the state-of-the-art tuning tools and DBA experts.

【Keywords】: Information systems; Data management systems; Database administration; Autonomous database administration; Database performance evaluation; Database utilities and tools

29. Fast General Distributed Transactions with Opacity.

【Paper Link】【Pages】:433-448

【Authors】: Alex Shamis ; Matthew Renzelmann ; Stanko Novakovic ; Georgios Chatzopoulos ; Aleksandar Dragojevic ; Dushyanth Narayanan ; Miguel Castro

【Abstract】: Transactions can simplify distributed applications by hiding data distribution, concurrency, and failures from the application developer. Ideally the developer would see the abstraction of a single large machine that runs transactions sequentially and never fails. This requires the transactional subsystem to provide opacity (strict serializability for both committed and aborted transactions), as well as transparent fault tolerance with high availability. As even the best abstractions are unlikely to be used if they perform poorly, the system must also provide high performance. Existing distributed transactional designs either weaken this abstraction or are not designed for the best performance within a data center. This paper extends the design of FaRM --- which provides strict serializability only for committed transactions --- to provide opacity while maintaining FaRM's high throughput, low latency, and high availability within a modern data center. It uses timestamp ordering based on real time with clocks synchronized to within tens of microseconds across a cluster, and a failover protocol to ensure correctness across clock master failures. FaRM with opacity can commit 5.4 million neworder transactions per second when running the TPC-C transaction mix on 90 machines with 3-way replication.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Information systems; Data management systems; Database management system engines; Database transaction processing; Parallel and distributed DBMSs

30. The Log-Structured Merge-Bush & the Wacky Continuum.

【Paper Link】【Pages】:449-466

【Authors】: Niv Dayan ; Stratos Idreos

【Abstract】: Data-intensive key-value stores based on the Log-Structured Merge-Tree are used in numerous modern applications ranging from social media and data science to cloud infrastructure. We show that such designs exhibit an intrinsic contention between the costs of point reads, writes and memory, and that this trade-off deteriorates as the data size grows. The root of the problem is that in all existing designs, the capacity ratio between any pair of levels is fixed. This causes write cost to increase with the data size while yielding exponentially diminishing returns for point reads and memory. We introduce the Log-Structured Merge-Bush (LSM-Bush), a new data structure that sets increasing capacity ratios between adjacent pairs of smaller levels. As a result, smaller levels get lazier by gathering more runs before merging them. By using a doubly-exponential ratio growth rate, LSM-bush brings write cost down from \small$O(łog N)$ to \small$O(łogłog N)$, and it can trade this gain to either improve point reads or memory. Thus, it enables more scalable trade-offs all around. We further introduce Wacky, a design continuum that includes LSM-Bush as well as all state-of-the-art merge policies, from laziest to greediest, and can assume any of them within a single implementation. Wacky encompasses a vast space of performance properties, including ones that favor range reads, and it can be searched analytically to find the design that performs best for a given workload in practice.

【Keywords】: Information systems; Data management systems; Data structures; Data access methods; Point lookups; Database management system engines; Parallel and distributed DBMSs; Key-value stores; Information storage systems; Record storage systems; Directory structures; Record storage alternatives; Indexed file organization; Theory of computation; Design and analysis of algorithms; Data structures design and analysis; Sorting and searching; Streaming, sublinear and near linear time algorithms; Bloom filters and hashing; Theory and algorithms for application domains; Database theory; Data structures and algorithms for data management

31. RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-aggregate-SQL on Spark.

【Paper Link】【Pages】:467-484

【Authors】: Jiaqi Gu ; Yugo H. Watanabe ; William A. Mazza ; Alexander Shkapsky ; Mohan Yang ; Ling Ding ; Carlo Zaniolo

【Abstract】: Thanks to a simple SQL extension, Recursive-aggregate-SQL (RaSQL) can express very powerful queries and declarative algorithms, such as classical graph algorithms and data mining algorithms. A novel compiler implementation allows RaSQL to map declarative queries into one basic fixpoint operator supporting aggregates in recursive queries. A fully optimized implementation of this fixpoint operator leads to superior performance, scalability and portability. Thus, our RaSQL system, which extends Spark SQL with the before-mentioned new constructs and implementation techniques, matches and often surpasses the performance of other systems, including Apache Giraph, GraphX and Myria.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query languages; Relational database query languages

Research 5: Provenance 4

32. Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances.

【Paper Link】【Pages】:485-502

【Authors】: Zhengjie Miao ; Qitian Zeng ; Boris Glavic ; Sudeepa Roy

【Abstract】: Provenance and intervention-based techniques have been used to explain surprisingly high or low outcomes of aggregation queries. However, such techniques may miss interesting explanations emerging from data that is not in the provenance. For instance, an unusually low number of publications of a prolific researcher in a certain venue and year can be explained by an increased number of publications in another venue in the same year. We present a novel approach for explaining outliers in aggregation queries through counter- balancing. That is, explanations are outliers in the opposite direction of the outlier of interest. Outliers are defined w.r.t. patterns that hold over the data in aggregate. We present efficient methods for mining such aggregate regression pat- terns (ARPs), discuss how to use ARPs to generate and rank explanations, and experimentally demonstrate the efficiency and effectiveness of our approach.

【Keywords】: Theory of computation; Theory and algorithms for application domains; Database theory; Data provenance

33. Explaining Wrong Queries Using Small Examples.

【Paper Link】【Pages】:503-520

【Authors】: Zhengjie Miao ; Sudeepa Roy ; Jun Yang

【Abstract】: For testing the correctness of SQL queries, a standard practice is to execute the query in question on some test database instance and compare its result with that of the correct query. Given two queries $Q_1$ and $Q_2$, we say that a database instance D is a counterexample (for $Q_1$ and $Q_2$) if $Q_1(D)$ differs from $Q_2(D)$; such a counterexample can serve as an explanation of why $Q_1$ and $Q_2$ are not equivalent. While the test database instance may serve as a counterexample, it may be too large or complex to understand where the inequivalence arises. Therefore, in this paper, given a known counterexample D for $Q_1$ and $Q_2$, we aim to find the smallest counterexample $D' \subseteq D$ where $Q_1(D') \neq Q_2(D')$. The problem in general is NP-hard. Drawing techniques from provenance and constraint solving, we develop a suite of algorithms for finding small counterexamples for different classes of queries, including those involving negation and aggregation. We evaluate the effectiveness and scalability of our algorithms on student queries from an undergraduate database course, and on queries from the TPC-H benchmark. We also report a user study from the course where we deployed our tool to help students with an assignment on relational algebra.

【Keywords】: Information systems; Data management systems; Database administration; Database utilities and tools

34. Ariadne: Online Provenance for Big Graph Analytics.

【Paper Link】【Pages】:521-536

【Authors】: Vicky Papavasileiou ; Ken Yocum ; Alin Deutsch

【Abstract】: Data provenance is a powerful tool for debugging large-scale analytics on batch processing systems. This paper presents Ariadne, a system for capturing and querying provenance from Vertex-Centric graph processing systems. While the size of provenance from map-reduce-style workflows is often a fraction of the input data size, graph algorithms iterate over the input graph many times, producing provenance much larger than the input graph. And though current provenance tracing procedures support explicit debugging scenarios, like crash-culprit determination, developers are increasingly interested in the behavior of analytics when a crash or exception does not occur. To address this challenge, Ariadne offers developers a concise declarative query language to capture and query graph analytics provenance. Exploiting the formal semantics of this datalog-based language, we identify useful query classes that can run while an analytic computes. Experiments with various analytics and real-world datasets show the overhead of online querying is 1.3x over the baseline vs. 8x for the traditional approach. These experiments also illustrate how Ariadne's query language supports execution monitoring and performance optimization for graph analytics.

【Keywords】: Information systems; Data management systems; Query languages; Query languages for non-relational engines; Theory of computation; Design and analysis of algorithms; Graph algorithms analysis; Theory and algorithms for application domains; Database theory; Data provenance

35. Hypothetical Reasoning via Provenance Abstraction.

【Paper Link】【Pages】:537-554

【Authors】: Daniel Deutch ; Yuval Moskovitch ; Noam Rinetzky

【Abstract】: Data analytics often involves hypothetical reasoning: repeatedly modifying the data and observing the induced effect on the computation result of a data-centric application. Previous work has shown that fine-grained data provenance can help make such an analysis more efficient: instead of a costly re-execution of the underlying application, hypothetical scenarios are applied to a pre-computed provenance expression. However, storing provenance for complex queries and large-scale data leads to a significant overhead, which is often a barrier to the incorporation of provenance-based solutions. To this end, we present a framework that allows to reduce provenance size. Our approach is based on reducing the provenance granularity using user defined abstraction trees over the provenance variables; the granularity is based on the anticipated hypothetical scenarios. We formalize the tradeoff between provenance size and supported granularity of the hypothetical reasoning, and study the complexity of the resulting optimization problem, provide efficient algorithms for tractable cases and heuristics for others. We experimentally study the performance of our solution for various queries and abstraction trees. Our study shows that the algorithms generally lead to substantial speedup of hypothetical reasoning, with a reasonable loss of accuracy.

【Keywords】: Theory of computation; Theory and algorithms for application domains; Database theory; Data provenance

Research 6: Streams 4

36. Event Trend Aggregation Under Rich Event Matching Semantics.

【Paper Link】【Pages】:555-572

【Authors】: Olga Poppe ; Chuan Lei ; Elke A. Rundensteiner ; David Maier

【Abstract】: Streaming applications from cluster monitoring to algorithmic trading deploy Kleene queries to detect and aggregate event trends. Rich event matching semantics determine how to compose events into trends. The expressive power of state-of-the-art streaming systems remains limited since they do not support many of these semantics. Worse yet, they suffer from long delays and high memory costs because they maintain aggregates at a fine granularity. To overcome these limitations, our Coarse-Grained Event Trend Aggregation (Cogra) approach supports a rich variety of event matching semantics within one system. Better yet, Cogra incrementally maintains aggregates at the coarsest granularity possible for each of these semantics. In this way, Cogra minimizes the number of aggregates -- reducing both time and space complexity. Our experiments demonstrate that Cogra achieves up to six orders of magnitude speed-up and up to seven orders of magnitude memory reduction compared to state-of-the-art approaches.

【Keywords】: Computing methodologies; Symbolic and algebraic manipulation; Symbolic and algebraic algorithms; Optimization algorithms; Information systems; Information systems applications; Spatial-temporal systems; Data streaming; Theory of computation; Design and analysis of algorithms; Online algorithms

37. Elasticutor: Rapid Elasticity for Realtime Stateful Stream Processing.

【Paper Link】【Pages】:573-588

【Authors】: Li Wang ; Tom Z. J. Fu ; Richard T. B. Ma ; Marianne Winslett ; Zhenjie Zhang

【Abstract】: Elasticity is highly desirable for stream systems to guarantee low latency against workload dynamics, such as surges in arrival rate and fluctuations in data distribution. Existing systems achieve elasticity using a resource-centric approach that repartitions keys across the parallel instances, i.e., executors, to balance the workload and scale operators. However, such operator-level repartitioning requires global synchronization and prohibits rapid elasticity. We propose an executor-centric approach that avoids operator-level key repartitioning and implements executors as the building blocks of elasticity. By this new approach, we design the Elasticutor framework with two level of optimizations: i) a novel implementation of executors, i.e., elastic executors, that perform elastic multi-core execution via efficient intra-executor load balancing and executor scaling and ii) a global model-based scheduler that dynamically allocates CPU cores to executors based on the instantaneous workloads. We implemented a prototype of Elasticutor and conducted extensive experiments. We show that Elasticutor doubles the throughput and achieves up to two orders of magnitude lower latency than previous methods for dynamic workloads of real-world applications.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Cloud computing; Information systems; Data management systems; Database design and models; Data model extensions; Data streams; Database management system engines; Stream management

38. Real-Time Multi-Pattern Detection over Event Streams.

【Paper Link】【Pages】:589-606

【Authors】: Ilya Kolchinsky ; Assaf Schuster

【Abstract】: Rapid advances in data-driven applications over recent years have intensified the need for efficient mechanisms capable of monitoring and detecting arbitrarily complex patterns in massive data streams. This task is usually performed by complex event processing (CEP) systems. CEP engines are required to process hundreds or even thousands of user-defined patterns in parallel under tight real-time constraints. To enhance the performance of this crucial operation, multiple techniques have been developed, utilizing well-known optimization approaches such as pattern rewriting and sharing common subexpressions. However, the scalability of these methods is limited by the high computation overhead, and the quality of the produced plans is compromised by ignoring significant parts of the solution space. In this paper, we present a novel framework for real-time multi-pattern complex event processing. Our approach is based on formulating the above task as a global optimization problem and applying a combination of sharing and pattern reordering techniques to construct an optimal plan satisfying the problem constraints. To the best of our knowledge, no such fusion was previously attempted in the field of CEP optimization. To locate the best possible evaluation plan in the resulting hyperexponential solution space, we design efficient local search algorithms that utilize the unique problem structure. An extensive theoretical and empirical analysis of our system demonstrates its superiority over state-of-the-art solutions.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Join algorithms; Stream management

39. AStream: Ad-hoc Shared Stream Processing.

【Paper Link】【Pages】:607-622

【Authors】: Jeyhun Karimov ; Tilmann Rabl ; Volker Markl

【Abstract】: In the last decade, many distributed stream processing engines (SPEs) were developed to perform continuous queries on massive online data. The central design principle of these engines is to handle queries that potentially run forever on data streams with a query-at-a-time model, i.e., each query is optimized and executed separately. In many real applications, streams are not only processed with long-running queries, but also thousands of short-running ad-hoc queries. To support this efficiently, it is essential to share resources and computation for stream ad-hoc queries in a multi-user environment. The goal of this paper is to bridge the gap between stream processing and ad-hoc queries in SPEs by sharing computation and resources. We define three main requirements for ad-hoc shared stream processing: (1) Integration: Ad-hoc query processing should be a composable layer which can extend stream operators, such as join, aggregation, and window operators; (2) Consistency: Ad-hoc query creation and deletion must be performed in a consistent manner and ensure exactly-once semantics and correctness; (3) Performance: In contrast to state-of-the-art SPEs, ad-hoc SPE should not only maximize data throughput but also query throughout via incremental computation and resource sharing. Based on these requirements, we have developed AStream, an ad-hoc, shared computation stream processing framework. To the best of our knowledge, AStream is the first system that supports distributed ad-hoc stream processing. AStream is built on top of Apache Flink. Our experiments show that AStream shows comparable results to Flink for single query deployments and outperforms it in orders of magnitude with multiple queries.

【Keywords】: Information systems; Data management systems; Database management system engines; Stream management

Industry 2: Storage & Indexing 4

40. Nanosecond Indexing of Graph Data With Hash Maps and VLists.

【Paper Link】【Pages】:623-635

【Authors】: Andrew Carter ; Andrew Rodriguez ; Yiming Yang ; Scott Meyer

【Abstract】: We introduce a wait-free, multi-reader, single-writer, "kill -9" durable, indexing structure for in-memory social graph databases. This structure requires no communication from the readers back to the writer, allowing for trivial read scalability and isolation. We support online updates without compromising availability or read performance. Our structure supports looking up small subgraphs in 80 nanoseconds and a materialization rate of 12 nanoseconds per edge. Storage takes 7 bytes per edge per index and supports almost 1 million online writes per second.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Information retrieval; Document representation

41. Implementation of Cluster-wide Logical Clock and Causal Consistency in MongoDB.

【Paper Link】【Pages】:636-650

【Authors】: Misha Tyulenev ; Andy Schwerin ; Asya Kamsky ; Randolph Tan ; Alyson Cabral ; Jack Mulrow

【Abstract】: MongoDB is a distributed database that supports replication and horizontal partitioning (sharding). MongoDB replica sets consist of a primary that accepts all client writes and then propagates those writes to the secondaries. Each member of the replica set contains the same set of data. For horizontal partitioning, each shard (or partition) is a replica set. This paper discusses the design and rationale behind MongoDB's implementation of a cluster-wide logical clock and causal consistency. The design leveraged ideas from across the research community to ensure that the implementation adds minimal processing overhead, tolerates possible operator errors, and gives protection against non-trusted client attacks. While the goal of the team was not to discover or test new algorithms, the practical implementation necessitated a novel combination of ideas from the research community on causal consistency, security, and minimal performance overhead at scale. This paper describes a large scale, practical implementation of causal consistency using a hybrid logical clock, adding the signing of logical time ranges to the protocol, and introducing performance optimizations necessary for systems at scale. The implementation seeks to define an event as a state change and as such must make forward progress guarantees even during periods of no state changes for a partition of data.

【Keywords】: Information systems; Data management systems; Database management system engines; DBMS engine architectures

42. X-Engine: An Optimized Storage Engine for Large-scale E-commerce Transaction Processing.

【Paper Link】【Pages】:651-665

【Authors】: Gui Huang ; Xuntao Cheng ; Jianying Wang ; Yujie Wang ; Dengcheng He ; Tieying Zhang ; Feifei Li ; Sheng Wang ; Wei Cao ; Qiang Li

【Abstract】: Alibaba runs the largest e-commerce platform in the world serving more than 600 million customers, with a GMV (gross merchandise value) exceeding USD 768 billion in FY2018. Online e-commerce transactions have three notable characteristics: (1) drastic increase of transactions per second with the kickoff of major sales and promotion events, (2) a large number of hot records that can easily overwhelm system buffers, and (3) quick shift of the "temperature'' (hot v.s. warm v.s. cold) of different records due to the availability of promotions on different categories over different short time periods. For example, Alibaba's OLTP database clusters experienced a 122 times increase of transactions on the start of the Singles' Day Global Shopping Festival in 2018, processing up to 491,000 sales transactions per second which translate to more than 70 million database transactions per second. To address these challenges, we introduce X-Engine, a write-optimized storage engine of POLARDB built at Alibaba, which utilizes a tiered storage architecture with the LSM-tree (log-structured merge tree) to leverage hardware acceleration such as FPGA-accelerated compactions, and a suite of optimizations including asynchronous writes in transactions, multi-staged pipelines and incremental cache replacement during compactions. Evaluation results show that X-Engine has outperformed other storage engines under such transactional workloads.

【Keywords】: Information systems; Data management systems; Data structures; Data access methods; Database management system engines; Database transaction processing; DBMS engine architectures

43. Automatically Indexing Millions of Databases in Microsoft Azure SQL Database.

【Paper Link】【Pages】:666-679

【Authors】: Sudipto Das ; Miroslav Grbic ; Igor Ilic ; Isidora Jovandic ; Andrija Jovanovic ; Vivek R. Narasayya ; Miodrag Radulovic ; Maja Stikic ; Gaoxiang Xu ; Surajit Chaudhuri

【Abstract】: An appropriate set of indexes can result in orders of magnitude better query performance. Index management is a challenging task even for expert human administrators. Fully automating this process is of significant value. We describe the challenges, architecture, design choices, implementation, and learnings from building an industrial-strength auto-indexing service for Microsoft Azure SQL Database, a relational database service. Our service has been generally available for more than two years, generating index recommendations for every database in Azure SQL Database, automatically implementing them for a large fraction, and significantly improving performance of hundreds of thousands of databases. We also share our experience from experimentation at scale with production databases which gives us confidence in our index recommendation quality for complex real applications.

Research 7: Modern Hardware 6

44. Concurrent Prefix Recovery: Performing CPR on a Database.

【Paper Link】【Pages】:687-704

【Authors】: Guna Prasaad ; Badrish Chandramouli ; Donald Kossmann

【Abstract】: With increasing multi-core parallelism, modern databases and key-value stores are designed for scalability and presently yield very high throughput for the in-memory working set. These systems typically depend on group commit using a write-ahead log (WAL) to provide durability and crash recovery. However, a WAL is expensive, particularly for update-intensive workloads, where it also introduces a concurrency bottleneck (the log) besides log creation and I/O overheads. In this paper, we propose a new recovery model based on group commit, called concurrent prefix recovery (CPR). CPR differs from traditional group commit implementations in two ways: (1) it provides a semantic description of committed operations, of the form "all operations until time Ti from session i"; and (2) it uses asynchronous incremental checkpointing instead of a WAL to implement group commit in a scalable bottleneck-free manner. CPR provides the same consistency as a point-in-time commit, but allows a scalable concurrent implementation. We used CPR to make two systems durable: (1) a custom in-memory transactional database; and (2) Faster, our state-of-the-art, scalable, larger-than-memory key-value store. Our detailed evaluation of these modified systems shows that CPR is highly scalable and supports concurrent performance reaching hundreds of millions of operations per second on a multi-core machine.

【Keywords】: Information systems; Data management systems; Database management system engines; Database transaction processing; Database recovery; Parallel and distributed DBMSs; Key-value stores

45. BriskStream: Scaling Data Stream Processing on Shared-Memory Multicore Architectures.

【Paper Link】【Pages】:705-722

【Authors】: Shuhao Zhang ; Jiong He ; Amelie Chi Zhou ; Bingsheng He

【Abstract】: We introduce BriskStream, an in-memory data stream processing system (DSPSs) specifically designed for modern shared-memory multicore architectures. BriskStream's key contribution is an execution plan optimization paradigm, namely RLAS, which takes relative-location (i.e., NUMA distance) of each pair of producer-consumer operators into consideration. We propose a branch and bound based approach with three heuristics to resolve the resulting nontrivial optimization problem. The experimental evaluations demonstrate that BriskStream yields much higher throughput and better scalability than existing DSPSs on multi-core architectures when processing different types of workloads.

【Keywords】: Computer systems organization; Architectures; Parallel architectures; Multicore architectures; Information systems; Data management systems; Database management system engines; Stream management

46. Border-Collie: A Wait-free, Read-optimal Algorithm for Database Logging on Multicore Hardware.

【Paper Link】【Pages】:723-740

【Authors】: Jong-Bin Kim ; Hyeongwon Jang ; Seohui Son ; Hyuck Han ; Sooyong Kang ; Hyungsoo Jung

【Abstract】: Actions changing the state of databases are all logged with proper ordering being imposed. Database engines obeying this golden rule of logging enforce total ordering on all events, and this poses challenges in addressing the scalability bottlenecks of database logging on multicore hardware. We reexamined the problem of database logging and realized that in any given log history, obtaining an upper bound on the size of a set that preserves the happen-before relation is the essence of the matter. Based on our understanding, we propose Border-Collie, a wait-free and read-optimal algorithm for database logging that finds such an upper bound even with some worker threads often being idle. We show that (1) Border-Collie always finds the largest set of logged events satisfying the condition in a finite number of steps (i.e., wait-free), (2) the number of logged events to be read is also minimal (i.e., read-optimal), and (3) both properties hold even with threads being in intermittent work. Experimental results demonstrated that Border-Collie proves our claims under various workloads; Border-Collie outperforms the state-of-the-art centralized logging techniques (i.e., Eleda and ERMIA) by up to ~2X and exhibits almost the same throughput with much shorter commit latency than the state-of-the-art decentralized logging techniques (i.e., Silo and FOEDUS).

【Keywords】: Information systems; Data management systems; Database management system engines; Database transaction processing; Database recovery; Transaction logging; DBMS engine architectures

47. Designing Distributed Tree-based Index Structures for Fast RDMA-capable Networks.

【Paper Link】【Pages】:741-758

【Authors】: Tobias Ziegler ; Sumukha Tumkur Vani ; Carsten Binnig ; Rodrigo Fonseca ; Tim Kraska

【Abstract】: Over the past decade, in-memory database systems have become prevalent in academia and industry. However, large data sets often need to be stored distributed across the memory of several nodes in a cluster, since they often do not fit into the memory of a single machine. A database architecture that has recently been proposed for building distributed in-memory databases for fast RDMA-capable networks is the Network-Attached-Memory (NAM) architecture. The NAM architecture logically separates compute and memory servers and thus provides independent scalability of both resources. One important key challenge in the NAM architecture, is to provide efficient remote access methods for compute nodes to access data residing in memory nodes. In this paper, we therefore discuss design alternatives for distributed tree-based index structures in the NAM architecture. The two main aspects that we focus on in our paper are: (1) how the index itself should be distributed across several memory servers and (2) which RDMA primitives should be used by compute servers to access the distributed index structure in the most efficient manner. Our experimental evaluation shows the trade-offs for different distributed index design alternatives using a variety of workloads. While the focus of this paper is on the NAM architecture, we believe that the findings can also help to understand the design space on how to build distributed tree-based indexes for other RDMA-based distributed database architectures in general.

【Keywords】: Hardware; Communication hardware, interfaces and storage; Networking hardware; Information systems; Data management systems; Data structures; Data access methods; Software and its engineering; Software organization and properties; Contextual software domains; Operating systems; Memory management; Main memory

48. DistME: A Fast and Elastic Distributed Matrix Computation Engine using GPUs.

【Paper Link】【Pages】:759-774

【Authors】: Donghyoung Han ; Yoon-Min Nam ; Jihye Lee ; Kyongseok Park ; Hyunwoo Kim ; Min-Soo Kim

【Abstract】: Matrix computation, in particular, matrix multiplication is time-consuming, but essentially and widely used in a large number of applications in science and industry. The existing distributed matrix multiplication methods only focus on either low communication cost (i.e., high performance) with the risk of out of memory or large-scale processing with high communication overhead. We propose a distributed elastic matrix multiplication method called CuboidMM that achieves both high performance and large-scale processing. We also propose a GPU acceleration method that can be combined with CuboidMM. CuboidMM partitions matrices into cuboids for optimizing the network communication cost with considering memory usage per task, and the GPU acceleration method partitions a cuboid into subcuboids for optimizing the PCI-E communication cost with considering GPU memory usage. We implement a fast and elastic matrix computation engine called DistME by integrating CuboidMM with GPU acceleration on top of Apache Spark. Through extensive experiments, we have demonstrated that CuboidMM and DistME significantly outperform the state-of-the-art methods and systems, respectively, in terms of both performance and data size.

【Keywords】: Computing methodologies; Distributed computing methodologies; Distributed algorithms; Information systems; Data management systems; Database management system engines; Parallel and distributed DBMSs; MapReduce-based systems

49. GPU-based Graph Traversal on Compressed Graphs.

【Paper Link】【Pages】:775-792

【Authors】: Mo Sha ; Yuchen Li ; Kian-Lee Tan

【Abstract】: Graph processing on GPUs received much attention in the industry and the academia recently, as the hardware accelerator offers attractive potential for performance boost. However, the high-bandwidth device memory on GPUs has limited capacity that constrains the size of the graph to be loaded on chip. In this paper, we introduce GPU-based graph traversal on compressed graphs, so as to enable the processing of graphs having a larger size than the device memory. Designed towards GPU's SIMT architecture, we propose two novel parallel scheduling strategies Two-Phase Traversal and Task-Stealing to handle thread divergence and workload imbalance issues when decoding the compressed graph. We further optimize our solution against power-law graphs by proposing Warp-centric Decoding and Residual Segmentation to facilitate parallelism on processing skewed out-degree distribution. Extensive experiments show that with 2x-18x compression rate, our proposed GPU-based graph traversal on compressed graphs (GCGT) achieves competitive efficiency compared with the state-of-the-art graph traversal approaches on non-compressed graphs.

【Keywords】: Theory of computation; Design and analysis of algorithms; Data structures design and analysis; Data compression; Graph algorithms analysis; Parallel algorithms; Shared memory algorithms

Research 8: Data Integration/Cleaning 6

50. Interventional Fairness: Causal Database Repair for Algorithmic Fairness.

【Paper Link】【Pages】:793-810

【Authors】: Babak Salimi ; Luke Rodriguez ; Bill Howe ; Dan Suciu

【Abstract】: Fairness is increasingly recognized as a critical component of machine learning systems. However, it is the underlying data on which these systems are trained that often reflect discrimination, suggesting a database repair problem. Existing treatments of fairness rely on statistical correlations that can be fooled by statistical anomalies, such as Simpson's paradox. Proposals for causality-based definitions of fairness can correctly model some of these situations, but they require specification of the underlying causal models. In this paper, we formalize the situation as a database repair problem, proving sufficient conditions for fair classifiers in terms of admissible variables as opposed to a complete causal model. We show that these conditions correctly capture subtle fairness violations. We then use these conditions as the basis for database repair algorithms that provide provable fairness guarantees about classifiers trained on their training labels. We evaluate our algorithms on real data, demonstrating improvement over the state of the art on multiple fairness metrics proposed in the literature while retaining high utility.

【Keywords】: Theory of computation; Theory and algorithms for application domains; Database theory; Incomplete, inconsistent, and uncertain databases; Machine learning theory

51. Uni-Detect: A Unified Approach to Automated Error Detection in Tables.

【Paper Link】【Pages】:811-828

【Authors】: Pei Wang ; Yeye He

【Abstract】: Data errors are ubiquitous in tables. Extensive research in this area has resulted in a rich variety of techniques, each often targeting a specific type of errors, e.g., numeric outliers, constraint violations, etc. While these diverse techniques clearly improve data quality, it places a significant burden on humans to configure these techniques with suitable rules and parameters for each data set. For example, an expert is expected to define suitable functional-dependencies between column pairs, or tune appropriate thresholds for outlier-detection algorithms, all of which are specific to one individual data set. As a result, users today often hire experts to cleanse only their high-value data sets. We propose \sj, a unified framework to automatically detect diverse types of errors. Our approach employs a novel "what-if'' analysis that performs local data perturbations to reason about data abnormality, leveraging classical hypothesis-tests on a large corpus of tables. We test \sj on a wide variety of tables including Wikipedia tables, and make surprising discoveries of thousands of FD violations, numeric outliers, spelling mistakes, etc., with better accuracy than existing algorithms specifically designed for each type of errors. For example, for spelling mistakes, \sj outperforms the state-of-the-art spell-checker from a commercial search engine.

【Keywords】: Information systems; Data management systems; Information integration; Data cleaning

52. HoloDetect: Few-Shot Learning for Error Detection.

【Paper Link】【Pages】:829-846

【Authors】: Alireza Heidari ; Joshua McGrath ; Ihab F. Ilyas ; Theodoros Rekatsinas

【Abstract】: We introduce a few-shot learning framework for error detection. We show that data augmentation (a form of weak supervision) is key to training high-quality, ML-based error detection models that require minimal human involvement. Our framework consists of two parts: (1) an expressive model to learn rich representations that capture the inherent syntactic and semantic heterogeneity of errors; and (2) a data augmentation model that, given a small seed of clean records, uses dataset-specific transformations to automatically generate additional training data. Our key insight is to learn data augmentation policies from the noisy input dataset in a weakly supervised manner. We show that our framework detects errors with an average precision of ~94% and an average recall of ~93% across a diverse array of datasets that exhibit different types and amounts of errors. We compare our approach to a comprehensive collection of error detection methods, ranging from traditional rule-based methods to ensemble-based and active learning approaches. We show that data augmentation yields an average improvement of 20 F1 points while it requires access to 3x fewer labeled examples compared to other ML approaches.

【Keywords】: Computing methodologies; Machine learning; Learning settings; Semi-supervised learning settings; Hardware; Robustness; Fault tolerance; Error detection and error correction; Information systems; Data management systems; Information integration; Data cleaning; Theory of computation; Theory and algorithms for application domains; Database theory; Incomplete, inconsistent, and uncertain databases

53. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes.

【Paper Link】【Pages】:847-864

【Authors】: Erkang Zhu ; Dong Deng ; Fatemeh Nargesian ; Renée J. Miller

【Abstract】: We present a new solution for finding joinable tables in massive data lakes: given a table and one join column, find tables that can be joined with the given table on the largest number of distinct values. The problem can be formulated as an overlap set similarity search problem by considering columns as sets and matching values as intersection between sets. Although set similarity search is well-studied in the field of approximate string search (e.g., fuzzy keyword search), the solutions are designed for and evaluated over sets of relatively small size (average set size rarely much over 100 and maximum set size in the low thousands) with modest dictionary sizes (the total number of distinct values in all sets is only a few million). We observe that modern data lakes typically have massive set sizes (with maximum set sizes that may be tens of millions) and dictionaries that include hundreds of millions of distinct values. Our new algorithm, JOSIE (Joining Search using Intersection Estimation) minimizes the cost of set reads and inverted index probes used in finding the top-k sets. We show that JOSIE completely out performs the state-of-the-art overlap set similarity search techniques on data lakes. More surprising, we also consider state-of-the-art approximate algorithm and show that our new exact search algorithm performs almost as well, and even in some cases better, on real data lakes.

【Keywords】: Information systems; Data management systems; Information integration; Information retrieval; Evaluation of retrieval results; Retrieval efficiency; Retrieval models and ranking; Top-k retrieval in databases; Retrieval tasks and goals; Business intelligence

54. Raha: A Configuration-Free Error Detection System.

【Paper Link】【Pages】:865-882

【Authors】: Mohammad Mahdavi ; Ziawasch Abedjan ; Raul Castro Fernandez ; Samuel Madden ; Mourad Ouzzani ; Michael Stonebraker ; Nan Tang

【Abstract】: Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.

【Keywords】: Computing methodologies; Machine learning; Learning paradigms; Multi-task learning; Transfer learning; Information systems; Data management systems; Information integration; Data cleaning; Information retrieval; Retrieval tasks and goals; Clustering and classification; Theory of computation; Theory and algorithms for application domains; Machine learning theory; Semi-supervised learning

55. Speculative Distributed CSV Data Parsing for Big Data Analytics.

【Paper Link】【Pages】:883-899

【Authors】: Chang Ge ; Yinan Li ; Eric Eilebrecht ; Badrish Chandramouli ; Donald Kossmann

【Abstract】: There has been a recent flurry of interest in providing query capability on raw data in today's big data systems. These raw data must be parsed before processing or use in analytics. Thus, a fundamental challenge in distributed big data systems is that of efficient parallel parsing of raw data. The difficulties come from the inherent ambiguity while independently parsing chunks of raw data without knowing the context of these chunks. Specifically, it can be difficult to find the beginnings and ends of fields and records in these chunks of raw data. To parallelize parsing, this paper proposes a speculation-based approach for the CSV format, arguably the most commonly used raw data format. Due to the syntactic and statistical properties of the format, speculative parsing rarely fails and therefore parsing is efficiently parallelized in a distributed setting. Our speculative approach is also robust, meaning that it can reliably detect syntax errors in CSV data. We experimentally evaluate the speculative, distributed parsing approach in Apache Spark using more than 11,000 real-world datasets, and show that our parser produces significant performance benefits over existing methods.

Research 9: Query Processing & Optimization 2 6

56. CATAPULT: Data-driven Selection of Canned Patterns for Efficient Visual Graph Query Formulation.

【Paper Link】【Pages】:900-917

【Authors】: Kai Huang ; Huey-Eng Chua ; Sourav S. Bhowmick ; Byron Choi ; Shuigeng Zhou

【Abstract】: Visual graph query interfaces (a.k.a gui ) widen the reach of graph querying frameworks across different users by enabling non-programmers to use them. Consequently, several commercial and academic frameworks for querying a large collection of small- or medium-sized data graphs (\textite.g., chemical compounds) provide such visual interfaces. Majority of these interfaces expose a fixed set ofcanned patterns (\textiti.e., small subgraph patterns) to expedite query formulation by enabling pattern-at-a-time in lieu of edge-at-a-time construction mode. Canned patterns to be displayed on a gui are typically selected manually based on domain knowledge. However, manual generation of canned patterns is labour intensive. Furthermore, these patterns may not sufficiently cover the underlying data graphs to expedite visual formulation of a wide range of subgraph queries. In this paper, we present a generic and extensible framework called Catapult to address these limitations. Catapult takes a data-driven approach toautomatically select canned patterns, thereby taking a concrete step towards the vision of data-driven construction of visual query interfaces. Specifically, it firstclusters the underlying data graphs based on their topological similarities and thensummarize each cluster to create acluster summary graph (csg ). The canned patterns within a user-specifiedpattern budget are then generated from these csg s by maximizingcoverage anddiversity, and minimizingcognitive load of the patterns. Experimental study with real-world datasets and visual graph interfaces demonstrates the superiority of Catapult compared to traditional techniques.

【Keywords】: Theory of computation; Theory and algorithms for application domains; Database theory; Data structures and algorithms for data management

57. iQCAR: inter-Query Contention Analyzer for Data Analytics Frameworks.

【Paper Link】【Pages】:918-935

【Authors】: Prajakta Kalmegh ; Shivnath Babu ; Sudeepa Roy

【Abstract】: Resource interferences caused by concurrent queries is one of the key reasons for unpredictable performance and missed workload SLAs in cluster computing systems. Analyzing these inter-query resource interactions is critical in order to answer time-sensitive questions like 'who is creating resource conflicts to my query'. More importantly, diagnosing whether the resource blocked times of a 'victim' query are caused by other queries or some other external factor can help the database administrator narrow down the many possibilities of query performance degradation. We introduce iQCAR, an inter-Query Contention Analyzer, that attributes blame for the slowdown of a query to concurrent queries. iQCAR models the resource conflicts using a multi-level directed acyclic graph that can help administrators compare impacts from concurrent queries, identify most contentious queries, resources and hosts in an online execution for a selected time window. Our experiments using TPCDS queries on Apache Spark show that our approach is substantially more accurate than other methods based on overlap time between concurrent queries.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Cloud computing; Information systems; Data management systems; Database administration; Autonomous database administration; Database utilities and tools; Database management system engines; Parallel and distributed DBMSs; MapReduce-based systems; Relational parallel and distributed DBMSs

58. A Holistic Approach for Query Evaluation andResult Vocalization in Voice-Based OLAP.

【Paper Link】【Pages】:936-953

【Authors】: Immanuel Trummer ; Yicheng Wang ; Saketh Mahankali

【Abstract】: We focus on the problem of answering OLAP queries via voice output. We present a holistic approach that combines query processing and result vocalization. We use the following key ideas to minimize processing overheads and maximize answer quality. First, our approach samples from the database to evaluate alternative speech fragments. OLAP queries are not fully evaluated. Instead, sampling focuses on result aspects that are relevant for voice output. To guide sampling, we rely on methods from the area of Monte-Carlo Tree Search. Second, we use pipelining to interleave query processing and voice output. The system starts providing the user with high-level insights while generating more fine-grained results in the background. Third, we optimize speech output to maximize the user's information gain under speaking time constraints. We use a maximum-entropy model to predict the user's belief about OLAP results, after listening to voice output. Based on that model, we select the most informative speech fragments (i.e., the ones minimizing the distance between user belief and actual data). We analyze formal properties of the proposed speech structure and analyze complexity of our algorithm. Also, we compare alternative vocalization approaches in an extensive user study.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing; Natural language generation; Human-centered computing; Human computer interaction (HCI); Information systems; Data management systems; Database management system engines; Database query processing; Online analytical processing engines; Information systems applications; Decision support systems; Online analytical processing

59. Top-k Queries over Digital Traces.

【Paper Link】【Pages】:954-971

【Authors】: Yifan Li ; Xiaohui Yu ; Nick Koudas

【Abstract】: Recent advances in social and mobile technology have enabled an abundance of digital traces (in the form of mobile check-ins, association of mobile devices to specific WiFi hotspots, etc.) revealing the physical presence history of diverse sets of entities (e.g., humans, devices, and vehicles). One challenging yet important task is to identify k entities that are most closely associated with a given query entity based on their digital traces. We propose a suite of indexing techniques and algorithms to enable fast query processing for this problem at scale. We first define a generic family of functions measuring the association between entities, and then propose algorithms to transform digital traces into a lower-dimensional space for more efficient computation. We subsequently design a hierarchical indexing structure to organize entities in a way that closely associated entities tend to appear together. We then develop algorithms to process top-k queries utilizing the index. We theoretically analyze the pruning effectiveness of the proposed methods based on a mobility model which we propose and validate in real life situations. Finally, we conduct extensive experiments on both synthetic and real datasets at scale, evaluating the performance of our techniques both analytically and experimentally, confirming the effectiveness and superiority of our approach over other applicable approaches across a variety of parameter settings and datasets.

【Keywords】: Information systems; Information retrieval; Retrieval models and ranking; Top-k retrieval in databases; Information systems applications; Spatial-temporal systems; Location based services

60. Visual Road: A Video Data Management Benchmark.

【Paper Link】【Pages】:972-987

【Authors】: Brandon Haynes ; Amrita Mazumdar ; Magdalena Balazinska ; Luis Ceze ; Alvin Cheung

【Abstract】: Recently, video database management systems (VDBMSs) have re-emerged as an active area of research and development. To accelerate innovation in this area, we present Visual Road, a benchmark that evaluates the performance of these systems. Visual Road comes with a data generator and a suite of queries over cameras positioned within a simulated metropolitan environment. Visual Road's video data is automatically generated with a high degree of realism, and annotated using a modern simulation and visualization engine. This allows for VDBMS performance evaluation while scaling up the size of the input data. Visual Road is designed to evaluate a broad variety of VDBMSs: real-time systems, systems for longitudinal analytical queries, systems processing traditional videos, and systems designed for 360 videos. We use the benchmark to evaluate three recent VDBMSs both in capabilities and performance.

【Keywords】: Information systems; Data management systems; Database administration; Database performance evaluation; Information systems applications; Multimedia information systems; Multimedia databases

61. Mining Precision Interfaces From Query Logs.

【Paper Link】【Pages】:988-1005

【Authors】: Qianrui Zhang ; Haoci Zhang ; Thibault Sellam ; Eugene Wu

【Abstract】: Interactive tools make data analysis more efficient and more accessible to end-users by hiding the underlying query complexity and exposing interactive widgets for the parts of the query that matter to the analysis. However, creating custom tailored (i.e., precise) interfaces is very costly, and automated approaches are desirable. We propose a syntactic approach that uses queries from an analysis to generate a tailored interface. We model interface widgets as functions I(q) - > q' that modify the current analysis query q, and interfaces as the set of queries that its widgets can express. Our system, Precision Interfaces, analyzes structural changes between input queries from an analysis, and generates an output interface with widgets to express those changes. Our experiments on the Sloan Digital Sky Survey query log suggest that Precision Interfaces can generate useful interfaces for simple unanticipated tasks, and our optimizations can generate interfaces from logs of up to 10,000 queries in >10s.

【Keywords】:

Research 10: Graphs 1 6

62. Distance-generalized Core Decomposition.

【Paper Link】【Pages】:1006-1023

【Authors】: Francesco Bonchi ; Arijit Khan ; Lorenzo Severini

【Abstract】: The k-core of a graph is defined as the maximal subgraph in which every vertex is connected to at least k other vertices within that subgraph. In this work we introduce a distance-based generalization of the notion of k-core, which we refer to as the $(k,h)$-core, i.e., the maximal subgraph in which every vertex has at least k other vertices at distance $łeq h$ within that subgraph. We study the properties of the $(k,h)$-core showing that it preserves many of the nice features of the classic core decomposition (e.g., its connection with the notion of distance-generalized chromatic number ) and it preserves its usefulness to speed-up or approximate distance-generalized notions of dense structures, such as h-club. Computing the distance-generalized core decomposition over large networks is intrinsically complex. However, by exploiting clever upper and lower bounds we can partition the computation in a set of totally independent subcomputations, opening the door to top-down exploration and to multithreading, and thus achieving an efficient algorithm.

【Keywords】: Information systems; Information systems applications; Data mining

63. Unboundedness and Efficiency of Truss Maintenance in Evolving Graphs.

【Paper Link】【Pages】:1024-1041

【Authors】: Yikai Zhang ; Jeffrey Xu Yu

【Abstract】: Due to the ubiquity of graphs, graph analytics has attracted much attention from both research and industry communities. The notion of k-truss is widely used in graph analytics. Since graphs are continuously evolving in real applications and it is costly to compute trusses from scratch, we study the problem of truss maintenance which aims at designing efficient incremental algorithms to update trusses when graphs are updated with changes. An incremental algorithm is desired to be bounded; that is, its cost is of $O(f(|\textttCHANGED |_c))$ for some polynomial function f and some positive integer c, where $\textttCHANGED $ comprises the changes to both the graph and the result and $|\textttCHANGED |_c$ is the size of the c-hop neighborhood of $\textttCHANGED $. An incremental problem is bounded if it has a bounded incremental algorithm and is unbounded otherwise. Under the model of locally persistent algorithms, we prove that truss maintenance is bounded under edge removals but is unbounded even for unit edge insertions. To address the unboundedness, we formulate a new notion $\textttAFF ^\preceq$ which, as a practically effective alternative to $\textttCHANGED $, represents a set of edgesaffected by the changes to the graph, and devise an insertion algorithm that is bounded with respect to $\textttAFF ^\preceq$, while retaining the boundedness for edge removals. More specifically, our insertion algorithm runs in $O(f(|\textttAFF ^\preceq|_c))$ time for some polynomial function f and some positive integer c with $|\textttAFF ^\preceq|_c$ being the size of the c-hop neighborhood of $\textttAFF ^\preceq$. Our extensive performance studies show that our new algorithms can significantly outperform the state-of-the-art by up to 3 orders of magnitude for the 12 large real graphs tested and are more efficient than computing trusses from scratch even for changes of non-trivial size. We report our findings in this paper.

【Keywords】: Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms; Theory of computation; Design and analysis of algorithms; Graph algorithms analysis

64. PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs.

【Paper Link】【Pages】:1042-1059

【Authors】: Zhewei Wei ; Xiaodong He ; Xiaokui Xiao ; Sibo Wang ; Yu Liu ; Xiaoyong Du ; Ji-Rong Wen

【Abstract】: SimRank is a classic measure of the similarities of nodes in a graph. Given a node u in graph $G =(V, E)$, a \em single-source SimRank query returns the SimRank similarities $s(u, v)$ between node u and each node $v \in V$. This type of queries has numerous applications in web search and social networks analysis, such as link prediction, web mining, and spam detection. Existing methods for single-source SimRank queries, however, incur query cost at least linear to the number of nodes n, which renders them inapplicable for real-time and interactive analysis. This paper proposes \prsim, an algorithm that exploits the structure of graphs to efficiently answer single-source SimRank queries. \prsim uses an index of size $O(m)$, where m is the number of edges in the graph, and guarantees a query time that depends on the \em reverse PageRank distribution of the input graph. In particular, we prove that \prsim runs in sub-linear time if the degree distribution of the input graph follows the power-law distribution, a property possessed by many real-world graphs. Based on the theoretical analysis, we show that the empirical query time of all existing SimRank algorithms also depends on the reverse PageRank distribution of the graph. Finally, we present the first experimental study that evaluates the absolute errors of various SimRank algorithms on large graphs, and we show that \prsim outperforms the state of the art in terms of query time, accuracy, index size, and scalability.

【Keywords】: Information systems; Information systems applications; Data mining; Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms

65. Scaling Distance Labeling on Small-World Networks.

【Paper Link】【Pages】:1060-1077

【Authors】: Wentao Li ; Miao Qiao ; Lu Qin ; Ying Zhang ; Lijun Chang ; Xuemin Lin

【Abstract】: Distance labeling approaches are widely adopted to speed up the online performance of shortest distance queries. The construction of the distance labeling, however, can be exhaustive especially on big graphs. For a major category of large graphs, small-world networks, the state-of-the-art approach is Pruned Landmark Labeling (PLL). PLL prunes distance labels based on a node order and directly constructs the pruned labels by performing breadth-first searches in the node order. The pruning technique, as well as the index construction, has a strong sequential nature which hinders PLL from being parallelized. It becomes an urgent issue on massive small-world networks whose index can hardly be constructed by a single thread within a reasonable time. This paper scales distance labeling on small-world networks by proposing a Parallel Shortest-distance Labeling (PSL) scheme and further reducing the index size by exploiting graph and label properties. PSL insightfully converts the PLL's node-order dependency to a shortest-distance dependence, which leads to a propagation-based parallel labeling in D rounds where D denotes the diameter of the graph. Extensive experimental results verify our efficiency on billion-scale graphs and near-linear speedup in a multi-core environment.

【Keywords】: Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms; Theory of computation; Design and analysis of algorithms; Graph algorithms analysis; Shortest paths; Parallel algorithms

【Paper Link】【Pages】:1078-1095

【Authors】: Prithu Banerjee ; Wei Chen ; Laks V. S. Lakshmanan

【Abstract】: Motivated by applications such as viral marketing, the problem of influence maximization (IM) has been extensively studied in the literature. The goal is to select a small number of users to adopt an item such that it results in a large cascade of adoptions by others. Existing works have three key limitations. (1) They do not account for economic considerations of a user in buying/adopting items. (2) Most studies on multiple items focus on competition, with complementary items receiving limited attention. (3) For the network owner, maximizing social welfare is important to ensure customer loyalty, which is not addressed in prior work in the IM literature. In this paper, we address all three limitations and propose a novel model called UIC that combines utility-driven item adoption with influence propagation over networks. Focusing on the mutually complementary setting, we formulate the problem of social welfare maximization in this novel setting. We show that while the objective function is neither submodular nor supermodular, surprisingly a simple greedy allocation algorithm achieves a factor of (1-1/e-ε) of the optimum expected social welfare. We develop bundleGRD, a scalable version of this approximation algorithm, and demonstrate, with comprehensive experiments on real and synthetic datasets, that it significantly outperforms all baselines.

【Keywords】: Information systems; Information systems applications; Collaborative and social computing systems and tools; Social networking sites; Computational advertising; Data mining

67. Efficient Approximation Algorithms for Adaptive Seed Minimization.

【Paper Link】【Pages】:1096-1113

【Authors】: Jing Tang ; Keke Huang ; Xiaokui Xiao ; Laks V. S. Lakshmanan ; Xueyan Tang ; Aixin Sun ; Andrew Lim

【Abstract】: As a dual problem of influence maximization, the seed minimization problem asks for the minimum number of seed nodes to influence a required number η of users in a given social network G. Existing algorithms for seed minimization mostly consider the non-adaptive setting, where all seed nodes are selected in one batch without observing how they may influence other users. In this paper, we study seed minimization in the adaptive setting, where the seed nodes are selected in several batches, such that the choice of a batch may exploit information about the actual influence of the previous batches. We propose a novel algorithm, ASTI, which addresses the adaptive seed minimization problem in $O\Big(\fracη \cdot (m+n) \varepsilon^2 łn n \Big)$ expected time and offers an approximation guarantee of $\frac(łn η+1)^2 (1 - (1-1/b)^b) (1-1/e)(1-\varepsilon) $ in expectation, where η is the targeted number of influenced nodes, b is size of each seed node batch, and $\varepsilon \in (0, 1)$ is a user-specified parameter. To the best of our knowledge, ASTI is the first algorithm that provides such an approximation guarantee without incurring prohibitive computation overhead. With extensive experiments on a variety of datasets, we demonstrate the effectiveness and efficiency of ASTI over competing methods.

【Keywords】: Information systems; Information systems applications; Data mining; World Wide Web; Online advertising; Social advertising; Web applications; Social networks; Theory of computation; Design and analysis of algorithms; Mathematical optimization; Mixed discrete-continuous optimization; Submodular optimization and polymatroids; Models of computation; Probabilistic computation

Award Talks 2

68. Data Management on Non-Volatile Memory.

【Paper Link】【Pages】:1114

【Authors】: Joy Arulraj

【Abstract】: We are at an exciting point in the evolution of memory technology. Device manufacturers have created a new non- volatile memory (NVM) technology that can serve as both system memory and storage. NVM supports fast reads and writes similar to volatile memory, but all writes to it are persistent like a solid-state disk. The advent of NVM invalidates decades of design decisions that are deeply embedded in today's database management systems (DBMSs). These systems are unable to take full advantage of NVM because their internal architectures are predicated on the assumption that memory is volatile. With NVM, many of the components of today's DBMSs are unnecessary and will degrade the performance of data-intensive applications. Thus, the best way to resolve these shortcomings is by designing a new system explicitly tailored for NVM. In this talk, I will present our research on the design and development of an NVM DBMS, called Peloton. Peloton's architecture shows that the impact of NVM spans across all the layers of the DBMS. I will first introduce write-behind logging, an NVM-centric protocol that improves the availability of the database system by two orders-of-magnitude compared to the widely-used write- ahead logging protocol. I will then present the BzTree, an NVM-centric index data structure that illustrates how to simplify programming on NVM. In drawing broader lessons from this work, I will argue that all types of software systems, including file systems, machine-learning systems, and key-value stores, are amenable to similar architectural changes to achieve high performance and availability on NVM.

【Keywords】: Information systems; Data management systems; Database management system engines

69. Formal Approaches to Querying Big Data in Shared-Nothing Systems.

【Paper Link】【Pages】:1115-1116

【Authors】: Bas Ketsman

【Abstract】: To meet today's data management needs, it is a widespread practice to use distributed data storage and processing systems. Since the publication of the MapReduce paradigm, a plethora of such systems arose, but although widespread, the capabilities of these systems are still poorly understood and putting them to effective use is often more of an art than a science. As one of the causes for this observation, we identify a lack of theoretical underpinnings for these systems, which makes it hard to understand what the advantages and disadvantages of the particular systems are and which, in addition, complicates the choice of a particular formalism for a particular task. In my PhD thesis, we zoom in on several important aspects of query evaluation using clusters of servers, including coordination and communication, data-skew, load balancing, and data partitioning, and propose a set of elegant and theoretically sound frameworks and theories that help to understand the applicable limitations and trade-offs.

Research 11: Systems & Machine Learning 4

70. DeepBase: Deep Inspection of Neural Networks.

【Paper Link】【Pages】:1117-1134

【Authors】: Thibault Sellam ; Kevin Lin ; Ian Yiran Huang ; Michelle Yang ; Carl Vondrick ; Eugene Wu

【Abstract】: Although deep learning models perform remarkably well across a range of tasks such as language translation and object recognition, it remains unclear what high-level logic, if any, they follow. Understanding this logic may lead to more transparency, better model design, and faster experimentation. Recent machine learning research has leveraged statistical methods to identify hidden units that behave (e.g., activate) similarly to human understandable logic, but those analyses require considerable manual effort. Our insight is that many of those studies follow a common analysis pattern, and therefore there is opportunity to provide a declarative abstraction to easily express, execute and optimize them. This paper describes DeepBase, a system to inspect neural network behaviors through a unified interface. We model logic with user-provided hypothesis functions that annotate the data with high-level labels (e.g., part-of-speech tags, image captions). DeepBase lets users quickly identify individual or groups of units that have strong statistical dependencies with desired hypotheses. We discuss how DeepBase can express existing analyses, propose a set of simple and effective optimizations to speed up a standard Python implementation by up to 72x, and reproduce recent studies from the NLP literature.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks; Information systems; Data management systems; Information systems applications; Data mining

71. BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees.

【Paper Link】【Pages】:1135-1152

【Authors】: Yongjoo Park ; Jingyi Qing ; Xiaoyang Shen ; Barzan Mozafari

【Abstract】: The rising volume of datasets has made training machine learning (ML) models a major computational cost in the enterprise. Given the iterative nature of model and parameter tuning, many analysts use a small sample of their entire data during their initial stage of analysis to make quick decisions (e.g., what features or hyperparameters to use) and use the entire dataset only in later stages (i.e., when they have converged to a specific model). This sampling, however, is performed in an ad-hoc fashion. Most practitioners cannot precisely capture the effect of sampling on the quality of their model, and eventually on their decision-making process during the tuning phase. Moreover, without systematic support for sampling operators, many optimizations and reuse opportunities are lost. In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML training. BlinkML allows users to make error-computation tradeoffs: instead of training a model on their full data (i.e., full model), BlinkML can quickly train an approximate model with quality guarantees using a sample. The quality guarantees ensure that, with high probability, the approximate model makes the same predictions as the full model. BlinkML currently supports any ML model that relies on maximum likelihood estimation (MLE), which includes Generalized Linear Models (e.g., linear regression, logistic regression, max entropy classifier, Poisson regression) as well as PPCA (Probabilistic Principal Component Analysis). Our experiments show that BlinkML can speed up the training of large-scale ML tasks by 6.26×?629× while guaranteeing the same predictions, with 95% probability, as the full model.

【Keywords】: Computing methodologies; Machine learning; Information systems; Data management systems; Database management system engines; Information systems applications; Decision support systems; Data analytics

72. SkinnerDB: Regret-Bounded Query Evaluation via Reinforcement Learning.

【Paper Link】【Pages】:1153-1170

【Authors】: Immanuel Trummer ; Junxiong Wang ; Deepak Maram ; Samuel Moseley ; Saehan Jo ; Joseph Antonakakis

【Abstract】: SkinnerDB is designed from the ground up for reliable join ordering. It maintains no data statistics and uses no cost or cardinality models. Instead, it uses reinforcement learning to learn optimal join orders on the fly, during the execution of the current query. To that purpose, we divide the execution of a query into many small time slices. Different join orders are tried in different time slices. We merge result tuples generated according to different join orders until a complete result is obtained. By measuring execution progress per time slice, we identify promising join orders as execution proceeds. Along with SkinnerDB, we introduce a new quality criterion for query execution strategies. We compare expected execution cost against execution cost for an optimal join order. SkinnerDB features multiple execution strategies that are optimized for that criterion. Some of them can be executed on top of existing database systems. For maximal performance, we introduce a customized execution engine, facilitating fast join order switching via specialized multi-way join algorithms and tuple representations. We experimentally compare SkinnerDB's performance against various baselines, including MonetDB, Postgres, and adaptive processing methods. We consider various benchmarks, including the join order benchmark and TPC-H variants with user-defined functions. Overall, the overheads of reliable join ordering are negligible compared to the performance impact of the occasional, catastrophic join order choice.

【Keywords】: Computing methodologies; Machine learning; Learning paradigms; Reinforcement learning; Information systems; Data management systems; Database management system engines; Database query processing; Join algorithms; Query optimization; Query planning; Theory of computation; Theory and algorithms for application domains; Database theory; Database query processing and optimization (theory)

73. Democratizing Data Science through Interactive Curation of ML Pipelines.

【Paper Link】【Pages】:1171-1188

【Authors】: Zeyuan Shang ; Emanuel Zgraggen ; Benedetto Buratti ; Ferdinand Kossmann ; Philipp Eichmann ; Yeounoh Chung ; Carsten Binnig ; Eli Upfal ; Tim Kraska

【Abstract】: Statistical knowledge and domain expertise are key to extract actionable insights out of data, yet such skills rarely coexist together. In Machine Learning, high-quality results are only attainable via mindful data preprocessing, hyperparameter tuning and model selection. Domain experts are often overwhelmed by such complexity, de-facto inhibiting a wider adoption of ML techniques in other fields. Existing libraries that claim to solve this problem, still require well-trained practitioners. Those frameworks involve heavy data preparation steps and are often too slow for interactive feedback from the user, severely limiting the scope of such systems.

【Keywords】: Computing methodologies; Machine learning

Research 12: Indexing 4

74. FITing-Tree: A Data-aware Index Structure.

【Paper Link】【Pages】:1189-1206

【Authors】: Alex Galakatos ; Michael Markovitch ; Carsten Binnig ; Rodrigo Fonseca ; Tim Kraska

【Abstract】: Index structures are one of the most important tools that DBAs leverage to improve the performance of analytics and transactional workloads. However, building several indexes over large datasets can often become prohibitive and consume valuable system resources. In fact, a recent study showed that indexes created as part of the TPC-C benchmark can account for 55% of the total memory available in a modern DBMS. This overhead consumes valuable and expensive main memory, and limits the amount of space available to store new data or process existing data. In this paper, we present a novel data-aware index structure called FITing-Tree which approximates an index using piece-wise linear functions with a bounded error specified at construction time. This error knob provides a tunable parameter that allows a DBA to FIT an index to a dataset and workload by being able to balance lookup performance and space consumption. To navigate this tradeoff, we provide a cost model that helps determine an appropriate error parameter given either (1) a lookup latency requirement (e.g., 500ns) or (2) a storage budget (e.g., 100MB). Using a variety of real-world datasets, we show that our index is able to provide performance that is comparable to full index structures while reducing the storage footprint by orders of magnitude.

【Keywords】: Information systems; Data management systems; Data structures; Data access methods; Point lookups

75. Hyperion: Building the Largest In-memory Search Tree.

【Paper Link】【Pages】:1207-1222

【Authors】: Markus Mäsker ; Tim Süß ; Lars Nagel ; Lingfang Zeng ; André Brinkmann

【Abstract】: Indexes are essential in data management systems to increase the speed of data retrievals. Widespread data structures to provide fast and memory-efficient indexes are prefix tries. Implementations like Judy, ART, or HOT optimize their internal alignments for cache and vector unit efficiency. While these measures usually improve the performance substantially, they can have a negative impact on memory efficiency. In this paper we present Hyperion, a trie-based main-memory key-value store achieving extreme space efficiency. In contrast to other data structures, Hyperion does not depend on CPU vector units, but scans the data structure linearly. Combined with a custom memory allocator, Hyperion accomplishes a remarkable data density while achieving a competitive point query and an exceptional range query performance. Hyperion can significantly reduce the index memory footprint and its performance-to-memory ratio is more than two times better than the best implemented alternative strategy for randomized string data sets.

【Keywords】: Information systems; Data management systems; Data structures; Data access methods; Data layout; Database management system engines; Main memory engines; Software and its engineering; Software organization and properties; Contextual software domains; Operating systems; Memory management; Theory of computation; Theory and algorithms for application domains; Database theory; Data modeling; Data structures and algorithms for data management

76. Designing Succinct Secondary Indexing Mechanism by Exploiting Column Correlations.

【Paper Link】【Pages】:1223-1240

【Authors】: Yingjun Wu ; Jia Yu ; Yuanyuan Tian ; Richard Sidle ; Ronald Barber

【Abstract】: Database administrators construct secondary indexes on data tables to accelerate query processing in relational database management systems (RDBMSs). These indexes are built on top of the most frequently queried columns according to the data statistics. Unfortunately, maintaining multiple secondary indexes in the same database can be extremely space consuming, causing significant performance degradation due to the potential exhaustion of memory space. In this paper, we demonstrate that there exist many opportunities to exploit column correlations for accelerating data access. We propose HERMIT, a succinct secondary indexing mechanism for modern RDBMSs. HERMIT judiciously leverages the rich soft functional dependencies hidden among columns to prune out redundant structures for indexed key access. Instead of building a complete index that stores every single entry in the key columns, HERMIT navigates any incoming key access queries to an existing index built on the correlated columns. This is achieved through the Tiered Regression Search Tree (TRS-Tree), a succinct, ML-enhanced data structure that performs fast curve fitting to adaptively and dynamically capture both column correlations and outliers. Our extensive experimental study in two different RDBMSs have confirmed that HERMIT can significantly reduce space consumption with limited performance overhead, especially when supporting complex range queries.

【Keywords】: Computing methodologies; Machine learning; Information systems; Data management systems; Data structures; Data access methods; Theory of computation; Design and analysis of algorithms; Data structures design and analysis; Data compression

77. AI Meets AI: Leveraging Query Executions to Improve Index Recommendations.

【Paper Link】【Pages】:1241-1258

【Authors】: Bailu Ding ; Sudipto Das ; Ryan Marcus ; Wentao Wu ; Surajit Chaudhuri ; Vivek R. Narasayya

【Abstract】: State-of-the-art index tuners rely on query optimizer's cost estimates to search for the index configuration with the largest estimated execution cost improvement`. Due to well-known limitations in optimizer's estimates, in a significant fraction of cases, an index estimated to improve a query's execution cost, e.g., CPU time, makes that worse when implemented. Such errors are a major impediment for automated indexing in production systems. We observe that comparing the execution cost of two plans of the same query corresponding to different index configurations is a key step during index tuning. Instead of using optimizer's estimates for such comparison, our key insight is that formulating it as a classification task in machine learning results in significantly higher accuracy. We present a study of the design space for this classification problem. We further show how to integrate this classifier into the state-of-the-art index tuners with minimal modifications, i.e., how artificial intelligence (AI) can benefit automated indexing (AI). Our evaluation using industry-standard benchmarks and a large number of real customer workloads demonstrates up to 5x reduction in the errors in identifying the cheaper plan in a pair, which eliminates almost all query execution cost regressions when the model is used in index tuning.

Research 13: Fairness, Uncertainty 4

78. Designing Fair Ranking Schemes.

【Paper Link】【Pages】:1259-1276

【Authors】: Abolfazl Asudeh ; H. V. Jagadish ; Julia Stoyanovich ; Gautam Das

【Abstract】: Items from a database are often ranked based on a combination of criteria. The weight given to each criterion in the combination can greatly affect the fairness of the produced ranking, for example, preferring men over women. A user may have the flexibility to choose combinations that weigh these criteria differently, within limits. In this paper, we develop a system that helps users choose criterion weights that lead to greater fairness. We consider ranking functions that compute the score of each item as a weighted sum of (numeric) attribute values, and then sort items on their score. Each ranking function can be expressed as a point in a multi-dimensional space. For a broad range of fairness criteria, including proportionality, we show how to efficiently identify regions in this space that satisfy these criteria. Using this identification method, our system is able to tell users whether their proposed ranking function satisfies the desired fairness criteria and, if it does not, to suggest the smallest modification that does. Our extensive experiments on real datasets demonstrate that our methods are able to find solutions that satisfy fairness criteria effectively (usually with only small changes to proposed weight vectors) and efficiently (in interactive time, after some initial pre-processing).

【Keywords】: General and reference; Cross-computing tools and techniques; Evaluation; Information systems; Information retrieval; Retrieval models and ranking; Top-k retrieval in databases; Theory of computation; Randomness, geometry and discrete structures; Computational geometry

79. Anti-Freeze for Large and Complex Spreadsheets: Asynchronous Formula Computation.

【Paper Link】【Pages】:1277-1294

【Authors】: Mangesh Bendre ; Tana Wattanawaroon ; Kelly Mack ; Kevin Chang ; Aditya G. Parameswaran

【Abstract】: Spreadsheet systems enable users to store and analyze data in an intuitive and flexible interface. Yet the scale of data being analyzed often leads to spreadsheets hanging and freezing on small changes. We propose a new asynchronous formula computation framework: instead of freezing the interface we return control to users quickly to ensure interactivity, while computing the formulae in the background. To ensure consistency, we indicate formulae being computed in the background via visual cues on the spreadsheet. Our asynchronous computation framework introduces two novel challenges: (a) How do we identify dependencies for a given change in a bounded time? (b) How do we schedule computation to maximize the number of spreadsheet cells available to the user over time? We bound the dependency identification time by compressing the formula dependency graph lossily, a problem we show to be NP-Hard. A compressed dependency table enables us to quickly identify the spreadsheet cells that need recomputation and indicate them as such to users. Finding an optimal computation schedule to maximize cell availability is also NP-Hard, and even merely obtaining a schedule can be expensive-we propose an on-the-fly scheduling technique to address this. We have incorporated asynchronous computation in DataSpread, a scalable spreadsheet system targeted at operating on arbitrarily large datasets on a spreadsheet frontend.

【Keywords】: Applied computing; Computers in other domains; Personal computers and PC applications; Spreadsheets; Human-centered computing; Human computer interaction (HCI); Interactive systems and tools; Information systems; Data management systems; Database management system engines; Database query processing; Query optimization

80. Anytime Approximation in Probabilistic Databases via Scaled Dissociations.

【Paper Link】【Pages】:1295-1312

【Authors】: Maarten Van den Heuvel ; Peter Ivanov ; Wolfgang Gatterbauer ; Floris Geerts ; Martin Theobald

【Abstract】: Speeding up probabilistic inference remains a key challenge in probabilistic databases (PDBs) and the related area of statistical relational learning (SRL). Since computing probabilities for query answers is #P-hard, even for fairly simple conjunctive queries, both the PDB and SRL communities have proposed a number of approximation techniques over the years. The two prevalent techniques are either (i) MCMC-style sampling or (ii) branch-and-bound (B&B) algorithms that iteratively improve model-based bounds using a combination of variable substitution and elimination. We propose a new anytime B&B approximation scheme that encompasses all prior model-based approximation schemes proposed in the PDB and SRL literature. Our approach relies on the novel idea of "scaled dissociation" which can improve both the upper and lower bounds of existing modelbased algorithms. We apply our approach to the well-studied problem of evaluating self-join-free conjunctive queries over tuple-independent PDBs, and show a consistent reduction in approximation error in our experiments on TPC-H, Yago3, and a synthetic benchmark setting.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Mathematics of computing; Probability and statistics

81. Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers.

【Paper Link】【Pages】:1313-1330

【Authors】: Su Feng ; Aaron Huber ; Boris Glavic ; Oliver Kennedy

【Abstract】: Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may ex- clude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve uncertainty. In this paper, we propose Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers to achieve the reliability of certain answers, with the performance of a classical database system. Furthermore, in contrast to prior work on certain answers, UA-DBs achieve a higher utility by including some (explicitly marked) answers that are not certain. UA-DBs are based on incomplete K-relations, which we introduce to generalize the classical set-based notion of incomplete databases and certain answers to a much larger class of data models. Using an implementation of our approach, we demonstrate experimentally that it efficiently produces tight approximations of certain answers that are of high utility.

【Keywords】: Information systems; Data management systems; Database design and models; Data model extensions; Data provenance; Incomplete data; Inconsistent data; Uncertainty; Information integration; Data cleaning; Entity resolution; Extraction, transformation and loading; Information storage systems; Storage management; Version management; Theory of computation; Theory and algorithms for application domains; Database theory; Data integration; Data provenance; Data structures and algorithms for data management; Incomplete, inconsistent, and uncertain databases

Research 14: Graphs 2 4

82. Efficient Estimation of Heat Kernel PageRank for Local Clustering.

【Paper Link】【Pages】:1339-1356

【Authors】: Renchi Yang ; Xiaokui Xiao ; Zhewei Wei ; Sourav S. Bhowmick ; Jun Zhao ; Rong-Hua Li

【Abstract】: Given an undirected graph G and a seed node s, the local clustering problem aims to identify a high-quality cluster containing s in time roughly proportional to the size of the cluster, regardless of the size of G. This problem finds numerous applications on large-scale graphs. Recently, heat kernel PageRank (HKPR), which is a measure of the proximity of nodes in graphs, is applied to this problem and found to be more efficient compared with prior methods. However, existing solutions for computing HKPR either are prohibitively expensive or provide unsatisfactory error approximation on HKPR values, rendering them impractical especially on billion-edge graphs. In this paper, we present TEA and TEA+, two novel local graph clustering algorithms based on HKPR, to address the aforementioned limitations. Specifically, these algorithms provide non-trivial theoretical guarantees in relative error of HKPR values and the time complexity. The basic idea is to utilize deterministic graph traversal to produce a rough estimation of exact HKPR vector, and then exploit Monte-Carlo random walks to refine the results in an optimized and non-trivial way. In particular, TEA+ offers practical efficiency and effectiveness due to non-trivial optimizations. Extensive experiments on real-world datasets demonstrate that TEA+ outperforms the state-of-the-art algorithm by more than four times on most benchmark datasets in terms of computational time when achieving the same clustering quality, and in particular, is an order of magnitude faster on large graphs including the widely studied Twitter and Friendster datasets.

【Keywords】: Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms

83. Fractal: A General-Purpose Graph Pattern Mining System.

【Paper Link】【Pages】:1357-1374

【Authors】: Vinícius Vitor dos Santos Dias ; Carlos H. C. Teixeira ; Dorgival O. Guedes ; Wagner Meira Jr. ; Srinivasan Parthasarathy

【Abstract】: In this paper we propose Fractal, a high performance and high productivity system for supporting distributed graph pattern mining (GPM) applications. Fractal employs a dynamic (auto-tuned) load-balancing based on a hierarchical and locality-aware work stealing mechanism, allowing the system to adapt to different workload characteristics. Additionally, Fractal enumerates subgraphs by combining a depth-first strategy with a from scratch processing paradigm to avoid storing large amounts of intermediate state and, thus, improves memory efficiency. Regarding programmer productivity, Fractal presents an intuitive, expressive and modular API, allowing for rapid compositional expression of many GPM algorithms. Fractal-based implementations outperform both existing systemic solutions and specialized distributed solutions on many problems - from frequent graph mining to subgraph querying, over a range of datasets.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms; Graph enumeration; Software and its engineering; Software notations and tools; General programming languages; Language types; Distributed programming languages

84. Experimental Analysis of Streaming Algorithms for Graph Partitioning.

【Paper Link】【Pages】:1375-1392

【Authors】: Anil Pacaci ; M. Tamer Özsu

【Abstract】: We report a systematic performance study of streaming graph partitioning algorithms. Graph partitioning plays a crucial role in overall system performance as it has a significant impact on both load balancing and inter-machine communication. The streaming model for graph partitioning has recently gained attention due to its ability to scale to very large graphs with limited resources. The main objective of this study is to understand how the choice of graph partitioning algorithm affects system performance, resource usage and scalability. We focus on both offline graph analytics and online graph query workloads. The study considers both edge-cut and vertex-cut approaches. Our results show that the no partitioning algorithms performs best in all cases, and the choice of graph partitioning algorithm depends on: (i) type and degree distribution of the graph, (ii) characteristics of the workloads, and (iii) specific application requirements.

【Keywords】: Computing methodologies; Parallel computing methodologies; Parallel algorithms; Vector / streaming algorithms; Information systems; Data management systems; Database administration; Database performance evaluation; Database design and models; Graph-based database models

85. Interactive Graph Search.

【Paper Link】【Pages】:1393-1410

【Authors】: Yufei Tao ; Yuanbing Li ; Guoliang Li

【Abstract】: We study \em interactive graph search (IGS), with the conceptual objective of departing from the conventional "top-down" strategy in searching a poly-hierarchy, a.k.a.\ a decision graph. In IGS, a machine assists a human in looking for a target node z in an acyclic directed graph G, by repetitively asking questions. In each \em question, the machine picks a node u in G, asks a human "is there a path from u to $z?"', and takes a boolean answer from the human. The efficiency goal is to locate z with as few questions as possible. We describe algorithms that solve the problem by asking a provably small number of questions, and establish lower bounds indicating that the algorithms are optimal up to a small additive factor. An experimental evaluation is presented to demonstrate the usefulness of our solutions in real-world scenarios.

【Keywords】: Information systems; Information retrieval; Users and interactive retrieval; Collaborative search

Research 15: Graphs 3 6

86. Optimizing Declarative Graph Queries at Large Scale.

【Paper Link】【Pages】:1411-1428

【Authors】: Qizhen Zhang ; Akash Acharya ; Hongzhi Chen ; Simran Arora ; Ang Chen ; Vincent Liu ; Boon Thau Loo

【Abstract】: This paper presents GraphRex, an efficient, robust, scalable, and easy-to-program framework for graph processing on datacenter infrastructure. To users, GraphRex presents a declarative, Datalog-like interface that is natural and expressive. Underneath, it compiles those queries into efficient implementations. A key technical contribution of GraphRex is the identification and optimization of a set of global operators whose efficiency is crucial to the good performance of datacenter-based, large graph analysis. Our experimental results show that GraphRex significantly outperforms existing frameworks---both high- and low-level---in scenarios ranging across a wide variety of graph workloads and network conditions, sometimes by two orders of magnitude.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Information systems; Data management systems; Database management system engines; Parallel and distributed DBMSs; Relational parallel and distributed DBMSs

87. Efficient Subgraph Matching: Harmonizing Dynamic Programming, Adaptive Matching Order, and Failing Set Together.

【Paper Link】【Pages】:1429-1446

【Authors】: Myoungji Han ; Hyunjoon Kim ; Geonmo Gu ; Kunsoo Park ; Wook-Shin Han

【Abstract】: Subgraph matching (or subgraph isomorphism) is one of the fundamental problems in graph analysis. Extensive research has been done to develop practical solutions for subgraph matching. The state-of-the-art algorithms such as \textsfCFL-Match and \textsfTurbo\textsubscriptiso convert a query graph into a spanning tree for obtaining candidates for each query vertex and obtaining a good matching order with the spanning tree. However, by using the spanning tree instead of the original query graph, it could lead to lower pruning power and a sub-optimal matching order. Another limitation is that they perform redundant computation in search without utilizing the knowledge learned from past computation. In this paper, we introduce three novel concepts to address these inherent limitations: 1) dynamic programming between a directed acyclic graph (DAG) and a graph, 2) adaptive matching order with DAG ordering, and 3) pruning by failing sets, which together lead to a much faster algorithm \textsfDAF for subgraph matching. Extensive experiments with real datasets show that \textsfDAF outperforms the fastest existing solution by up to orders of magnitude in terms of recursive calls as well as in terms of the elapsed time.

【Keywords】: Information systems; Information retrieval; Information retrieval query processing; Theory of computation; Design and analysis of algorithms; Data structures design and analysis; Pattern matching

88. CECI: Compact Embedding Cluster Index for Scalable Subgraph Matching.

【Paper Link】【Pages】:1447-1462

【Authors】: Bibek Bhattarai ; Hang Liu ; H. Howie Huang

【Abstract】: Subgraph matching finds all distinct isomorphic embeddings of a query graph on a data graph. For large graphs, current solutions face the scalability challenge due to expensive joins, excessive false candidates, and workload imbalance. In this paper, we propose a novel framework for subgraph listing based on Compact Embedding Cluster Index (\idx), which divides the data graph into multiple embedding clusters for parallel processing. The \sub has three unique techniques: utilizing the BFS-based filtering and reverse-BFS-based refinement to prune the unpromising candidates early on, replacing the edge verification with set intersection to speed up the candidate verification, and using search cardinality based cost estimation for detecting and dividing large embedding clusters in advance. The experiments performed on several real and synthetic datasets show that the \sub outperforms state-of-the-art solutions on average by 20.4× for listing all embeddings and by 2.6× for enumerating the first 1,024 embeddings.

【Keywords】: Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms; Software and its engineering; Software organization and properties; Software system structures; Software system models; Massively parallel systems; Theory of computation; Design and analysis of algorithms; Data structures design and analysis; Pattern matching; Sorting and searching; Graph algorithms analysis

89. Efficiently Answering Regular Simple Path Queries on Large Labeled Networks.

【Paper Link】【Pages】:1463-1480

【Authors】: Sarisht Wadhwa ; Anagh Prasad ; Sayan Ranu ; Amitabha Bagchi ; Srikanta Bedathur

【Abstract】: A fundamental query in labeled graphs is to determine if there exists a path between a given source and target vertices, such that the path satisfies a given label constraint. One of the powerful forms of specifying label constraints is through regular expressions, and the resulting problem of reachability queries under regular simple paths (RSP) form the core of many practical graph query languages such as SPARQL from W3C, Cypher of Neo4J, Oracle's PGQL and LDBC's G-CORE. Despite its importance, since it is known that answering RSP queries is NP-Hard, there are no scalable and practical solutions for answering reachability with full-range of regular expressions as constraints. In this paper, we circumvent this computational bottleneck by designing a random-walk based sampling algorithm called ARRIVAL, which is backed by theoretical guarantees on its expected quality. Extensive experiments on billion-sized real graph datasets with thousands of labels show that ARRIVAL to be 100 times faster than baseline strategies with an average accuracy of 95%.

【Keywords】: Information systems; Information systems applications; Data mining; Spatial-temporal systems; Location based services; Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms; Graph enumeration; Paths and connectivity problems

90. Answering Why-questions by Exemplars in Attributed Graphs.

【Paper Link】【Pages】:1481-1498

【Authors】: Mohammad Hossein Namaki ; Qi Song ; Yinghui Wu ; Shengqi Yang

【Abstract】: This paper studies the problem of \em answering Why-questions for graph pattern queries. Given a query Q, its answers $Q(G)$ in a graph G, and an exemplar $\E$ that describes desired answers, it aims to compute a query rewrite $Q'$, such that $Q'(G)$ incorporates relevant entities and excludes irrelevant ones wrt $\E$ under a closeness measure. (1) We characterize the problem by \em Q-Chase. It rewrites Q by applying a sequence of applicable operators guided by $\E$, and backtracks to derive optimal query rewrite. (2) We develop feasible Q-Chase-based algorithms, from anytime solutions to fixed-parameter approximations to compute query rewrites. These algorithms implement Q-Chase by detecting picky operators at run time, which discriminately enforce $\E$ to retain answers that are closer to exemplars, and effectively prune both operators and irrelevant matches, by consulting a cache of star patterns (called \em star views ). Using real-world graphs, we experimentally verify the efficiency and effectiveness of \qchase techniques and their applications.

【Keywords】: Information systems; Data management systems; Database design and models; Data model extensions; Data provenance; Database management system engines; Database query processing; Query operators; Query optimization; Database views; Information retrieval; Evaluation of retrieval results; Retrieval effectiveness; Information retrieval query processing; Query intent; Query reformulation; Query suggestion; Retrieval tasks and goals; Question answering; Users and interactive retrieval; Mathematics of computing; Discrete mathematics; Graph theory; Approximation algorithms

91. An Efficient Index for RDF Query Containment.

【Paper Link】【Pages】:1499-1516

【Authors】: Theofilos Mailis ; Yannis Kotidis ; Vaggelis Nikolopoulos ; Evgeny Kharlamov ; Ian Horrocks ; Yannis E. Ioannidis

【Abstract】: Query containment is a fundamental operation used to expedite query processing in view materialisation and query caching techniques. Since query containment has been shown to be NP-complete for arbitrary conjunctive queries on RDF graphs, we introduce a simpler form of conjunctive queries that we name f-graph queries. We first show that containment checking for f-graph queries can be solved in polynomial time. Based on this observation, we propose a novel indexing structure, named mv-index, that allows for fast containment checking between a single f-graph query and an arbitrary number of stored queries. Search is performed in polynomial time in the combined size of the query and the index. We then show how our algorithms and structures can be extended for arbitrary conjunctive queries on RDF graphs by introducing f-graph witnesses, i.e., f-graph representatives of conjunctive queries. F-graph witnesses have the following interesting property, a conjunctive query for RDF graphs is contained in another query only if its corresponding f-graph witness is also contained in it. The latter allows to use our indexing structure for the general case of conjunctive query containment. This translates in practice to microseconds or less for the containment test against hundreds of thousands of queries that are indexed within our structure.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization; Database views; Query languages; Relational database query languages; Information retrieval; Document representation; Ontologies; World Wide Web; Web data description languages; Semantic web description languages; Resource Description Framework (RDF); Web Ontology Language (OWL); Theory of computation; Theory and algorithms for application domains; Database theory; Data structures and algorithms for data management; Database query processing and optimization (theory)

Research 16: Machine Learning 6

92. Tuple-oriented Compression for Large-scale Mini-batch Stochastic Gradient Descent.

【Paper Link】【Pages】:1517-1534

【Authors】: Fengan Li ; Lingjiao Chen ; Yijing Zeng ; Arun Kumar ; Xi Wu ; Jeffrey F. Naughton ; Jignesh M. Patel

【Abstract】: Data compression is a popular technique for improving the efficiency of data processing workloads such as SQL queries and more recently, machine learning (ML) with classical batch gradient methods. But the efficacy of such ideas for mini-batch stochastic gradient descent (MGD), arguably the workhorse algorithm of modern ML, is an open question. MGD's unique data access pattern renders prior art, including those designed for batch gradient methods, less effective. We fill this crucial research gap by proposing a new lossless compression scheme we call tuple-oriented compression (TOC) that is inspired by an unlikely source, the string/ text compression scheme Lempel-Ziv-Welch, but tailored to MGD in a way that preserves tuple boundaries within mini-batches. We then present a suite of novel compressed matrix operation execution techniques tailored to the TOC compression scheme that operate directly over the compressed data representation and avoid decompression overheads. An extensive empirical evaluation with real-world datasets shows that TOC consistently achieves substantial compression ratios by up to 51x and reduces runtimes for MGD workloads by up to 10.2x in popular ML systems.

【Keywords】: Computing methodologies; Machine learning; Information systems; Data management systems; Data structures; Data layout; Data compression

93. Towards Model-based Pricing for Machine Learning in a Data Marketplace.

【Paper Link】【Pages】:1535-1552

【Authors】: Lingjiao Chen ; Paraschos Koutris ; Arun Kumar

【Abstract】: Data analytics using machine learning (ML) has become ubiquitous in science, business intelligence, journalism and many other domains. While a lot of work focuses on reducing the training cost, inference runtime and storage cost of ML models, little work studies how to reduce the cost of data acquisition, which potentially leads to a loss of sellers' revenue and buyers' affordability and efficiency. In this paper, we propose a model-based pricing (MBP) framework, which instead of pricing the data, directly prices ML model instances. We first formally describe the desired properties of the MBP framework, with a focus on avoiding arbitrage. Next, we show a concrete realization of the MBP framework via a noise injection approach, which provably satisfies the desired formal properties. Based on the proposed framework, we then provide algorithmic solutions on how the seller can assign prices to models under different market scenarios (such as to maximize revenue). Finally, we conduct extensive experiments, which validate that the MBP framework can provide high revenue to the seller, high affordability to the buyer, and also operate on low runtime cost.

【Keywords】: Computing methodologies; Machine learning; Information systems; Data management systems; Theory of computation; Theory and algorithms for application domains; Algorithmic game theory and mechanism design; Computational pricing and auctions; Database theory; Data exchange

94. DBEst: Revisiting Approximate Query Processing Engines with Machine Learning Models.

【Paper Link】【Pages】:1553-1570

【Authors】: Qingzhi Ma ; Peter Triantafillou

【Abstract】: In the era of big data, computing exact answers to analytical queries becomes prohibitively expensive. This greatly increases the value of approaches that can compute efficiently approximate, but highly-accurate, answers to analytical queries. Alas, the state of the art still suffers from many shortcomings: Errors are still high unless large memory investments are made. Many important analytics tasks are not supported. Query response times are too long and thus approaches rely on parallel execution of queries atop large big data analytics clusters, in-situ or in the cloud, whose acquisition/use costs dearly. Hence, the following questions are crucial: Can we develop AQP engines that reduce response times by orders of magnitude, ensure high accuracy, and support most aggregate functions? With smaller memory footprints and small overheads to build the state upon which they are based? With this paper, we show that the answers to all questions above can be positive. The paper presents DBEst, a system based on Machine Learning models (regression models and probability density estimators). It will discuss its limitations, promises, and how it can complement existing systems. It will substantiate its advantages using queries and data from the TPC-DS benchmark and real-life datasets, compared against state of the art AQP engines.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization; Online analytical processing engines

95. Enabling and Optimizing Non-linear Feature Interactions in Factorized Linear Algebra.

【Paper Link】【Pages】:1571-1588

【Authors】: Side Li ; Lingjiao Chen ; Arun Kumar

【Abstract】: Accelerating machine learning (ML) over relational data is a key focus of the database community. While many real-world datasets are multi-table, most ML tools expect single-table inputs, forcing users to materialize joins before ML, leading to data redundancy and runtime waste. Recent works on ''factorized ML'' address such issues by pushing ML through joins. However, they have hitherto been restricted to ML models linear in the feature space, rendering them less effective when users construct non-linear feature interactions such as pairwise products to boost ML accuracy. In this work, we take a first step towards closing this gap by introducing a new abstraction to enable pairwise feature interactions in multi-table data and present an extensive framework of algebraic rewrite rules for factorized LA operators over feature interactions. Our rewrite rules carefully exploit the interplay of the redundancy caused by both joins and interactions. We prototype our framework in Python to build a tool we call MorpheusFI. An extensive empirical evaluation with both synthetic and real datasets shows that MorpheusFI yields up to 5x speedups over materialized execution for a popular second-order gradient method and even an order of magnitude speedups over a popular stochastic gradient method.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization; Information systems applications; Decision support systems; Data analytics; Theory of computation; Theory and algorithms for application domains; Database theory; Data structures and algorithms for data management

96. Incremental and Approximate Inference for Faster Occlusion-based Deep CNN Explanations.

【Paper Link】【Pages】:1589-1606

【Authors】: Supun Nakandala ; Arun Kumar ; Yannis Papakonstantinou

【Abstract】: Deep Convolutional Neural Networks (CNNs) now match human accuracy in many image prediction tasks, resulting in a growing adoption in e-commerce, radiology, and other domains. Naturally, explaining CNN predictions is a key concern for many users. Since the internal workings of CNNs are unintuitive for most users, occlusion-based explanations (OBE) are popular for understanding which parts of an image matter most for a prediction. One occludes a region of the image using a patch and moves it around to produce a heat map of changes to the prediction probability. Alas, this approach is computationally expensive due to the large number of re-inference requests produced, which wastes time and raises resource costs. We tackle this issue by casting the OBE task as a new instance of the classical incremental view maintenance problem. We create a novel and comprehensive algebraic framework for incremental CNN inference combining materialized views with multi-query optimization to reduce computational costs. We then present two novel approximate inference optimizations that exploit the semantics of CNNs and the OBE task to further reduce runtimes. We prototype our ideas in Python to create a tool we call Krypton that supports both CPUs and GPUs. Experiments with real data and CNNs show that Krypton reduces runtimes by up to 5X (resp. 35X) to produce exact (resp. high-quality approximate) results without raising resource requirements.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks; Information systems; Data management systems; Database management system engines; Database query processing; Query optimization; Database views; Information systems applications; Decision support systems; Data analytics

97. MNC: Structure-Exploiting Sparsity Estimation for Matrix Expressions.

【Paper Link】【Pages】:1607-1623

【Authors】: Johanna Sommer ; Matthias Boehm ; Alexandre V. Evfimievski ; Berthold Reinwald ; Peter J. Haas

【Abstract】: Efficiently computing linear algebra expressions is central to machine learning (ML) systems. Most systems support sparse formats and operations because sparse matrices are ubiquitous and their dense representation can cause prohibitive overheads. Estimating the sparsity of intermediates, however, remains a key challenge when generating execution plans or performing sparse operations. These sparsity estimates are used for cost and memory estimates, format decisions, and result allocation. Existing estimators tend to focus on matrix products only, and struggle to attain good accuracy with low estimation overhead. However, a key observation is that real-world sparse matrices commonly exhibit structural properties such as a single non-zero per row, or columns with varying sparsity. In this paper, we introduce MNC (Matrix Non-zero Count), a remarkably simple, count-based matrix synopsis that exploits these structural properties for efficient, accurate, and general sparsity estimation. We describe estimators and sketch propagation for realistic linear algebra expressions. Our experiments - on a new estimation benchmark called SparsEst - show that the MNC estimator yields good accuracy with very low overhead. This behavior makes MNC practical and broadly applicable in ML systems.

【Keywords】: Information systems; Data management systems; Theory of computation; Design and analysis of algorithms; Streaming, sublinear and near linear time algorithms; Sketching and sampling

Research 17: Scalability 6

98. A Scalable Index for Top-k Subtree Similarity Queries.

【Paper Link】【Pages】:1624-1641

【Authors】: Daniel Kocher ; Nikolaus Augsten

【Abstract】: Given a query tree Q, the top-k subtree similarity query retrieves the k subtrees in a large document tree T that are closest to Q in terms of tree edit distance. The classical solution scans the entire document, which is slow. The state-of-the-art approach precomputes an index to reduce the query time. However, the index is large (quadratic in the document size), building the index is expensive, updates are not supported, and data-specific tuning is required.

【Keywords】: Information systems; Information retrieval; Retrieval models and ranking; Top-k retrieval in databases

99. A Layered Aggregate Engine for Analytics Workloads.

【Paper Link】【Pages】:1642-1659

【Authors】: Maximilian Schleich ; Dan Olteanu ; Mahmoud Abo Khamis ; Hung Q. Ngo ; XuanLong Nguyen

【Abstract】: This paper introduces LMFAO (Layered Multiple Functional Aggregate Optimization), an in-memory optimization and execution engine for batches of aggregates over the input database. The primary motivation for this work stems from the observation that for a variety of analytics over databases, their data-intensive tasks can be decomposed into group-by aggregates over the join of the input database relations. We exemplify the versatility and competitiveness of LMFAO for a handful of widely used analytics: learning ridge linear regression, classification trees, regression trees, and the structure of Bayesian networks using Chow-Liu trees; and data cubes used for exploration in data warehousing. LMFAO consists of several layers of logical and code optimizations that systematically exploit sharing of computation, parallelism, and code specialization. We conducted two types of performance benchmarks. In experiments with four datasets, LMFAO outperforms by several orders of magnitude on one hand, a commercial database system and MonetDB for computing batches of aggregates, and on the other hand, TensorFlow, Scikit, R, and AC/DC for learning a variety of models over databases.

100. Towards Scalable Hybrid Stores: Constraint-Based Rewriting to the Rescue.

【Paper Link】【Pages】:1660-1677

【Authors】: Rana Alotaibi ; Damian Bursztyn ; Alin Deutsch ; Ioana Manolescu ; Stamatis Zampetakis

【Abstract】: Big data applications routinely involve diverse datasets: relations flat or nested, complex-structure graphs, documents, poorly structured logs, or even text data. To handle the data, application designers usually rely on several data stores used side-by-side, each capable of handling one or a few data models, and each very efficient for some, but not all, kinds of processing on the data. A current limitation is that applications are written taking into account which part of the data is stored in which store and how. This fails to take advantage of (i) possible redundancy, when the same data may be accessible (with different performance) from distinct data stores; (ii) partial query results (in the style of materialized views) which may be available in the stores. We present ESTOCADA, a novel approach connecting applications to the potentially heterogeneous systems where their input data resides. ESTOCADA can be used in a polystore setting to transparently enable each query to benefit from the best combination of stored data and available processing capabilities. ESTOCADA leverages recent advances in the area of view-based query rewriting under constraints, which we use to describe the various data models and stored data. Our experiments illustrate the significant performance gains achieved by ESTOCADA.

【Keywords】: Information systems; Data management systems; Database management system engines; Information retrieval; Information retrieval query processing; Query reformulation; Theory of computation; Theory and algorithms for application domains; Database theory; Data integration; Database constraints theory; Database query languages (principles)

101. MIFO: A Query-Semantic Aware Resource Allocation Policy.

【Paper Link】【Pages】:1678-1695

【Authors】: Prajakta Kalmegh ; Shivnath Babu

【Abstract】: Data Analytics Frameworks encourage sharing of clusters for execution of mixed workloads by promising fairness and isolation along with high performance and resource utilization. However, concurrent query executions on such shared clusters result in increased queue and resource waiting times for queries affecting their overall performance. MIFO is a dataflow aware scheduling policy that mitigates the impacts due to queue and resource contentions by reducing the waiting times for queries near completion. We present heuristics that exploit query semantics to proactively trigger MIFO-based allocations in a workload. Our experiments on Apache Spark using TPCDS benchmark show that compared to a FAIR policy, MIFO provides an improved mean response time, reduced makespan of the workload and average speedup between 1.2x-2.7x in highly concurrent setting with only a momentary deviation in fairness.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Cloud computing; Computing methodologies; Distributed computing methodologies; Distributed algorithms; MapReduce algorithms; Information systems; Data management systems; Database administration; Database utilities and tools; Database management system engines; Online analytical processing engines; Parallel and distributed DBMSs; MapReduce-based systems

102. Dissecting the Performance of Strongly-Consistent Replication Protocols.

【Paper Link】【Pages】:1696-1710

【Authors】: Ailidani Ailijiang ; Aleksey Charapko ; Murat Demirbas

【Abstract】: Many distributed databases employ consensus protocols to ensure that data is replicated in a strongly-consistent manner on multiple machines despite failures and concurrency. Unfortunately, these protocols show widely varying performance under different network, workload, and deployment conditions, and no previous study offers a comprehensive dissection and comparison of their performance. To fill this gap, we study single-leader, multi-leader, hierarchical multi-leader, and leaderless (opportunistic leader) consensus protocols, and present a comprehensive evaluation of their performance in local area networks (LANs) and wide area networks (WANs). We take a two-pronged systematic approach. We present an analytic modeling of the protocols using queuing theory and show simulations under varying controlled parameters. To cross-validate the analytic model, we also present empirical results from our prototyping and evaluation framework, Paxi. We distill our findings to simple throughput and latency formulas over the most significant parameters. These formulas enable the developers to decide which category of protocols would be most suitable under given deployment conditions.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Cloud computing; Computing methodologies; Distributed computing methodologies; Distributed algorithms; General and reference; Cross-computing tools and techniques; Performance; Information systems; Data management systems; Database management system engines; Parallel and distributed DBMSs

103. FishStore: Faster Ingestion with Subset Hashing.

【Paper Link】【Pages】:1711-1728

【Authors】: Dong Xie ; Badrish Chandramouli ; Yinan Li ; Donald Kossmann

【Abstract】: The last decade has witnessed a huge increase in data being ingested into the cloud, in forms such as JSON, CSV, and binary formats. Traditionally, data is either ingested into storage in raw form, indexed ad-hoc using range indices, or cooked into analytics-friendly columnar formats. None of these solutions is able to handle modern requirements on storage: making the data available immediately for ad-hoc and streaming queries while ingesting at extremely high throughputs. This paper builds on recent advances in parsing and indexing techniques to propose FishStore, a concurrent latch-free storage layer for data with flexible schema, based on multi-chain hash indexing of dynamically registered predicated subsets of data. We find predicated subset hashing to be a powerful primitive that supports a broad range of queries on ingested data and admits a high-performance concurrent implementation. Our detailed evaluation on real datasets and queries shows that FishStore can handle a wide range of workloads and can ingest and retrieve data at an order of magnitude lower cost than state-of-the-art alternatives.

【Keywords】: Information systems; Data management systems; Data structures; Database management system engines; Parallel and distributed DBMSs; Key-value stores; Record and buffer management; Stream management

Industry 3: Data Platforms 6

104. CFS: A Distributed File System for Large Scale Container Platforms.

【Paper Link】【Pages】:1729-1742

【Authors】: Haifeng Liu ; Wei Ding ; Yuan Chen ; Weilong Guo ; Shuoran Liu ; Tianpeng Li ; Mofei Zhang ; Jianxing Zhao ; Hongyin Zhu ; Zhengyi Zhu

【Abstract】: We propose CFS, a distributed file system for large scale container platforms. CFS supports both sequential and random file accesses with optimized storage for both large files and small files, and adopts different replication protocols for different write scenarios to improve the replication performance. It employs a metadata subsystem to store and distribute the file metadata across different storage nodes based on the memory usage. This metadata placement strategy avoids the need of data rebalancing during capacity expansion. CFS also provides POSIX-compliant APIs with relaxed semantics and metadata atomicity to improve the system performance. We performed a comprehensive comparison with Ceph, a widely-used distributed file system on container platforms. Our experimental results show that, in testing 7 commonly used metadata operations, CFS gives around 3 times performance boost on average. In addition, CFS exhibits better random-read/write performance in highly concurrent environments with multiple clients and processes.

【Keywords】: Information systems; Information storage systems; Storage architectures; Distributed storage

105. Socrates: The New SQL Server in the Cloud.

【Paper Link】【Pages】:1743-1756

【Authors】: Panagiotis Antonopoulos ; Alex Budovski ; Cristian Diaconu ; Alejandro Hernandez Saenz ; Jack Hu ; Hanuma Kodavalla ; Donald Kossmann ; Sandeep Lingam ; Umar Farooq Minhas ; Naveen Prakash ; Vijendra Purohit ; Hugh Qu ; Chaitanya Sreenivas Ravella ; Krystyna Reisteter ; Sheetal Shrotri ; Dixin Tang ; Vikram Wakade

【Abstract】: The database-as-a-service paradigm in the cloud (DBaaS) is becoming increasingly popular. Organizations adopt this paradigm because they expect higher security, higher availability, and lower and more flexible cost with high performance. It has become clear, however, that these expectations cannot be met in the cloud with the traditional, monolithic database architecture. This paper presents a novel DBaaS architecture, called Socrates. Socrates has been implemented in Microsoft SQL Server and is available in Azure as SQL DB Hyperscale. This paper describes the key ideas and features of Socrates, and it compares the performance of Socrates with the previous SQL DB offering in Azure.

【Keywords】: Information systems; Data management systems; Database management system engines; DBMS engine architectures

106. One SQL to Rule Them All - an Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables.

【Paper Link】【Pages】:1757-1772

【Authors】: Edmon Begoli ; Tyler Akidau ; Fabian Hueske ; Julian Hyde ; Kathryn Knight ; Kenneth Knowles

【Abstract】: Real-time data analysis and management are increasingly critical for today's businesses. SQL is the de facto lingua franca for these endeavors, yet support for robust streaming analysis and management with SQL remains limited. Many approaches restrict semantics to a reduced subset of features and/or require a suite of non-standard constructs. Additionally, use of event timestamps to provide native support for analyzing events according to when they actually occurred is not pervasive, and often comes with important limitations. We present a three-part proposal for integrating robust streaming into SQL, namely: (1) time-varying relations as a foundation for classical tables as well as streaming data, (2) event time semantics, (3) a limited set of optional keyword extensions to control the materialization of time-varying query results. We show how with these minimal additions it is possible to utilize the complete suite of standard SQL semantics to perform robust stream processing. We motivate and illustrate these concepts using examples and describe lessons learned from implementations in Apache Calcite, Apache Flink, and Apache Beam. We conclude with syntax and semantics of a concrete proposal for extensions of the SQL standard and note further areas of exploration.

【Keywords】: Information systems; Data management systems; Database management system engines; Stream management; Query languages

107. Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing.

【Paper Link】【Pages】:1773-1786

【Authors】: Jesús Camacho-Rodríguez ; Ashutosh Chauhan ; Alan Gates ; Eugene Koifman ; Owen O'Malley ; Vineet Garg ; Zoltan Haindrich ; Sergey Shelukhin ; Prasanth Jayachandran ; Siddharth Seth ; Deepak Jaiswal ; Slim Bouguerra ; Nishant Bangarwa ; Sankar Hariappan ; Anishek Agarwal ; Jason Dere ; Daniel Dai ; Thejas Nair ; Nita Dembla ; Gopal Vijayaraghavan ; Günther Hagleitner

【Abstract】: Apache Hive is an open-source relational database system for analytic big-data workloads. In this paper we describe the key innovations on the journey from batch tool to fully fledged enterprise data warehousing system. We present a hybrid architecture that combines traditional MPP techniques with more recent big data and cloud concepts to achieve the scale and performance required by today's analytic applications. We explore the system by detailing enhancements along four main axis: Transactions, optimizer, runtime, and federation. We then provide experimental results to demonstrate the performance of the system for typical workloads and conclude with a look at the community roadmap.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Distributed database transactions; Parallel and distributed DBMSs; Relational parallel and distributed DBMSs; Information integration; Federated databases; Information systems applications; Decision support systems; Data warehouses

108. FoundationDB Record Layer: A Multi-Tenant Structured Datastore.

【Paper Link】【Pages】:1787-1802

【Authors】: Christos Chrysafis ; Ben Collins ; Scott Dugas ; Jay Dunkelberger ; Moussa Ehsan ; Scott Gray ; Alec Grieser ; Ori Herrnstadt ; Kfir Lev-Ari ; Tao Lin ; Mike McMahon ; Nicholas Schiefer ; Alexander Shraer

【Abstract】: The FoundationDB Record Layer is an open source library that provides a record-oriented data store with semantics similar to a relational database implemented on top of FoundationDB, an ordered, transactional key-value store. The Record Layer provides a lightweight, highly extensible way to store structured data. It offers schema management and a rich set of query and indexing facilities, some of which are not usually found in traditional relational databases, such as nested record types, indexes on commit versions, and indexes that span multiple record types. The Record Layer is stateless and built for massive multi-tenancy, encapsulating and isolating all of a tenant's state, including indexes, into a separate logical database. We demonstrate how the Record Layer is used by CloudKit, Apple's cloud backend service, to provide powerful abstractions to applications serving hundreds of millions of users. CloudKit uses the Record Layer to host billions of independent databases, many with a common schema. Features provided by the Record Layer enable CloudKit to provide richer APIs and stronger semantics with reduced maintenance overhead and improved scalability.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Cloud computing; Information systems; Data management systems; Database management system engines; Parallel and distributed DBMSs

109. Data Platform for Machine Learning.

【Paper Link】【Pages】:1803-1816

【Authors】: Pulkit Agrawal ; Rajat Arya ; Aanchal Bindal ; Sandeep Bhatia ; Anupriya Gagneja ; Joseph Godlewski ; Yucheng Low ; Timothy Muss ; Mudit Manu Paliwal ; Sethu Raman ; Vishrut Shah ; Bochao Shen ; Laura Sugden ; Kaiyu Zhao ; Ming-Chuan Wu

【Abstract】: In this paper, we present a purpose-built data management system, MLdp, for all machine learning (ML) datasets. ML applications pose some unique requirements different from common conventional data processing applications, including but not limited to: data lineage and provenance tracking, rich data semantics and formats, integration with diverse ML frameworks and access patterns, trial-and-error driven data exploration and evolution, rapid experimentation, reproducibility of the model training, strict compliance and privacy regulations, etc. Current ML systems/services, often named MLaaS, to-date focus on the ML algorithms, and offer no integrated data management system. Instead, they require users to bring their own data and to manage their own data on either blob storage or on file systems. The burdens of data management tasks, such as versioning and access control, fall onto the users, and not all compliance features, such as terms of use, privacy measures, and auditing, are available. MLdp offers a minimalist and flexible data model for all varieties of data, strong version management to guarantee re-producibility of ML experiments, and integration with major ML frameworks. MLdp also maintains the data provenance to help users track lineage and dependencies among data versions and models in their ML pipelines. In addition to table-stake features, such as security, availability and scalability, MLdp's internal design choices are strongly influenced by the goal to support rapid ML experiment iterations, which cycle through data discovery, data exploration, feature engineering, model training, model evaluation, and back to data discovery. The contributions of this paper are: 1) to recognize the needs and to call out the requirements of an ML data platform, 2) to share our experiences in building MLdp by adopting existing database technologies to the new problem as well as by devising new solutions, and 3) to call for actions from our communities on future challenges.

【Keywords】: Information systems; Data management systems; Data structures; Data layout; Record and block layout; Database design and models; Data model extensions; Data provenance; Data streams; Semi-structured data; Information storage systems; Storage management; Information lifecycle management; Version management; Theory of computation; Theory and algorithms for application domains; Database theory; Data modeling

Student Abstracts 16

110. Scalable Reservoir Sampling on Many-Core CPUs.

【Paper Link】【Pages】:1817-1819

【Authors】: Altan Birler

【Abstract】: Database systems need to be able to convert queries to efficient execution plans. As recent research has shown, correctly estimating cardinalities of subqueries is an important factor in the efficiency of the resulting plans [7, 8]. Many algorithms have been proposed in literature that utilize a random sample to estimate cardinalities [6, 9, 13]. Thus, some modern database systems choose to store a materialized uniformly random sample for their relations [3, 6]. Such samples are built and refreshed when statistics are gathered, by loading uniformly random tuples from the relation in disk using random IO.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization; Main memory engines; Parallel and distributed DBMSs; Relational parallel and distributed DBMSs

111. Helios: An Adaptive and Query Workload-driven Partitioning Framework for Distributed Graph Stores.

【Paper Link】【Pages】:1820-1822

【Authors】: Ali Davoudian

【Abstract】: In this paper, we present a novel adaptive and workload-driven partitioning framework, named Helios, aiming to achieve low-latency and high-throughput online queries in distributed graph stores. As each workload typically contains popular or similar queries, our partitioning method uses the existing workload to capture active vertices and edges which are frequently visited and traversed respectively. This information is used to heuristically improve the quality of partitions either by avoiding the concentration of active vertices in a few partitions proportional to their visit frequencies or by reducing the probability of the cut of active edges proportional to their traversal frequencies. In order to assess the impact of Helios on a graph store, and to show how easily the approach can be plugged on top of the system, we exploit it in a distributed, graph-based RDF store. The query engine of the store exploits Helios to reduce or eliminate data communication for future queries and balance the load among nodes. We evaluate the store by using realistic query workloads over an RDF dataset. Our results demonstrate the ability of Helios to handle varying query workloads with minimum overhead while maintaining the quality of partitions over time, along with being scalable by increasing either the data size or the number of computing nodes.

【Keywords】: Information systems; Data management systems; Database design and models; Graph-based database models

112. CAvSAT: A System for Query Answering over Inconsistent Databases.

【Paper Link】【Pages】:1823-1825

【Authors】: Akhil A. Dixit

【Abstract】: Managing inconsistencies in databases is an old, but recurring, problem. An inconsistent database is a database that violates one or more integrity constraints. In the real-world, inconsistent databases arise in several different contexts, including data warehousing and information integration. The framework of database repairs and consistent query answering (CQA) is a principled way of handling inconsistencies. In this work, we propose a novel approach that has a potential to build a comprehensive and scalable CQA system. We report preliminary experimental results on a prototype CQA system CAvSAT (Consistent Answering via Satisfiability), implemented using this approach.

【Keywords】: Hardware; Hardware validation; Functional verification; Theorem proving and SAT solving; Information systems; Data management systems; Database design and models; Data model extensions; Inconsistent data; Information integration; Data cleaning; Theory of computation; Theory and algorithms for application domains; Database theory; Incomplete, inconsistent, and uncertain databases

113. Interactive Visualization For Big Spatial Data.

【Paper Link】【Pages】:1826-1828

【Authors】: Saheli Ghosh

【Abstract】: The significance of spatial data is undeniable in present world. Starting from satellite data to GPS locations, from Facebook tags to Yelp check-ins, spatial data has become an intricate part of our daily life. Interactive visualization of these datasets can be of immense help to the scientific community for exploratory analytics which in turn helps to identify unique patterns and trends.

【Keywords】: Information systems; Data management systems

114. LSM-Trees and B-Trees: The Best of Both Worlds.

【Paper Link】【Pages】:1829-1831

【Authors】: Varun Jain ; James Lennon ; Harshita Gupta

【Abstract】: LSM-Trees and B-Trees are the two primary data structures used as storage engines in modern key-value (KV) stores. These two structures are optimal for different workloads; LSM-Trees perform better on update queries, whereas B-Trees are preferable for short range lookups. KV stores today use one or the other. However, for modern applications with increasingly diverse workloads, limiting KV stores to utilize only one of the two designs leads to a significant loss in performance. We propose a novel method of online transitioning a KV store from an LSM-Tree to a B-Tree and vice versa. This allows KV stores to smoothly adapt to changing workloads and use the optimal data structure as the workload changes.

【Keywords】: Information systems; Data management systems; Data structures; Data access methods; Data layout; Record and block layout; Information storage systems; Storage management

115. Answering Range Queries Under Local Differential Privacy.

【Paper Link】【Pages】:1832-1834

【Authors】: Tejas Kulkarni

【Abstract】: Counting the fraction of a population having an input within a specified interval i.e. range count query is a fundamental database operation. Range count queries can also be used to compute other interesting statistics such as quantiles. The framework of differential privacy [6] (DP) is becoming a standard for privacy-preserving data analysis [1]. While many works address the problem of range counting queries in the trusted aggregation model, surprisingly, this problem has not been addressed specifically under untrusted aggregation (local DP [10]). In this work we study the problem of answering 1-dimensional range count queries under the constraint of LDP.

【Keywords】: Security and privacy; Security services; Privacy-preserving protocols; Theory of computation; Theory and algorithms for application domains; Database theory; Theory of database privacy and security

116. Fingerprints for Compressed Columnar Data Search.

【Paper Link】【Pages】:1835-1837

【Authors】: Carmen Kwan

【Abstract】: To enhance performance in main memory databases, compression techniques have been suggested to keep large volume of data in-memory, as opposed to loading data on demand from slower media storage. High compression ratio, however, comes with both memory and performance overhead for queries; packed data needs to be decompressed into vectors before applying optimized scan algorithms. In this work, we propose data summaries at column block level. Our preliminary experimental studies on TPC-H data confirm that under the same memory budget used for MinMax synopsis, our block headers can lower the false positive rates by up to 30% for compressed data scans and can reduce the overhead of employing advanced compression schemes.

【Keywords】: Information systems; Data management systems; Data structures; Data access methods; Data scans; Data layout; Data compression

117. Learning to Generate Questions with Adaptive Copying Neural Networks.

【Paper Link】【Pages】:1838-1840

【Authors】: Xinyuan Lu

【Abstract】: Automatic question generation is an important problem in natural language processing. In this paper, we propose a novel adaptive copying recurrent neural network model to tackle the problem of question generation from sentences and paragraphs. The proposed model adds a copying mechanism component onto a bidirectional LSTM architecture to generate more suitable questions adaptively from the input data. Our experimental results show the proposed model can outperform the state-of-the-art question generation methods in terms of BLEU and ROUGE evaluation scores.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing; Natural language generation

118. Towards Understanding Data Analysis Workflows using a Large Notebook Corpus.

【Paper Link】【Pages】:1841-1843

【Authors】: Mohammed Suhail Rehman

【Abstract】: The advent of big data analysis as a profession as well as a hobby has brought an increase in novel forms of data exploration and analysis, particularly ad-hoc analysis. Analysis of raw datasets using frameworks such as pandas and R have become very popular [8]. Typically these types of workflows are geared towards ingesting and transforming data in an exploratory fashion in order to derive knowledge while minimizing time-to-insight. However, there exists very little work studying usability and performance concerns of such unstructured workflows.

【Keywords】: Human-centered computing; Human computer interaction (HCI); Empirical studies in HCI; Information systems; Data management systems; Database design and models; Data model extensions; Data provenance; Information integration; Extraction, transformation and loading

119. Query-Driven Learning for Next Generation Predictive Modeling & Analytics.

【Paper Link】【Pages】:1844-1846

【Authors】: Fotis Savva

【Abstract】: As data-size is increasing exponentially, new paradigm shifts have to emerge allowing fast exploitation of data by every- body. Large-scale predictive analytics is restricted to wealthy organizations as small-scale enterprises (SMEs) struggle to compete and are inundated by the sheer monetary cost of either procuring data infrastructures or analyzing datasets over the Cloud. The aim of this work is to study mechanisms which can democratize analytics, in the sense of making them affordable, while at the same time ensuring high efficiency, scalability, and accuracy. The crux of this proposal lies in developing query-driven solutions that can be used off the Cloud thus minimizing costs. Our query-driven approach will learn and adapt on-the-fly machine learning models, based solely on query-answer interactions, which can be used for answering analytical queries. In this abstract we describe the methodology followed for the implementation and evaluation of the system designed.

【Keywords】: Computing methodologies; Machine learning; Learning paradigms; Supervised learning; Supervised learning by regression; Information systems; Data management systems; Information systems applications; Decision support systems; Data analytics

120. SpeakQL: Towards Speech-driven Multimodal Querying.

【Paper Link】【Pages】:1847-1849

【Authors】: Vraj Shah

【Abstract】: Speech-based inputs have become popular in many applications on constrained device environments such as smartphones and tablets, and even personal conversational assistants such as Siri, Alexa, and Cortana. Inspired by this recent success of speech-driven interfaces, in this work, we consider an important fundamental question: How should one design a speech-driven system to query structured data? Recent works have studied new querying modalities like visual [4, 8], touch-based [3, 7], and natural language interfaces (NLIs) [5, 6], especially for constrained querying environments such as tablets, smartphones, and conversational assistants. The commands given by the user are then translated to the Structured Query Language (SQL). But conspicuous by its absence is a speech-driven interface for regular SQL or other structured querying. One might wonder: Why dictate structured queries and not just use NLIs or visual interfaces? From a practical standpoint, many users, including in the C-suite, enterprise, Web, and other domains are already familiar with SQL (even if only a subset of it) and use it routinely. A spoken SQL interface could help them speed up query specification, especially in constrained settings such as smartphones and tablets, where typing SQL would be painful. More fundamentally, there is a trade-off inherent in any query interface, as illustrated in Figure 1(A).

【Keywords】: Human-centered computing; Human computer interaction (HCI); Interaction paradigms; Natural language interfaces; Information systems; Data management systems; Database management system engines; Query languages; Relational database query languages

121. Arachnid: Generalized Visual Data Cleaning.

【Paper Link】【Pages】:1850-1852

【Authors】: Conder L. Shou ; Amita Shukla

【Abstract】: Data cleaning is an inherently exploratory and visual process. Visualizations help analysts spot errors or surprising patterns/relationships that would otherwise go unnoticed. Once identified, the user will want to execute transformations in an attempt to correct these errors. However, alternating between the contexts of transforming and visualizing the data can be tedious. As a result, there is a need for data to be visualized and cleaned on the fly. Current solutions address this issue by allowing users to specify data cleaning transformations through a limited set of interactions that can directly manipulate visualizations pre-defined by the cleaning system itself. These visualizations are either generated as a table (OpenRefine, Wrangler, Microsoft Excel), or as bar graphs (Tableau Prep). By mapping a narrow set of mouse-based interactions to data cleaning specifications, these systems allow users to intuitively and quickly transform data within a constrained set of use cases. However, in order to gain a comprehensive visual understanding of the data and to quickly clean it for additional use cases, analysts often need to generate multiple visualizations from a set of common types and interactively execute data transformations beyond those supported by current cleaning software. Thus, we propose Arachnid as a novel system that builds upon existing work in generalized selection and direct manipulation to introduce a model for translating mouse-based interactions on common types of user-defined visualizations into an enhanced set of data cleaning transformations.

【Keywords】: Human-centered computing; Interaction design; Interaction design process and methods; User centered design; Interaction design theory, concepts and paradigms; Information systems; Data management systems; Information integration; Data cleaning; Information systems applications; Data mining; Data cleaning

122. Generating Selective Filters for Access Method and Physical Design Evaluation.

【Paper Link】【Pages】:1853-1855

【Authors】: Pranav Subramaniam

【Abstract】: It is a challenge for researchers and system developers to evaluate the impacts of new access methods and physical designs on query performance. One can evaluate these by executing selective filters over differing data distributions. For example, an index lookup may give better performance than a full table scan for a highly selective query [4]. However, generating a filter workload on nonsynthetic datasets to evaluate new techniques currently involves manually coming up with workloads of varying selectivities, which can be cumbersome. Automatically generating workloads with given selectivities over any dataset can facilitate a systematic study of performance of data storage and query optimization methods that also allows researchers to leverage interesting datasets. Toward this goal, this paper describes a new query generation method that, given a table T in a database D, and a selectivity constraint (L, R), where ≤ L < R ≤ 1, generates filter queries (i.e., queries with predicates in WHERE-clauses) whose output size is between L • T and R • T. We first show that generating filter queries with selectivities satisfying the given constraint requires solving the subset sum problem for each integral value in the range L • T and R • T . We then present a polynomial-time heuristic for the same and discuss the performance of the heuristic on a large data set.

【Keywords】: Information systems; Data management systems; Database administration; Database performance evaluation; Database management system engines; Database query processing; Query planning

123. Deep Query Optimization.

【Paper Link】【Pages】:1856-1858

【Authors】: Tin Vu

【Abstract】: In recent decades, we observed the rapid growth of several big data platforms. Each of them is designed for specific demands. For instance, Spark can efficiently process iterative queries, while Storm is designed for in-memory processing. In this context, the complexity of these distributed systems make it much harder to develop rigorous cost models for query optimization problems. This paper aims to address two problems of the query optimization process: cost estimation and index selection. The cost estimation problem predicts the best execution plan by measuring the cost of alternative query plans. The index selection problem determines the most suitable indexing method with a given dataset. Both problems require the development of a complex function that measures the cost or suitability of alternatives to a specific dataset. Therefore, we employ deep learning to solve those problems due to its capability of learning complicated models. We first address a simple form of cost estimation problem: selectivity estimation. Our preliminary results show that our deep learning models work efficiently with the accuracy of selectivity estimation up to 97%.

【Keywords】: Information systems; Data management systems

124. Recommending Deployment Strategies in Crowdsourcing Platforms.

【Paper Link】【Pages】:1859-1861

【Authors】: Dong Wei

【Abstract】: We initiate the study of how to recommend deployment strategies to the task designers in crowdsourcing platforms. The work proposes the first ever optimization based formalism of the task deployment strategies based on different parameters that the task designers have in mind during deployment and present principled algorithms that are designed using computational geometry techniques.

【Keywords】: Information systems; World Wide Web; Web applications; Crowdsourcing; Answer ranking; Theory of computation; Randomness, geometry and discrete structures; Computational geometry

125. Bootstrapping an End-to-End Natural Language Interface for Databases.

【Paper Link】【Pages】:1862-1864

【Authors】: Nathaniel Weir ; Prasetya Utama

【Abstract】: The ability to extract insights from data is critical for decision making. Intuitive natural language interfaces to databases provide non-technical users with an effective way to formulate complex questions and information needs efficiently and effectively. A recent trend in the area of Natural Language Interfaces for Databases (NLIDBs) has been the use of neural machine translation models to synthesize executable Structured Query Language (SQL) queries from natural language utterances. The main bottleneck in this type of approach is the acquisition of examples for training the model. Recent work has assumed access to a rich manually-curated training set for a given target database. However, this assumption ignores the large manual overhead required to curate the training set for any new database. As a result, NLIDB systems that can simply 'plug in' to any new database and perform effectively for naive users have yet to make their way into commercial products. Here we present DBPal, an end-to-end NLIDB framework in which a neural translation model is trained for any new database schema with minimal manual overhead. In addition to being the first off-the-shelf, neural machine translationbased system of its kind, the contributions of our project are 1) its use of a synthetic training set generation pipeline used to bootstrap a translation model without requiring manually curated data, and 2) its use of state-of-the-art multi-task and cross-domain learning techniques that increases the robustness of the translation model towards unseen linguistic phenomena in new domains. In experiments we show that our system can achieve competitive performance on the recently released benchmarks for nl-to-sql translation. Through ablation experiments we show the benefit of using cross-domain learning techniques on the performance of the system. In a user study we show that DBPal outperforms a well-known rule-based NLIDB and performs comparably to an approach using a similar neural model that relies on manually curated data.

【Keywords】: General and reference; Cross-computing tools and techniques; Experimentation; Human-centered computing; Human computer interaction (HCI); Interaction paradigms; Natural language interfaces; Information systems; Data management systems; Query languages; Relational database query languages; Structured Query Language

Demonstrations 40

126. GraphWrangler: An Interactive Graph View on Relational Data.

【Paper Link】【Pages】:1865-1868

【Authors】: Nafisa Anzum ; Semih Salihoglu ; Daniel Vogel

【Abstract】: Existing data stores of enterprises are full of connected data and users are increasingly finding value in performing graph querying, analytics and visualization on this data. This process involves a labor-intensive ETL pipeline, where users write scripts to extract graphs from data stored in legacy stores, often an RDBMS, and import these graphs into a graph-specific software. We demonstrate GraphWrangler, a system that allows users to connect to an RDBMS and within a few clicks extract graphs out of their tabular data, visualize and explore these graphs, and automatically generate scripts for their ETL pipelines. GraphWrangler adopts the predictive interaction framework and internally uses a data transformation language that is a limited subset of SQL. Our demonstration video can be found here: https://youtu.be/k92Qk6vuIsU

【Keywords】: Human-centered computing; Interaction design; Interaction design process and methods; User interface design; Visualization; Information systems; Data management systems; Database design and models; Graph-based database models; Information integration; Extraction, transformation and loading

127. Apollo: A Dataset Profiling and Operator Modeling System.

【Paper Link】【Pages】:1869-1872

【Authors】: Tasos Bakogiannis ; Ioannis Giannakopoulos ; Dimitrios Tsoumakos ; Nectarios Koziris

【Abstract】: The rapidly increasing amount of available data has created invaluable business opportunities but also new challenges. The focus on content-driven analytics is shifting attention from optimizing operators and systems to handle massive data sizes, to intelligent selection of those datasets that maximize the business competitive advantage. To date, there exists no efficient method to quantify the impact of numerous available datasets over different analytics tasks - a thorough execution over every input would be prohibitively expensive. In this demonstration, we present Apollo, a data profiling and operator modeling system that tackles this challenge. Our system quantifies dataset similarities and projects them into a low-dimensional space. Operator outputs are then estimated over the entire dataset, utilizing similarity information with Machine Learning and a small sample of actual executions. During the demo, attendees will be able to model and visualize multiple analytics operators over datasets from the domains of machine learning and graph analytics.

【Keywords】: Information systems; Data management systems

128. MapRepair: Mapping and Repairing under Policy Views.

【Paper Link】【Pages】:1873-1876

【Authors】: Angela Bonifati ; Ugo Comignani ; Efthymia Tsamoura

【Abstract】: Mapping design is overwhelming for end users, who have to check at par the correctness of the mappings and the possible information disclosure over the exported source instance. In this demonstration, we focus on the latter problem by proposing a novel practical solution to ensure that a mapping faithfully complies with a set of privacy restrictions specified as source policy views. We showcase MapRepair, that guides the user through the tasks of visualizing the results of the data exchange process with and without the privacy restrictions. MapRepair leverages formal privacy guarantees and is inherently data-independent, i.e. if a set of criteria are satisfied by the mapping statement, then it guarantees that both the mapping and the underlying instances do not leak sensitive information. Furthermore, MapRepair also allows to automatically repair an input mapping w.r.t. a set of policy views in case of information leakage. We build on various demonstration scenarios, including synthetic and real-world instances and mappings.

【Keywords】: Information systems; Data management systems; Information integration; Data exchange

129. Data Debugging and Exploration with Vizier.

【Paper Link】【Pages】:1877-1880

【Authors】: Mike Brachmann ; Carlos Bautista ; Sonia Castelo ; Su Feng ; Juliana Freire ; Boris Glavic ; Oliver Kennedy ; Heiko Mueller ; Rémi Rampin ; William Spoth ; Ying Yang

【Abstract】: We present Vizier, a multi-modal data exploration and debugging tool. The system supports a wide range of operations by seamlessly integrating Python, SQL, and automated data curation and debugging methods. Using Spark as an execution backend, Vizier handles large datasets in multiple formats. Ease-of-use is attained through integration of a notebook with a spreadsheet-style interface and with visualizations that guide and support the user in the loop. In addition, native support for provenance and versioning enable collaboration and uncertainty management. In this demonstration we will illustrate the diverse features of the system using several realistic data science tasks based on real data.

【Keywords】: Information systems; Data management systems; Information integration; Data cleaning; Entity resolution; Extraction, transformation and loading; Information storage systems; Storage management; Version management; Theory of computation; Theory and algorithms for application domains; Database theory; Data integration; Data provenance; Data structures and algorithms for data management; Incomplete, inconsistent, and uncertain databases

130. Large Scale Graph Mining with G-Miner.

【Paper Link】【Pages】:1881-1884

【Authors】: Hongzhi Chen ; Xiaoxi Wang ; Chenghuan Huang ; Juncheng Fang ; Yifan Hou ; Changji Li ; James Cheng

【Abstract】: This Demo presents G-Miner, a distributed system for graph mining. The take-aways for Demo attendees are: (1) a good understanding of the challenges of various graph mining workloads; (2) useful insights on how to design a good system for graph mining by comparing G-Miner with existing systems on performance, expressiveness and user-friendliness; and (3) how to use G-Miner for interactive graph analytics.

【Keywords】: Theory of computation; Models of computation; Concurrency; Distributed computing models

131. Demonstration of Nimbus: Model-based Pricing for Machine Learning in a Data Marketplace.

【Paper Link】【Pages】:1885-1888

【Authors】: Lingjiao Chen ; Hongyi Wang ; Leshang Chen ; Paraschos Koutris ; Arun Kumar

【Abstract】: Various domains such as business intelligence and journalism have made many achievements with help of data analytics based on machine learning (ML). While a lot of work has studied how to reduce the cost of training, storing, and deploying ML models, there is little work on eliminating the data collection and purchase cost. Existing data markets provide only simplistic mechanism allowing the sale of fixed datasets with fixed price, which potentially hurts not only ML model availability to buyers with limited budget, but market expansion and thus sellers' revenue as well. In this work, we demonstrate Nimbus, a data market framework for ML model exchange. Instead of pricing data, Nimbus prices ML models directly, which we call model-based pricing (MBP). Through interactive interfaces, the audience can play the role of sellers to vend their own ML models with different price requirements, as well as the role of buyers to purchase ML model instances with different accuracy/budget constraints. We will further demonstrate how much gain of sellers' revenue and buyers' affordability Nimbus can achieve with low runtime cost via both real time and offline results.

132. Peering through the Dark: An Owl's View of Inter-job Dependencies and Jobs' Impact in Shared Clusters.

【Paper Link】【Pages】:1889-1892

【Authors】: Andrew Chung ; Carlo Curino ; Subru Krishnan ; Konstantinos Karanasos ; Panagiotis Garefalakis ; Gregory R. Ganger

【Abstract】: Shared multi-tenant infrastructures have enabled companies to consolidate workloads and data, increasing data-sharing and cross-organizational re-use of job outputs. This same resource- and work-sharing has also increased the risk of missed deadlines and diverging priorities as recurring jobs and workflows developed by different teams evolve independently. To prevent incidental business disruptions, identifying and managing job dependencies with clarity becomes increasingly important. Owl is a cluster log analysis and visualization tool that (i) extracts and visualizes job dependencies derived from historical job telemetry and data provenance data sets, and (ii) introduces a novel job valuation algorithm estimating the impact of a job on dependent users and jobs. This demonstration showcases Owl's features that can help users identify critical job dependencies and quantify job importance based on jobs' impact.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Cloud computing; Software and its engineering; Software organization and properties; Contextual software domains; Operating systems; Process management; Scheduling

133. Capturing and Querying Structural Provenance in Spark with Pebble.

【Paper Link】【Pages】:1893-1896

【Authors】: Ralf Diestelkämper ; Melanie Herschel

【Abstract】: Analyzing and debugging Spark processing pipelines is a tedious task which typically involves a lot of engineering effort. The task becomes even more complex when the pipelines process nested data. Provenance solutions that track the derivation process of individual data items assist data engineers while debugging these pipelines. However, state-of-the-art solutions do not precisely track nested data items.

【Keywords】: Information systems; Data management systems; Database design and models; Data model extensions; Data provenance

134. CLASH: A High-Level Abstraction for Optimized, Multi-Way Stream Joins over Apache Storm.

【Paper Link】【Pages】:1897-1900

【Authors】: Manuel Dossinger ; Sebastian Michel ; Constantin Roudsarabi

【Abstract】: We propose the demonstration of CLASH, a high-level abstraction on top of Apache Storm. CLASH is designed around MultiStream, a novel join operator designed for native support of distributed, multi-way stream joins. MultiStream allows trading off materialization of intermediate results versus communication load. With this demonstration, we invite the audience to explore the full potential of CLASH: multi-way stream joins, creation of complex join plans and their automated optimization, and ultimately the hassle-free SQL-style user/application interface and the translation of the optimized query plans to deployable Storm topologies that are executed on our local compute cluster.

【Keywords】: Information systems; Data management systems; Database management system engines; Stream management

135. Visual Exploration of Time Series Anomalies with Metro-Viz.

【Paper Link】【Pages】:1901-1904

【Authors】: Philipp Eichmann ; Franco Solleza ; Nesime Tatbul ; Stan Zdonik

【Abstract】: This demo presents a novel data visualization solution for exploring the results of time series anomaly detection systems. When anomalies are reported, there is a need to reason about the results. We introduce Metro-Viz -- a visual tool to assist data scientists in performing this analysis. Metro-Viz offers a rich set of interaction features (e.g., comparative analysis, what-if testing) backed by data management strategies specifically tailored to the workload. We show our tool in action via multiple time series datasets and anomaly detectors.

【Keywords】: Computing methodologies; Machine learning; Human-centered computing; Human computer interaction (HCI); Interaction paradigms; Graphical user interfaces; Interactive systems and tools; Visualization; Information systems; Data management systems; Database design and models; Data model extensions; Temporal data; Database management system engines; Main memory engines; Middleware for databases; Database web servers

136. BlockchainDB - Towards a Shared Database on Blockchains.

【Paper Link】【Pages】:1905-1908

【Authors】: Muhammad El-Hindi ; Martin Heyden ; Carsten Binnig ; Ravi Ramamurthy ; Arvind Arasu ; Donald Kossmann

【Abstract】: In this demo we present BlockchainDB, which leverages blockchains as storage layer and introduces a database layer on top that extends blockchains by classical data management techniques (e.g., sharding). Further, BlockchainDB provides a standardized key/value-based query interface to facilitate the adoption of blockchains for data sharing use cases. With BlockchainDB we can thus not only improve the performance and scalability of blockchains for data sharing but also decrease the complexity for organizations intending to use blockchains for this use case.

【Keywords】: Information systems; Data management systems; Database management system engines; Parallel and distributed DBMSs; Key-value stores; Information systems applications

137. Cost-Effective, Workload-Adaptive Migration of Big Data Applications to the Cloud.

【Paper Link】【Pages】:1909-1912

【Authors】: Victor Giannakouris ; Alejandro Fernandez ; Alkis Simitsis ; Shivnath Babu

【Abstract】: More than 10,000 enterprises worldwide use the big data stack composed of multiple distributed systems. At Unravel, we build the next-generation APM platform for the big data stack, and we have worked with a representative sample of these enterprises that covers most industry verticals. This sample covers the spectrum of choices for deploying the big data stack across on-premises datacenters, private and public cloud deployments, and hybrid combinations of these. In this paper, we present a solution for assisting enterprises planning the migration of their big data stacks from on-premises deployments to the cloud. Our solution is goal driven and adapts to various migration scenarios. We present the system architecture we built and several cloud mapping options. We also describe a demonstration script that involves practical, real-world use-cases of the path to cloud adoption.

【Keywords】: Computing methodologies; Machine learning; Information systems; Data management systems; Data structures; Data layout; Data compression

138. MithraRanking: A System for Responsible Ranking Design.

【Paper Link】【Pages】:1913-1916

【Authors】: Yifan Guan ; Abolfazl Asudeh ; Pranav Mayuram ; H. V. Jagadish ; Julia Stoyanovich ; Gerome Miklau ; Gautam Das

【Abstract】: Items from a database are often ranked based on a combination of criteria. The weight given to each criterion in the combination can greatly affect the ranking produced. Often, a user may have a general sense of the relative importance of the different criteria, but beyond this may have the flexibility, within limits, to choose combinations that weigh these criteria differently with an acceptable region. We demonstrate MithraRanking, a system that helps users choose criterion weights that lead to "better'' rankings in terms of having desirable properties while remaining within the acceptable region. The goodness properties we focus on are stability and fairness.

【Keywords】: Computing methodologies; Machine learning; Learning paradigms; Supervised learning; Ranking; Information systems; Information retrieval; Retrieval models and ranking; Learning to rank; Top-k retrieval in databases

139. MorphStore - In-Memory Query Processing based on Morphing Compressed Intermediates LIVE.

【Paper Link】【Pages】:1917-1920

【Authors】: Dirk Habich ; Patrick Damme ; Annett Ungethüm ; Johannes Pietrzyk ; Alexander Krause ; Juliana Hildebrandt ; Wolfgang Lehner

【Abstract】: In this demo, we present MorphStore, an in-memory column store with a novel compression-aware query processing concept. Basically, compression using lightweight integer compression algorithms already plays an important role in existing in-memory column stores, but mainly for base data. The continuous handling of compression from the base data to the intermediate results during query processing has already been discussed, but not investigated in detail since the computational effort for compression as well as decompression is often assumed to exceed the benefits of a reduced transfer cost between CPU and main memory. However, this argument increasingly loses its validity as we are going to show in our demo. Generally, our novel compression-aware query processing concept is characterized by the fact that we are able to speed up the query execution by morphing compressed intermediate results from one scheme to another scheme to dynamically adapt to the changing data characteristics during query processing. Our morphing decisions are made using a cost-based approach.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query operators; Query optimization; Query planning; DBMS engine architectures; Main memory engines

140. Fluid: A Blockchain based Framework for Crowdsourcing.

【Paper Link】【Pages】:1921-1924

【Authors】: Siyuan Han ; Zihuan Xu ; Yuxiang Zeng ; Lei Chen

【Abstract】: Recently, crowdsourcing has emerged as a new computing paradigm to solve problems that need human intrinsic, such as image annotation. However, there are two limitations in existing crowdsourcing platforms, i.e. non-transparent incentive mechanism and isolated profiles of workers, which harms the interests of both requesters and workers. Meanwhile, Blockchain technology introduces a solution to build a transparent, immutable data model in the Byzantine environment. Moreover, Blockchain systems (e.g. Ethereum) can also support the Tuning-complete script called smart contracts. Thus, we are motivated to use the feature of the transparent data model and smart contract in Blockchain to address the two limitations. Based on the proposed solutions, we have designed a Blockchain based framework which supports foundations of general crowdsourcing platforms. In addition, our framework also has following novel features: (1) it provides the transparent incentive mechanisms; (2) it supports a trusted worker's profile sharing in a cross-platform mode.

【Keywords】: Information systems; World Wide Web; Web applications; Crowdsourcing; Incentive schemes

141. MigCast: Putting a Price Tag on Data Model Evolution in NoSQL Data Stores.

【Paper Link】【Pages】:1925-1928

【Authors】: Andrea Hillenbrand ; Maksym Levchenko ; Uta Störl ; Stefanie Scherzinger ; Meike Klettke

【Abstract】: We demonstrate MigCast, a tool-based advisor for exploring data migration strategies in the context of developing NoSQL-backed applications. Users of MigCast can consider their options for evolving their data model along with legacy data already persisted in the cloud-hosted production database. They can explore alternative actions as the financial costs are predicted respective to the cloud provider chosen. Thereby they are better equipped to assess potential consequences of imminent data migration decisions. To this end, MigCast maintains an internal cost model, taking into account characteristics of the data instance, expected workload, data model changes, and cloud provider pricing models. Hence, MigCast enables software project stakeholders to remain in control of the operative costs and to make informed decisions evolving their applications.

【Keywords】: Information systems; Data management systems; Middleware for databases

142. PgCuckoo: Laying Plan Eggs in PostgreSQL's Nest.

【Paper Link】【Pages】:1929-1932

【Authors】: Denis Hirn ; Torsten Grust

【Abstract】: We demonstrate how to use PostgreSQL's planner hook to open a side entrance through which we can pass plan trees for immediate execution. Since this reaches deep into PostgreSQL, we implement plan detail inference and decoration to ensure that externally crafted trees perfectly mimic regular plans. Plan trees may then (1) be generated by external code generators that want to use PostgreSQL as a reliable and efficient back-end for new (maybe even non-relational) languages, or (2) stem from experimental rewrites of SQL plans that PostgreSQL itself does not implement (yet). The demonstration provides a live account of what becomes possible once we let PostgreSQL hatch foreign plan eggs.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query planning

143. Demonstration of ModelarDB: Model-Based Management of Dimensional Time Series.

【Paper Link】【Pages】:1933-1936

【Authors】: Søren Kejser Jensen ; Torben Bach Pedersen ; Christian Thomsen

【Abstract】: Due to the big amounts of sensor data produced, it is infeasible to store all of the data points collected and practitioners currently hide outliers by storing simple aggregates instead. As a remedy, we demonstrate ModelarDB, a model-based Time Series Management System (TSMS) for time series with dimensions and possibly gaps. In this demonstration, participants can ingest data sets from multiple domains and experience how ModelarDB provides fast ingestion and a high compression ratio by adaptively compressing time series using a set of models to accommodate changes in the structure of each time series over time. Models approximate time series within a user-defined error bound (possibly zero). Participants can also experience how the compression ratio can be improved by ingesting correlated time series in groups created by ModelarDB from user-hints. Participants provide these using primitives for describing correlation. Last, participants can execute SQL queries on the ingested data sets and see how the system optimizes queries directly on models.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Cloud computing; Other architectures; Special purpose systems; Embedded and cyber-physical systems; Sensor networks; Computing methodologies; Parallel computing methodologies; Parallel algorithms; MapReduce algorithms; Information systems; Data management systems; Database design and models; Data model extensions; Data streams; Physical data models; Database management system engines; Database query processing; Query optimization; Database views; Online analytical processing engines; Stream management; Information integration; Data warehouses; Query languages; Relational database query languages; Structured Query Language; Information systems applications; Decision support systems; Data analytics; Online analytical processing; Networks; Network services; Cloud computing; Network types; Cyber-physical networks; Sensor networks

144. Estimating Cardinalities with Deep Sketches.

【Paper Link】【Pages】:1937-1940

【Authors】: Andreas Kipf ; Dimitri Vorona ; Jonas Müller ; Thomas Kipf ; Bernhard Radke ; Viktor Leis ; Peter A. Boncz ; Thomas Neumann ; Alfons Kemper

【Abstract】: We introduce Deep Sketches, which are compact models of databases that allow us to estimate the result sizes of SQL queries. Deep Sketches are powered by a new deep learning approach to cardinality estimation that can capture correlations between columns, even across tables. Our demonstration allows users to define such sketches on the TPC-H and IMDb datasets, monitor the training process, and run ad-hoc queries against trained sketches. We also estimate query cardinalities with HyPer and PostgreSQL to visualize the gains over traditional cardinality estimators.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Query optimization

145. Coconut Palm: Static and Streaming Data Series Exploration Now in your Palm.

【Paper Link】【Pages】:1941-1944

【Authors】: Haridimos Kondylakis ; Niv Dayan ; Kostas Zoumpatianos ; Themis Palpanas

【Abstract】: Many modern applications produce massive streams of data series and maintain them in indexes to be able to explore them through nearest neighbor search. Existing data series indexes, however, are expensive to operate as they issue many random I/Os to storage. To address this problem, we recently proposed Coconut, a new infrastructure that organizes data series based on a new sortable format. In this way, Coconut is able to leverage state-of-the-art indexing techniques that rely on sorting for the first time to build, maintain and query data series indexes using fast sequential I/Os. In this demonstration, we present Coconut Palm, a new exploration tool that allows to interactively combine different indexing techniques from within the Coconut infrastructure and to thereby seamlessly explore data series from across various scientific domains. We highlight the rich indexing design choices that Coconut opens up, and we present a new recommender tool that allows users to intelligently navigate them for both static and streaming data exploration scenarios.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing

146. NeMeSys - A Showcase of Data Oriented Near Memory Graph Processing.

【Paper Link】【Pages】:1945-1948

【Authors】: Alexander Krause ; Thomas Kissinger ; Dirk Habich ; Wolfgang Lehner

【Abstract】: NeMeSys is a NUMA-aware graph pattern processing engine, which uses the Near Memory Processing paradigm to allow for high scalability. With modern server systems incorporating an increasing amount of main memory, we can store graphs and compute analytical graph algorithms like graph pattern matching completely in-memory. Our system blends state-of-the-art approaches from the transactional database world together with graph processing principles. We demonstrate, that graph pattern processing - standalone and workloads - can be controlled by leveraging different partitioning strategies, applying Bloom filter based messaging optimization and, given performance constraints, can save energy by applying frequency scaling of CPU cores.

【Keywords】: Information systems; Data management systems; Data structures; Data access methods; Data scans; Point lookups; Database design and models; Graph-based database models; Database management system engines; Main memory engines; Parallel and distributed DBMSs; Relational parallel and distributed DBMSs

147. Ratel: Interactive Analytics for Large Scale Trajectories.

【Paper Link】【Pages】:1949-1952

【Authors】: Haoda Li ; Guoliang Li ; Jiayang Liu ; Haitao Yuan ; Haiquan Wang

【Abstract】: Trajectory data analytics plays an important role in many applications, such as transportation optimization, urban planning, taxi scheduling, and so on. However, trajectory data analytics has a great challenge that the time cost for processing queries is too high on big datasets. In this paper, we demonstrate a distributed in-memory framework Ratel base on Spark for analyzing large scale trajectories. Ratel groups trajectories into partitions by considering the data locality and load balance. We build R-Tree based global indexes to prune partitions when applying trajectory search and join. For each partition, Ratel uses a filter-refinement method to efficiently find similar trajectories. We show three kinds of scenarios - bus station planning, route recommendation, and transportation analytics. Demo attendees can interact with a web UI, pose different queries on the dataset, and navigate the query result.

【Keywords】: Computing methodologies; Distributed computing methodologies; Distributed algorithms

148. NEURON: Query Execution Plan Meets Natural Language Processing For Augmenting DB Education.

【Paper Link】【Pages】:1953-1956

【Authors】: Siyuan Liu ; Sourav S. Bhowmick ; Wanlu Zhang ; Shu Wang ; Wanyi Huang ; Shafiq R. Joty

【Abstract】: A core component of a database systems course at the undergraduate level is the design and implementation of the query optimizer in an rdbms. The query optimization process produces aquery execution plan (qep ), which represents an execution strategy for an sql query. Unfortunately, in practice, it is often difficult for a student to comprehend a query execution strategy by perusing its qep, hindering her learning process. In this demonstration, we present a novel system called neuron that facilitates natural language interaction with qep s to enhance its understanding. neuron accepts an sql query (which may include joins, aggregation, nesting, among other things) as input, executes it, and generates a simplified natural language description (both in text and voice form) of the execution strategy deployed by the underlying rdbms. Furthermore, it facilitates understanding of various features related to a qep through anatural language question answering (nlqa ) framework. We advocate that such tool, world's first of its kind, can greatly enhance students' learning of the query optimization topic.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing

149. CrowdGame: A Game-Based Crowdsourcing System for Cost-Effective Data Labeling.

【Paper Link】【Pages】:1957-1960

【Authors】: Tongyu Liu ; Jingru Yang ; Ju Fan ; Zhewei Wei ; Guoliang Li ; Xiaoyong Du

【Abstract】: Large-scale data labeling has become a major bottleneck for many applications, such as machine learning and data integration. This paper presents CrowdGame, a crowdsourcing system that harnesses the crowd to gather data labels in a cost-effective way. CrowdGame focuses on generating high-quality labeling rules to largely reduce the labeling cost while preserving quality. It first generates candidate rules, and then devises a game-based crowdsourcing approach to select rules with high coverage and accuracy. CrowdGame applies the generated rules for effective data labeling. We have implemented CrowdGame and provided a user-friendly interface for users to deploy their labeling applications. We will demonstrate CrowdGame in two representative data labeling scenarios, entity matching and relation extraction.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Rule learning; Information systems; World Wide Web; Web applications; Crowdsourcing; Theory of computation; Theory and algorithms for application domains; Database theory; Data integration

150. RATest: Explaining Wrong Relational Queries Using Small Examples.

【Paper Link】【Pages】:1961-1964

【Authors】: Zhengjie Miao ; Sudeepa Roy ; Jun Yang

【Abstract】: We present a system called RATest, designed to help debug relational queries against reference queries and test database instances. In many applications, e.g., classroom learning and regression testing, we test the correctness of a user query Q by evaluating it over a test database instance D and comparing its result with that of evaluating a reference (correct) query $Q_0$ over D. If $Q(D)$ differs from $Q_0(D)$, the user knows Q is incorrect. However, D can be large (often by design), which makes debugging Q difficult. The key idea behind RATest is to show the user a much smaller database instance $D' \subseteq D$, which we call a counterexample, such that $Q(D') \neq Q_0(D')$. RATest builds on data provenance and constraint solving, and employs a suite of techniques to support, at interactive speed, complex queries involving differences and group-by aggregation. We demonstrate an application of RATest in learning: it has been used successfully by a large undergraduate database course in a university to help students with a relational algebra assignment.

【Keywords】: Information systems; Data management systems; Database administration; Database utilities and tools

151. NAVIGATE: Explainable Visual Graph Exploration by Examples.

【Paper Link】【Pages】:1965-1968

【Authors】: Mohammad Hossein Namaki ; Qi Song ; Yinghui Wu

【Abstract】: We demonstrate NAVIGATE, an explai\underlineNA ble query engine for \underlineVI sual \underlineG r\underlineA ph explora\underlineT ion by \underlineE xamples. NAVIGATE interleavesquery rewriting and query answering to help users (1) search graphs \textslwithout writing complex queries, and (2) understand answers by providing intuitive explanations. Users can visually construct queries and specify missing or unwanted example entities to guide the exploration towards desired answers. NAVIGATE can rewrite queries with answers close to examples, by minimally altering their topological and semantic constraints. Another unique feature is its ability to explain query results by tracing the query manipulation operators that are responsible for transforming the original answers to desirable ones. In addition, NAVIGATE optimizes system response time by referring to dynamically cachedstar views to reduce both query evaluation and rewriting cost at run time. We also demonstrate its ease-of-use, efficiency, and explainable exploration in applications such as recommendation and knowledge base search.

【Paper Link】【Pages】:1969-1972

【Authors】: Jemish Patel ; Goutam Tadi ; Oz Basarir ; Lawrence Hamel ; David Sharp ; Fei Yang ; Xin Zhang

【Abstract】: Greenplum Database (GPDB) has many features designed to enable data scientists. Before a data scientist can use GPDB, a database administrator (DBA) must provision a cluster and install any required data science packages. Provisioning a GPDB cluster on bare metal requires a lengthy setup process. Scaling, recovering, and securing the cluster post-deployment are also complex. Greenplum for Kubernetes (GP4K) abstracts away these complexities, simplifying and automating the process for users. In this demonstration, we introduce GP4K with an opinionated deployment and a declarative manifest. We provide a brief overview of GP4K's architecture and discuss its implementation. We also demonstrate a full life cycle of managing a cluster from birth to retirement, including scale-up and self-healing all with minimal DBA inputs.

【Keywords】: Information systems; Data management systems; Database administration; Autonomous database administration; Database management system engines; DBMS engine architectures; Distributed database transactions; Distributed database recovery; Information storage systems; Storage architectures; Cloud based storage; Software and its engineering; Software organization and properties; Extra-functional properties; Interoperability; Software reliability; Software usability; Software system structures; Distributed systems organizing principles; Software system models; Massively parallel systems

153. NEWS: News Event Walker and Summarizer.

【Paper Link】【Pages】:1973-1976

【Authors】: Radityo Eko Prasojo ; Mouna Kacimi ; Werner Nutt

【Abstract】: Most news summarization techniques are static, and thus do not satisfy user needs in having summaries with specific structures or details. Meanwhile, existing dynamic techniques such as query-based summarization fail to handle content-independent queries that target the type of summary information such as time, location, reasons, and consequences of reported events. The NEWS system supports multi-granular summarization along two dimensions: the level of detail and type of information. The system employs fine-grained information extraction to extract facts and their facets with type tagging. The extracted information is then modeled as a graph used to create summaries. The system incrementally expands summaries based on the nodes visited by users, folding related events into the search space.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing; Information systems; Information retrieval; Document representation; Document structure; Retrieval tasks and goals; Summarization

154. ANMAT: Automatic Knowledge Discovery and Error Detection through Pattern Functional Dependencies.

【Paper Link】【Pages】:1977-1980

【Authors】: Abdulhakim Ali Qahtan ; Nan Tang ; Mourad Ouzzani ; Yang Cao ; Michael Stonebraker

【Abstract】: Knowledge discovery is critical to successful data analytics. We propose a new type of meta-knowledge, namely pattern functional dependencies (PFDs), that combine patterns (or regex-like rules) and integrity constraints (ICs) to model the dependencies (or meta-knowledge) between partial values (or patterns) across different attributes in a table. PFDs go beyond the classical functional dependencies and their extensions. For instance, in an employee table, ID "F-9-107'', "F'' determines the finance department. Moreover, a key application of PFDs is to use them to identify erroneous data; tuples that violate some PFDs. In this demonstration, attendees will experience the following features: (i) PFD discovery -- automatically discover PFDs from (dirty) data in different domains; and (ii) Error detection with PFDs -- we will show errors that are detected by PFDs but cannot be captured by existing approaches.

【Keywords】: Information systems; Data management systems; Information integration; Data cleaning

155. DuckDB: an Embeddable Analytical Database.

【Paper Link】【Pages】:1981-1984

【Authors】: Mark Raasveldt ; Hannes Mühleisen

【Abstract】: The immense popularity of SQLite shows that there is a need for unobtrusive in-process data management solutions. However, there is no such system yet geared towards analytical workloads. We demonstrate DuckDB, a novel data management system designed to execute analytical SQL queries while embedded in another process. In our demonstration, we pit DuckDB against other data management solutions to showcase its performance in the embedded analytics scenario. DuckDB is available as Open Source software under a permissive license.

【Keywords】: Information systems; Data management systems; Database management system engines

156. ChronosDB in Action: Manage, Process, and Visualize Big Geospatial Arrays in the Cloud.

【Paper Link】【Pages】:1985-1988

【Authors】: Ramon Antonio Rodriges Zalipynis

【Abstract】: Immense volumes of geospatial arrays are generated daily. Examples of such include satellite imagery, numerical simulation, and derivative data avalanche. Array DBMS are one of the prominent tools for working with large geospatial arrays. Usually the arrays natively come as raster files. ChronosDB is a novel distributed, file based, geospatial array DBMS: http://chronosdb.gis.land/ ChronosDB operates directly on raster files, delegates array processing to existing elaborate command line tools, and outperforms SciDB by up to 75x on average. This demonstration will showcase three new components of ChronosDB enabling users to interact with the system and appreciate its benefits: (i) a Web GUI (edit, submit queries and get the output), (ii) an execution plan explainer (investigate the generated DAG), and (iii) a dataset visualizer (display ChronosDB arrays on an interactive web map).

【Keywords】: Information systems; Data management systems; Database management system engines; Parallel and distributed DBMSs; Information systems applications; Spatial-temporal systems; Geographic information systems

157. Ursprung: Provenance for Large-Scale Analytics Environments.

【Paper Link】【Pages】:1989-1992

【Authors】: Lukas Rupprecht ; James C. Davis ; Constantine Arnold ; Alexander Lubbock ; Darren Tyson ; Deepavali Bhagwat

【Abstract】: Modern analytics has produced wonders, but reproducing and verifying these wonders is difficult. Data provenance helps to solve this problem by collecting information on how data is created and accessed. Although provenance collection techniques have been used successfully on a smaller scale, tracking provenance in large-scale analytics environments is challenging due to the scale of provenance generated and the heterogeneous domains. Without provenance, analysts struggle to keep track of and reproduce their analyses. We demonstrate Ursprung, a provenance collection system specifically targeted at such environments. Ursprung transparently collects the minimal set of system-level provenance required to track the relationships between data and processes. To collect domain specific provenance, Usprung enables users to specify capture rules to curate application-specific logs, intermediate results etc. To reduce storage overhead and accelerate queries, it uses event hierarchies to synthesize raw provenance into compact summaries.

【Keywords】: Information systems; Data management systems; Database design and models; Data model extensions; Data provenance; Information systems applications; Decision support systems; Data analytics

158. Unit Testing Data with Deequ.

【Paper Link】【Pages】:1993-1996

【Authors】: Sebastian Schelter ; Felix Bießmann ; Dustin Lange ; Tammo Rukat ; Philipp Schmidt ; Stephan Seufert ; Pierre Brunelle ; Andrey Taptunov

【Abstract】: Modern companies and institutions rely on data to guide every single decision. Missing or incorrect information seriously compromises any decision process. We demonstrate "Deequ", an Apache Spark-based library for automating the verification of data quality at scale. This library provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables "unit tests for data". Deequ is available as open source, meets the requirements of production use cases at Amazon, and scales to datasets with billions of records if the constraints to evaluate are chosen carefully. Our demonstration walks attendees through a fictitious business use case of validating daily product reviews from a public dataset, and is executed in a proprietary interactive notebook environment. We show attendees how to define data unit tests from automatically suggested constraints and how to create customized tests. Additionally, we demonstrate how to apply Deequ to validate incrementally growing datasets, and give examples of how to configure anomaly detection algorithms on time series of data quality metrics to further automate the data validation.

【Keywords】: Information systems; Data management systems; Database administration; Database performance evaluation; Database management system engines; Database query processing; Query planning

159. Natural Language Querying of Complex Business Intelligence Queries.

【Paper Link】【Pages】:1997-2000

【Authors】: Jaydeep Sen ; Fatma Ozcan ; Abdul Quamar ; Greg Stager ; Ashish R. Mittal ; Manasa Jammi ; Chuan Lei ; Diptikalyan Saha ; Karthik Sankaranarayanan

【Abstract】: Natural Language Interface to Database (NLIDB) eliminates the need for an end user to use complex query languages like SQL by translating the input natural language statements to SQL automatically. Although NLIDB systems have seen rapid growth of interest recently, the current state-of-the-art systems can at best handle point queries to retrieve certain column values satisfying some filters, or aggregation queries involving basic SQL aggregation functions. In this demo, we showcase our NLIDB system with extended capabilities for business applications that require complex nested SQL queries without prior training or feedback from human in-the-loop. In particular, our system uses novel algorithms that combine linguistic analysis with deep domain reasoning for solving core challenges in handling nested queries. To demonstrate the capabilities, we propose a new benchmark dataset containing realistic business intelligence queries, conforming to an ontology derived from FIBO and FRO financial ontologies. In this demo, we will showcase a wide range of complex business intelligence queries against our benchmark dataset, with increasing level of complexity. The users will be able to examine the SQL queries generated, and also will be provided with an English description of the interpretation.

【Keywords】: Human-centered computing; Human computer interaction (HCI); Interaction paradigms; Natural language interfaces

160. Demonstration of SpeakQL: Speech-driven Multimodal Querying of Structured Data.

【Paper Link】【Pages】:2001-2004

【Authors】: Vraj Shah ; Side Li ; Kevin Yang ; Arun Kumar ; Lawrence Saul

【Abstract】: In this demonstration, we present SpeakQL, a speech-driven query system and interface for structured data. SpeakQL supports a tractable and practically useful subset of regular SQL, allowing users to query in any domain with unbounded vocabulary with the help of speech/touch based user-in-the-loop mechanisms for correction. When querying in such domains, automatic speech recognition introduces countless forms of errors in transcriptions, presenting us with a technical challenge. We characterize such errors and leverage our observations along with SQL's unambiguous context-free grammar to first correct the query structure. We then exploit phonetic representation of the queried database to identify the correct Literals, hence delivering the corrected transcribed query. In this demo, we show that SpeakQL helps users reduce time and effort in specifying SQL queries significantly. In addition, we show that SpeakQL, unlike Natural Language Interfaces and conversational assistants, allows users to query over any arbitrary database schema. We allow the audience to explore SpeakQL using an easy-to-use web-based interface to compose SQL queries.

【Keywords】: Information systems; Data management systems; Database administration; Database utilities and tools; Query languages; Relational database query languages; Structured Query Language

161. C2Metadata: Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation.

【Paper Link】【Pages】:2005-2008

【Authors】: Jie Song ; George Alter ; H. V. Jagadish

【Abstract】: Datasets are often derived by manipulating raw data with statistical software packages. The derivation of a dataset must be recorded in terms of both the raw input and the manipulations applied to it. Statistics packages typically provide limited help in documenting provenance for the resulting derived data. At best, the operations performed by the statistical package are described in a script. Disparate representations make these scripts hard to understand for users. To address these challenges, we created Continuous Capture of Metadata (C2Metadata), a system to capture data transformations in scripts for statistical packages and represent it as metadata in a standard format that is easy to understand. We do so by devising a Structured Data Transformation Algebra (SDTA), which uses a small set of algebraic operators to express a large fraction of data manipulation performed in practce. We then implement SDTA, inspired by relational algebra, in a data transformation specification language we call SDTL. In this demonstration, we showcase C2metadata's capture of data transformations from a pool of sample transformation scripts in at least two languages: SPSS and Stata (SAS and R are under development), for social science data in a large academic repository. We will allow the audience to explore C2Metadata using a web-based interface, visualize the intermediate steps and trace the provenance and changes of data at different levels for better understanding of the process.

【Keywords】: Information systems; Data management systems; Database design and models; Data model extensions; Data provenance; Information integration; Extraction, transformation and loading

162. [Demo] Low-latency Spark Queries on Updatable Data.

【Paper Link】【Pages】:2009-2012

【Authors】: Alexandru Uta ; Bogdan Ghit ; Ankur Dave ; Peter A. Boncz

【Abstract】: As data science gets deployed more and more into operational applications, it becomes important for data science frameworks to be able to perform computations in interactive, sub-second time. Indexing and caching are two key techniques that can make interactive query processing on large datasets possible. In this demo, we show the design, implementation and performance of a new indexing abstraction in Apache Spark, called the Indexed DataFrame. This is a cached DataFrame that incorporates an index to support fast lookup and join operations, and supports updates with multi-version concurrency. We demonstrate the Indexed Dataframe on a social network dataset using microbenchmarks and real-world graph processing queries, in datasets that are continuously growing.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Cloud computing; Information systems; Data management systems; Data structures; Data access methods; Database management system engines; Database query processing; Software and its engineering; Software organization and properties; Software system structures; Software architectures; Data flow architectures; Theory of computation; Theory and algorithms for application domains; Database theory; Data structures and algorithms for data management

163. SVQ: Streaming Video Queries.

【Paper Link】【Pages】:2013-2016

【Authors】: Ioannis Xarchakos ; Nick Koudas

【Abstract】: Recent advances in video processing utilizing deep learning primitives achieved breakthroughs in fundamental problems in video analysis such as frame classification and object detection enabling an array of new applications.

【Keywords】: Computing methodologies; Artificial intelligence; Computer vision; Image and video acquisition; Information systems; Data management systems; Database management system engines; Stream management; Information retrieval; Specialized information retrieval; Multimedia and multimodal retrieval; Video search

164. FindYourFavorite: An Interactive System for Finding the User's Favorite Tuple in the Database.

【Paper Link】【Pages】:2017-2020

【Authors】: Min Xie ; Tianwen Chen ; Raymond Chi-Wing Wong

【Abstract】: When faced with a database containing millions of tuples, an end user might be only interested in finding his/her favorite tuple in the database. In this paper, we study how to help an end user to find such a favorite tuple with a few user interactions. In each interaction, a user is presented with a small number of tuples (which can be artificial tuples outside the database or true tuples inside the database) and s/he is asked to indicate the tuple s/he favors the most among them. Different from the previous work which displays artificial tuples to users during the interaction and requires heavy user interactions, we achieve a stronger result. Specifically, we use a concept, called the utility hyperplane, to model the user preference and an effective pruning strategy to locate the favorite tuple for a user in the whole database. Based on these techniques, we developed an interactive system, called FindYourFavorite, and demonstrate that the system could identify the favorite tuple for a user with a few user interactions by always displaying true tuples in the database.

【Keywords】: Information systems; Information systems applications; Decision support systems; Data analytics

165. PIClean: A Probabilistic and Interactive Data Cleaning System.

【Paper Link】【Pages】:2021-2024

【Authors】: Zhuoran Yu ; Xu Chu

【Abstract】: With the dramatic increasing interest in data analysis, ensuring data quality becomes one of the most important topics in data science. Data Cleaning, the process of ensuring data quality, is composed of two stages: error detection and error repair. Despite decades of research in data cleaning, existing cleaning systems still have limitations in terms of usability and error coverage. We propose PIClean, a probabilistic and interactive data cleaning system that aims at addressing the aforementioned limitations. PIClean produces probabilistic errors and probabilistic fixes using low-rank approximation, which implicitly discovers and uses relationships between columns of a dataset for cleaning. The probabilistic errors and fixes are confirmed or rejected by users, and the user feedbacks are constantly incorporated by PIClean to produce more accurate and higher-coverage cleaning results.

【Keywords】: Information systems; Data management systems; Information integration; Data cleaning

Tutorials 7

166. Towards Democratizing Relational Data Visualization.

【Paper Link】【Pages】:2025-2030

【Authors】: Nan Tang ; Eugene Wu ; Guoliang Li

【Abstract】: The problem of data visualization is to transform data into a visual context such that people can easily understand the significance of data. Nowadays, data visualization becomes especially important, because it is the de facto standard for modern business intelligence and successful data science. This tutorial will cover three specific topics: visualization languages define how the users can interact with various visualization systems; efficient data visualization processes the data and produces visualizations based on well-specified user queries; smart data visualization recommends data visualizations based on underspecified user queries. In this tutorial, we will go logically through these prior art, paying particular attentions on problems that may attract the interest from the database community.

【Keywords】: Human-centered computing; Visualization; Visualization techniques

167. Exploring the Data Wilderness through Examples.

【Paper Link】【Pages】:2031-2035

【Authors】: Davide Mottin ; Matteo Lissandrini ; Yannis Velegrakis ; Themis Palpanas

【Abstract】: Exploration is one of the primordial ways to accrue knowledge about the world and its nature. As we accumulate, mostly automatically, data at unprecedented volumes and speed, our datasets have become complex and hard to understand. In this context exploratory search provides a handy tool for progressively gather the necessary knowledge by starting from a tentative query that hopefully leads to answers at least partially relevant and that can provide cues about the next queries to issue. Recently, we have witnessed a rediscovery of the so-called example-based methods, in which the user or the analyst circumvent query languages by using examples as input. This shift in semantics has led to a number of methods receiving as query a set of example members of the answer set. The search system then infers the entire answer set based on the given examples and any additional information provided by the underlying database. In this tutorial, we present an excursus over the main example-based methods for exploratory analysis, show techniques tailored to different data types, and provide a unifying view of the problem. We show how different data types require different techniques, and present algorithms that are specifically designed for relational, textual, and graph data.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Information retrieval; Information retrieval query processing; Query intent; Mathematics of computing; Probability and statistics; Statistical paradigms; Exploratory data analysis

168. Database and Distributed Computing Foundations of Blockchains.

【Paper Link】【Pages】:2036-2041

【Authors】: Sujaya Maiyya ; Victor Zakhary ; Mohammad Javad Amiri ; Divyakant Agrawal ; Amr El Abbadi

【Abstract】: The uprise of Bitcoin and other peer-to-peer cryptocurrencies has opened many interesting and challenging problems in cryptography, distributed systems, and databases. The main underlying data structure is blockchain, a scalable fully replicated structure that is shared among all participants and guarantees a consistent view of all user transactions by all participants in the system. In this tutorial, we discuss the basic protocols used in blockchain, and elaborate on its main advantages and limitations. To overcome these limitations, we provide the necessary distributed systems background in managing large scale fully replicated ledgers, using Byzantine Agreement protocols to solve the consensus problem. Finally, we expound on some of the most recent proposals to design scalable and efficient blockchains in both permissionless and permissioned settings. The focus of the tutorial is on the distributed systems and database aspects of the recent innovations in blockchains.

【Keywords】: Computer systems organization; Architectures; Distributed architectures; Peer-to-peer architectures; Dependable and fault-tolerant systems and networks; Information systems; Data management systems; Database management system engines; Distributed database transactions; Networks; Network protocols

169. Classical and Contemporary Approaches to Big Time Series Forecasting.

【Paper Link】【Pages】:2042-2047

【Authors】: Christos Faloutsos ; Jan Gasthaus ; Tim Januschowski ; Yuyang Wang

【Abstract】: Time series forecasting is a key ingredient in the automation and optimization of business processes: in retail, deciding which products to order and where to store them depends on the forecasts of future demand in different regions; in cloud computing, the estimated future usage of services and infrastructure components guides capacity planning; and workforce scheduling in warehouses and factories requires forecasts of the future workload. Recent years have witnessed a paradigm shift in forecasting techniques and applications, from computer-assisted model- and assumption-based to data-driven and fully-automated. This shift can be attributed to the availability of large, rich, and diverse time series corpora and result in a set of challenges that need to be addressed such as the following. How can we build statistical models to efficiently and effectively learn to forecast from large and diverse data sources? How can we leverage the statistical power of "similar'' time series to improve forecasts in the case of limited observations? What are the implications for building forecasting systems that can handle large data volumes? The objective of this tutorial is to provide a concise and intuitive overview of the most important methods and tools available for solving large-scale forecasting problems. We review the state of the art in three related fields: (1) classical modeling of time series, (2) scalable tensor methods, and (3) deep learning for forecasting. Further, we share lessons learned from building scalable forecasting systems. While our focus is on providing an intuitive overview of the methods and practical issues which we will illustrate via case studies, we also present some technical details underlying these powerful tools.

【Keywords】: Applied computing; Operations research; Forecasting; Computing methodologies; Machine learning; Machine learning approaches; Factorization methods; Learning in probabilistic graphical models; Neural networks; Mathematics of computing; Probability and statistics; Statistical paradigms; Time series analysis

170. Data Pipelines for User Group Analytics.

【Paper Link】【Pages】:2048-2053

【Authors】: Behrooz Omidvar-Tehrani ; Sihem Amer-Yahia

【Abstract】: User data is becoming increasingly available in various domains ranging from the social Web to electronic patient health records (EHRs). User data is characterized by a combination of demographics (e.g., age, gender, life status) and user actions (e.g., posting a tweet, following a diet). Domain experts rely on user data to conduct large-scale population studies. Information consumers, on the other hand, rely on user data for routine tasks such as finding a book club and getting advice from look-alike patients. User data analytics is usually based on identifying group-level behaviors such as "teenage females who watch Titanic" and "old male patients in Paris who suffer from Bronchitis." In this tutorial, we review data pipelines for User Group Analytics (UGA). These pipelines admit raw user data as input and return insights in the form of user groups. We review research on UGA pipelines and discuss approaches and open challenges for discovering, exploring, and visualizing user groups. Throughout the tutorial, we will illustrate examples in two key domains: "the social Web" and "health-care".

【Keywords】: Human-centered computing; Human computer interaction (HCI); HCI design and evaluation methods; User models; Visualization; Visualization application domains; Visual analytics; Visualization systems and tools; Information systems; Information retrieval; Users and interactive retrieval; World Wide Web; Web searching and information discovery; Mathematics of computing; Probability and statistics; Statistical paradigms; Exploratory data analysis

171. From Auto-tuning One Size Fits All to Self-designed and Learned Data-intensive Systems.

【Paper Link】【Pages】:2054-2059

【Authors】: Stratos Idreos ; Tim Kraska

【Abstract】: We survey new opportunities to design data systems, data structures and algorithms that can adapt to both data and query workloads. Data keeps growing, hardware keeps changing and new applications appear ever more frequently. One size does not fit all, but data-intensive applications would like to balance and control memory requirements, read costs, write costs, as well as monetary costs on the cloud. This calls for tailored data systems, storage, and computation solutions that match the exact requirements of the scenario at hand. Such systems should be "synthesized'' quickly and nearly automatically, removing the human system designers and administrators from the loop as much as possible to keep up with the quick evolution of applications and workloads. In addition, such systems should "learn'' from both past and current system performance and workload patterns to keep adapting their design. We survey new trends in 1) self-designed, and 2) learned data systems and how these technologies can apply to relational, NoSQL, and big data systems as well as to broad data science applications. We focus on both recent research advances and practical applications of this technology, as well as numerous open research opportunities that come from their fusion. We specifically highlight recent work on data structures, algorithms, and query optimization, and how machine learning inspired designs as well as a detailed mapping of the possible design space of solutions can drive innovation to create tailored systems. We also position and connect with past seminal system designs and research in auto-tuning, modular/extensible, and adaptive data systems to highlight the new challenges as well as the opportunities to combine past and new technologies.

【Keywords】: Information systems; Data management systems; Database management system engines; Theory of computation; Design and analysis of algorithms; Data structures design and analysis; Theory and algorithms for application domains; Database theory; Data structures and algorithms for data management

172. Schemas and Types for JSON Data: From Theory to Practice.

【Paper Link】【Pages】:2060-2063

【Authors】: Mohamed Amine Baazizi ; Dario Colazzo ; Giorgio Ghelli ; Carlo Sartiani

【Abstract】: The last few years have seen the fast and ubiquitous diffusion of JSON as one of the most widely used formats for publishing and interchanging data, as it combines the flexibility of semistructured data models with well-known data structures like records and arrays. The user willing to effectively manage JSON data collections can rely on several schema languages, like JSON Schema, JSound, and Joi, as well as on the type abstractions offered by modern programming and scripting languages like Swift or TypeScript. The main aim of this tutorial is to provide the audience (both researchers and practitioners) with the basic notions for enjoying all the benefits that schema and types can offer while processing and manipulating JSON data. This tutorial focuses on four main aspects of the relation between JSON and schemas: (1) we survey existing schema language proposals and discuss their prominent features; (2) we analyze tools that can infer schemas from data, or that exploit schema information for improving data parsing and management; and (3) we discuss some open research challenges and opportunities related to JSON data.

【Keywords】: Information systems; Data management systems; Database design and models; Data model extensions; Semi-structured data; Theory of computation; Logic; Type theory

Workshop Summaries 7

173. GRADES-NDA 2019: Joint International Workshop on Graph Data Management Experiences & Systems and Network Data Analytics.

【Paper Link】【Pages】:2064-2065

【Authors】: Akhil Arora ; Arnab Bhattacharya ; George H. L. Fletcher

【Abstract】: GRADES-NDA 2019 is the second joint meeting of the GRADES and NDA workshops, which were each independently organized at previous SIGMOD-PODS meetings, GRADES since 2013 and NDA since 2016. The focus of GRADES-NDA is the application areas, usage scenarios, and open challenges in managing large-scale graph-shaped data. To summarize, GRADES-NDA aims to present technical contributions inside graph, RDF, and other data management systems on massive graphs.

【Keywords】: Information systems; Data management systems; Query languages; Query languages for non-relational engines; Information systems applications; Data mining; World Wide Web; Web applications; Social networks; Web data description languages

174. DEEM 2019: Workshop on Data Management for End-to-End Machine Learning.

【Paper Link】【Pages】:2066-2067

【Authors】: Sebastian Schelter ; Neoklis Polyzotis ; Manasi Vartak ; Stephan Seufert

【Abstract】: The DEEM workshop brings together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in machine learning application scenarios.

【Keywords】: Information systems; Data management systems

175. DSMM'19: The 5th Workshop on Data Science for Macro-modeling with Financial and Economic Datasets.

【Paper Link】【Pages】:2068-2069

【Authors】: Douglas Burdick ; Rajasekar Krishnamurthy ; Louiqa Raschid

【Abstract】: DSMM 2019 will explore the challenges of macro-modeling with financial and economic datasets. The workshop will also showcase the 2019 Financial Entity Identification and Information Integration (FEIII) Challenge which involves two challenge tasks, one over small business data and the other over customs data for shipping manifests.

【Keywords】: General and reference; Document types; General conference proceedings

176. DaMoN 19: The 15th International Workshop on Data Management on New Hardware.

【Paper Link】【Pages】:2070-2071

【Authors】: Thomas Neumann ; Ken Salem

【Abstract】: The 15th International Workshop on Data Management on New Hardware is held in Amsterdam, The Netherlands on July 1th, 2019, co-located with the ACM Conference on Management of Data (SIGMOD). The focus of this workshop is to strengthen the communication between the database community and broader computer systems communities, specifically the computer architecture, compiler, operating systems, and storage communities.

【Keywords】: Information systems; Data management systems; Database management system engines

177. International Workshop on Human-In-the-Loop Data Analytics (HILDA).

【Paper Link】【Pages】:2072

【Authors】: Leilani Battle ; Surajit Chaudhuri ; Arnab Nandi

【Abstract】: The Human In the Loop Data Analytics (HILDA) workshop aims to foster interdisciplinary efforts that tackle important challenges in better supporting humans in the loop in the context of data-intensive computations, such as interactive data exploration, integration, analytics, and machine learning. Over the past several years, HILDA has brought together DB researchers interested in the distinctive ways that people impact data management tasks, as well as like-minded researchers in other communities, such as Information Visualization, Data Mining/Machine Learning, and HCI. The work presented at HILDA covers a broad range of topics, from algorithmic, interface, and system design to the user's cognitive, physical, and goal-seeking perspectives when managing and exploring data as well as notions of approximation/prediction. This year, we continued to encourage submissions for initial ideas and visions, early reports of work in progress, as well as reflections on completed projects.

【Keywords】: Computing methodologies; Machine learning; Human-centered computing; Visualization; Information systems; Data management systems

178. Overview of the 2nd International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM'19).

【Paper Link】【Pages】:2073-2074

【Authors】: Rajesh Bordawekar ; Oded Shmueli

【Abstract】: Recently, the Artificial Intelligence (AI) field has been experiencing a resurgence. AI broadly covers a wide swath of techniques which include logic-based approaches, probabilistic graphical models, and machine learning/deep learning approaches. Advances in hardware capabilities, such as Graphics Processing Units (GPUs), software components (e.g., accelerated libraries, programming frameworks), and systems infrastructures (e.g., GPU-enabled cloud providers) has led to a wide-spread adaptation of AI techniques to a variety of domains. Examples of such domains include image classification, autonomous driving, automatic speech recognition (ASR) and conversational systems (chatbots). AI techniques not only support multiple datatypes (e.g., free text, images, or speech), but are also available in various configurations, from personal devices to large-scale distributed systems.

【Keywords】: Computing methodologies; Artificial intelligence; Machine learning; Learning paradigms; Information systems; Data management systems; Database management system engines; DBMS engine architectures; Information integration

179. SBD'19: Fourth Edition of the International Workshop on Semantic Big Data.

【Paper Link】【Pages】:2075-2076

【Authors】: Sven Groppe ; Le Gruenwald

【Abstract】: The International Workshop on Semantic Big Data (SBD) is already in the fourth year in conjunction with the ACM SIGMOD Conference. By focusing on topics related to Semantic Web and Big Data, it highlights these important research areas to all participants of the SBD workshop and the main conference. Hence, the workshop offers an ideal forum for open discussions and establishment of professional relationships between academic researchers and industrial members of the Semantic Web and database communities.

【Keywords】: Information systems; Data management systems; Information retrieval; World Wide Web; Theory of computation; Logic