Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018. ACM 【DBLP Link】
【Paper Link】 【Pages】:1
【Authors】: Eric Brewer
【Abstract】:
【Keywords】:
【Paper Link】 【Pages】:3-18
【Authors】: Sainyam Galhotra ; Donatella Firmani ; Barna Saha ; Divesh Srivastava
【Abstract】: Entity resolution (ER) seeks to identify which records in a data set refer to the same real-world entity. Given the diversity of ways in which entities can be represented, matched and distinguished, ER is known to be a challenging task for automated strategies, but relatively easier for expert humans. In our work, we abstract the knowledge of experts with the notion of a binary oracle. Our oracle can answer questions of the form "do records u and v refer to the same entity?" under a flexible error model, allowing for some questions to be more difficult to answer correctly than others. Our contribution is a general error correction tool that can be leveraged by a variety of hybrid-human machine ER algorithms, based on a formal way for selecting indirect "control queries''. In our experiments we demonstrate that correction-less ER algorithms equipped with our tool can perform even better than recent ER algorithms specifically designed for correcting errors. Our control queries are selected among those that provide strongest connectivity between records of each cluster, based on the concept ofgraph expanders (which are sparse graphs with formal connectivity properties). We give formal performance guarantees for our toolkit and provide experiments on real and synthetic data.
【Keywords】: crowdsourcing; data cleaning; entity resolution; expander graphs
【Paper Link】 【Pages】:19-34
【Authors】: Sidharth Mudgal ; Han Li ; Theodoros Rekatsinas ; AnHai Doan ; Youngchoon Park ; Ganesh Krishnan ; Rohit Deep ; Esteban Arcaute ; Vijay Raghavendra
【Abstract】: Entity matching (EM) finds data instances that refer to the same real-world entity. In this paper we examine applying deep learning (DL) to EM, to understand DL's benefits and limitations. We review many DL solutions that have been developed for related matching tasks in text processing (e.g., entity linking, textual entailment, etc.). We categorize these solutions and define a space of DL solutions for EM, as embodied by four solutions with varying representational power: SIF, RNN, Attention, and Hybrid. Next, we investigate the types of EM problems for which DL can be helpful. We consider three such problem types, which match structured data instances, textual instances, and dirty instances, respectively. We empirically compare the above four DL solutions with Magellan, a state-of-the-art learning-based EM solution. The results show that DL does not outperform current solutions on structured EM, but it can significantly outperform them on textual and dirty EM. For practitioners, this suggests that they should seriously consider using DL for textual and dirty EM problems. Finally, we analyze DL's performance and discuss future research directions.
【Keywords】: deep learning; entity matching; entity resolution
【Paper Link】 【Pages】:35-50
【Abstract】: Given a table of data, existing systems can often detect basic atomic types (e.g., strings vs. numbers) for each column. A new generation of data-analytics and data-preparation systems are starting to automatically recognize rich semantic types such as date-time, email address, etc., for such metadata can bring an array of benefits including better table understanding, improved search relevance, precise data validation, and semantic data transformation. However, existing approaches only detect a limited number of types using regular-expression-like patterns, which are often inaccurate, and cannot handle rich semantic types such as credit card and ISBN numbers that encode semantic validations (e.g., checksum). We developed AUTOTYPE from open-source repositories like GitHub. Users only need to provide a set of positive examples for a target data type and a search keyword, our system will automatically identify relevant code, and synthesize type-detection functions using execution traces. We compiled a benchmark with 112 semantic types, out of which the proposed system can synthesize code to detect 84 such types at a high precision. Applying the synthesized type-detection logic on web table columns have also resulted in a significant increase in data types discovered compared to alternative approaches.
【Keywords】: code search; data preparation; metadata management; open-source code; semantic data types; type detection
【Paper Link】 【Pages】:51-66
【Authors】: Jian Dai ; Meihui Zhang ; Gang Chen ; Ju Fan ; Kee Yuan Ngiam ; Beng Chin Ooi
【Abstract】: To unlock the wealth of the healthcare data, we often need to link the real-world text snippets to the referred medical concepts described by the canonical descriptions. However, existing healthcare concept linking methods, such as dictionary-based and simple machine learning methods, are not effective due to the word discrepancy between the text snippet and the canonical concept description, and the overlapping concept meaning among the fine-grained concepts. To address these challenges, we propose a Neural Concept Linking (NCL) approach for accurate concept linking using systematically integrated neural networks. We call the novel neural network architecture as the COMposite AttentIonal encode-Decode neural network (COM-AID). COM-AID performs an encode-decode process that encodes a concept into a vector and decodes the vector into a text snippet with the help of two devised contexts. On the one hand, it injects the textual context into the neural network through the attention mechanism, so that the word discrepancy can be overcome from the semantic perspective. On the other hand, it incorporates the structural context into the neural network through the attention mechanism, so that minor concept meaning differences can be enlarged and effectively differentiated. Empirical studies on two real-world datasets confirm that the NCL produces accurate concept linking results and significantly outperforms state-of-the-art techniques.
【Keywords】: attention mechanism; fine-grained concept linking; healthcare; neural networks
【Paper Link】 【Pages】:67-81
【Authors】: Disheng Qiu ; Luciano Barbosa ; Valter Crescenzi ; Paolo Merialdo ; Divesh Srivastava
【Abstract】: An increasing number of product pages are available from thousands of web sources, each page associated with a product, containing its attributes and one or more product identifiers. The sources provide overlapping information about the products, using diverse schemas, making web-scale integration extremely challenging. In this paper, we take advantage of the opportunity that sources publish product identifiers to perform big data linkage across sources at the beginning of the data integration pipeline, before schema alignment. To realize this opportunity, several challenges need to be addressed: identifiers need to be discovered on product pages, made difficult by the diversity of identifiers; the main product identifier on the page needs to be identified, made difficult by the many related products presented on the page; and identifiers across pages need to beresolved, made difficult by the ambiguity between identifiers across product categories. We present our RaF (Redundancy as Friend) solution to the problem of big data linkage for product specification pages, which takes advantage of the redundancy of identifiers at a global level, and the homogeneity of structure and semantics at the local source level, to effectively and efficiently link millions of pages of head and tail products across thousands of head and tail sources. We perform a thorough empirical evaluation of our RaF approach using the publicly available Dexter dataset consisting of 1.9M product pages from 7.1k sources of 3.5k websites, and demonstrate its effectiveness in practice.
【Keywords】: big data; data extraction; data integration; data linkage
【Paper Link】 【Pages】:83-98
【Authors】: Ben McCamish ; Vahid Ghadakchi ; Arash Termehchy ; Behrouz Touri ; Liang Huang
【Abstract】: As many users do not precisely know the structure and/or the content of databases, their queries do not exactly reflect their information needs. The database management systems (DBMS) may interact with users and leverage their feedback on the returned results to learn the information needs behind users' queries. Current query interfaces assume that users follow a fixed strategy of expressing their information needs, that is, the likelihood by which a user submits a query to express an information need remains unchanged during her interaction with the DBMS. Using a real-world interaction workload, we show that users learn and modify how to express their information needs during their interactions with the DBMS. We also show that users' learning is accurately modeled by a well-known reinforcement learning mechanism. As current data interaction systems assume that users do not modify their strategies, they cannot discover the information needs behind users' queries effectively. We model the interaction between users and DBMS as a game with identical interest between two rational agents whose goal is to establish a common language for representing information needs in form of queries. We propose a reinforcement learning method that learns and answers the information needs behind queries and adapts to the changes in users' strategies and prove that it improves the effectiveness of answering queries stochastically speaking. We analyze the challenges of efficient implementation of this method over large-scale relational databases and propose two efficient adaptations of this algorithm over large-scale relational databases. Our extensive empirical studies over real-world query workloads and large-scale relational databases indicate that our algorithms are efficient. Our empirical results also show that our proposed learning mechanism is more effective than the state-of-the-art query answering method.
【Keywords】: collaborative interaction; database interaction; game theory; reinforcement learning
【Paper Link】 【Pages】:99-114
【Authors】: Yinjun Wu ; Abdussalam Alawini ; Susan B. Davidson ; Gianmaria Silvello
【Abstract】: An increasing amount of information is being published in structured databases and retrieved using queries, raising the question of how query results should be cited. Since there are a large number of possible queries over a database, one strategy is to specify citations to a small set of frequent queries - citation views - and use these to construct citations to other "general" queries. We present three approaches to implementing citation views and describe alternative policies for the joint, alternate and aggregated use of citation views. Extensive experiments using both synthetic and realistic citation views and queries show the trade-offs between the approaches in terms of the time to generate citations, as well as the size of the resulting citation. They also show that the choice of policy has a huge effect both on performance and size, leading to useful guidelines for what policies to use and how to specify citation views.
【Keywords】: data citation; provenance; scientific databases
【Paper Link】 【Pages】:115-130
【Authors】: Dan Zhang ; Ryan McKenna ; Ios Kotsogiannis ; Michael Hay ; Ashwin Machanavajjhala ; Gerome Miklau
【Abstract】: The adoption of differential privacy is growing but the complexity of designing private, efficient and accurate algorithms is still high. We propose a novel programming framework and system, Ektelo, for implementing both existing and new privacy algorithms. For the task of answering linear counting queries, we show that nearly all existing algorithms can be composed from operators, each conforming to one of a small number of operator classes. While past programming frameworks have helped to ensure the privacy of programs, the novelty of our framework is its significant support for authoring accurate and efficient (as well as private) programs. After describing the design and architecture of the Ektelo system, we show that Ektelo is expressive, that it allows for safer implementations through code reuse, and that it allows both privacy novices and experts to easily design algorithms. We demonstrate the use of Ektelo by designing several new state-of-the-art algorithms.
【Keywords】:
【Paper Link】 【Pages】:131-146
【Authors】: Graham Cormode ; Tejas Kulkarni ; Divesh Srivastava
【Abstract】: Many analysis and machine learning tasks require the availability of marginal statistics on multidimensional datasets while providing strong privacy guarantees for the data subjects. Applications for these statistics range from finding correlations in the data to fitting sophisticated prediction models. In this paper, we provide a set of algorithms for materializing marginal statistics under the strong model of local differential privacy. We prove the first tight theoretical bounds on the accuracy of marginals compiled under each approach, perform empirical evaluation to confirm these bounds, and evaluate them for tasks such as modeling and correlation testing. Our results show that releasing information based on (local) Fourier transformations of the input is preferable to alternatives based directly on (local) marginals.
【Keywords】:
【Paper Link】 【Pages】:147-162
【Authors】: Cheng Xu ; Jianliang Xu ; Haibo Hu ; Man Ho Au
【Abstract】: Query authentication has been extensively studied to ensure the integrity of query results for outsourced databases, which are often not fully trusted. However, access control, another important security concern, is largely ignored by existing works. Notably, recent breakthroughs in cryptography have enabled fine-grained access control over outsourced data. In this paper, we take the first step toward studying the problem of authenticating relational queries with fine-grained access control. The key challenge is how to protect information confidentiality during query authentication, which is essential to many critical applications. To address this challenge, we propose a novel access-policy-preserving (APP) signature as the primitive authenticated data structure. A useful property of the APP signature is that it can be used to derive customized signatures for unauthorized users to prove the inaccessibility while achieving the zero-knowledge confidentiality. We also propose a grid-index-based tree structure that can aggregate APP signatures for efficient range and join query authentication. In addition to this, a number of optimization techniques are proposed to further improve the authentication performance. Security analysis and performance evaluation show that the proposed solutions and techniques are robust and efficient under various system settings.
【Keywords】: data integrity; fine-grained access control; query processing
【Paper Link】 【Pages】:163-176
【Authors】: Florian Hahn ; Nicolas Loza ; Florian Kerschbaum
【Abstract】: In this paper we address the problem of outsourcing sensitive strings while still providing the functionality of substring searches. While security is one important aspect that requires careful system design, the practical application of the solution depends on feasible processing time and integration efforts into existing systems. That is, searchable symmetric encryption (SSE) allows queries on encrypted data but makes common indexing techniques used in database management systems for fast query processing impossible. As a result, the overhead for deploying such functional and secure encryption schemes into database systems while maintaining acceptable processing time requires carefully designed special purpose index structures. Such structures are not available on common database systems but require individual modifications depending on the deployed SSE scheme. Our technique transforms the problem of secure substring search into range queries that can be answered efficiently and in a privacy-preserving way on common database systems without further modifications using frequency-hiding order-preserving encryption. We evaluated our prototype implementation deployed in a real-world scenario, including the consideration of network latency, we demonstrate the practicability of our scheme with 98.3 ms search time for 10,000 indexed emails. Further, we provide a practical security evaluation of this transformation based on the bucketing attack that is the best known published attack against this kind of property-preserving encryption.
【Keywords】: encrypted databases; secure substring search
【Paper Link】 【Pages】:177-190
【Authors】: Adam Dziedzic ; Jingjing Wang ; Sudipto Das ; Bolin Ding ; Vivek R. Narasayya ; Manoj Syamala
【Abstract】: Commercial DBMSs, such as Microsoft SQL Server, cater to diverse workloads including transaction processing, decision support, and operational analytics. They also support variety in physical design structures such as B+ tree and columnstore. The benefits of B+ tree for OLTP workloads and columnstore for decision support workloads are well-understood. However, the importance of hybrid physical designs, consisting of both columnstore and B+ tree indexes on the same database, is not well-studied --- a focus of this paper. We first quantify the trade-offs using carefully-crafted micro-benchmarks. This micro-benchmarking indicates that hybrid physical designs can result in orders of magnitude better performance depending on the workload. For complex real-world applications, choosing an appropriate combination of columnstore and B+ tree indexes for a database workload is challenging. We extend the Database Engine Tuning Advisor for Microsoft SQL Server to recommend a suitable combination of B+ tree and columnstore indexes for a given workload. Through extensive experiments using industry-standard benchmarks and several real-world customer workloads, we quantify how a physical design tool capable of recommending hybrid physical designs can result in orders of magnitude better execution costs compared to approaches that rely either on columnstore-only or B+ tree-only designs.
【Keywords】: b+ tree; columnstore; hybrid physical designs; hybrid transactional and analytical processing; operational analytics
【Paper Link】 【Pages】:191-203
【Authors】: Alekh Jindal ; Shi Qiao ; Hiren Patel ; Zhicheng Yin ; Jieming Di ; Malay Bag ; Marc Friedman ; Yifung Lin ; Konstantinos Karanasos ; Sriram Rao
【Abstract】: Analytics-as-a-service, or analytics job service, is emerging as a new paradigm for data analytics, be it in a cloud environment or within enterprises. In this setting, users are not required to manage or tune their hardware and software infrastructure, and they pay only for the processing resources consumed per job. However, the shared nature of these job services across several users and teams leads to significant overlaps in partial computations, i.e., parts of the processing are duplicated across multiple jobs, thus generating redundant costs. In this paper, we describe a computation reuse framework, coined CLOUDVIEWS, which we built to address the computation overlap problem in Microsoft's SCOPE job service. We present a detailed analysis from our production workloads to motivate the computation overlap problem and the possible gains from computation reuse. The key aspects of our system are the following: (i) we reuse computations by creating materialized views over recurring workloads, i.e., periodically executing jobs that have the same script templates but process new data each time, (ii) we select the views to materialize using a feedback loop that reconciles the compile-time and run-time statistics and gathers precise measures of the utility and cost of each overlapping computation, and (iii) we create materialized views in an online fashion, without requiring an offline phase to materialize the overlapping computations.
【Keywords】: computation reuse; materialized views; shared clouds
【Paper Link】 【Pages】:205-219
【Authors】: Rebecca Taft ; Nosayba El-Sayed ; Marco Serafini ; Yu Lu ; Ashraf Aboulnaga ; Michael Stonebraker ; Ricardo Mayerhofer ; Francisco Andrade
【Abstract】: OLTP database systems are a critical part of the operation of many enterprises. Such systems are often configured statically with sufficient capacity for peak load. For many OLTP applications, however, the maximum load is an order of magnitude larger than the minimum, and load varies in a repeating daily pattern. It is thus prudent to allocate computing resources dynamically to match demand. One can allocate resources reactively after a load increase is detected, but this places additional burden on the already-overloaded system to reconfigure. A predictive allocation, in advance of load increases, is clearly preferable. We present P-Store, the first elastic OLTP DBMS to use prediction, and apply it to the workload of B2W Digital (B2W), a large online retailer. Our study shows that P-Store outperforms a reactive system on B2W's workload by causing 72% fewer latency violations, and achieves performance comparable to static allocation for peak demand while using 50% fewer servers.
【Keywords】: database systems; predictive allocation
【Paper Link】 【Pages】:221-230
【Authors】: Edmon Begoli ; Jesús Camacho-Rodríguez ; Julian Hyde ; Michael J. Mior ; Daniel Lemire
【Abstract】: Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. The goal of this paper is to formally introduce Calcite to the broader research community, brie y present its history, and describe its architecture, features, functionality, and patterns for adoption. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This exible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.
【Keywords】: apache calcite; data management; modular query optimization; query algebra; relational semantics; storage adapters
【Paper Link】 【Pages】:231-243
【Authors】: Xinan Yan ; Linguan Yang ; Hongbo Zhang ; Xiayue Charles Lin ; Bernard Wong ; Kenneth Salem ; Tim Brecht
【Abstract】: The trend towards global applications and services has created an increasing demand for transaction processing on globally-distributed data. Many database systems, such as Spanner and CockroachDB, support distributed transactions but require a large number of wide-area network roundtrips to commit each transaction and ensure the transaction's state is durably replicated across multiple datacenters. This can significantly increase transaction completion time, resulting in developers replacing database-level transactions with their own error-prone application-level solutions. This paper introduces Carousel, a distributed database system that provides low-latency transaction processing for multi-partition globally-distributed transactions. Carousel shortens transaction processing time by reducing the number of sequential wide-area network round trips required to commit a transaction and replicate its results while maintaining serializability. This is possible in part by using information about a transaction's potential write set to enable transaction processing, including any necessary remote read operations, to overlap with 2PC and state replication. Carousel further reduces transaction completion time by introducing a consensus protocol that can perform state replication in parallel with 2PC. For a multi-partition 2-round Fixed-set Interactive (2FI) transaction, Carousel requires at most two wide-area network roundtrips to commit the transaction when there are no failures, and only one round trip in the common case if local replicas are available.
【Keywords】: distributed transactions; globally-distributed data
【Paper Link】 【Pages】:245-258
【Authors】: Ankur Sharma ; Felix Martin Schuhknecht ; Jens Dittrich
【Abstract】: Efficient transaction management is a delicate task. As systems face transactions of inherently different types, ranging from point updates to long-running analytical queries, it is hard to satisfy their requirements with a single execution engine. Unfortunately, most systems rely on such a design that implements its parallelism using multi-version concurrency control. While MVCC parallelizes short-running OLTP transactions well, it struggles in the presence of mixed workloads containing long-running OLAP queries, as scans have to work their way through vast amounts of versioned data. To overcome this problem, we reintroduce the concept of hybrid processing and combine it with state-of-the-art MVCC: OLAP queries are outsourced to run on separate virtual snapshots while OLTP transactions run on the most recent version of the database. Inside both execution engines, we still apply MVCC. The most significant challenge of a hybrid approach is to generate the snapshots at a high frequency. Previous approaches heavily suffered from the high cost of snapshot creation. In our approach termed AnKer, we follow the current trend of co-designing underlying system components and the DBMS, to overcome the restrictions of the OS by introducing a custom system call vm_snapshot. It allows fine-granular snapshot creation that is orders of magnitudes faster than state-of-the-art approaches. Our experimental evaluation on an HTAP workload based on TPC-C transactions and OLAP queries show that our snapshotting mechanism is more than a factor of 100x faster than fork-based snapshotting and that the latency of OLAP queries is up to a factor of 4x lower than MVCC in a single execution engine. Besides, our approach enables a higher OLTP throughput than all state-of-the-art methods.
【Keywords】: htap; main-memory databases; mvcc; transaction processing; virtual snapshotting
【Paper Link】 【Pages】:259-274
【Authors】: Vivek Shah ; Marcos Antonio Vaz Salles
【Abstract】: The requirements for OLTP database systems are becoming ever more demanding. Domains such as finance and computer games increasingly mandate that developers be able to encode complex application logic and control transaction latencies in in-memory databases. At the same time, infrastructure engineers in these domains need to experiment with and deploy OLTP database architectures that ensure application scalability and maximize resource utilization in modern machines. In this paper, we propose a relational actor programming model for in-memory databases as a novel, holistic approach towards fulfilling these challenging requirements. Conceptually, relational actors, or reactors for short, are application-defined, isolated logical actors that encapsulate relations and process function calls asynchronously. Reactors ease reasoning about correctness by guaranteeing serializability of application-level function calls. In contrast to classic transactional models, however, reactors allow developers to take advantage of intra-transaction parallelism and state encapsulation in their applications to reduce latency and improve locality. Moreover, reactors enable a new degree of flexibility in database deployment. We present ReactDB, a system design exposing reactors that allows for flexible virtualization of database architecture between the extremes of shared-nothing and shared-everything without changes to application code. Our experiments illustrate latency control, low overhead, and asynchronicity trade-offs with ReactDB in OLTP benchmarks.
【Keywords】: actor database systems; relational actors; transactions
【Paper Link】 【Pages】:275-290
【Authors】: Badrish Chandramouli ; Guna Prasaad ; Donald Kossmann ; Justin J. Levandoski ; James Hunter ; Mike Barnett
【Abstract】: Over the last decade, there has been a tremendous growth in data-intensive applications and services in the cloud. Data is created on a variety of edge sources, e.g., devices, browsers, and servers, and processed by cloud applications to gain insights or take decisions. Applications and services either work on collected data, or monitor and process data in real time. These applications are typically update intensive and involve a large amount of state beyond what can fit in main memory. However, they display significant temporal locality in their access pattern. This paper presents FASTER, a new key-value store for point read, blind update, and read-modify-write operations. FASTER combines a highly cache-optimized concurrent hash index with a hybrid log: a concurrent log-structured record store that spans main memory and storage, while supporting fast in-place updates of the hot set in memory. Experiments show that FASTER achieves orders-of-magnitude better throughput - up to 160M operations per second on a single machine - than alternative systems deployed widely today, and exceeds the performance of pure in-memory data structures when the workload fits in memory.
【Keywords】: concurrent; hash index; high performance; key-value store; latch-free; log-structured storage
【Paper Link】 【Pages】:291-306
【Authors】: Mustafa Korkmaz ; Martin Karsten ; Kenneth Salem ; Semih Salihoglu
【Abstract】: Natural short term fluctuations in the load of transactional data systems present an opportunity for power savings. For example, a system handling 1000 requests per second on average can expect more than 1000 requests in some seconds, fewer in others. By quickly adjusting processing capacity to match such fluctuations, power consumption can be reduced. Many systems do this already, using dynamic voltage and frequency scaling (DVFS) to reduce processor performance and power consumption when the load is low. DVFS is typically controlled by frequency governors in the operating system, or by the processor itself. In this paper, we show that transactional database systems can manage DVFS more effectively than the underlying operating system. This is because the database system has more information about the workload, and more control over that workload, than is available to the operating system. We present a technique called POLARIS for reducing the power consumption of transactional database systems. POLARIS directly manages processor DVFS and controls database transaction scheduling. Its goal is to minimize power consumption while ensuring the transactions are completed within a specified latency target. POLARIS is workload-aware, and can accommodate concurrent workloads with different characteristics and latency budgets. We show that POLARIS can simultaneously reduce power consumption and reduce missed latency targets, relative to operating-system-based DVFS governors.
【Keywords】: cpu; dbms systems; deadline; dvfs; power; scheduling; transaction
【Paper Link】 【Pages】:307-322
【Authors】: Ruby Y. Tahboub ; Grégory M. Essertel ; Tiark Rompf
【Abstract】: To leverage modern hardware platforms to their fullest, more and more database systems embrace compilation of query plans to native code. In the research community, there is an ongoing debate about the best way to architect such query compilers. This is perceived to be a difficult task, requiring techniques fundamentally different from traditional interpreted query execution. We aim to contribute to this discussion by drawing attention to an old but underappreciated idea known as Futamura projections, which fundamentally link interpreters and compilers. Guided by this idea, we demonstrate that efficient query compilation can actually be very simple, using techniques that are no more difficult than writing a query interpreter in a high-level language. Moreover, we demonstrate how intricate compilation patterns that were previously used to justify multiple compiler passes can be realized in one single, straightforward, generation pass. Key examples are injection of specialized index structures, data representation changes such as string dictionaries, and various kinds of code motion to reduce the amount of work on the critical path. We present LB2: a high-level query compiler developed in this style that performs on par with, and sometimes beats, the best compiled query engines on the standard TPC-H benchmark.
【Keywords】: futamura projections; query compilation
【Paper Link】 【Pages】:323-336
【Authors】: Huanchen Zhang ; Hyeontaek Lim ; Viktor Leis ; David G. Andersen ; Michael Kaminsky ; Kimberly Keeton ; Andrew Pavlo
【Abstract】: We present the Succinct Range Filter (SuRF), a fast and compact data structure for approximate membership tests. Unlike traditional Bloom filters, SuRF supports both single-key lookups and common range queries: open-range queries, closed-range queries, and range counts. SuRF is based on a new data structure called the Fast Succinct Trie (FST) that matches the point and range query performance of state-of-the-art order-preserving indexes, while consuming only 10 bits per trie node. The false positive rates in SuRF for both point and range queries are tunable to satisfy different application needs. We evaluate SuRF in RocksDB as a replacement for its Bloom filters to reduce I/O by filtering requests before they access on-disk data structures. Our experiments on a 100 GB dataset show that replacing RocksDB's Bloom filters with SuRFs speeds up open-seek (without upper-bound) and closed-seek (with upper-bound) queries by up to 1.5× and 5× with a modest cost on the worst-case (all-missing) point query throughput due to slightly higher false positive rate.
【Keywords】: fast succinct tries; lsm-trees; range filter; succinct data structures; surf
【Paper Link】 【Pages】:337-350
【Authors】: Dmitri V. Kalashnikov ; Laks V. S. Lakshmanan ; Divesh Srivastava
【Abstract】: We study the problem of Query Reverse Engineering (QRE), where given a database and an output table, the task is to find a simple project-join SQL query that generates that table when applied on the database. This problem is known for its efficiency challenge due to mainly two reasons. First, the problem has a very large search space and its various variants are known to be NP-hard. Second, executing even a single candidate SQL query can be very computationally expensive. In this work we propose a novel approach for solving the QRE problem efficiently. Our solution outperforms the existing state of the art by 2-3 orders of magnitude for complex queries, resolving those queries in seconds rather than days, thus making our approach more practical in real-life settings.
【Keywords】: automated data lineage discovery; cgm; column coherence
【Paper Link】 【Pages】:351-364
【Authors】: Thomas Kissinger ; Dirk Habich ; Wolfgang Lehner
【Abstract】: The ever-increasing demand for scalable database systems is limited by their energy consumption, which is one of the major challenges in research today. While existing approaches mainly focused on transaction-oriented disk-based database systems, we are investigating and optimizing the energy consumption and performance of data-oriented scale-up in-memory database systems that make heavy use of the main power consumers, which are processors and main memory. We give an in-depth energy analysis of a current mainstream server system and show that modern processors provide a rich set of energy-control features, but lack the capability of controlling them appropriately, because of missing application-specific knowledge. Thus, we propose the Energy-Control Loop (ECL) as an DBMS-integrated approach for adaptive energy-control on scale-up in-memory database systems that obeys a query latency limit as a soft constraint and actively optimizes energy efficiency and performance of the DBMS. The ECL relies on adaptive workload-dependent energy profiles that are continuously maintained at runtime. In our evaluation, we observed energy savings ranging from 20% to 40% for a real-world load profile.
【Keywords】: adaptivity; database systems; energy efficiency; in-memory
【Paper Link】 【Pages】:365-380
【Authors】: Milos Nikolic ; Dan Olteanu
【Abstract】: We introduce F-IVM, a unified incremental view maintenance (IVM) approach for a variety of tasks, including gradient computation for learning linear regression models over joins, matrix chain multiplication, and factorized evaluation of conjunctive queries. F-IVM is a higher-order IVM algorithm that reduces the maintenance of the given task to the maintenance of a hierarchy of increasingly simpler views. The views are functions mapping keys, which are tuples of input data values, to payloads, which are elements from a task-specific ring. Whereas the computation over the keys is the same for all tasks, the computation over the payloads depends on the task. F-IVM achieves efficiency by factorizing the computation of the keys, payloads, and updates. We implemented F-IVM as an extension of DBToaster. We show in a range of scenarios that it can outperform classical first-order IVM, DBToaster's fully recursive higher-order IVM, and plain recomputation by orders of magnitude while using less memory.
【Keywords】: factorized representation; incremental view maintenance; materialized views; query optimization; rings; stream processing
【Paper Link】 【Pages】:381-393
【Authors】: Wenfei Fan ; Xueli Liu ; Ping Lu ; Chao Tian
【Abstract】: Numeric inconsistencies are common in real-life knowledge bases and social networks. To catch such errors, we propose to extend graph functional dependencies with linear arithmetic expressions and comparison predicates, referred to as NGDs. We study fundamental problems for NGDs. We show that their satisfiability, implication and validation problems are Σ 2 p-complete, ¶II2 p-complete and coNP-complete, respectively. However, if we allow non-linear arithmetic expressions, even of degree at most 2, the satisfiability and implication problems become undecidable. In other words, NGDs strike a balance between expressivity and complexity. To make practical use of NGDs, we develop an incremental algorithm IncDect to detect errors in a graph G using NGDs, in response to updates Δ G to G. We show that the incremental validation problem is coNP-complete. Nonetheless, algorithm IncDect is localizable, i.e., its cost is determined by small neighbors of nodes in Δ G instead of the entire G. Moreover, we parallelize IncDect such that it guarantees to reduce running time with the increase of processors. Using real-life and synthetic graphs, we experimentally verify the scalability and efficiency of the algorithms.
【Keywords】: graph dependencies; incremental validation; numeric errors
【Paper Link】 【Pages】:395-410
【Authors】: Seongyun Ko ; Wook-Shin Han
【Abstract】: Existing distributed graph analytics systems are categorized into two main groups: those that focus on efficiency with a risk of out-of-memory error and those that focus on scale-up with a fixed memory budget and a sacrifice in performance. While the former group keeps a partitioned graph resident in memory of each machine and uses an in-memory processing technique, the latter stores the partitioned graph in external memory of each machine and exploits a streaming processing technique. Gemini and Chaos are the state-of-the-art distributed graph systems in each group, respectively. We present TurboGraph++, a scalable and fast graph analytics system which efficiently processes large graphs by exploiting external memory for scale-up without compromising efficiency. First, TurboGraph++ provides a new graph processing abstraction for efficiently supporting neighborhood analytics that requires processing multi-hop neighborhoods of vertices, such as triangle counting and local clustering coefficient computation, with a fixed memory budget. Second, TurboGraph++ provides a balanced and buffer-aware partitioning scheme for ensuring balanced workloads across machines with reasonable cost. Lastly, TurboGraph++ leverages three-level parallel and overlapping processing for fully utilizing three hardware resources, CPU, disk, and network, in a cluster. Extensive experiments show that TurboGraph++ is designed to scale well to very large graphs, like Chaos, while its performance is comparable to Gemini.
【Keywords】: disk-based system; distributed system; graph analytics system; scalability
【Paper Link】 【Pages】:411-426
【Authors】: Kyoungmin Kim ; In Seo ; Wook-Shin Han ; Jeong-Hoon Lee ; Sungpack Hong ; Hassan Chafi ; Hyungyu Shin ; Geonhwa Jeong
【Abstract】: A dynamic graph is defined by an initial graph and a graph update stream consisting of edge insertions and deletions. Identifying and monitoring critical patterns in the dynamic graph is important in various application domains such as fraud detection, cyber security, and emergency response. Given a dynamic data graph and a query graph, a continuous subgraph matching system reports positive matches for an edge insertion and reports negative matches for an edge deletion. Previous systems show significantly low throughput due to either repeated subgraph matching for each edge update or expensive overheads in maintaining enormous intermediate results. We present a fast continuous subgraph matching system called TurboFlux which provides high throughput over a fast graph update stream. TurboFlux employs a concise representation of intermediate results, and its execution model allows fast incremental maintenance. Our empirical evaluation shows that TurboFlux significantly outperforms existing competitors by up to six orders of magnitude.
【Keywords】: continuous subgraph matching; data-centric graph; dynamic graph; edge transition model
【Paper Link】 【Pages】:427-439
【Authors】: Wenfei Fan ; Chunming Hu ; Xueli Liu ; Ping Lu
【Abstract】: This paper studies discovery of GFDs, a class of functional dependencies defined on graphs. We investigate the fixed-parameter tractability of three fundamental problems related to GFD discovery. We show that the implication and satisfiability problems are fixed-parameter tractable, but the validation problem is co-W[1]-hard. We introduce notions of reduced GFDs and their topological support, and formalize the discovery problem for GFDs. We develop algorithms for discovering GFDs and computing their covers. Moreover, we show that GFD discovery is feasible over large-scale graphs, by providing parallel scalable algorithms for discovering GFDs that guarantee to reduce running time when more processors are used. Using real-life and synthetic data, we experimentally verify the effectiveness and scalability of the algorithms.
【Keywords】: fixed-parameter tractability; gfd discovery; parallel scalable
【Paper Link】 【Pages】:441-456
【Authors】: Zhewei Wei ; Xiaodong He ; Xiaokui Xiao ; Sibo Wang ; Shuo Shang ; Ji-Rong Wen
【Abstract】: Personalized PageRank (PPR) is a classic metric that measures the relevance of graph nodes with respect to a source node. Given a graph G, a source node s, and a parameter k, a top-k PPR query returns a set of k nodes with the highest PPR values with respect to s. This type of queries serves as an important building block for numerous applications in web search and social networks, such as Twitter's Who-To-Follow recommendation service. Existing techniques for top-k PPR, however, suffer from two major deficiencies. First, they either incur prohibitive space and time overheads on large graphs, or fail to provide any guarantee on the precision of top-k results (i.e., the results returned might miss a number of actual top-k answers). Second, most of them require significant pre-computation on the input graph G, which renders them unsuitable for graphs with frequent updates (e.g., Twitter's social graph). To address the deficiencies of existing solutions, we propose PPR, an algorithm for top-k PPR queries that ensure at least ρ precision (i.e., at least ρ fraction of the actual top-k results are returned) with at least 1 - 1/n probability, where ρ ∈;n (0, 1] is a user-specified parameter and n is the number of nodes in G. In addition, PPR offers non-trivial guarantees on query time in terms of ρ, and it can easily handle dynamic graphs as it does not require any preprocessing. We experimentally evaluate PPR using a variety of benchmark datasets, and demonstrate that PPR outperforms the state-of-the-art solutions in terms of both efficiency and precision, even when we set ρ = 1 (i.e., when PPR returns the exact top-k results). Notably, on a billion-edge Twitter graph, PPR only requires 15 seconds to answer a top-500 PPR query with ρ = 1.
【Keywords】: personalized pagerank; top-k queries
【Paper Link】 【Pages】:457-472
【Authors】: Rong-Hua Li ; Lu Qin ; Fanghua Ye ; Jeffrey Xu Yu ; Xiaokui Xiao ; Nong Xiao ; Zibin Zheng
【Abstract】: Given a scientific collaboration network, how can we find a group of collaborators with high research indicator (e.g., h-index) and diverse research interests? Given a social network, how can we identify the communities that have high influence (e.g., PageRank) and also have similar interests to a specified user? In such settings, the network can be modeled as a multi-valued network where each node has d ($d \ge 1$) numerical attributes (i.e., h-index, diversity, PageRank, similarity score, etc.). In the multi-valued network, we want to find communities that are not dominated by the other communities in terms of d numerical attributes. Most existing community search algorithms either completely ignore the numerical attributes or only consider one numerical attribute of the nodes. To capture d numerical attributes, we propose a novel community model, called skyline community, based on the concepts of k-core and skyline. A skyline community is a maximal connected k-core that cannot be dominated by the other connected k-cores in the d-dimensional attribute space. We develop an elegant space-partition algorithm to efficiently compute the skyline communities. Two striking advantages of our algorithm are that (1) its time complexity relies mainly on the size of the answer s (i.e., the number of skyline communities), thus it is very efficient if s is small; and (2) it can progressively output the skyline communities, which is very useful for applications that only require part of the skyline communities. Extensive experiments on both synthetic and real-world networks demonstrate the efficiency, scalability, and effectiveness of the proposed algorithm.
【Keywords】: $k$-core; community search; massive graphs; skyline
【Paper Link】 【Pages】:473-488
【Authors】: Ziqi Wang ; Andrew Pavlo ; Hyeontaek Lim ; Viktor Leis ; Huanchen Zhang ; Michael Kaminsky ; David G. Andersen
【Abstract】: In 2013, Microsoft Research proposed the Bw-Tree (humorously termed the "Buzz Word Tree''), a lock-free index that provides high throughput for transactional database workloads in SQL Server's Hekaton engine. The Buzz Word Tree avoids locks by appending delta record to tree nodes and using an indirection layer that allows it to atomically update physical pointers using compare-and-swap (CaS). Correctly implementing this techniques requires careful attention to detail. Unfortunately, the Bw-Tree papers from Microsoft are missing important details and the source code has not been released. This paper has two contributions: First, it is the missing guide for how to build a lock-free Bw-Tree. We clarify missing points in Microsoft's original design documents and then present techniques to improve the index's performance. Although our focus here is on the Bw-Tree, many of our methods apply more broadly to designing and implementing future lock-free in-memory data structures. Our experimental evaluation shows that our optimized variant achieves 1.1--2.5× better performance than the original Microsoft proposal for highly concurrent workloads. Second, our evaluation shows that despite our improvements, the Bw-Tree still does not perform as well as other concurrent data structures that use locks.
【Keywords】: bw-tree; database index; lock-free; main memory oltp; multicore
【Paper Link】 【Pages】:489-504
【Authors】: Tim Kraska ; Alex Beutel ; Ed H. Chi ; Jeffrey Dean ; Neoklis Polyzotis
【Abstract】: Indexes are models: a \btree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term \em learned indexes. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show that our learned indexes can have significant advantages over traditional indexes. More importantly, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work provides just a glimpse of what might be possible.
【Keywords】: b-tree; bloom-filter; cdf; hash-map; index structures; learned data structures; learned index; learned index structure; linear regression; mixture of experts; neural net
【Paper Link】 【Pages】:505-520
【Authors】: Niv Dayan ; Stratos Idreos
【Abstract】: In this paper, we show that all mainstream LSM-tree based key-value stores in the literature and in industry are suboptimal with respect to how they trade off among the I/O costs of updates, point lookups, range lookups, as well as the cost of storage, measured as space-amplification. The reason is that they perform expensive merge operations in order to (1) bound the number of runs that a lookup has to probe, and to (2) remove obsolete entries to reclaim space. However, most of these merge operations reduce point lookup cost, long range lookup cost, and space-amplification by a negligible amount. To address this problem, we expand the LSM-tree design space with Lazy Leveling, a new design that prohibits merge operations at all levels of LSM-tree but the largest. We show that Lazy Leveling improves the worst-case cost complexity of updates while maintaining the same bounds on point lookup cost, long range lookup cost, and space-amplification. To be able to navigate between Lazy Leveling and other designs, we make the LSM-tree design space fluid by introducing Fluid LSM-tree, a generalization of LSM-tree that can be parameterized to assume all existing LSM-tree designs. We show how to fluidly transition from Lazy Leveling to (1) designs that are more optimized for updates by merging less at the largest level, and (2) designs that are more optimized for small range lookups by merging more at all other levels. We put everything together to design Dostoevsky, a key-value store that navigates the entire Fluid LSM-tree design space based on the application workload and hardware to maximize throughput using a novel closed-form performance model. We implemented Dostoevsky on top of RocksDB, and we show that it strictly dominates state-of-the-art LSM-tree based key-value stores in terms of performance and space-amplification.
【Keywords】: bloom filters; compaction; lsm-tree
【Paper Link】 【Pages】:521-534
【Authors】: Robert Binna ; Eva Zangerle ; Martin Pichl ; Günther Specht ; Viktor Leis
【Abstract】: We present the Height Optimized Trie (HOT), a fast and space-efficient in-memory index structure. The core algorithmic idea of HOT is to dynamically vary the number of bits considered at each node, which enables a consistently high fanout and thereby good cache efficiency. The layout of each node is carefully engineered for compactness and fast search using SIMD instructions. Our experimental results, which use a wide variety of workloads and data sets, show that HOT outperforms other state-of-the-art index structures for string keys both in terms of search performance and memory footprint, while being competitive for integer keys. We believe that these properties make HOT highly useful as a general-purpose index structure for main-memory databases.
【Keywords】: height optimized trie; index; main memory; simd
【Paper Link】 【Pages】:535-550
【Authors】: Stratos Idreos ; Kostas Zoumpatianos ; Brian Hentschel ; Michael S. Kester ; Demi Guo
【Abstract】: Data structures are critical in any data-driven scenario, but they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously. We present a design engine, the Data Calculator, which enables interactive and semi-automated design of data structures. It brings two innovations. First, it offers a set of fine-grained design primitives that capture the first principles of data layout design: how data structure nodes lay data out, and how they are positioned relative to each other. This allows for a structured description of the universe of possible data structure designs that can be synthesized as combinations of those primitives. The second innovation is computation of performance using learned cost models. These models are trained on diverse hardware and data profiles and capture the cost properties of fundamental data access primitives (e.g., random access). With these models, we synthesize the performance cost of complex operations on arbitrary data structure designs without having to: 1) implement the data structure, 2) run the workload, or even 3) access the target hardware. We demonstrate that the Data Calculator can assist data structure designers and researchers by accurately answering rich what-if design questions on the order of a few seconds or minutes, i.e., computing how the performance (response time) of a given data structure design is impacted by variations in the: 1) design, 2) hardware, 3) data, and 4) query workloads. This makes it effortless to test numerous designs and ideas before embarking on lengthy implementation, deployment, and hardware acquisition steps. We also demonstrate that the Data Calculator can synthesize entirely new designs, auto-complete partial designs, and detect suboptimal design choices.
【Keywords】: data structure synthesis; learned cost models
【Paper Link】 【Pages】:551-566
【Authors】: Mohiuddin Abdul Qader ; Shiwen Cheng ; Vagelis Hristidis
【Abstract】: NoSQL databases are increasingly used in big data applications, because they achieve fast write throughput and fast lookups on the primary key. Many of these applications also require queries on non-primary attributes. For that reason, several NoSQL databases have added support for secondary indexes. However, these works are fragmented, as each system generally supports one type of secondary index, and may be using different names or no name at all to refer to such indexes. As there is no single system that supports all types of secondary indexes, no experimental head-to-head comparison or performance analysis of the various secondary indexing techniques in terms of throughput and space exists. In this paper, we present a taxonomy of NoSQL secondary indexes, broadly split into two classes: Embedded Indexes (i.e. lightweight filters embedded inside the primary table) and Stand-Alone Indexes (i.e. separate data structures). To ensure the fairness of our comparative study, we built a system, LevelDB++, on top of Google's popular open-source LevelDB key-value store. There, we implemented two Embedded Indexes and three state-of-the-art Stand-Alone indexes, which cover most of the popular NoSQL databases. Our comprehensive experimental study and theoretical evaluation show that none of these indexing techniques dominate the others: the embedded indexes offer superior write throughput and are more space efficient, whereas the stand-alone secondary indexes achieve faster query response times. Thus, the optimal choice of secondary index depends on the application workload. This paper provides an empirical guideline for choosing secondary indexes
【Keywords】: leveldb; lsm-tree; nosql; secondary indexing; top-$k$
【Paper Link】 【Pages】:567-582
【Authors】: Tao Guo ; Kaiyu Feng ; Gao Cong ; Zhifeng Bao
【Abstract】: With the proliferation of mobile devices, large collections of geospatial data are becoming available, such as geo-tagged photos. Map rendering systems play an important role in presenting such large geospatial datasets to end users. We propose that such systems should support the following desirable features: representativeness, visibility constraint, zooming consistency, and panning consistency. The first two constraints are fundamental challenges to a map exploration system, which aims to efficiently select a small set of representative objects from the current region of user's interest, and any two selected objects should not be too close to each other for users to distinguish in the limited space of a screen. We formalize it as the Spatial Object Selection (SOS) problem, prove that it is an NP-hard problem, and develop a novel approximation algorithm with performance guarantees. % To further support interactive exploration of geospatial data on maps, we propose the Interactive SOS (ISOS) problem, in which we enrich the SOS problem with the zooming consistency and panning consistency constraints. The objective of ISOS is to provide seamless experience for end-users to interactively explore the data by navigating the map. We extend our algorithm for the SOS problem to solve the ISOS problem, and propose a new strategy based on pre-fetching to significantly enhance the efficiency. Finally we have conducted extensive experiments to show the efficiency and scalability of our approach.
【Keywords】: geospatial data visualization; map exploration; sampling
【Paper Link】 【Pages】:583-594
【Authors】: Jean-François Im ; Kishore Gopalakrishna ; Subbu Subramaniam ; Mayank Shrivastava ; Adwait Tumbde ; Xiaotian Jiang ; Jennifer Dai ; Seunghyun Lee ; Neha Pawar ; Jialiang Li ; Ravi Aringunram
【Abstract】: Modern users demand analytical features on fresh, real time data. Offering these analytical features to hundreds of millions of users is a relevant problem encountered by many large scale web companies. Relational databases and key-value stores can be scaled to provide point lookups for a large number of users but fall apart at the combination of high ingest rates, high query rates at low latency for analytical queries. Online analytical databases typically rely on bulk data loads and are not typically built to handle nonstop operation in demanding web environments. Offline analytical systems have high throughput but do not offer low query latencies nor can scale to serving tens of thousands of queries per second. We present Pinot, a single system used in production at Linkedin that can serve tens of thousands of analytical queries per second, offers near-realtime data ingestion from streaming data sources, and handles the operational requirements of large web properties. We also provide a performance comparison with Druid, a system similar to Pinot.
【Keywords】: near real time data ingestion; olap; parallel and distributed dbmss; pinot
【Paper Link】 【Pages】:595-599
【Authors】: Peilin Yang ; Srikanth Thiagarajan ; Jimmy Lin
【Abstract】: Twitter's data engineering team is faced with the challenge of processing billions of events every day in batch and in real time, and we have built various tools to meet these demands. In this paper, we describe TSAR (TimeSeries AggregatoR), a robust, scalable, real-time event time series aggregation framework built primarily for engagement monitoring: aggregating interactions with Tweets, segmented along a multitude of dimensions such as device, engagement type, etc. TSAR is built on top of Summingbird, an open-source framework for integrating batch and online MapReduce computations, and removes much of the tedium associated with building end-to-end aggregation pipelines---from the ingestion and processing of events to the publication of results in heterogeneous datastores. Clients are provided a query interface that powers dashboards and supports downstream ad hoc analytics.
【Keywords】: batch processing; online processing
【Paper Link】 【Pages】:601-613
【Authors】: Michael Armbrust ; Tathagata Das ; Joseph Torres ; Burak Yavuz ; Shixiong Zhu ; Reynold Xin ; Ali Ghodsi ; Ion Stoica ; Matei Zaharia
【Abstract】: With the ubiquity of real-time data, organizations need streaming systems that are scalable, easy to use, and easy to integrate into business applications. Structured Streaming is a new high-level streaming API in Apache Spark based on our experience with Spark Streaming. Structured Streaming differs from other recent streaming APIs, such as Google Dataflow, in two main ways. First, it is a purely declarative API based on automatically incrementalizing a static relational query (expressed using SQL or DataFrames), in contrast to APIs that ask the user to build a DAG of physical operators. Second, Structured Streaming aims to support end-to-end real-time applications that integrate streaming with batch and interactive analysis. We found that this integration was often a key challenge in practice. Structured Streaming achieves high performance via Spark SQL's code generation engine and can outperform Apache Flink by up to 2x and Apache Kafka Streams by 90x. It also offers rich operational features such as rollbacks, code updates, and mixed streaming/batch execution. We describe the system's design and use cases from several hundred production deployments on Databricks, the largest of which process over 1 PB of data per month.
【Keywords】: apache spark; programming models; stream processing
【Paper Link】 【Pages】:615-627
【Authors】: Wei Cao ; Yusong Gao ; Bingchen Lin ; Xiaojie Feng ; Yu Xie ; Xiao Lou ; Peng Wang
【Abstract】: Smooth end-to-end performance of mission-critical database system is essential to the stability of applications deployed on the cloud. It's a challenge for cloud database vendors to detect any performance degradation in real-time and locate the root cause quickly in sophisticated network environment. Cloud databases vendors tend to favor a multi-tier distributed architecture to achieve multi-tenant management, scalability and high-availability, which may further complicate the problem. This paper presents TcpRT, the instrument and diagnosis infrastructure in Alibaba Cloud RDS that achieves real-time anomaly detection. We wrote a Linux kernel module to collect trace data of each SQL query, designed to be efficient with minimal overhead, it adds tracepoints in callbacks of TCP congestion control kernel module, that is totally transparent to database processes. In order to reduce the amount of data significantly before sending it to backend, raw trace data is aggregated. Aggregated trace data is then processed, grouped and analyzed in a distributed streaming computing platform. By utilizing a self-adjustable Cauchy distribution statistical model from historical performance data for each DB instance, anomalous events can be automatically detected in databases, which eliminates manually configuring thresholds by experience. A fault or hiccup occurred in any network component that is shared among multiple DB instances (e.g. hosted on the same physical machine or uplinked to the same pair of TOR switches) may cause large-scale service quality degradations. The ratio of anomalous DB instances vs networks components is being calculated, which helps pinpoint the faulty component. TcpRT has been deployed in production at Alibaba Cloud for the past 3 years, collects over 20 million raw traces per second, and processes over 10 billion locally aggregated results in the backend per day, and managed to have within 1% performance impact on DB system. We present case studies of typical scenarios where TcpRT helps to solve various problems occurred in production system.
【Keywords】:
【Paper Link】 【Pages】:629
【Authors】: Pedro Domingos
【Abstract】: Machine learning has made great strides in recent years, and its applications are spreading rapidly. Unfortunately, the standard machine learning formulation does not match well with data management problems. For example, most learning algorithms assume that the data is contained in a single table, and consists of i.i.d. (independent and identically distributed) samples. This leads to a proliferation of ad hoc solutions, slow development, and suboptimal results. Fortunately, a body of machine learning theory and practice is being developed that dispenses with such assumptions, and promises to make machine learning for data management much easier and more effective [1]. In particular, representations like Markov logic, which includes many types of deep networks as special cases, allow us to define very rich probability distributions over non-i.i.d., multi-relational data [2]. Despite their generality, learning the parameters of these models is still a convex optimization problem, allowing for efficient solution. Learning structure-in the case of Markov logic, a set of formulas in first-order logic-is intractable, as in more traditional representations, but can be done effectively using inductive logic programming techniques. Inference is performed using probabilistic generalizations of theorem proving, and takes linear time and space in tractable Markov logic, an object-oriented specialization of Markov logic [3]. These techniques have led to state-of-the-art, principled solutions to problems like entity resolution, schema matching, ontology alignment, and information extraction. Using tractable Markov logic, we have extracted from the Web a probabilistic knowledge base with millions of objects and billions of parameters, which can be queried exactly in subsecond times using an RDBMS backend [3]. With these foundations in place, we expect the pace of machine learning applications in data management to continue to accelerate in coming years.
【Keywords】: graphical models; non-i.i.d. data.; probabilistic databases; probabilistic theorem proving
【Paper Link】 【Pages】:631-645
【Authors】: Lin Ma ; Dana Van Aken ; Ahmed Hefny ; Gustavo Mezerhane ; Andrew Pavlo ; Geoffrey J. Gordon
【Abstract】: The first step towards an autonomous database management system (DBMS) is the ability to model the target application's workload. This is necessary to allow the system to anticipate future workload needs and select the proper optimizations in a timely manner. Previous forecasting techniques model the resource utilization of the queries. Such metrics, however, change whenever the physical design of the database and the hardware resources change, thereby rendering previous forecasting models useless. We present a robust forecasting framework called QueryBot 5000 that allows a DBMS to predict the expected arrival rate of queries in the future based on historical data. To better support highly dynamic environments, our approach uses the logical composition of queries in the workload rather than the amount of physical resources used for query execution. It provides multiple horizons (short- vs. long-term) with different aggregation intervals. We also present a clustering-based technique for reducing the total number of forecasting models to maintain. To evaluate our approach, we compare our forecasting models against other state-of-the-art models on three real-world database traces. We implemented our models in an external controller for PostgreSQL and MySQL and demonstrate their effectiveness in selecting indexes.
【Keywords】: autonomic computing; autonomous dbms; database management systems; machine learning; query forecasting; workload prediction
【Paper Link】 【Pages】:647-662
【Authors】: Zechao Shang ; Jeffrey Xu Yu ; Aaron J. Elmore
【Abstract】: Motivated by the applicability of HogWild!-style algorithms, people turn their focus on system architectures that provide ultra-high throughput random-access with very limited or no isolation guarantees, and build inconsistent-tolerant applications (i.e., large scale optimization algorithms) on top of them. Although some optimization algorithms have theoretical convergence guarantees, sometimes these systems fail to compute the correct results when the presumptions of convergence cannot hold. Moreover, there is no practical way to tell whether a given result is accurate (without cross validation) or to tune the isolation strength on-the-fly. To resolve these problems, these systems need an indicator to report the number of "bad event" caused by "out-of-order" executions. In this paper, we tackle this problem. Based on transaction processing theory, we find the number of cycles in the dependency graph, and demonstrate it is a good indicator. With this observation, we propose the first real-time isolation anomalies monitor. Our monitor is at least 1000x faster than naive implementations and reports accurate isolation anomalies levels with less than 1% extra overhead. Monitoring anomalies in a real-time manner efficiently protects the systems from excessive isolation anomalies which could lead to incorrect results. We verify the performance and effectiveness of our monitor via extensive experimental studies.
【Keywords】: isolation anomalies; large-scale optimization
【Paper Link】 【Pages】:663-675
【Authors】: Florian Wolf ; Norman May ; Paul R. Willems ; Kai-Uwe Sattler
【Abstract】: Cardinality estimation is a crucial task in query optimization and typically relies on heuristics and basic statistical approximations. At execution time, estimation errors might result in situations where intermediate result sizes may differ from the estimated ones, so that the originally chosen plan is not the optimal plan anymore. In this paper we analyze the deviation from the estimate, and denote the cardinality range of an intermediate result, where the optimal plan remains optimal as the optimality range. While previous work used simple heuristics to calculate similar ranges, we generate the precise bounds for the optimality range considering all relevant plan alternatives. Our experimental results show that the fixed optimality ranges used in previous work fail to characterize the range of cardinalities where a plan is optimal. We derive theoretical worst case bounds for the number of enumerated plans required to compute the precise optimality range, and experimentally show that in real queries this number is significantly smaller. Our experiments also show the benefit for applications like Mid-Query Re-Optimization in terms of significant execution time improvement.
【Keywords】: mid-query re-optimization; optimality ranges; parametric queries; query optimization; query plan caching; query processing
【Paper Link】 【Pages】:677-692
【Authors】: Thomas Neumann ; Bernhard Radke
【Abstract】: The use of business intelligence tools and other means to generate queries has led to great variety in the size of join queries. While most queries are reasonably small, join queries with up to a hundred relations are not that exotic anymore, and the distribution of query sizes has an incredible long tail. The largest real-world query that we are aware of accesses more than 4,000 relations. This large spread makes query optimization very challenging. Join ordering is known to be NP-hard, which means that we cannot hope to solve such large problems exactly. On the other hand most queries are much smaller, and there is no reason to sacrifice optimality there. This paper introduces an adaptive optimization framework that is able to solve most common join queries exactly, while simultaneously scaling to queries with thousands of joins. A key component there is a novel search space linearization technique that leads to near-optimal execution plans for large classes of queries. In addition, we describe implementation techniques that are necessary to scale join ordering algorithms to these extremely large queries. Extensive experiments with over 10 different approaches show that the new adaptive approach proposed here performs excellent over a huge spectrum of query sizes, and produces optimal or near-optimal solutions for most common queries.
【Keywords】:
【Paper Link】 【Pages】:693-708
【Authors】: TaiNing Wang ; Chee-Yong Chan
【Abstract】: A critical task in query optimization is the join reordering problem which is to find an efficient evaluation order for the join operators in a query plan. While the join reordering problem is well studied for queries with only inner-joins, the problem becomes considerably harder when outerjoins/antijoins are involved as such operators are generally not associative. The existing solutions for this problem do not enumerate the complete space of join orderings due to various restrictions on the query rewriting rules considered. In this paper, we present a novel approach for this problem for the class of queries involving inner-joins, single-sided outerjoins, and/or antijoins. Our work is able to support complete join reorderability for this class of queries which supersedes the state-of-the-art approaches.
【Keywords】: join reordering; query optimization
【Paper Link】 【Pages】:709-724
【Authors】: Dian Ouyang ; Lu Qin ; Lijun Chang ; Xuemin Lin ; Ying Zhang ; Qing Zhu
【Abstract】: Computing the shortest distance between two vertices is a fundamental problem in road networks. Since a direct search using the Dijkstra's algorithm results in a large search space, researchers resort to indexing-based approaches. State-of-the-art indexing-based solutions can be categorized into hierarchy-based solutions and hopbased solutions. However, the hierarchy-based solutions require a large search space for long-distance queries while the hop-based solutions result in a high computational waste for short-distance queries. To overcome the drawbacks of both solutions, in this paper, we propose a novel hierarchical 2-hop index (H2H-Index) which assigns a label for each vertex and at the same time preserves a hierarchy among all vertices. With the H2H-Index, we design an efficient query processing algorithm with performance guarantees by visiting part of the labels for the source and destination based on the vertex hierarchy. We also propose an algorithm to construct the H2H-Index based on distance preserved graphs. The algorithm is further optimized by computing the labels based on the partially computed labels of other vertices. We conducted extensive performance studies using large real road networks including the whole USA road network. The experimental results demonstrate that our approach can achieve a speedup of an order of magnitude in query processing compared to the state-of-the-art while consuming comparable indexing time and index size.
【Keywords】: road network; shortest distance; tree decomposition
【Paper Link】 【Pages】:725-740
【Authors】: Zeyuan Shang ; Guoliang Li ; Zhifeng Bao
【Abstract】: Trajectory analytics can benefit many real-world applications, e.g., frequent trajectory based navigation systems, road planning, car pooling, and transportation optimizations. Existing algorithms focus on optimizing this problem in a single machine. However, the amount of trajectories exceeds the storage and processing capability of a single machine, and it calls for large-scale trajectory analytics in distributed environments. The distributed trajectory analytics faces challenges of data locality aware partitioning, load balance, easy-to-use interface, and versatility to support various trajectory similarity functions. To address these challenges, we propose a distributed in-memory trajectory analytics system DITA. We propose an effective partitioning method, global index and local index, to address the data locality problem. We devise cost-based techniques to balance the workload. We develop a filter-verification framework to improve the performance. Moreover, DITA can support most of existing similarity functions to quantify the similarity between trajectories. We integrate our framework seamlessly into Spark SQL, and make it support SQL and DataFrame API interfaces. We have conducted extensive experiments on real world datasets, and experimental results show that DITA outperforms existing distributed trajectory similarity search and join approaches significantly.
【Keywords】:
【Paper Link】 【Pages】:741-756
【Authors】: Yang Zhou ; Tong Yang ; Jie Jiang ; Bin Cui ; Minlan Yu ; Xiaoming Li ; Steve Uhlig
【Abstract】: Approximate stream processing algorithms, such as Count-Min sketch, Space-Saving, etc., support numerous applications in databases, storage systems, networking, and other domains. However, the unbalanced distribution in real data streams poses great challenges to existing algorithms. To enhance these algorithms, we propose a meta-framework, called Cold Filter (CF), that enables faster and more accurate stream processing. Different from existing filters that mainly focus on hot items, our filter captures cold items in the first stage, and hot items in the second stage. Also, existing filters require two-direction communication - with frequent exchanges between the two stages; our filter on the other hand is one-direction - each item enters one stage at most once. Our filter can accurately estimate both cold and hot items, giving it a genericity that makes it applicable to many stream processing tasks. To illustrate the benefits of our filter, we deploy it on three typical stream processing tasks and experimental results show speed improvements of up to 4.7 times, and accuracy improvements of up to 51 times. All source code is made publicly available at Github.
【Keywords】: approximate algorithms; data streams; data structures; sketch
【Paper Link】 【Pages】:757-772
【Authors】: Kai Sheng Tai ; Vatsal Sharan ; Peter Bailis ; Gregory Valiant
【Abstract】: We introduce a new sub-linear space sketch---the Weight-Median Sketch---for learning compressed linear classifiers over data streams while supporting the efficient recovery of large-magnitude weights in the model. This enables memory-limited execution of several statistical analyses over streams, including online feature selection, streaming data explanation, relative deltoid detection, and streaming estimation of pointwise mutual information. Unlike related sketches that capture the most frequently-occurring features (or items) in a data stream, the Weight-Median Sketch captures the features that are most discriminative of one stream (or class) compared to another. The Weight-Median Sketch adopts the core data structure used in the Count-Sketch, but, instead of sketching counts, it captures sketched gradient updates to the model parameters. We provide a theoretical analysis that establishes recovery guarantees for batch and online learning, and demonstrate empirical improvements in memory-accuracy trade-offs over alternative memory-budgeted methods, including count-based sketches and feature hashing.
【Keywords】: linear classification; online learning; sketching
【Paper Link】 【Pages】:773-788
【Authors】: Yongxin Tong ; Libin Wang ; Zimu Zhou ; Lei Chen ; Bowen Du ; Jieping Ye
【Abstract】: In spatial crowdsourcing, requesters submit their task-related locations and increase the demand of a local area. The platform prices these tasks and assigns spatial workers to serve if the prices are accepted by requesters. There exist mature pricing strategies which specialize in tackling the imbalance between supply and demand in a local market. However, in global optimization, the platform should consider the mobility of workers; that is, any single worker can be the potential supply for several areas, while it can only be the true supply of one area when assigned by the platform. The hardness lies in the uncertainty of the true supply of each area, hence the existing pricing strategies do not work. In the paper, we formally define this Global Dynamic Pricing(GDP) problem in spatial crowdsourcing. And since the objective is concerned with how the platform matches the supply to areas, we let the matching algorithm guide us how to price. We propose a MAtching-based Pricing Strategy (MAPS) with guaranteed bound. Extensive experiments conducted on the synthetic and real datasets demonstrate the effectiveness of MAPS.
【Keywords】: pricing strategy; spatial crowdsourcing
【Paper Link】 【Pages】:789-796
【Authors】: Alexandre Verbitski ; Anurag Gupta ; Debanjan Saha ; James Corey ; Kamal Gupta ; Murali Brahmadesam ; Raman Mittal ; Sailesh Krishnamurthy ; Sandor Maurice ; Tengiz Kharatishvili ; Xiaofeng Bao
【Abstract】: Amazon Aurora is a high-throughput cloud-native relational database offered as part of Amazon Web Services (AWS). One of the more novel differences between Aurora and other relational databases is how it pushes redo processing to a multi-tenant scale-out storage service, purpose-built for Aurora. Doing so reduces networking traffic, avoids checkpoints and crash recovery, enables failovers to replicas without loss of data, and enables fault-tolerant storage that heals without database involvement. Traditional implementations that leverage distributed storage would use distributed consensus algorithms for commits, reads, replication, and membership changes and amplify cost of underlying storage. In this paper, we describe how Aurora avoids distributed consensus under most circumstances by establishing invariants and leveraging local transient state. Doing so improves performance, reduces variability, and lowers costs.
【Keywords】: databases; distributed systems; fault tolerance; log processing; performance; quorum models; quorum sets; recovery; replication
【Paper Link】 【Pages】:797-809
【Authors】: Ben Vandiver ; Shreya Prasad ; Pratibha Rana ; Eden Zik ; Amin Saeidi ; Pratyush Parimal ; Styliani Pantela ; Jaimin Dave
【Abstract】: The Vertica Analytic Database is a powerful tool for high performance, large scale SQL analytics. Historically, Vertica has managed direct-attached disk for performance and reliability, at a cost of product complexity and scalability. Eon mode is a new architecture for Vertica that places the data on a reliable shared storage, matching the original architecture's performance on existing workloads and supporting new workloads. While the design reuses Vertica's optimizer and execution engine, the metadata, storage, and fault tolerance mechanisms are re-architected to enable and take advantage of shared storage. A sharding mechanism distributes load over the nodes while retaining the capability of running node-local table joins. Running on Amazon EC2 compute and S3 storage, Eon mode demonstrates good performance, superior scalability, and robust operational behavior. With these improvements, Vertica delivers on the promise of cloud economics, consuming only the compute and storage resources needed, while supporting efficient elasticity.
【Keywords】: cloud storage; column stores; databases; shared storage
【Paper Link】 【Pages】:811-823
【Authors】: Jose Picado ; Willis Lang ; Edward C. Thayer
【Abstract】: Public cloud database providers observe all sorts of different usage patterns and behaviors while operating their services. Service providers such as Microsoft try to understand and characterize these behaviors in order to improve the quality of their service, provide new features for customers, and/or increase the efficiency of the operations. While there are many types of patterns of behavior that are of interest to providers, such as query types, workload intensity, and temporal activity, in this paper, we focus on the lowest level of behavior -- how long do public cloud databases survive before being dropped? Given the large and diverse relational database population that Azure SQL DB has, we present a large-scale survivability study of our service and identify some factors that can demonstrably help predict the lifespan of cloud databases. The results of this study are being used to influence how Azure SQL DB operates in order to increase efficiency as well as improve customer experience.
【Keywords】: cloud; database; learning; lifespan; longevity; prediction; survival
【Paper Link】 【Pages】:825-839
【Authors】: Lyublena Antova ; Derrick Bryant ; Tuan Cao ; Michael Duller ; Mohamed A. Soliman ; Florian M. Waas
【Abstract】: The database industry is about to undergo a fundamental transformation of unprecedented magnitude as enterprises start trading their well-established database stacks on premises for cloud database technology in order to take advantage of the economics cloud service providers have long promised. Industry experts and analysts expect the next years to prove a watershed moment in this transformation, as cloud databases finally reached critical mass and maturity. Enterprises eager to move to the cloud face a significant dilemma: while moving the content of their databases to the cloud is a well-studied problem, making existing applications work with new database platforms is an enormously costly undertaking that calls for rewriting and adjusting of 100's if not 1,000's of applications. In this paper, we present a next-generation virtualization technology that lets existing applications run natively on cloud-based database systems. Using this platform, enterprises can move rapidly to the cloud and innovate and create competitive advantage as a matter of months instead of years. We describe technology and application scenarios and demonstrate effectiveness and performance of the approach through actual customer use cases.
【Keywords】: adaptive data warehouse virtualization; cloud data warehouses; query processing
【Paper Link】 【Pages】:841-855
【Authors】: Ildar Absalyamov ; Michael J. Carey ; Vassilis J. Tsotras
【Abstract】: Data sources, such as social media, mobile apps and IoT sensors, generate billions of records each day. Keeping up with this influx of data while providing useful analytics to the users is a major challenge for today's data-intensive systems. A popular solution that allows such systems to handle rapidly incoming data is to rely on log-structured merge (LSM) storage models. LSM-based systems provide a tunable trade-off between ingesting vast amounts of data at a high rate and running efficient analytical queries on top of that data. For queries, it is well-known that the query processing performance largely depends on the ability to generate efficient execution plans. Previous research showed that OLAP query workloads rely on having small, yet precise, statistical summaries of the underlying data, which can drive the cost-based query optimization. In this paper we address the problem of computing data statistics for workloads with rapid data ingestion and propose a lightweight statistics-collection framework that exploits the properties of LSM storage. Our approach is designed to piggyback on the events (flush and merge) of the LSM lifecycle. This allows us to easily create an initial statistics and then keep them in sync with rapidly changing data while minimizing the overhead to the existing system. We have implemented and adapted well-known algorithms to produce various types of statistical synopses, including equi-width histograms, equi-height histograms, and wavelets. We performed an in-depth empirical evaluation that considers both the cardinality estimation accuracy and runtime overheads of collecting and using statistics. The experiments were conducted by prototyping our approach on top of Apache AsterixDB, an open source Big Data management system that has an entirely LSM-based storage backend.
【Keywords】: database statistics; lsm storage
【Paper Link】 【Pages】:857-872
【Authors】: Brian Hentschel ; Michael S. Kester ; Stratos Idreos
【Abstract】: While numerous indexing and storage schemes have been developed to address the core functionality of predicate evaluation in data systems, they all require specific workload properties (query selectivity, data distribution, data clustering) to provide good performance and fail in other cases. We present a new class of indexing scheme, termed a Column Sketch, which improves the performance of predicate evaluation independently of workload properties. Column Sketches work primarily through the use of lossy compression schemes which are designed so that the index ingests data quickly, evaluates any query performantly, and has small memory footprint. A Column Sketch works by applying this lossy compression on a value-by-value basis, mapping base data to a representation of smaller fixed width codes. Queries are evaluated affirmatively or negatively for the vast majority of values using the compressed data, and only if needed check the base data for the remaining values. Column Sketches work over column, row, and hybrid storage layouts. We demonstrate that by using a Column Sketch, the select operator in modern analytic systems attains better CPU efficiency and less data movement than state-of-the-art storage and indexing schemes. Compared to standard scans, Column Sketches provide an improvement of 3x-6x for numerical attributes and 2.7x for categorical attributes. Compared to state-of-the-art scan accelerators such as Column Imprints and BitWeaving, Column Sketches perform 1.4 - 4.8× better.
【Keywords】: column; index; lossy compression; sketch
【Paper Link】 【Pages】:873-888
【Authors】: Wenhai Li ; Lingfeng Deng ; Yang Li ; Chen Li
【Abstract】: In this paper we study the problem of supporting similarity queries on a large number of records using a vector space model, where each record is a bag of tokens. We consider similarity functions that incorporate non-negative global token weights as well as record-specific token degrees. We develop a family of algorithms based on an inverted index for large data sets, especially for the case of using external storage such as hard disks or flash drives, and present pruning techniques based on various bounds to improve their performance. We formally prove the correctness of these techniques, and show how to achieve better pruning power by iteratively tightening these bounds to exactly filter dissimilar records. We conduct an extensive experimental study using real, large-scale data sets based on different storage platforms, including memory, hard disks, and flash drives. The results show that these algorithms and techniques can efficiently support similarity queries on large data sets.
【Keywords】: elastic bounds; similarity queries; vector space models; zigzag
【Paper Link】 【Pages】:889-903
【Authors】: Yiqiu Wang ; Anshumali Shrivastava ; Jonathan Wang ; Junghee Ryu
【Abstract】: We present FLASH (F ast L SH A lgorithm for S imilarity search accelerated with H PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (n2D), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results.
【Keywords】: gpgpu; locality sensitive hashing; reservoir sampling; similarity search
【Paper Link】 【Pages】:905-920
【Authors】: Dong Deng ; Yufei Tao ; Guoliang Li
【Abstract】: This paper studies the set similarity join problem with overlap constraints which, given two collections of sets and a constant c, finds all the set pairs in the datasets that share at least c common elements. This is a fundamental operation in many fields, such as information retrieval, data mining, and machine learning. The time complexity of all existing methods is O(n2) where n is the total size of all the sets. In this paper, we present a size-aware algorithm with the time complexity of O(n2-over 1 c k1 over 2c)=o(n2)+O(k), where k is the number of results. The size-aware algorithm divides all the sets into small and large ones based on their sizes and processes them separately. We can use existing methods to process the large sets and focus on the small sets in this paper. We develop several optimization heuristics for the small sets to improve the practical performance significantly. As the size boundary between the small sets and the large sets is crucial to the efficiency, we propose an effective size boundary selection algorithm to judiciously choose an appropriate size boundary, which works very well in practice. Experimental results on real-world datasets show that our methods achieve high performance and outperform the state-of-the-art approaches by up to an order of magnitude.
【Keywords】: overlap; scalable; set; similarity join; sub-quadratic; theoretical guarantee
【Paper Link】 【Pages】:927-942
【Authors】: Yinglong Song ; Huey-Eng Chua ; Sourav S. Bhowmick ; Byron Choi ; Shuigeng Zhou
【Abstract】: Visual graph query interfaces (a.k.a GUI) make it easy for non-expert users to query graphs. Recent research has laid out and implemented a vision of a novel subgraph query processing paradigm where the latency offered by the GUI is exploited to blend visual query construction and processing by generating and refining candidate result matches iteratively during query formulation. This paradigm brings in several potential benefits such as superior system response time (srt) and opportunities to enhance usability of graph databases. However, these early efforts focused on subgraph isomorphism-based graph queries where blending is performed by iterative edge-to-edge mapping. In this paper, we explore how this vision can be realized for more generic but complex 1-1 p-homomorphic p-hom) queries introduced by Fan et al. A 1-1 p-hom query maps an edge of the query to paths in the data graph. We present a novel framework called BOOMER for blending bounded 1-1 p-hom (bph ) queries, a variant of 1-1 p-hom where the length of the path is bounded instead of arbitrary length. Our framework is based on a novel online , adaptive indexing scheme called cap index. We present two strategies for CAP index construction, immediate and deferment-based, and show how they can be utilized to facilitate judicious interleaving of visual bph query formulation and query processing. BOOMER is also amenable to modifications to a bph query during visual formulation. Experiments on real-world datasets demonstrate both efficiency and effectiveness of Boomer for realizing the visual querying paradigm on an important type of graph query.
【Keywords】: adaptive index; blender; deferment-based evaluation; immediate evaluation; large networks; p-homomorphic query; visual graph query interface
【Paper Link】 【Pages】:943-958
【Authors】: Yihan Gao ; Silu Huang ; Aditya G. Parameswaran
【Abstract】: Organizations routinely accumulate semi-structured log datasets generated as the output of code; these datasets remain unused and uninterpreted, and occupy wasted space---this phenomenon has been colloquially referred to as "data lake'' problem. One approach to leverage these semi-structured datasets is to convert them into a structured relational format, following which they can be analyzed in conjunction with other datasets. We present DATAMARAN, an tool that extracts structure from semi-structured log datasets with no human supervision. DATAMARAN automatically identifies field and record endpoints, separates the structured parts from the unstructured noise or formatting, and can tease apart multiple structures from within a dataset, in order to efficiently extract structured relational datasets from semi-structured log datasets, at scale with high accuracy. Compared to other unsupervised log dataset extraction tools developed in prior work, DATAMARAN does not require the record boundaries to be known beforehand, making it much more applicable to the noisy log files that are ubiquitous in data lakes. DATAMARAN can successfully extract structured information from all datasets used in prior work, and can achieve 95% extraction accuracy on automatically collected log datasets from GitHub---a substantial 66% increase of accuracy compared to unsupervised schemes from prior work. Our user study further demonstrates that the extraction results of DATAMARAN are closer to the desired structure than competing algorithms.
【Keywords】: log datasets; unsupervised structure extraction
【Paper Link】 【Pages】:959-974
【Authors】: Min Xie ; Raymond Chi-Wing Wong ; Jian Li ; Cheng Long ; Ashwin Lall
【Abstract】: Extracting interesting tuples from a large database is an important problem in multi-criteria decision making. Two representative queries were proposed in the literature: top- k queries and skyline queries. A top- k query requires users to specify their utility functions beforehand and then returns k tuples to the users. A skyline query does not require any utility function from users but it puts no control on the number of tuples returned to users. Recently, a k-regret query was proposed and received attention from the community because it does not require any utility function from users and the output size is controllable, and thus it avoids those deficiencies of top- k queries and skyline queries. Specifically, it returns k tuples that minimize a criterion called the maximum regret ratio . In this paper, we present the lower bound of the maximum regret ratio for the k -regret query. Besides, we propose a novel algorithm, called SPHERE, whose upper bound on the maximum regret ratio is asymptotically optimal and restriction-free for any dimensionality, the best-known result in the literature. We conducted extensive experiments to show that SPHERE performs better than the state-of-the-art methods for the k -regret query.
【Keywords】: data analytics; k-regret; multi-criteria decision making
【Paper Link】 【Pages】:975-990
【Authors】: Kaiyu Li ; Xiaohang Zhang ; Guoliang Li
【Abstract】: Crowdsourced top- k computation aims to utilize the human ability to identify Top- k objects from a given set of objects. Most of existing studies employ a pairwise comparison based method, which first asks workers to compare each pair of objects and then infers the Top- k results based on the pairwise comparison results. Obviously, it is quadratic to compare every object pair and these methods involve huge monetary cost, especially for large datasets. To address this problem, we propose a rating-ranking-based approach, which contains two types of questions to ask the crowd. The first is a rating question, which asks the crowd to give a score for an object. The second is a ranking question, which asks the crowd to rank several (e.g., 3) objects. Rating questions are coarse grained and can roughly get a score for each object, which can be used to prune the objects whose scores are much smaller than those of the Top- k objects. Ranking questions are fine grained and can be used to refine the scores. We propose a unified model to model the rating and ranking questions, and seamlessly combine them together to compute the Top- k results. We also study how to judiciously select appropriate rating or ranking questions and assign them to a coming worker. Experimental results on real datasets show that our method significantly outperforms existing approaches.
【Keywords】: crowdsourcing; ranking; rating; top-k computation
【Paper Link】 【Pages】:991-1005
【Authors】: Jing Tang ; Xueyan Tang ; Xiaokui Xiao ; Junsong Yuan
【Abstract】: Influence maximization is a classic and extensively studied problem with important applications in viral marketing. Existing algorithms for influence maximization, however, mostly focus on offline processing, in the sense that they do not provide any output to the user until the final answer is derived, and that the user is not allowed to terminate the algorithm early to trade the quality of solution for efficiency. Such lack of interactiveness and flexibility leads to poor user experience, especially when the algorithm incurs long running time. To address the above problem, this paper studies algorithms for online processing of influence maximization (OPIM), where the user can pause the algorithm at any time and ask for a solution (to the influence maximization problem) and its approximation guarantee, and can resume the algorithm to let it improve the quality of solution by giving it more time to run. (This interactive paradigm is similar in spirit to online query processing in database systems.) We show that the only existing algorithm for OPIM is vastly ineffective in practice, and that adopting existing influence maximization methods for OPIM yields unsatisfactory results. Motivated by this, we propose a new algorithm for OPIM with both superior empirical effectiveness and strong theoretical guarantees, and we show that it can also be extended to handle conventional influence maximization. Extensive experiments on real data demonstrate that our solutions outperform the state of the art for both OPIM and conventional influence maximization.
【Keywords】: influence maximization; online processing algorithm; sampling
【Paper Link】 【Pages】:1007-1019
【Authors】: Eyal Dushkin ; Tova Milo
【Abstract】: We address the problem of sorting the top-k elements of a set, given a predefined partial order over the set elements. Our means to obtain missing order information is via a comparison operator that interacts with a crowd of domain experts to determine the order between two unordered items. The practical motivation for studying this problem is the common scenario where elements cannot be easily compared by machines and thus human experts are harnessed for this task. As some initial partial order is given, our goal is to optimally exploit it in order to minimize the domain experts work. The problem lies at the intersection of two well-studied problems in the theory and crowdsourcing communities:full sorting under partial order information and top-k sorting with no prior partial order information. As we show, resorting to one of the existing state-of-the-art algorithms in these two problems turns out to be extravagant in terms of the number of comparisons performed by the users. In light of this, we present a dedicated algorithm for top-k sorting that aims to minimize the number of comparisons by thoroughly leveraging the partial order information. We examine two possible interpretations of the comparison operator, taken from the theory and crowdsourcing communities, and demonstrate the efficiency and effectiveness of our algorithm for both of them. We further demonstrate the utility of our algorithm, beyond identifying the top-k elements in a dataset, as a vehicle to improve the performance of Learning-to-Rank algorithms in machine learning context. We conduct a comprehensive experimental evaluation in both synthetic and real-world settings.
【Keywords】:
【Paper Link】 【Pages】:1021-1035
【Authors】: Babak Salimi ; Johannes Gehrke ; Dan Suciu
【Abstract】: On line analytical processing (OLAP) is an essential element of decision-support systems. OLAP tools provide insights and understanding needed for improved decision making. However, the answers to OLAP queries can be biased and lead to perplexing and incorrect insights. In this paper, we propose, a system to detect, explain, and to resolve bias in decision-support queries. We give a simple definition of a biased query, which performs a set of independence tests on the data to detect bias. We propose a novel technique that gives explanations for bias, thus assisting an analyst in understanding what goes on. Additionally, we develop an automated method for rewriting a biased query into an unbiased query, which shows what the analyst intended to examine. In a thorough evaluation on several real datasets we show both the quality and the performance of our techniques, including the completely automatic discovery of the revolutionary insights from a famous 1973 discrimination case.
【Keywords】: algorithmic faireness; biased query; causal inference; olap; simpson's paradox
【Paper Link】 【Pages】:1037-1052
【Authors】: Yanqing Peng ; Jinwei Guo ; Feifei Li ; Weining Qian ; Aoying Zhou
【Abstract】: Membership testing is the problem of testing whether an element is in a set of elements. Performing the test exactly is expensive space-wise, requiring the storage of all elements in a set. In many applications, an approximate testing that can be done quickly using small space is often desired. Bloom filter (BF) was designed and has witnessed great success across numerous application domains. But there is no compact structure that supports set membership testing for temporal queries, e.g., has person A visited a web server between 9:30am and 9:40am? And has the same person visited the web server again between 9:45am and 9:50am? It is possible to support such "temporal membership testing" using a BF, but we will show that this is fairly expensive. To that end, this paper designs persistent bloom filter (PBF), a novel data structure for temporal membership testing with compact space.
【Keywords】: bloom filter; persistent bloom filter; persistent data structure
【Paper Link】 【Pages】:1053-1066
【Authors】: Michele Linardi ; Yan Zhu ; Themis Palpanas ; Eamonn J. Keogh
【Abstract】: In the last fifteen years, data series motif discovery has emerged as one of the most useful primitives for data series mining, with applications to many domains, including robotics, entomology, seismology, medicine, and climatology. Nevertheless, the state-of-the-art motif discovery tools still require the user to provide the motif length. Yet, in at least some cases, the choice of motif length is critical and unforgiving. Unfortunately, the obvious brute-force solution, which tests all lengths within a given range, is computationally untenable. In this work, we introduce VALMOD, an exact and scalable motif discovery algorithm that efficiently finds all motifs in a given range of lengths. We evaluate our approach with five diverse real datasets, and demonstrate that it is up to 20 times faster than the state-of-the-art. Our results also show that removing the unrealistic assumption that the user knows the correct length, can often produce more intuitive and actionable results, which could have been missed otherwise.
【Keywords】: data mining; data series; motif discovery; time series; variable length
【Paper Link】 【Pages】:1067-1082
【Authors】: Junhao Gan ; Yufei Tao
【Abstract】: OPTICS is a popular method for visualizing multidimensional clusters. All the existing implementations of this method have a time complexity of $O(n^2)$ --- where n is the size of the input dataset --- and thus, may not be suitable for datasets of large volumes. This paper alleviates the problem by resorting to approximation with guarantees. The main result is a new algorithm that runs in $O(n łog n)$ time under any fixed dimensionality, and computes a visualization that has provably small discrepancies from that of OPTICS. As a side product, our algorithm gives an index structure that occupies linear space, and supports the cluster group-by query with near-optimal cost. The quality of the cluster visualizations produced by our techniques and the efficiency of the proposed algorithms are demonstrated with an empirical evaluation on real data.
【Keywords】: density-based clustering; optics; visualizations
【Paper Link】 【Pages】:1083-1096
【Authors】: Matthias Ruhl ; Mukund Sundararajan ; Qiqi Yan
【Abstract】: We study changes in metrics that are defined on a cartesian product of trees. Such metrics occur naturally in many practical applications, where a global metric (such as revenue) can be broken down along several hierarchical dimensions (such as location, gender, etc). Given a change in such a metric, our goal is to identify a small set of non-overlapping data segments that account for a majority of the change. An organization interested in improving the metric can then focus their attention on these data segments. Our key contribution is an algorithm that naturally mimics the operation of a hierarchical organization of analysts. The algorithm has been successfully applied within Google's ad platform (AdWords) to help Google's advertisers triage the performance of their advertising campaigns, and within Google Analytics to help website developers understand their traffic. We empirically analyze the runtime and quality of the algorithm by comparing it against benchmarks for a census dataset. We prove theoretical, worst-case bounds on the performance of the algorithm. For instance, we show that the algorithm is optimal for two dimensions, and has an approximation ratio log d-2 (n+1) for d ≥ 3 dimensions, where n is the number of input data segments. For the advertising application, we can show that our algorithm is a 2-approximation. To characterize the hardness of the problem, we study data patterns called conflicts These allow us to construct hard instances of the problem, and derive a lower bound of 1.144 d-2 (again d ≥3) for our algorithm, and to show that the problem is NP-hard; this justifies are focus on approximation.
【Keywords】: business intelligence; business metrics; decision support; explanation
【Paper Link】 【Pages】:1097-1111
【Authors】: Xiangyu Ke ; Arijit Khan ; Gao Cong
【Abstract】: -1mmWe study the novel problem of jointly finding the top- k seed nodes and the top- r relevant tags for targeted influence maximization in a social network. The bulk of the research on influence maximization assumes that the influence diffusion probabilities across edges are fixed, and the top- k seed users are identified to maximize the cascade in the entire graph. However, in real-world applications, edge probabilities typically depend on the information being cascaded, e.g., in social influence networks, the probability that a tweet of some user will be re-tweeted by her followers depends on whether the tweet contains specific hashtags. In addition, a campaigner often has a specific group of target customers in mind. In this work, we model such practical constraints, and investigate the novel problem of jointly finding the top-k seed nodes and the top- r relevant tags that maximize the influence inside a target set of users. Due to the hardness of the influence maximization problem, we develop heuristic solutions --- with smart indexing, iterative algorithms, and good initial conditions, which target high-quality, efficiency, and scalability. -1mm
【Keywords】: conditional influence probability; indexing; reverse sketching; targeted influence maximization
【Paper Link】 【Pages】:1113-1127
【Authors】: Sibo Wang ; Yufei Tao
【Abstract】: Given a directed graph G, a source node s, and a target node t, the personalized PageRank (PPR of t with respect to s is the probability that a random walk starting from s terminates at t. The average of the personalized PageRank score of t with respect to each source node υ∈ V is exactly the PageRank score π( t ) of node t , which denotes the overall importance of node t in the graph. A heavy hitter of node t is a node whose contribution to π( t ) is above a φ fraction, where φ is a value between 0 and 1. Finding heavy hitters has important applications in link spam detection, classification of web pages, and friend recommendations. In this paper, we propose BLOG, an efficient framework for three types of heavy hitter queries: the pairwise approximate heavy hitter (AHH), the reverse AHH, and the multi-source reverse AHH queries. For pairwise AHH queries, our algorithm combines the Monte-Carlo approach and the backward propagation approach to reduce the cost of both methods, and incorporates new techniques to deal with high in-degree nodes. For reverse AHH and multi-source reverse AHH queries, our algorithm extends the ideas behind the pairwise AHH algorithm with a new "logarithmic bucketing'' technique to improve the query efficiency. Extensive experiments demonstrate that our BLOG is far more efficient than alternative solutions on the three queries.
【Keywords】: approximate algorithms; heavy hitters; personalized pagerank; social recommendation
【Paper Link】 【Pages】:1129-1140
【Authors】: Daniel Ting
【Abstract】: We introduce and study a new data sketch for processing massive datasets. It addresses two common problems: 1) computing a sum given arbitrary filter conditions and 2) identifying the frequent items or heavy hitters in a data set. For the former, the sketch provides unbiased estimates with state of the art accuracy. It handles the challenging scenario when the data is disaggregated so that computing the per unit metric of interest requires an expensive aggregation. For example, the metric of interest may be total clicks per user while the raw data is a click stream with multiple rows per user. Thus the sketch is suitable for use in a wide range of applications including computing historical click through rates for ad prediction, reporting user metrics from event streams, and measuring network traffic for IP flows. We prove and empirically show the sketch has good properties for both the disaggregated subset sum estimation and frequent item problems. On i.i.d. data, it not only picks out the frequent items but gives strongly consistent estimates for the proportion of each frequent item. The resulting sketch asymptotically draws a probability proportional to size sample that is optimal for estimating sums over the data. For non i.i.d. data, we show that it typically does much better than random sampling for the frequent item problem and never does worse. For subset sum estimation, we show that even for pathological sequences, the variance is close to that of an optimal sampling design. Empirically, despite the disadvantage of operating on disaggregated data, our method matches or bests priority sampling, a state of the art method for pre-aggregated data and performs orders of magnitude better on skewed data compared to uniform sampling. We propose extensions to the sketch that allow it to be used in combining multiple data sets, in distributed systems, and for time decayed aggregation.
【Keywords】: counting; data sketching; frequent item; heavy hitters; sampling; subset sum estimation
【Paper Link】 【Pages】:1141-1156
【Authors】: Wenfei Fan ; Ping Lu ; Xiaojian Luo ; Jingbo Xu ; Qiang Yin ; Wenyuan Yu ; Ruiqi Xu
【Abstract】: This paper proposes an Adaptive Asynchronous Parallel (AAP) model for graph computations. As opposed to Bulk Synchronous Parallel (BSP) and Asynchronous Parallel (AP) models, AAP reduces both stragglers and stale computations by dynamically adjusting relative progress of workers. We show that BSP, AP and Stale Synchronous Parallel model (SSP) are special cases of AAP. Better yet, AAP optimizes parallel processing by adaptively switching among these models at different stages of a single execution. Moreover, employing the programming model of GRAPE, AAP aims to parallelize existing sequential algorithms based on fixpoint computation with partial and incremental evaluation. Under a monotone condition, AAP guarantees to converge at correct answers if the sequential algorithms are correct. Furthermore, we show that AAP can optimally simulate MapReduce, PRAM, BSP, AP and SSP. Using real-life and synthetic graphs, we experimentally verify that AAP outperforms BSP, AP and SSP for a variety of graph computations.
【Keywords】: church-rosser; graph computations; parallel model; parallelization
【Paper Link】 【Pages】:1157-1172
【Authors】: Raul Castro Fernandez ; William Culhane ; Pijika Watcharapichat ; Matthias Weidlich ; Victoria Lopez Morales ; Peter R. Pietzuch
【Abstract】: Distributed dataflow systems such as Apache Spark and Apache Flink are used to derive new insights from large datasets. While they efficiently execute concrete data processing workflows, expressed as dataflow graphs, they lack generic support for exploratory workflows : if a user is uncertain about the correct processing pipeline, e.g. in terms of data cleaning strategy or choice of model parameters, they must repeatedly submit modified jobs to the system. This, however, misses out on optimisation opportunities for exploratory workflows, both in terms of scheduling and memory allocation. We describe meta-dataflows(MDFs), a new model to effectively express exploratory workflows and efficiently execute them on compute clusters. With MDFs, users specify a family of dataflows using two primitives: (a) an explore operator automatically considers choices in a dataflow; and (b) a choose operator assesses the result quality of explored dataflow branches and selects a subset of the results. We propose optimisations to execute MDFs: a system can (i) avoid redundant computation when exploring branches by reusing intermediate results, discarded results from underperforming branches, and pruning unnecessary branches; and (ii) consider future data access patterns in the MDF when allocating cluster memory. Our evaluation shows that MDFs improve the runtime of exploratory workflows by up to 90% compared to sequential job execution.
【Keywords】: dataflow; distributed data processing; exploratory workflows; parallel data processing; parameter space exploration
【Paper Link】 【Pages】:1173-1187
【Authors】: Hwanjun Song ; Jae-Gil Lee
【Abstract】: In most parallel DBSCAN algorithms, neighboring points are assigned to the same data partition for parallel processing to facilitate calculation of the density of the neighbors. This data partitioning scheme causes a few critical problems including load imbalance between data partitions, especially in a skewed data set. To remedy these problems, we propose a cell-based data partitioning scheme, pseudo random partitioning , that randomly distributes small cells rather than the points themselves. It achieves high load balance regardless of data skewness while retaining the data contiguity required for DBSCAN. In addition, we build and broadcast a highly compact summary of the entire data set, which we call a two-level cell dictionary , to supplement random partitions. Then, we develop a novel parallel DBSCAN algorithm, Random Partitioning-DBSCAN (shortly, RP-DBSCAN), that uses pseudo random partitioning together with a two-level cell dictionary. The algorithm simultaneously finds the local clusters to each data partition and then merges these local clusters to obtain global clustering. To validate the merit of our approach, we implement RP-DBSCAN on Spark and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). In RP-DBSCAN, data partitioning and cluster merging are very light, and clustering on each split is not dragged out by a specific worker. Therefore, the performance results show that RP-DBSCAN significantly outperforms the state-of-the-art algorithms by up to 180 times.
【Keywords】: clustering; dbscan; parallelization; spark
【Paper Link】 【Pages】:1189-1204
【Authors】: Jia Zou ; R. Matthew Barnett ; Tania Lorido-Botran ; Shangyu Luo ; Carlos Monroy ; Sourav Sikdar ; Kia Teymourian ; Binhang Yuan ; Chris Jermaine
【Abstract】: This paper describes PlinyCompute, a system for development of high-performance, data-intensive, distributed computing tools and libraries. \emphIn the large, PlinyCompute presents the programmer with a very high-level, declarative interface, relying on automatic, relational-database style optimization to figure out how to stage distributed computations. However, in the small, PlinyCompute presents the capable systems programmer with a persistent object data model and API (the "PC object model'') and associated memory management system that has been designed from the ground-up for high performance, distributed, data-intensive computing. This contrasts with most other Big Data systems, which are constructed on top of the Java Virtual Machine (JVM), and hence must at least partially cede performance-critical concerns such as memory management (including layout and de/allocation) and virtual method/function dispatch to the JVM. This hybrid approach---declarative in the large, trusting the programmer's ability to utilize PC object model efficiently in the small---results in a system that is ideal for the development of reusable, data-intensive tools and libraries.
【Keywords】: distributed computing; object model; query compilation
【Paper Link】 【Pages】:1205-1220
【Authors】: Maaz Bin Safeer Ahmad ; Alvin Cheung
【Abstract】: MapReduce is a popular programming paradigm for developing large-scale, data-intensive computation. Many frameworks that implement this paradigm have recently been developed. To leverage these frameworks, however, developers must become familiar with their APIs and rewrite existing code. We present Casper, a new tool that automatically translates sequential Java programs into the MapReduce paradigm. Casper identifies potential code fragments to rewrite and translates them in two steps: (1) Casper uses program synthesis to search for a program summary (i.e., a functional specification) of each code fragment. The summary is expressed using a high-level intermediate language resembling the MapReduce paradigm and verified to be semantically equivalent to the original using a theorem prover. (2) Casper generates executable code from the summary, using either the Hadoop, Spark, or Flink API. We evaluated Casper by automatically converting real-world, sequential Java benchmarks to MapReduce. The resulting benchmarks perform up to 48.2x faster compared to the original.
【Keywords】: mapreduce; program synthesis; verification; compilers; verified lifting
【Paper Link】 【Pages】:1221-1236
【Authors】: Faisal Nawab ; Divyakant Agrawal ; Amr El Abbadi
【Abstract】: In this paper, we propose Dynamic Paxos (DPaxos), a Paxos-based consensus protocol to manage access to partitioned data across globally-distributed datacenters and edge nodes. DPaxos is intended to implement a State Machine Replication component in data management systems for the edge. DPaxos targets the unique opportunities of utilizing edge computing resources to support emerging applications with stringent mobility and real-time requirements such as Augmented and Virtual Reality and vehicular applications. The main objective of DPaxos is to reduce the latency of serving user requests, recovering from failures, and reacting to mobility. DPaxos achieves these objectives by a few proposed changes to the traditional Paxos protocol. Most notably, DPaxos proposes a dynamic allocation of quorums ( i.e. , groups of nodes) that are needed for Paxos Leader Election. Leader Election quorums in DPaxos are smaller than traditional Paxos and expand only in the presence of conflicts.
【Keywords】: edge computing; geo-replication; multi-datacenter; paxos; transaction processing
【Paper Link】 【Pages】:1237-1252
【Authors】: Rundong Li ; Mirek Riedewald ; Xinyan Deng
【Abstract】: We study distributed equi-join computation in the presence of join-attribute skew, which causes load imbalance. Skew can be addressed by more fine-grained partitioning, at the cost of input duplication. For random load assignment, e.g., using a hash function, fine-grained partitioning creates a tradeoff between load expectation and variance. We show that minimizing load variance subject to a constraint on expectation is a monotone submodular maximization problem with Knapsack constraints, hence admitting provably near-optimal greedy solutions. In contrast to previous work on formal optimality guarantees, constant factors are not abstracted away, and we can prove this result also for more general load functions accounting for both input and output. We further demonstrate through experiments that this theoretical result leads to an effective algorithm for the problem of minimizing running time, even when load is assigned deterministically.
【Keywords】: distributed join; load balancing; load variance minimization
【Paper Link】 【Pages】:1253-1267
【Authors】: Ryan Marcus ; Olga Papaemmanouil ; Sofiya Semenova ; Solomon Garber
【Abstract】: Distributed data management systems often operate on "elastic'' clusters that can scale up or down on demand. These systems face numerous challenges, including data fragmentation, replication, and cluster sizing. Unfortunately, these challenges have traditionally been treated independently, leaving administrators with little insight on how the interplay of these decisions affects query performance. This paper introduces NashDB, an adaptive data distribution framework that relies on an economic model to automatically balance the supply and demand of data fragments, replicas, and cluster nodes. NashDB adapts its decisions to query priorities and shifting workloads, while avoiding underutilized cluster nodes and redundant replicas. This paper introduces and evaluates NashDB's model, as well as a suite of optimization techniques designed to efficiently identify data distribution schemes that match workload demands and transition the system to this new scheme with minimum data transfer overhead. Experimentally, we show that NashDB is often Pareto dominant compared to other solutions.
【Keywords】: database management systems; fragmentation; partitioning
【Paper Link】 【Pages】:1269-1284
【Authors】: Jiawei Jiang ; Fangcheng Fu ; Tong Yang ; Bin Cui
【Abstract】: To address the challenge of explosive big data, distributed machine learning (ML) has drawn the interests of many researchers. Since many distributed ML algorithms trained by stochastic gradient descent (SGD) involve communicating gradients through the network, it is important to compress the transferred gradient. A category of low-precision algorithms can significantly reduce the size of gradients, at the expense of some precision loss. However, existing low-precision methods are not suitable for many cases where the gradients are sparse and nonuniformly distributed. In this paper, we study is there a compression method that can efficiently handle a sparse and nonuniform gradient consisting of key-value pairs? Our first contribution is a sketch based method that compresses the gradient values. Sketch is a class of algorithms using a probabilistic data structure to approximate the distribution of input data. We design a quantile-bucket quantification method that uses a quantile sketch to sort gradient values into buckets and encodes them with the bucket indexes. To further compress the bucket indexes, our second contribution is a sketch algorithm, namely MinMaxSketch. MinMaxSketch builds a set of hash tables and solves hash collisions with a MinMax strategy. The third contribution of this paper is a delta-binary encoding method that calculates the increment of the gradient keys and stores them with fewer bytes. We also theoretically discuss the correctness and the error bound of three proposed methods. To the best of our knowledge, this is the first effort combining data sketch with ML. We implement a prototype system in a real cluster of our industrial partner Tencent Inc., and show that our method is up to 10X faster than existing methods.
【Keywords】: distributed machine learning; frequency sketch; quantification; quantile sketch; stochastic gradient descent
【Paper Link】 【Pages】:1285-1300
【Authors】: Manasi Vartak ; Joana M. F. da Trindade ; Samuel Madden ; Matei Zaharia
【Abstract】: Model diagnosis is the process of analyzing machine learning (ML) model performance to identify where the model works well and where it doesn't. It is a key part of the modeling process and helps ML developers iteratively improve model accuracy. Often, model diagnosis is performed by analyzing different datasets or intermediates associated with the model such as the input data and hidden representations learned by the model (e.g., [4, 24, 39,]). The bottleneck in fast model diagnosis is the creation and storage of model intermediates. Storing these intermediates requires tens to hundreds of GB of storage whereas re-running the model for each diagnostic query slows down model diagnosis. To address this bottleneck, we propose a system called MISTIQUE that can work with traditional ML pipelines as well as deep neural networks to efficiently capture, store, and query model intermediates for diagnosis. For each diagnostic query, MISTIQUE intelligently chooses whether to re-run the model or read a previously stored intermediate. For intermediates that are stored in MISTIQUE, we propose a range of optimizations to reduce storage footprint including quantization, summarization, and data de-duplication. We evaluate our techniques on a range of real-world ML models in scikit-learn and Tensorflow. We demonstrate that our optimizations reduce storage by up to 110X for traditional ML pipelines and up to 6X for deep neural networks. Furthermore, by using MISTIQUE, we can speed up diagnostic queries on traditional ML pipelines by up to 390X and 210X on deep neural networks.
【Keywords】: machine learning; model diagnosis; model interpretability; systems for machine learning
【Paper Link】 【Pages】:1301-1316
【Authors】: Sen Wu ; Luke Hsiao ; Xiao Cheng ; Braden Hancock ; Theodoros Rekatsinas ; Philip Levis ; Christopher Ré
【Abstract】: We focus on knowledge base construction (KBC) from richly formatted data. In contrast to KBC from text or tabular data, KBC from richly formatted data aims to extract relations conveyed jointly via textual, structural, tabular, and visual expressions. We introduce Fonduer, a machine-learning-based KBC system for richly formatted data. Fonduer presents a new data model that accounts for three challenging characteristics of richly formatted data: (1) prevalent document-level relations, (2) multimodality, and (3) data variety. Fonduer uses a new deep-learning model to automatically capture the representation (i.e., features) needed to learn how to extract relations from richly formatted data. Finally, Fonduer provides a new programming model that enables users to convert domain expertise, based on multiple modalities of information, to meaningful signals of supervision for training a KBC system. Fonduer-based KBC systems are in production for a range of use cases, including at a major online retailer. We compare Fonduer against state-of-the-art KBC approaches in four different domains. We show that Fonduer achieves an average improvement of 41 F1 points on the quality of the output knowledge base---and in some cases produces up to 1.87x the number of correct entries---compared to expert-curated public knowledge bases. We also conduct a user study to assess the usability of Fonduer's new programming model. We show that after using Fonduer for only 30 minutes, non-domain experts are able to design KBC systems that achieve on average 23 F1 points higher quality than traditional machine-learning-based KBC approaches.
【Keywords】: knowledge base construction; multimodal supervision; richly formatted data; weak supervision
【Paper Link】 【Pages】:1317-1332
【Authors】: Gensheng Zhang ; Damian Jimenez ; Chengkai Li
【Abstract】: We present Maverick, a general, extensible framework that discovers exceptional facts about entities in knowledge graphs. To the best of our knowledge, there was no previous study of the problem. We model an exceptional fact about an entity of interest as a context-subspace pair, in which a subspace is a set of attributes and a context is defined by a graph query pattern of which the entity is a match. The entity is exceptional among the entities in the context, with regard to the subspace. The search spaces of both patterns and subspaces are exponentially large. Maverick conducts beam search on the patterns which uses a match-based pattern construction method to evade the evaluation of invalid patterns. It applies two heuristics to select promising patterns to form the beam in each iteration. Maverick traverses and prunes the subspaces organized as a set enumeration tree by exploiting the upper bound properties of exceptionality scoring functions. Results of experiments and user studies using real-world datasets demonstrated substantial performance improvement of the proposed framework over the baselines as well as its effectiveness in discovering exceptional facts.
【Keywords】: contextual outlying subspaces; entity exceptional facts; knowledge graph mining
【Paper Link】 【Pages】:1333-1347
【Authors】: Jinfeng Li ; Xiao Yan ; Jian Zhang ; An Xu ; James Cheng ; Jie Liu ; Kelvin K. W. Ng ; Ti-chung Cheng
【Abstract】: As an effective solution to the approximate nearest neighbors (ANN) search problem, learning to hash (L2H) is able to learn similarity-preserving hash functions tailored for a given dataset. However, existing L2H research mainly focuses on improving query performance by learning good hash functions, while Hamming ranking (HR) is used as the default querying method. We show by analysis and experiments that Hamming distance, the similarity indicator used in HR, is too coarse-grained and thus limits the performance of query processing. We propose a new fine-grained similarity indicator, quantization distance (QD), which provides more information about the similarity between a query and the items in a bucket. We then develop two efficient querying methods based on QD, which achieve significantly better query performance than HR. Our methods are general and can work with various L2H algorithms. Our experiments demonstrate that a simple and elegant querying method can produce performance gain equivalent to advanced and complicated learning algorithms.
【Keywords】: distributed computing; large-scale similarity search
【Paper Link】 【Pages】:1349-1361
【Authors】: Hao Xin ; Rui Meng ; Lei Chen
【Abstract】: Knowledge base construction (KBC) has become a hot and in-time topic recently with the increasing application need of large-scale knowledge bases (KBs), such as semantic search, QA systems, the Google Knowledge Graph and IBM Watson QA System. Existing KBs mainly focus on encoding the factual facts of the world, e.g., city area and company product, which are regarded as the objective knowledge, whereas the subjective knowledge, which is frequently mentioned in Web queries, has been neglected. The subjective knowledge has no documented ground truth, instead, the truth relies on people's dominant opinion, which can be solicited from online crowd workers. In our work, we propose a KBC framework for subjective knowledge base construction taking advantage of the knowledge from the crowd and existing KBs. We develop a two-staged framework for subjective KB construction which consists of core subjective KB construction and subjective KB enrichment. Firstly, we try to build a core subjective KB mined from existing KBs, where every instance has rich objective properties. Then, we populate the core subjective KB with instances extracted from existing KBs, in which the crowd is leverage to annotate the subjective property of the instances. In order to optimize the crowd annotation process, we formulate the problem of subjective KB enrichment procedure as a cost-aware instance annotation problem and propose two instance annotation algorithms, i.e., adaptive instance annotation and batch-mode instance annotation algorithms. We develop a two-stage system for subjective KB construction which consists of core subjective KB construction and subjective knowledge enrichment. We evaluate our framework on real knowledge bases and a real crowdsourcing platform, the experimental results show that we can derive high quality subjective knowledge facts from existing KBs and crowdsourcing techniques through our proposed framework.
【Keywords】: crowdsourcing; knowledge base construction; subjective knowledge
【Paper Link】 【Pages】:1363-1376
【Authors】: Jiawei Jiang ; Bin Cui ; Ce Zhang ; Fangcheng Fu
【Abstract】: Gradient boosting decision tree (GBDT) is one of the most popular machine learning models widely used in both academia and industry. Although GBDT has been widely supported by existing systems such as XGBoost, LightGBM, and MLlib, one system bottleneck appears when the dimensionality of the data becomes high. As a result, when we tried to support our industrial partner on datasets of the dimension up to 330K, we observed suboptimal performance for all these aforementioned systems. In this paper, we ask "Can we build a scalable GBDT training system whose performance scales better with respect to dimensionality of the data?" The first contribution of this paper is a careful investigation of existing systems by developing a performance model with respect to the dimensionality of the data. We find that the collective communication operations in many existing systems only implement the algorithm designed for small messages. By just fixing this problem, we are able to speed up these systems by up to 2X. Our second contribution is a series of optimizations to further optimize the performance of collective communications. These optimizations include a task scheduler, a two-phase split finding method, and low-precision gradient histograms. Our third contribution is a sparsity-aware algorithm to build gradient histograms and a novel index structure to build histograms in parallel. We implement these optimizations in DimBoost and show that it can be 2-9X faster than existing systems.
【Keywords】: collective communication; gradient boosting decision tree; gradient histogram; high-dimensional feature; parameter server
【Paper Link】 【Pages】:1377-1392
【Authors】: Zhipeng Huang ; Yeye He
【Abstract】: Given a single column of values, existing approaches typically employ regex-like rules to detect errors by finding anomalous values inconsistent with others. Such techniques make local decisions based only on values in the given input column, without considering a more global notion of compatibility that can be inferred from large corpora of clean tables. We propose \sj, a statistics-based technique that leverages co-occurrence statistics from large corpora for error detection, which is a significant departure from existing rule-based methods. Our approach can automatically detect incompatible values, by leveraging an ensemble of judiciously selected generalization languages, each of which uses different generalizations and is sensitive to different types of errors. Errors so detected are based on global statistics, which is robust and aligns well with human intuition of errors. We test \sj on a large set of public Wikipedia tables, as well as proprietary enterprise Excel files. While both of these test sets are supposed to be of high-quality, \sj makes surprising discoveries of over tens of thousands of errors in both cases, which are manually verified to be of high precision (over 0.98). Our labeled benchmark set on Wikipedia tables is released for future research.
【Keywords】: data cleaning; data-driven error detection; outliers; table corpus
【Paper Link】 【Pages】:1393-1405
【Authors】: Pramod A. Jamkhedkar ; Theodore Johnson ; Yaron Kanza ; Aman Shaikh ; N. K. Shankaranarayanan ; Vladislav Shkapenyuk
【Abstract】: Modern communication networks are large, dynamic, complex, and increasingly use virtualized network infrastructure. To deploy, maintain, and troubleshoot such networks, it is essential to understand how network elements - such as servers, switches, virtual machines, and virtual network functions - are connected to one another, and to be able to discover communication paths between them. For network maintenance applications such as troubleshooting and service quality management, it is also essential to understand how connections change over time, and be able to pose time-travel queries to retrieve information about past network states. With the industry-wide move to Software Defined Networks and Virtualized Network Functions (VNFs) [26][24], maintaining these inventory and topology databases becomes a critical issue. In this paper, we explore the database requirements for the management and troubleshooting of network services using VNF and SDN technologies. This work was initiated in the context of Open source ECOMP, which has been now merged into ONAP [24], the new industry-standard for managing network automation. We develop a graph-based layered network model with layers representing increasing levels of specificity, from VNFs to physical hardware. We then describe the kinds of queries required for activities such as operations management and troubleshooting. These considerations have led us to develop Nepal, a model-driven graph database system to represent and reason over network service topology and data flows within the network. Nepal has several features making it particularly applicable for querying inventory: Nepal has a strongly-typed but flexible schema to support model-driven networking; it makes graph paths a first-class object in its query system; it has sophisticated support for in-the-past queries; and it works as a layer over one or more underlying databases. We demonstrate the capabilities of Nepal by examples, discuss its model-driven query capabilities, and implementation details on Gremlin and Postgres. We illustrate how path queries can simplify the extraction of information from a dynamic inventory of a multi-layer network and can be used for troubleshooting.
【Keywords】: graph database; software defined networking; virtualized network functions
【Paper Link】 【Pages】:1407-1419
【Authors】: Cagri Balkesen ; Nitin Kunal ; Georgios Giannikis ; Pit Fender ; Seema Sundara ; Felix Schmidt ; Jarod Wen ; Sandeep R. Agrawal ; Arun Raghavan ; Venkatanathan Varadarajan ; Anand Viswanathan ; Balakrishnan Chandrasekaran ; Sam Idicula ; Nipun Agarwal ; Eric Sedlar
【Abstract】: Today, an ever increasing amount of transistors are packed into processor designs with extra features to support a broad range of applications. As a consequence, processors are becoming more and more complex and power hungry. At the same time, they only sustain an average performance for a wide variety of applications while not providing the best performance for specific applications. In this paper, we demonstrate through a carefully designed modern data processing system called RAPID and a simple, low-power processor specially tailored for data processing that at least an order of magnitude performance/power improvement in SQL processing can be achieved over a modern system running on today's complex processors. RAPID is designed from the ground up with hardware/software co-design in mind to provide architecture-conscious extreme performance while consuming less power in comparison to the modern database systems. The paper presents in detail the design and implementation of RAPID, a relational, columnar, in-memory query processing engine supporting analytical query workloads.
【Keywords】: analytic query processing; databases; dpu; hardware/software co-design; in-memory data processing; low-power
【Paper Link】 【Pages】:1421-1432
【Authors】: Renzo Angles ; Marcelo Arenas ; Pablo Barceló ; Peter A. Boncz ; George H. L. Fletcher ; Claudio Gutierrez ; Tobias Lindaaker ; Marcus Paradies ; Stefan Plantikow ; Juan F. Sequeda ; Oskar van Rest ; Hannes Voigt
【Abstract】: We report on a community effort between industry and academia to shape the future of graph query languages. We argue that existing graph database management systems should consider supporting a query language with two key characteristics. First, it should be composable, meaning, that graphs are the input and the output of queries. Second, the graph query language should treat paths as first-class citizens. Our result is G-CORE, a powerful graph query language design that fulfills these goals, and strikes a careful balance between path query expressivity and evaluation complexity.
【Keywords】: graph data models; graph databases; graph query languages
【Paper Link】 【Pages】:1433-1445
【Authors】: Nadime Francis ; Alastair Green ; Paolo Guagliardo ; Leonid Libkin ; Tobias Lindaaker ; Victor Marsault ; Stefan Plantikow ; Mats Rydberg ; Petra Selmer ; Andrés Taylor
【Abstract】: The Cypher property graph query language is an evolving language, originally designed and implemented as part of the Neo4j graph database, and it is currently used by several commercial database products and researchers. We describe Cypher 9, which is the first version of the language governed by the openCypher Implementers Group. We first introduce the language by example, and describe its uses in industry. We then provide a formal semantic definition of the core read-query features of Cypher, including its variant of the property graph data model, and its ASCII Art graph pattern matching mechanism for expressing subgraphs of interest to an application. We compare the features of Cypher to other property graph query languages, and describe extensions, at an advanced stage of development, which will form part of Cypher 10, turning the language into a compositional language which supports graph projections and multiple named graphs.
【Keywords】: cypher; formal semantics; formal specification; graph databases; property graphs; query language
【Paper Link】 【Pages】:1447-1459
【Authors】: Michal Nowakiewicz ; Eric Boutin ; Eric N. Hanson ; Robert Walzer ; Akash Katipally
【Abstract】: Advances in modern hardware, such as increases in the size of main memory available on computers, have made it possible to analyze data at a much higher rate than before. In this paper, we demonstrate that there is tremendous room for improvement in the processing of analytical queries on modern commodity hardware. We introduce BIPie, an engine for query processing implementing highly efficient decoding, selection, and aggregation for analytical queries executing on a columnar storage engine in MemSQL. We demonstrate that these operations are interdependent, and must be fused and considered together to achieve very high performance. We propose and compare multiple strategies for decoding, selection and aggregation (with GROUP BY), all of which are designed to take advantage of modern CPU architectures, including SIMD. We implemented these approaches in MemSQL, a high performance hybrid transaction and analytical processing database designed for commodity hardware. We thoroughly evaluate the performance of the approach across a range of parameters, and demonstrate a two to four times speedup over previously published TPC-H Query 1 performance.
【Keywords】: aggregation; bipie; column store; encoded data; operator specialization; query processing; selection
【Paper Link】 【Pages】:1461-1476
【Authors】: Yongjoo Park ; Barzan Mozafari ; Joseph Sorenson ; Junhao Wang
【Abstract】: Despite 25 years of research in academia, approximate query processing (AQP) has had little industrial adoption. One of the major causes of this slow adoption is the reluctance of traditional vendors to make radical changes to their legacy codebases, and the preoccupation of newer vendors (e.g., SQL-on-Hadoop products) with implementing standard features. Additionally, the few AQP engines that are available are each tied to a specific platform and require users to completely abandon their existing databases---an unrealistic expectation given the infancy of the AQP technology. Therefore, we argue that a universal solution is needed: a database-agnostic approximation engine that will widen the reach of this emerging technology across various platforms. Our proposal, called VerdictDB, uses a middleware architecture that requires no changes to the backend database, and thus, can work with all off-the-shelf engines. Operating at the driver-level, VerdictDB intercepts analytical queries issued to the database and rewrites them into another query that, if executed by any standard relational engine, will yield sufficient information for computing an approximate answer. VerdictDB uses the returned result set to compute an approximate answer and error estimates, which are then passed on to the user or application. However, lack of access to the query execution layer introduces significant challenges in terms of generality, correctness, and efficiency. This paper shows how VerdictDB overcomes these challenges and delivers up to 171× speedup (18.45× on average) for a variety of existing engines, such as Impala, Spark SQL, and Amazon Redshift, while incurring less than 2.6% relative error. VerdictDB is open-sourced under Apache License.
【Keywords】: approximate query processing; data analytics
【Paper Link】 【Pages】:1477-1492
【Authors】: Jinglin Peng ; Dongxiang Zhang ; Jiannan Wang ; Jian Pei
【Abstract】: Interactive analytics requires database systems to be able to answer aggregation queries within interactive response times. As the amount of data is continuously growing at an unprecedented rate, this is becoming increasingly challenging. In the past, the database community has proposed two separate ideas, sampling-based approximate query processing (AQP) and aggregate precomputation (AggPre) such as data cubes, to address this challenge. In this paper, we argue for the need to connect these two separate ideas for interactive analytics. We propose AQP++, a novel framework to enable the connection. The framework can leverage both a sample as well as a precomputed aggregate to answer user queries. We discuss the advantages of having such a unified framework and identify new challenges to fulfill this vision. We conduct an in-depth study of these challenges for range queries and explore both optimal and heuristic solutions to address them. Our experiments using two public benchmarks and one real-world dataset show that AQP++ achieves a more flexible and better trade-off among preprocessing cost, query response time, and answer quality than AQP or AggPre.
【Keywords】: aggregation; aqp; aqp++; data cube; range query; sampling
【Paper Link】 【Pages】:1493-1508
【Authors】: Yao Lu ; Aakanksha Chowdhery ; Srikanth Kandula ; Surajit Chaudhuri
【Abstract】: Classic query optimization techniques, including predicate pushdown, are of limited use for machine learning inference queries, because the user-defined functions (UDFs) which extract relational columns from unstructured inputs are often very expensive; query predicates will remain stuck behind these UDFs if they happen to require relational columns that are generated by the UDFs. In this work, we demonstrate constructing and applying probabilistic predicates to filter data blobs that do not satisfy the query predicate; such filtering is parametrized to different target accuracies. Furthermore, to support complex predicates and to avoid per-query training, we augment a cost-based query optimizer to choose plans with appropriate combinations of simpler probabilistic predicates. Experiments with several machine learning workloads on a big-data cluster show that query processing improves by as much as 10x.
【Keywords】: image analysis; inference; machine learning; model cascades; probabilistic predicates; query processing; user-defined functions; video analysis
【Paper Link】 【Pages】:1509-1524
【Authors】: Uzi Cohen ; Batya Kenig ; Haoyue Ping ; Benny Kimelfeld ; Julia Stoyanovich
【Abstract】: Models of uncertain preferences, such as Mallows, have been extensively studied due to their plethora of application domains. In a recent work, a conceptual and theoretical framework has been proposed for supporting uncertain preferences as first-class citizens in a relational database. The resulting database is probabilistic, and, consequently, query evaluation entails inference of marginal probabilities of query answers. In this paper, we embark on the challenge of a practical realization of this framework. We first describe an implementation of a query engine that supports querying probabilistic preferences alongside relational data. Our system accommodates preference distributions in the general form of the Repeated Insertion Model (RIM), which generalizes Mallows and other models. We then devise a novel inference algorithm for conjunctive queries over RIM, and show that it significantly outperforms the state of the art in terms of both asymptotic and empirical execution cost. We also develop performance optimizations that are based on sharing computation among different inference tasks in the workload. Finally, we conduct an extensive experimental evaluation and demonstrate that clear performance benefits can be realized by a query engine with built-in probabilistic inference, as compared to a stand alone implementation with a black-box inference solver.
【Keywords】: preferences; probabilistic databases
【Paper Link】 【Pages】:1525-1539
【Authors】: Zhuoyue Zhao ; Robert Christensen ; Feifei Li ; Xiao Hu ; Ke Yi
【Abstract】: Joins are expensive, especially on large data and/or multiple relations. One promising approach in mitigating their high costs is to just return a simple random sample of the full join results, which is sufficient for many tasks. Indeed, in as early as 1999, Chaudhuri et al. posed the problem of sampling over joins as a fundamental challenge in large database systems. They also pointed out a fundamental barrier for this problem, that the sampling operator cannot be pushed through a join, i.e., sample( R bowtie S )≠ sample( R ) bowtie sample( S ). To overcome this barrier, they used precomputed statistics to guide the sampling process, but only showed how this works for two-relation joins. This paper revisits this classic problem for both acyclic and cyclic multi-way joins. We build upon the idea of Chaudhuri et al., but extend it in several nontrivial directions. First, we propose a general framework for random sampling over multi-way joins, which includes the algorithm of Chaudhuri et al. as a special case. Second, we explore several ways to instantiate this framework, depending on what prior information is available about the underlying data, and offer different tradeoffs between sample generation latency and throughput. We analyze the properties of different instantiations and evaluate them against the baseline methods; the results clearly demonstrate the superiority of our new techniques.
【Keywords】: join sampling framework; multi-way joins; random sampling
【Paper Link】 【Pages】:1541-1555
【Authors】: Alexander van Renen ; Viktor Leis ; Alfons Kemper ; Thomas Neumann ; Takushi Hashida ; Kazuichi Oe ; Yoshiyasu Doi ; Lilian Harada ; Mitsuru Sato
【Abstract】: Non-volatile memory (NVM) is a new storage technology that combines the performance and byte addressability of DRAM with the persistence of traditional storage devices like flash (SSD). While these properties make NVM highly promising, it is not yet clear how to best integrate NVM into the storage layer of modern database systems. Two system designs have been proposed. The first is to use NVM exclusively, i.e., to store all data and index structures on it. However, because NVM has a higher latency than DRAM, this design can be less efficient than main-memory database systems. For this reason, the second approach uses a page-based DRAM cache in front of NVM. This approach, however, does not utilize the byte addressability of NVM and, as a result, accessing an uncached tuple on NVM requires retrieving an entire page. In this work, we evaluate these two approaches and compare them with in-memory databases as well as more traditional buffer managers that use main memory as a cache in front of SSDs. This allows us to determine how much performance gain can be expected from NVM. We also propose a lightweight storage manager that simultaneously supports DRAM, NVM, and flash. Our design utilizes the byte addressability of NVM and uses it as an additional caching layer that improves performance without losing the benefits from the even faster DRAM and the large capacities of SSDs.
【Keywords】: database architecture; non-volatile memory
【Paper Link】 【Pages】:1557-1570
【Authors】: Anil Shanbhag ; Holger Pirk ; Samuel Madden
【Abstract】: A common operation in many data analytics workloads is to find the top-k items, i.e., the largest or smallest operations according to some sort order (implemented via LIMIT or ORDER BY expressions in SQL). A naive implementation of top-k is to sort all of the items and then return the first k, but this does much more work than needed. Although efficient implementations for top-k have been explored on traditional multi-core processors, there has been no prior systematic study of top-k implementations on GPUs, despite open requests for such implementations in GPU-based frameworks like TensorFlow and ArrayFire. In this work, we present several top-k algorithms for GPUs, including a new algorithm based on bitonic sort called bitonic top-k. The bitonic top-k algorithm is up to a factor of \new15x faster than sort and 4x faster than a variety of other possible implementations for values of k up to 256. We also develop a cost model to predict the performance of several of our algorithms, and show that it accurately predicts actual performance on modern GPUs.
【Keywords】: bitonic top-k; top-k algorithms for gpu
【Paper Link】 【Pages】:1571-1586
【Authors】: Dong Young Yoon ; Mosharaf Chowdhury ; Barzan Mozafari
【Abstract】: Lock managers are a crucial component of modern distributed systems. However, with the increasing availability of fast RDMA-enabled networks, traditional lock managers can no longer keep up with the latency and throughput requirements of modern systems. Centralized lock managers can ensure fairness and prevent starvation using global knowledge of the system, but are themselves single points of contention and failure. Consequently, they fall short in leveraging the full potential of RDMA networks. On the other hand, decentralized (RDMA-based) lock managers either completely sacrifice global knowledge to achieve higher throughput at the risk of starvation and higher tail latencies, or they resort to costly communications in order to maintain global knowledge, which can result in significantly lower throughput. In this paper, we show that it is possible for a lock manager to be fully decentralized and yet exchange the partial knowledge necessary for preventing starvation and thereby reducing tail latencies. Our main observation is that we can design a lock manager primarily using RDMA's fetch-and-add (FA) operations, which always succeed, rather than compare-and-swap (CAS) operations, which only succeed if a given condition is satisfied. While this requires us to rethink the locking mechanism from the ground up, it enables us to sidestep the performance drawbacks of the previous CAS-based proposals that relied solely on blind retries upon lock conflicts. Specifically, we present DSLR (Decentralized and Starvation-free Lock management with RDMA), a decentralized lock manager that targets distributed systems running on RDMA-enabled networks. We demonstrate that, despite being fully decentralized, DSLR prevents starvation and blind retries by guaranteeing first-come-first-serve (FCFS) scheduling without maintaining explicit queues. We adapt Lamport's bakery algorithm to an RDMA-enabled environment with multiple bakers, utilizing only one-sided READ and atomic FA operations. Our experiments show that, on average, DSLR delivers 1.8x (and up to 2.8x) higher throughput than all existing RDMA-based lock managers, while reducing their mean and 99.9% latencies by 2.0x and 18.3x (and up to 2.5x and 47x), respectively.
【Keywords】: decentralized; distributed locking; distributed systems; infiniband; rdma; serializability; starvation-free
【Paper Link】 【Pages】:1587-1602
【Authors】: Shuo Han ; Lei Zou ; Jeffrey Xu Yu
【Abstract】: In this paper, we focus on accelerating a widely employed computing pattern --- set intersection, to boost a group of graph algorithms. Graph's adjacency-lists can be naturally considered as node sets, thus set intersection is a primitive operation in many graph algorithms. We propose QFilter, a set intersection algorithm using SIMD instructions. QFilter adopts a merge-based framework and compares two blocks of elements iteratively by SIMD instructions. The key insight for our improvement is that we quickly filter out most of unnecessary comparisons in one byte-checking step. We also present a binary representation called BSR that encodes sets in a compact layout. By combining QFilter and BSR, we achieve data-parallelism in two levels --- inter-chunk and intra-chunk parallelism. Moreover, we find that node ordering impacts the performance of intersection by affecting the compactness of BSR. We formulate the graph reordering problem as an optimization of the compactness of BSR, and prove its strong NP-completeness. Thus we propose an approximate algorithm that can find a better ordering to enhance the intra-chunk parallelism. We conduct extensive experiments to confirm that our approach can improve the performance of set intersection in graph algorithms significantly.
【Keywords】: graph ordering; graph processing; set intersection; simd
【Paper Link】 【Pages】:1603-1618
【Authors】: Henning Funke ; Sebastian Breß ; Stefan Noll ; Volker Markl ; Jens Teubner
【Abstract】: Query processing on GPU-style coprocessors is severely limited by the movement of data. With teraflops of compute throughput in one device, even high-bandwidth memory cannot provision enough data for a reasonable utilization. Query compilation is a proven technique to improve memory efficiency. However, its inherent tuple-at-a-time processing style does not suit the massively parallel execution model of GPU-style coprocessors. This compromises the improvements in efficiency offered by query compilation. In this paper, we show how query compilation and GPU-style parallelism can be made to play in unison nevertheless. We describe a compiler strategy that merges multiple operations into a single GPU kernel, thereby significantly reducing bandwidth demand. Compared to operator-at-a-time, we show reductions of memory access volumes by factors of up to 7.5x resulting in shorter kernel execution times by factors of up to 9.5x.
【Keywords】: just-in-time query compilation; massively parallel query processing; olap; operator pipelining; query-coprocessing
【Paper Link】 【Pages】:1619-1634
【Authors】: Till Kolditz ; Dirk Habich ; Wolfgang Lehner ; Matthias Werner ; Stefan T. J. de Bruijn
【Abstract】: We have already known for a long time that hardware components are not perfect and soft errors in terms of single bit flips happen all the time. Up to now, these single bit flips are mainly addressed in hardware using general-purpose protection techniques. However, recent studies have shown that all future hardware components become less and less reliable in total and multi-bit flips are occurring regularly rather than exceptionally. Additionally, hardware aging effects will lead to error models that change during run-time. Scaling hardware-based protection techniques to cover changing multi-bit flips is possible, but this introduces large performance, chip area, and power overheads, which will become non-affordable in the future. To tackle that, an emerging research direction is employing protection techniques in higher software layers like compilers or applications. The available knowledge at these layers can be efficiently used to specialize and adapt protection techniques. Thus, we propose a novel adaptable and on-the-fly hardware error detection approach called AHEAD for database systems in this paper. AHEAD provides configurable error detection in an end-to-end fashion and reduces the overhead (storage and computation) compared to other techniques at this level. Our approach uses an arithmetic error coding technique which allows query processing to completely work on hardened data on the one hand. On the other hand, this enables on-the-fly detection during query processing of (i) errors that modify data stored in memory or transferred on an interconnect and (ii) errors induced during computations. Our exhaustive evaluation clearly shows the benefits of our AHEAD approach.
【Keywords】: an coding; column stores; in-memory; reliablity; resilience
【Paper Link】 【Pages】:1635-1636
【Authors】: Julia Stoyanovich ; Bill Howe ; H. V. Jagadish
【Abstract】:
【Keywords】: accountability; data ethics; fairness; responsible data management; transparency
【Paper Link】 【Pages】:1637-1644
【Authors】: Lilong Jiang ; Protiva Rahman ; Arnab Nandi
【Abstract】: Highly interactive query interfaces have become a popular tool for ad-hoc data analysis and exploration, posing a new kind of workload to the underlying data infrastructure. Compared with traditional systems that are optimized for throughput or batched performance, ad-hoc and interactive data exploration systems focus more on user-centric interactivity, which raises a new class of performance challenges. Further, with the advent of new interaction devices~(e.g., touch, gesture) and different query interface paradigms~(e.g., sliders), maintaining interactive performance becomes even more challenging. Thus, when building interactive data systems, there is a clear need to articulate the design space. In this tutorial, we will describe unique characteristics of interactive workloads for a variety of user input devices and query interfaces. We will catalog popular metrics based on an extensive survey of current literature. Through two case studies, we will not only walk through previously defined metrics using real-world user traces but also highlight where these defined metrics are inadequate. Further, we will introduce some new metrics that are required to capture a complete picture of interactivity. In each case study, we also demonstrate how the behavior analyses on users' trace and performance experiments can provide guidelines to help researchers and developers design better interactive data systems.
【Keywords】: benchmark; databases; human-computer interaction
【Paper Link】 【Pages】:1645-1650
【Authors】: Xin Luna Dong ; Theodoros Rekatsinas
【Abstract】: There is now more data to analyze than ever before. As data volume and variety have increased, so have the ties between machine learning and data integration become stronger. For machine learning to be effective, one must utilize data from the greatest possible variety of sources; and this is why data integration plays a key role. At the same time machine learning is driving automation in data integration, resulting in overall reduction of integration costs and improved accuracy. This tutorial focuses on three aspects of the synergistic relationship between data integration and machine learning: (1) we survey how state-of-the-art data integration solutions rely on machine learning-based approaches for accurate results and effective human-in-the-loop pipelines, (2) we review how end-to-end machine learning applications rely on data integration to identify accurate, clean, and relevant data for their analytics exercises, and (3) we discuss open research challenges and opportunities that span across data integration and machine learning.
【Keywords】: data enrichment; data integration; machine learning
【Paper Link】 【Pages】:1651-1654
【Authors】: Georgia Koutrika
【Abstract】: Starting with the Netflix Prize, which fueled much recent progress in the field of collaborative filtering, recent years have witnessed rapid development of new recommendation algorithms and increasingly more complex systems, which greatly differ from their early content-based and collaborative filtering systems. Modern recommender systems leverage several novel algorithmic approaches: from matrix factorization methods and multi-armed bandits to deep neural networks. In this tutorial, we will cover recent algorithmic advances in recommender systems, highlight their capabilities, and their impact. We will give many examples of industrial-scale recommender systems that define the future of the recommender systems area. We will discuss related evaluation issues, and outline future research directions. The ultimate goal of the tutorial is to encourage the application of novel recommendation approaches to solve problems that go beyond user consumption and to further promote research in the intersection of recommender systems and databases.
【Keywords】:
【Paper Link】 【Pages】:1655-1658
【Authors】: Graham Cormode ; Somesh Jha ; Tejas Kulkarni ; Ninghui Li ; Divesh Srivastava ; Tianhao Wang
【Abstract】: Local differential privacy (LDP), where users randomly perturb their inputs to provide plausible deniability of their data without the need for a trusted party, has been adopted recently by several major technology organizations, including Google, Apple and Microsoft. This tutorial aims to introduce the key technical underpinnings of these deployed systems, to survey current research that addresses related problems within the LDP model, and to identify relevant open problems and research directions for the community.
【Keywords】: data collection; differential privacy; local differential privacy; privacy
【Paper Link】 【Pages】:1659-1664
【Authors】: Paris Koutris ; Semih Salihoglu ; Dan Suciu
【Abstract】: In the last decade we have witnessed a growing interest in process- ing large data sets on large-scale distributed clusters. A big part of the complex data analysis pipelines performed by these systems consists of a sequence of relatively simple query operations, such as joining two or more tables, or sorting. This tutorial discusses several recent algorithmic developments for data processing in such large distributed clusters. It uses as a model of computation the Massively Parallel Computation (MPC) model, a simplification of the BSP model, where the only cost is given by the amount of communication and the number of communication rounds. Based on the MPC model, we study and analyze several algorithms for three core data processing tasks: multiway join queries, sorting and matrix multiplication. We discuss the common algorithmic techniques across all tasks, relate the algorithms to what is used in practical systems, and finally present open problems for future research.
【Keywords】: bulk synchronous parallel model; distributed query evaluation
【Paper Link】 【Pages】:1665-1668
【Authors】: Wen He ; Yongjoo Park ; Idris Hanafi ; Jacob Yatvitskiy ; Barzan Mozafari
【Abstract】: We demonstrate VerdictDB, the first platform-independent approximate query processing (AQP) system. Unlike existing AQP systems that are tightly-integrated into a specific database, VerdictDB operates at the driver-level, acting as a middleware between users and off-the-shelf database systems. In other words, VerdictDB requires no modifications to the database internals; it simply relies on rewriting incoming queries such that the standard execution of the rewritten queries under relational semantics yields approximate answers to the original queries. VerdictDB exploits a novel technique for error estimation called variational subsampling, which is amenable to efficient computation via SQL. In this demonstration, we showcase VerdictDB's performance benefits (up to two orders of magnitude) compared to the queries that are issued directly to existing query engines. We also illustrate that the approximate answers returned by VerdictDB are nearly identical to the exact answers. We use Apache Spark SQL and Amazon Redshift as two examples of modern distributed query platforms. We allow the audience to explore VerdictDB using a web-based interface (e.g., Hue or Apache Zeppelin) to issue queries and visualize their answers. VerdictDB is currently open-sourced and available under Apache License (V2).
【Keywords】: approximate query processing; data analytics
【Paper Link】 【Pages】:1669-1672
【Authors】: Yao Lu ; Srikanth Kandula ; Surajit Chaudhuri
【Abstract】: We will demonstrate a prototype query processing engine that uses probabilistic predicates (PPs) to speed up machine learning inference jobs. In current analytic engines, machine learning functions are modeled as user-defined functions (UDFs) which are both time and resource intensive. These UDFs prevent predicate pushdown; predicates that use the outputs of these UDFs cannot be pushed to before the UDFs. Hence, considerable time and resources are wasted in applying the UDFs on inputs that will be rejected by the subsequent predicate. We uses PPs that are lightweight classifiers applied directly on the raw input and filter data blobs that disagree with the query predicate. By reducing the input to be processed by the UDFs, PPs substantially improve query processing. We will show that PPs are broadly applicable by constructing PPs for many inference tasks including image recognition, document classification and video analyses. We will also demonstrate query optimization methods that extend PPs to complex query predicates and support different accuracy requirements.
【Keywords】: image analysis; inference; machine learning; model cascades; probabilistic predicates; query processing; user-defined functions; video analysis
【Paper Link】 【Pages】:1673-1676
【Authors】: Sheng Wang ; Mingzhao Li ; Yipeng Zhang ; Zhifeng Bao ; David Alexander Tedjopurnomo ; Xiaolin Qin
【Abstract】: In this paper, we build a trip planning system called TISP, which enables user's interactive exploration of POIs and trajectories in their incremental trip planning. At the back end, TISP is able to support seven types of common queries over spatial-only, spatial-textual and textual-only data, based on our proposed unified indexing and search paradigm [7]. At the front end, we propose novel visualisation designs to present the result of different types of queries; our user-friendly interaction designs allow users to construct further queries without inputting any text.
【Keywords】: indexing; trajectory search; trip planning
【Paper Link】 【Pages】:1677-1680
【Authors】: Tao Guo ; Mingzhao Li ; Peishan Li ; Zhifeng Bao ; Gao Cong
【Abstract】: In this demonstration we present POIsam, a visualization system supporting the following desirable features: representativeness, visibility constraint, zooming consistency, and panning consistency. The first two constraints aim to efficiently select a small set of representative objects from the current region of user's interest, and any two selected objects should not be too close to each other for users to distinguish in the limited space of a screen. One unique feature of POISam is that any similarity metrics can be plugged into POISam to meet the user's specific needs in different scenarios. The latter two consistencies are fundamental challenges to efficiently update the selection result w.r.t. user's zoom in, zoom out and panning operations when they interact with the map. POISam drops a common assumption from all previous work, i.e. the zoom levels and region cells are pre-defined and indexed, and objects are selected from such region cells at a particular zoom level rather than from user's current region of interest (which in most cases do not correspond to the pre-defined cells). It results in extra challenge as we need to do object selection via online computation. To our best knowledge, this is the first system that is able to meet all the four features to achieve an interactive visualization map exploration system.
【Keywords】: geospatial data visualization; map exploration; sampling
【Paper Link】 【Pages】:1681-1684
【Authors】: Zeyuan Shang ; Guoliang Li ; Zhifeng Bao
【Abstract】: Trajectory analytics can benefit many real-world applications, e.g., frequent trajectory based navigation systems, road planning, car pooling, and transportation optimizations. In this paper, we demonstrate a distributed in-memory trajectory analytics system DITA to support large-scale trajectory data analytics. DITA exhibit three unique features. First, DITA supports threshold-based and KNN-based trajectory similarity search and join operations, as well as range queries (i.e., space and time). Second, DITA is versatile to support most existing similarity functions to cater for different analytic purposes and scenarios. Last, DITA is seamlessly integrated into Spark SQL to support easy-to-use SQL and DataFrame API interfaces. Technically, DITA proposes an effective partitioning method, global index and local index, to address the data locality problem. It also devises cost-based techniques to balance the workload, and develops a filter-verification framework for efficient and scalable search and join.
【Keywords】:
【Paper Link】 【Pages】:1685-1688
【Authors】: Nicola Fiorentino ; Sergio Greco ; Cristian Molinaro ; Irina Trubitsyna
【Abstract】: Incomplete information arises in many current database applications. Certain answers are a widely accepted semantics of query answering over incomplete databases. Since their computation is a coNP-hard problem, recent research has focused on developing polynomial time approximation algorithms computing a sound (but possibly incomplete) set of certain answers. In this demo we showcase ACID, a system to compute sound sets of certain answers. The central tools of its underlying algorithms are conditional tables and the conditional evaluation of relation algebra. Different evaluation strategies can be applied, with more accurate ones having higher complexity, but returning more certain answers. We show how to query incomplete databases using the ACID system, which offers a suite of approximation algorithms enabling users to choose the technique that best meets their needs in terms of balance between efficiency and quality of the result's approximation.
【Keywords】: approximation algorithm; certain query answer; incomplete database
【Paper Link】 【Pages】:1689-1692
【Authors】: Ibrahim Sabek ; Mashaal Musleh ; Mohamed F. Mokbel
【Abstract】: This demo presents Sya; the first full-fledged spatial probabilistic knowledge base construction system. Sya is a comprehensive extension to the DeepDive system that enables exploiting the spatial relationships between extracted relations during the knowledge base construction process, and hence results in a better knowledge base output. Sya runs existing DeepDive programs as is, yet, it extracts more accurate relations than DeepDive when dealing with input data that have spatial attributes. Sya employs a simple spatial high-level language, a rule-based spatial SQL query engine, a spatially-indexed probabilistic graphical model, and an adapted spatial statistical inference technique to infer the factual scores of relations. We demonstrate a real system prototype of Sya, showing a case study of constructing a crime knowledge base. The demonstration shows to the audience the internal steps of building the knowledge base, as well as a comparison with the output of DeepDive.
【Keywords】: knowledge base construction; spatial probabilistic knowledge bases
【Paper Link】 【Pages】:1693-1696
【Authors】: Harish Doraiswamy ; Eleni Tzirita Zacharatou ; Fábio Miranda ; Marcos Lage ; Anastasia Ailamaki ; Cláudio T. Silva ; Juliana Freire
【Abstract】: The recent explosion in the number and size of spatio-temporal data sets from urban environments and social sensors creates new opportunities for data-driven approaches to understand and improve cities. Visual analytics systems like Urbane aim to empower domain experts to explore multiple data sets, at different time and space resolutions. Since these systems rely on computationally-intensive spatial aggregation queries that slice and summarize the data over different regions, an important challenge is how to attain interactivity. While traditional pre-aggregation approaches support interactive exploration, they are unsuitable in this setting because they do not support ad-hoc query constraints or polygons of arbitrary shapes. To address this limitation, we have recently proposed Raster Join, an approach that converts a spatial aggregation query into a set of drawing operations on a canvas and leverages the rendering pipeline of the graphics hardware (GPU). By doing so, Raster Join evaluates queries on the fly at interactive speeds on commodity laptops and desktops. In this demonstration, we showcase the efficiency of Raster Join by integrating it with Urbane and enabling interactivity. Demo visitors will interact with Urbane to filter and visualize several urban data sets over multiple resolutions.
【Keywords】:
【Paper Link】 【Pages】:1697-1700
【Authors】: Yeshwanth D. Gunasekaran ; Md Farhadur Rahman ; Sona Hasani ; Nan Zhang ; Gautam Das
【Abstract】: Location Based Services (LBS) have become extremely popular over the past decade. Popular LBS run the entire gamut from mapping services (such as Google Maps) to restaurants reviews (such as Yelp) and real-estate search (such as Zillow). The backend database of these applications can be a rich data source for geospatial and commercial information such as Point-Of-Interest (POI) locations, reviews, ratings, user geo-distributions, etc. However, access to the backend database is often restricted by a public query interface (often web-based) provided by the LBS owners. In most cases the public search interface of these applications can be abstractly modeled as kNN interface, taking a geolocation (i.e., latitude and longitude) as input and returning top-k POI's that are closest to the query point, where k is a small constant such as 50 or 100. Because of this restriction it becomes extremely difficult for third-party users to perform analytics or mining over LBS. We demonstrate DBLOC, a web-based system that enables analytics over the LBS by using nothing but limited access to kNN interface provided by the LBS. Specifically, using DBLOC the users can perform density based clustering over the backend database of LBS. Due to query rate limit constraint - i.e., maximum number of kNN queries a user/IP address can issue over a specific period of time, it is often impossible to access all the tuples in backend database of an LBS. Thus, DBLOC aims to mine from the LBS a cluster assignment function f(.), such that for any tuple t in the database (which may or may not have been accessed), f(.) can produce the cluster assignment of t with high accuracy. We also demonstrate how DBLOC enables the users to further analyze the discovered clusters in order to mine interesting intra/inter cluster information.
【Keywords】: clustering; density; density based clustering; hdbscan; knn query; lbs; location based services; space-filling curve; third-party service
【Paper Link】 【Pages】:1701-1704
【Authors】: Mohammadreza Esfandiari ; Kavan Bharat Patel ; Sihem Amer-Yahia ; Senjuti Basu Roy
【Abstract】: We propose to demonstrate CrowdCur \xspace, a system that allows platform administrators, requesters, and workers to conduct various analytics of interest. CrowdCur \xspace includes a worker curation component that relies on explicit feedback elicitation to best capture workers' preferences, a task curation component that monitors task completion and aggregates their statistics, and an OLAP-style component to query and combine analytics by a worker, by task type, etc. Administrators can fine tune their system's performance. Requesters can compare platforms and better choose the set of workers to target. Workers can compare themselves to others and find tasks and requesters that suit them best.
【Keywords】: constrained optimization; crowdsourcing; crowdsourcing analytics
【Paper Link】 【Pages】:1705-1708
【Authors】: Mohammed Al-Kateb ; Paul Sinclair ; Grace Au ; Sanjay Nair ; Mark Sirek ; Lu Ma ; Mohamed Y. Eltabakh
【Abstract】: The UNION ALL set operator is useful for combining data from multiple sources. With the emergence and prevalence of big data ecosystems in which data is typically stored on multiple systems, UNION ALL has become even more important in many analytical queries. In this project, we demonstrate novel cost-based optimization techniques implemented in Teradata Database for join queries involving UNION ALL views and derived tables. Instead of the naive and traditional way of spooling each UNION ALL branch to a common spool prior to performing join operations, which can be prohibitively expensive, we demonstrate new techniques developed in Teradata Database including: 1) Cost-based pushing of joins into UNION ALL branches, 2) Branch grouping strategy prior to join pushing, 3) Geography adjustment of the pushed relations to avoid unnecessary redistribution or duplication, 4) Iterative join decomposition of a pushed join to multiple joins, and 5) Combining multiple join steps into a single multisource join step. In the demonstration, we use the Teradata Visual Explain tool, which offers a rich set of visual rendering capabilities of query plans, the display of various metadata information for each plan step, and several interactive UGI options for end-users.
【Keywords】: cost-based optimization; joins over union all; query optimization
【Paper Link】 【Pages】:1709-1712
【Authors】: Yuhao Wen ; Xiaodan Zhu ; Sudeepa Roy ; Jun Yang
【Abstract】: Methods for summarizing and diversifying query results have drawn significant attention recently, because they help present query results with lots of tuples to users in more informative ways. We present QAGView (Quick AGgregate View), which provides a holistic overview of high-valued aggregate query answers to the user in the form of summaries (showing high-level properties that emerge from subsets of answers) with coverage guarantee (for a user-specified number of top-valued answers) that is both diverse (avoiding overlapping or similar summaries) and relevant (focusing on high-valued aggregate answers). QAGView allows users to view the high-level summaries as clusters, and to expand individual clusters for their constituent result tuples. Users can fine-tune the behavior of QAGView by specifying a number of parameters according their preference. To help users choose appropriate parameters interactively, QAGView employ a suite of optimizations that enable quick preview of how the quality of the summaries changes over wide ranges of parameter settings, as well as real-time visualization of how the summaries evolve in response to parameter updates.
【Keywords】: clustering; data mining; database usability; database visualization
【Paper Link】 【Pages】:1713-1716
【Authors】: Siyuan Wu ; Leong Hou U ; Sourav S. Bhowmick ; Wolfgang Gatterbauer
【Abstract】: Detecting conflicts of interest (COIs) is key for guaranteeing the fairness of a peer-review process. In many conference management systems, the COIs of authors and reviewers are self-declared, and the declaration process is time consuming and potentially incomplete. To address this problem, we demonstrate a novel interactive system called PISTIS that assists the declaration process in a semi-automatic manner. Apart from keyword search and simple filtering, our system provides an interactive graphical interface that helps users explore potential COIs based on the heterogenous data sources. To simply the process of declaration, we also recommend latent COIs using a supervised ranking model that can be iteratively refined from the data collected from past declarations. We believe that PISTIS can be useful as an assistant tool in many real world conference management systems.
【Keywords】: conflict of interest; heterogeneous network; peer review process
【Paper Link】 【Pages】:1717-1720
【Authors】: Thomas Kissinger ; Marcus Hähnel ; Till Smejkal ; Dirk Habich ; Hermann Härtig ; Wolfgang Lehner
【Abstract】: The ever-increasing demand for scalable database systems is limited by their energy consumption, which is one of the major challenges in research today. While existing approaches mainly focused on transaction-oriented disk-based database systems, we are investigating and optimizing the energy consumption and performance of data-oriented scale-up in-memory database systems that make heavy use of the main power consumers, which are processors and main memory. In this demo, we present energy-utility functions as an approach for enabling the operating system to improve the energy efficiency of scalable in-memory database systems. Our highly interactive demo setup mainly allows attendees to switch between multiple DBMS workloads and watch in detail how the system responds by adapting the hardware configuration appropriately.
【Keywords】: adaptivity; elasticity; energy efficiency; in-memory
【Paper Link】 【Pages】:1721-1724
【Authors】: Prajakta Kalmegh ; Harrison Lundberg ; Frederick Xu ; Shivnath Babu ; Sudeepa Roy
【Abstract】: Unpredictability in query runtimes can arise in a shared cluster as a result of resource contentions caused by inter-query interactions. iQCAR - inter Query Contention AnalyzeR is a system that formally models these interferences between concurrent queries and provides a framework to attribute blame for contentions. iQCAR leverages a multi-level directed acyclic graph called iQCAR-Graph to diagnose the aberrations in query schedules that lead to these resource contentions. The demonstration will enable users to perform a step-wise deep exploration of such resource contentions faced by a query at various stages of its execution. The interface will allow users to identify top-k victims and sources of contentions, diagnose high-contention nodes and resources in the cluster, and rank their impacts on the performance of a query. Users will also be able to navigate through a set of rules recommended by iQCAR to compare how application of each rule by the cluster scheduler resolves the contentions in subsequent executions.
【Keywords】:
【Paper Link】 【Pages】:1725-1728
【Authors】: Sameera Ghayyur ; Yan Chen ; Roberto Yus ; Ashwin Machanavajjhala ; Michael Hay ; Gerome Miklau ; Sharad Mehrotra
【Abstract】: Emerging IoT technologies promise to bring revolutionary changes to many domains including health, transportation, and building management. However, continuous monitoring of individuals threatens privacy. The success of IoT thus depends on integrating privacy protections into IoT infrastructures. This demonstration adapts a recently-proposed system, PeGaSus, which releases streaming data under the formal guarantee of differential privacy, with a state-of-the-art IoT testbed (TIPPERS) located at UC Irvine. PeGaSus protects individuals' data by introducing distortion into the output stream. While PeGaSuS has been shown to offer lower numerical error compared to competing methods, assessing the usefulness of the output is application dependent. The goal of the demonstration is to assess the usefulness of private streaming data in a real-world IoT application setting. The demo consists of a game, IoT-Detective, in which participants carry out visual data analysis tasks on private data streams, earning points when they achieve results similar to those on the true data stream. The demo will educate participants about the impact of privacy mechanisms on IoT data while at the same time generating insights into privacy-utility trade-offs in IoT applications.
【Keywords】: differential privacy; iot; pegasus; tippers
【Paper Link】 【Pages】:1729-1732
【Authors】: Mohammad Hossein Namaki ; Yinghui Wu ; Xin Zhang
【Abstract】: We demonstrate GExp, an interactive graph exploration tool that uses keywords to support effective access and exploration of large graphs. GExp interleaves keyword query suggestion, which generates keyword queries that expand the original query, and query evaluation, that returns the answers to suggested queries for feedback. It advocates (1) cost-aware exploration, which suggests keyword queries that have low answer cost (thus high answer quality), and (2) incremental query evaluation to update the query answers with a bounded time cost. It differs from prior systems in its ability to identify and leverage substructures that augment original answers for query expansion and evaluation with provable cost bounds. A unique feature of GExp is that users can trade the quality of query answers with their evaluation cost by tuning the bound of answer cost in an ad-hoc manner. We demonstrate how GExp supports graph exploratory with three established keyword query classes with bounded time cost, and guarantees on result quality, using real-world knowledge bases and information networks.
【Keywords】: graph database; keyword search; query expansion; query suggestion
【Paper Link】 【Pages】:1733-1736
【Authors】: Yuyu Luo ; Xuedi Qin ; Nan Tang ; Guoliang Li ; Xinran Wang
【Abstract】: Creating good visualizations for ordinary users is hard, even with the help of the state-of-the-art interactive data visualization tools, such as Tableau, Qlik, because they require the users to understand the data and visualizations very well. DeepEye is an innovative visualization system that aims at helping everyone create good visualizations simply like a Google search. Given a dataset and a keyword query, DeepEye understands the query intent, generates and ranks good visualizations. The user can pick the one she likes and do a further faceted navigation to easily navigate the candidate visualizations. In this demonstration, the attendees will have the opportunity to experience the following features: (1) visualization recommendation -- Our system can automatically recommends meaningful visualizations by learning from existing known datasets and good visualizations; (2) keyword search -- The attendee can pose text queries for specifying what visualizations she wants (e.g., trends) without specifying how to generate them; (3) faceted navigation -- One can further refine the results by a click-based faceted navigation to find other relevant and interesting visualizations.
【Keywords】: data visualization; faceted navigation; keyword-based visualization search; visualization recommendation
【Paper Link】 【Pages】:1737-1740
【Authors】: Zhongle Xie ; Qingchao Cai ; Fei He ; Gene Yan Ooi ; Weilong Huang ; Beng Chin Ooi
【Abstract】: The tremendous volume of user behavior records generated in various domains provides data analysts new opportunities to mine valuable insights into user behavior. Cohort analysis, which aims to find user behavioral trends hidden in time series, is one of the most commonly used techniques. Since traditional database systems suffer from both operability and efficiency when processing cohort analysis queries, we proposed COHANA, a query processing system specialized for cohort analysis. In order to make COHANA easy-to-use, we present a comprehensive and powerful tool in this demo, covering the major use cases in cohort analysis with intuitive and accessible operations. Analysts can easily adapt COHANA to their own use with provided visualizations which can help verify their analysis assumptions and inconspicuous trends hidden in user behavior data.
【Keywords】: cohana; cohort analysis; query processing
【Paper Link】 【Pages】:1741-1744
【Authors】: Miro Mannino ; Azza Abouzied
【Abstract】: Query-by-sketch tools allow users to sketch a pattern to search a time series database for matches. Prior work adopts a bottom-up design approach: the sketching interface is built to reflect the inner workings of popular matching algorithms like Dynamic time warping (DTW) or Euclidean distance (ED). We design Qetch, a query-by-sketch tool for time series data, top-down. Users freely sketch patterns on a scale-less canvas. By studying how humans sketch time series patterns we develop a matching algorithm that accounts for human sketching errors. Qetch's top-down design and novel matching algorithm enable the easy construction of expressive queries that include regular expressions over sketches and queries over multiple time series. Our demonstration showcases Qetch and summarizes results from our evaluation of Qetch's effectiveness.
【Keywords】: regular expressions; scale-less sketches; time series querying by sketching
【Paper Link】 【Pages】:1745-1748
【Authors】: Anna Fariha ; Sheikh Muhammad Sarwar ; Alexandra Meliou
【Abstract】: Recent expansion of database technology demands a convenient framework for non-expert users to explore datasets. Several approaches exist to assist these non-expert users where they can express their query intent by providing example tuples for their intended query output. However, these approaches treat the structural similarity among the example tuples as the only factor specifying query intent and ignore the richer context present in the data. In this demo, we present SQuID, a system for Semantic similarity-aware Query Intent Discovery. SQuID takes a few example tuples from the user as input, through a simple interface, and consults the database to discover deeper associations among these examples. These data-driven associations reveal the semantic context of the provided examples, allowing SQuID to infer the user's intended query precisely and effectively. SQuID further explains its inference, by displaying the discovered semantic context to the user, who can then provide feedback and tune the result. We demonstrate how SQuID can capture even esoteric and complex semantic contexts, alleviating the need for constructing complex SQL queries, while not requiring the user to have any schema or query language knowledge.
【Keywords】: query by example; semantic similarity; sql query discovery
【Paper Link】 【Pages】:1749-1752
【Authors】: Ashish R. Mittal ; Jaydeep Sen ; Diptikalyan Saha ; Karthik Sankaranarayanan
【Abstract】: In this paper, we extend the state-of-the-art NLIDB system and present a dialog interface to relational databases. Dialog interface enables users to automatically exploit the semantic context of the conversation while asking natural language queries over RDBMS, thereby making it simpler to express complex questions in a natural, piece-wise manner. We propose novel ontology-driven techniques for addressing each of the dialog-specific challenges such as co-reference resolution, ellipsis resolution, and query disambiguation, and use them in determining the overall intent of the user query. We demonstrate the applicability and usefulness of dialog interface over two different domains viz. finance and healthcare.
【Keywords】: database; dialog; natural language interface; ontology
【Paper Link】 【Pages】:1753-1756
【Authors】: Xinbo Zhang ; Lei Zou
【Abstract】: RDF Question/Answering(Q/A) systems can interpret user's question N as SPARQL query Q and return answer set $Q(D)$ over RDF repository D to the user. However, due to the complexity of linking natural phrases with specific RDF items (e.g., entities and predicates), it remains difficult to understand users' questions precisely, hence $Q(D)$ may not meet users' expectation, offering wrong answers and dismissing some correct answers. In this demo, we design an I Interactive Mechanism aiming for PRO motion V ia feedback to Q/A systems (IMPROVE-QA), a whole platform to make existing Q/A systems return more precise answers (denoted as $\mathcal Q^\prime (D)$) to users. Based on user's feedback over $Q(D)$, IMPROVE-QA automatically refines the original query Q into a new query graph $\mathcal Q^\prime $ with minimum modifications, where $\mathcal Q^\prime (D)$ provides more precise answers. We will also demonstrate how IMPROVE-QA can apply the "lesson'' learned from the user in each query to improve the precision of Q/A systems on subsequent natural language questions.
【Keywords】: interactive query; learning from users; q/a; rdf; self-correction
【Paper Link】 【Pages】:1757-1760
【Authors】: Michele Linardi ; Yan Zhu ; Themis Palpanas ; Eamonn J. Keogh
【Abstract】: Data series motif discovery represents one of the most useful primitives for data series mining, with applications to many domains, such as robotics, entomology, seismology, medicine, and climatology, and others. The state-of-the-art motif discovery tools still require the user to provide the motif length. Yet, in several cases, the choice of motif length is critical for their detection. Unfortunately, the obvious brute-force solution, which tests all lengths within a given range, is computationally untenable, and does not provide any support for ranking motifs at different resolutions (i.e., lengths). We demonstrate VALMOD, our scalable motif discovery algorithm that efficiently finds all motifs in a given range of lengths, and outputs a length-invariant ranking of motifs. Furthermore, we support the analysis process by means of a newly proposed meta-data structure that helps the user to select the most promising pattern length. This demo aims at illustrating in detail the steps of the proposed approach, showcasing how our algorithm and corresponding graphical insights enable users to efficiently identify the correct motifs.
【Keywords】:
【Paper Link】 【Pages】:1761-1764
【Authors】: Xiaoyu Ge ; Panos K. Chrysanthis ; Konstantinos Pelechrinis ; Demetrios Zeinalipour-Yazti
【Abstract】:
【Keywords】:
【Paper Link】 【Pages】:1765-1768
【Authors】: Fuat Basik ; Benjamin Hättasch ; Amir Ilkhechi ; Arif Usta ; Shekar Ramaswamy ; Prasetya Utama ; Nathaniel Weir ; Carsten Binnig ; Ugur Çetintemel
【Abstract】: In this demo, we present DBPal, a novel data exploration tool with a natural language interface. DBPal leverages recent advances in deep models to make query understanding more robust in the following ways: First, DBPal uses novel machine translation models to translate natural language statements to SQL, making the translation process more robust to paraphrasing and linguistic variations. Second, to support the users in phrasing questions without knowing the database schema and the query features, DBPal provides a learned auto-completion model that suggests to users partial query extensions during query formulation and thus helps to write complex queries.
【Keywords】: natural language to sql; nlidb; relational database; robust natural language interface
【Paper Link】 【Pages】:1769-1772
【Authors】: Gunce Su Yilmaz ; Tana Wattanawaroon ; Liqi Xu ; Abhishek Nigam ; Aaron J. Elmore ; Aditya G. Parameswaran
【Abstract】: Interest in collaborative dataset versioning has emerged due to complex, ad-hoc, and collaborative nature of data science, and the need to record and reason about data at various stages of pre-processing, cleaning, and analysis. To support effective collaborative dataset versioning, one critical operation is differentiation : to succinctly describe what has changed from one dataset to the next. Differentiation, or diffing, allows users to understand changes between two versions, to better understand the evolution process, or to support effective merging or conflict detection across versions. We demonstrate DataDiff, a practical and concise data-diff tool that provides human-interpretable explanations of changes between datasets without reliance on the operations that led to the changes.
【Keywords】: differentiation; versioning
【Paper Link】 【Pages】:1773-1776
【Authors】: Ke Yang ; Julia Stoyanovich ; Abolfazl Asudeh ; Bill Howe ; H. V. Jagadish ; Gerome Miklau
【Abstract】: Algorithmic decisions often result in scoring and ranking individuals to determine credit worthiness, qualifications for college admissions and employment, and compatibility as dating partners. While automatic and seemingly objective, ranking algorithms can discriminate against individuals and protected groups, and exhibit low diversity. Furthermore, ranked results are often unstable -- small changes in the input data or in the ranking methodology may lead to drastic changes in the output, making the result uninformative and easy to manipulate. Similar concerns apply in cases where items other than individuals are ranked, including colleges, academic departments, or products. Despite the ubiquity of rankers, there is, to the best of our knowledge, no technical work that focuses on making rankers transparent. In this demonstration we present Ranking Facts, a Web-based application that generates a "nutritional label" for rankings. Ranking Facts is made up of a collection of visual widgets that implement our latest research results on fairness, stability, and transparency for rankings, and that communicate details of the ranking methodology, or of the output, to the end user. We will showcase Ranking Facts on real datasets from different domains, including college rankings, criminal risk assessment, and financial services.
【Keywords】: accountability; data ethics; data; responsibly; diversity; fairness; ranking; stability; transparency
【Paper Link】 【Pages】:1777-1780
【Authors】: Haoci Zhang ; Viraj Raj ; Thibault Sellam ; Eugene Wu
【Abstract】: Building interactive tools to support data analysis is hard because it is not always clear what to build and how to build it. To address this problem, we present Precision Interfaces, a semi-automatic system to generate task-specific data analytics interfaces. Precision Interface can turn a log of executed programs into an interface, by identifying micro-variations between the programs and mapping them to interface components. This paper focuses on SQL query logs, but we can generalize the approach to other languages. Our system operates in two steps: it first builds an interaction graph, which describes how the queries can be transformed into each other. Then, it finds a set of UI components that covers a maximal number of transformations. To restrict the domain of changes to be detected, our system uses a domain-specific language, PILang. We describe each of Precision Interface's components, showcase an early prototype on real program logs, and discuss future research opportunities. This demonstration highlights the potential for data-driven interactive interface mining from query logs. We will first walk participants through the process that Precision Interfaces goes through to generate interactive analysis interfaces from query logs. We will then show the versatility of Precision Interfaces by letting participants choose from multiple different interface modalities, interaction designs, and query logs to generate 2D point-and-click, gestural, and even natural language analysis interfaces for commonly performed analyses.
【Keywords】: interface generation; multi-modal interaction; query log analysis
【Paper Link】 【Pages】:1781-1784
【Authors】: Fotis Psallidas ; Eugene Wu
【Abstract】: Data lineage is a fundamental type of information that describes the relationships between input and output data items in a workflow. As such, an immense amount of data-intensive applications with logic over the input-output relationships can be expressed declaratively in lineage terms. Unfortunately, many applications resort to hand-tuned implementations because either lineage systems are not fast enough to meet their requirements or due to no knowledge of the lineage capabilities. Recently, we introduced a set of implementation design principles and associated techniques to optimize lineage-enabled database engines and realized them in our prototype database engine, namely, Smoke. In this demonstration, we showcase lineage as the building block across a variety of data-intensive applications, including tooltips and details on demand; crossfilter; and data profiling. In addition, we show how Smoke outperforms alternative lineage systems to meet or improve on existing hand-tuned implementations of these applications.
【Keywords】: databases; interactive visualizations; provenance
【Paper Link】 【Pages】:1785-1788
【Authors】: Yeye He ; Kris Ganjam ; Kukjin Lee ; Yue Wang ; Vivek R. Narasayya ; Surajit Chaudhuri ; Xu Chu ; Yudian Zheng
【Abstract】: Business analysts and data scientists today increasingly need to clean, standardize and transform diverse data sets, such as name, address, date time, phone number, etc., before they can perform analysis. These ad-hoc transformation problems are typically solved by one-off scripts, which is both difficult and time-consuming. Our observation is that these domain-specific transformation problems have long been solved by developers with code libraries, which are often shared in places like GitHub. We thus develop an extensible data transformation system called Transform-Data-by-Example (TDE) that can leverage rich transformation logic in source code, DLLs, web services and mapping tables, so that end-users only need to provide a few (typically 3) input/output examples, and TDE can synthesize desired programs using relevant transformation logic from these sources. The beta version of TDE was released in Office Store for Excel.
【Keywords】: data cleaning; data integration; data preparation; etl
【Paper Link】 【Pages】:1789-1792
【Authors】: Mohamed S. Hassan ; Tatiana Kuznetsova ; Hyun Chai Jeong ; Walid G. Aref ; Mohammad Sadoghi
【Abstract】: The maturity of RDBMSs has motivated academia and industry to invest efforts in leveraging RDBMSs for graph processing, where efficiency is proven for vital graph queries. However, none of these efforts process graphs natively inside the RDBMS, which is particularly challenging due to the impedance mismatch between the relational and the graph models. In this demonstration, we present GRFusion, an in-memory relational database system, where graphs are managed as first-class citizens. GRFusion is realized inside VoltDB. The SQL and query engines of VoltDB are empowered to declaratively define graphs and execute cross-data-model query plans that consist of relational operators and newly-introduced graph operators. Using a social network and a real continental-sized road network covering the entire U.S., we demonstrate the functionality and the performance of GRFusion in evaluating queries that reference both relational tables and graphs seamlessly in the same query execution pipeline. GRFusion shows up to four orders-of-magnitude speed-up in query-time w.r.t. state-of-the-art approaches.
【Keywords】: cross-data-model query plans; graph queries; main-memory databases; path traversals
【Paper Link】 【Pages】:1793-1796
【Authors】: Ziheng Wei ; Sebastian Link
【Abstract】: We showcase the first semantic data profiler, DataProf. For the constraint class of interest, current profilers compute all constraints that hold on the given data set. DataProf also computes perfect sample records that together satisfy the same constraints as the given data set. Such perfect samples make it easier to spot violations of business rules, which experts can cleanse. This novel iterative process of discovery and sampling facilitates the interaction of experts with the data, and provides the key to improving data quality and business rule acquisition. The demonstration will exemplify the process on a real-world data set and the novel class of embedded uniqueness constraints. The audience will experience how DataProf guides them in cleansing data and discovering business rules.
【Keywords】: armstrong relation; business rule; constraint; data cleansing; data profiling; missing data; sql; uniqueness
【Paper Link】 【Pages】:1797-1800
【Authors】: Jie Song ; Danai Koutra ; Murali Mani ; H. V. Jagadish
【Abstract】: Data integration is frequently required to obtain the full value of data from multiple sources. In spite of extensive research on tools to assist users, data integration remains hard, particularly for users with limited technical proficiency. To address this barrier, we study how much we can do with no user guidance. Our vision is that the user should merely specify two input datasets to be joined and get a meaningful integrated result. It turns out that our vision can be realized if the system can correctly determine the join key, for example based on domain knowledge. We demonstrate this notion by considering a broad domain: socioeconomic data aggregated by geography, a widespread category that accounts for 80% of the data published by government agencies. Intuitively two such datasets can be integrated by joining on the geographic unit column. Although it sounds easy, this task has many challenges: How can we automatically identify columns corresponding to geographic units, other dimension variables and measure variables, respectively? If multiple geographic types exist, which one should be chosen for the join? How to join tables with idiosyncratic schema, different geographic units of aggregation or no aggregation at all? We have developed GeoFlux, a data integration system that handles all these challenges and joins tabular data by automatically aggregating geographic information with a new, advanced crosswalk algorithm. In this demo paper, we overview the architecture of the system and its user-friendly interfaces, and then demonstrate via a real-world example that it is general, fully automatic and easy-to-use. In the demonstration, we invite users to interact with GeoFlux to integrate more sample socioeconomic data from data.ny.gov.
【Keywords】: crosswalk; geographic data; keywords{automatic data integration; multi-dimensional aggregate interpolation}; socioeconomic data
【Paper Link】 【Pages】:1801-1804
【Authors】: Pei Wang ; Yongjun He ; Ryan Shea ; Jiannan Wang ; Eugene Wu
【Abstract】: Data scientists often spend more than 80% of their time on data preparation. Data enrichment, the act of extending a local database with new attributes from external data sources, is among the most time-consuming tasks. Existing data enrichment works are resource intensive: data-intensive by relying on web tables or knowledge bases, monetarily-intensive by purchasing entire datasets, or time-intensive by fully crawling a web-based data source. In this work, we explore a more targeted alternative that uses resources (in terms of web API calls) proportional to the size of the local database of interest. We build Deeper, a data enrichment system powered by the deep web. The goal of Deeper is to help data scientists to link a local database to a hidden database so that they can easily enrich the local database with the attributes from the hidden database. We find that a challenging problem is how to crawl a hidden database. This is different from a typical deep web crawling problem, whose goal is to crawl the entire hidden database rather than only the content relating to the data enrichment task. We demonstrate the limitations of straightforward solutions and propose an effective new crawling strategy. We also present the Deeper system architecture and discuss how to implement each component. During the demo, we will use Deeper to enrich a publication database and aim to show that (1) Deeper is an end-to-end data enrichment solution, and (2) the proposed crawling strategy is superior to the straightforward ones.
【Keywords】: data enrichment; deep web; entity resolution
【Paper Link】 【Pages】:1805-1807
【Authors】: Walter Cai
【Abstract】:
【Keywords】:
【Paper Link】 【Pages】:1809-1811
【Authors】: Siddharth Bhatia
【Abstract】: Two important metrics used to characterise a graph are its triangle count and clustering coefficient. In this paper, we present methods to approximate these metrics for graphs.
【Keywords】: clustering coefficient; graph characteristics; large-scale graphs; triangle count
【Paper Link】 【Pages】:1813-1815
【Authors】: Roee Shraga
【Abstract】:
【Keywords】: data integration; human-machine interaction; learn-to-rank; schema matching
【Paper Link】 【Pages】:1817-1819
【Authors】: Michael Günther
【Abstract】:
【Keywords】: nearest-neighbor search; postgresql; product quantization; relational database; word embeddings
【Paper Link】 【Pages】:1821-1823
【Authors】: Haolin Yu
【Abstract】:
【Keywords】: message queue; shared logs
【Paper Link】 【Pages】:1825-1827
【Authors】: Oren Kalinsky
【Abstract】: Harnessing the potential of the Semantic Web for building knowledgeable machines entails the ability to understand RDF graphs and integrate them with applications. Analyzing the vast information using common tools requires skill and time. Towards that we develop ELinda - an explorer for linked data. ELinda enables the understanding of the rich content stored in an RDF graph, via a visual query language for interactive exploration. The focus is on rich and open-domain datasets where it is especially challenging to detect the precise value to the application at hand. In essence, the model is based on the concept of a bar chart that depicts the distribution of a focus set of nodes (URIs), and each bar can be expanded to a new bar chart for further exploration. Three types of expansions are supported: subclass, property, and object. Under the hood, our visual query language is compiled into SPARQL. Yet, these queries require prohibitively long execution times on a standard SPARQL engine. To address this challenge, we develop a specialized query engine that is based on the concept of a worst-case-optimal join algorithm. The novel query engine provides a speedup of 1-2 orders of magnitude compared to standard SPARQL engines, and thereby facilitates the practical implementation of ELinda.
【Keywords】:
【Paper Link】 【Pages】:1829-1831
【Authors】: Soyee Choi
【Abstract】:
【Keywords】: offloading; sqlite; ssd; ssd as sqlite
【Paper Link】 【Pages】:1833-1835
【Authors】: Yuxing Chen
【Abstract】: In recent data management ecosystem, one of the greatest challenges is the data variety. Data varies in multiple formats such as relational and (semi-)structured data. Traditional database handles a single type of data format and thus its ability to deal with different types of data formats is limited. To overcome such limitation, we propose a multi-model processing framework for relational and semi-structured data (i.e. XML), and design a worst-case optimal join algorithm. The salient feature of our algorithm is that it can guarantee that the intermediate results are no larger than the worst-case join results. Preliminary results show that our multi-model algorithm significantly outperforms the baseline join methods in terms of running time and intermediate result size.
【Keywords】: relational; size bound; worst-case optimal; xml
【Paper Link】 【Pages】:1837-1838
【Authors】: Mark Raasveldt
【Abstract】:
【Keywords】: embedded databases; machine learning; olap
【Paper Link】 【Pages】:1839-1841
【Authors】: Thomas Lively ; Luca Schroeder ; Carlos Mendizábal
【Abstract】: Modern persistent key-value stores typically use a log-structured merge-tree (LSM-tree) design, which allows for high write throughput. Our observation is that the LSM-tree, however, has suboptimal performance during read-intensive workload windows with non-uniform key access distributions. To address this shortcoming, we propose and analyze a simple decision scheme that can be added to any LSM-based key-value store and dramatically reduce the number of disk I/Os for these classes of workloads. The key insight is that copying a frequently accessed key to the top of an LSM-tree ("splaying'') allows cheaper reads on that key in the near future.
【Keywords】: lsm-tree; rocksdb; splaying
【Paper Link】 【Pages】:1843-1845
【Authors】: Gábor Szárnyas
【Abstract】: Graph processing challenges are common in modern database systems, with the property graph data model gaining widespread adoption. Due to the novelty of the field, graph databases and frameworks typically provide their own query language, such as Cypher for Neo4j, Gremlin for TinkerPop and GraphScript for SAP HANA. These languages often lack a formal background for their data model and semantics. To address this, the openCypher initiative aims to standardise a subset of the Cypher language, for which it currently provides grammar specification and a set of acceptance tests to allow vendors to implement their openCypher compatible engine.
【Keywords】: cypher; graph queries; incremental view maintenance; opencypher; property graphs