Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. ACM 【DBLP Link】
【Paper Link】 【Pages】:1
【Authors】: Jignesh M. Patel
【Abstract】: Data analytics platforms today largely employ data processing kernels (e.g. implementation of selection and join operator algorithms) that were developed for a now bygone hardware era. Hardware has made fundamental shifts in recent years, driven by the need to consider energy as a first-class design parameter. Consequently, across the processor-IO hierarchy, the hardware paradigm today looks very different than it did just a few years ago. I argue that because of this shift, we are now building a 'deficit' between the pace at which the hardware is evolving and the pace that is demanded of data processing kernels to keep up with the growth of big data. This deficit is unsustainable in the long run. One way to 'pay off' this deficit is to have hardware and software co-evolve to exploit the full potential of the hardware. I will provide some examples of recent work from our Wisconsin Quickstep project that demonstrates the merit of this line of thinking. I'll focus on analytical data processing environments, and argue that our new way of storing data, called BitWeaving, and flattening databases into BitWeaved de-normalized tables, which we call WideTables, provides a dramatic new way to build 'sustainable' analytical data processing systems. I will also discuss the implications of our approach on future hardware-software co-design for data analytics platforms.
【Keywords】: analytics; hardware-software co-design; intra-cycle parallelism
【Paper Link】 【Pages】:3-16
【Authors】: Ying Yan ; Jiaxing Zhang ; Bojun Huang ; Xuzhan Sun ; Jiaqi Mu ; Zheng Zhang ; Thomas Moscibroda
【Abstract】: Computing outliers and related statistical aggregation functions from large-scale big data sources is a critical operation in many cloud computing scenarios, e.g. service quality assurance, fraud detection, or novelty discovery. Such problems commonly have to be solved in a distributed environment where each node only has a local slice of the entirety of the data. To process a query on the global data, each node must transmit its local slice of data or an aggregated subset thereof to a global aggregator node, which can then compute the desired statistical aggregation function. In this context, reducing the total communication cost is often critical to the overall efficiency. In this paper, we show both theoretically and empirically that these communication costs can be significantly reduced for common distributed computing problems if we take advantage of the fact that production-level big data usually exhibits a form of sparse structure. Specifically, we devise a new aggregation paradigm for outlier detection and related queries. The paradigm leverages compressive sensing for data sketching in combination with outlier detection techniques. We further propose an algorithm that works even for non-sparse data that concentrates around an unknown value. In both cases, we show that the communication cost is reduced to the logarithm of the global data size. We incorporate our approach into Hadoop and evaluate it on real web-scale production data (distributed click-data logs). Our approach reduces data shuffling IO by up to 99%, and end-to-end job duration by up to 40% on many actual production queries.
【Keywords】: big sparse data; compressive sensing; distributed aggregation; outlier detection
【Paper Link】 【Pages】:17-30
【Authors】: Erfan Zamanian ; Carsten Binnig ; Abdallah Salama
【Abstract】: Parallel database systems horizontally partition large amounts of structured data in order to provide parallel data processing capabilities for analytical workloads in shared-nothing clusters. One major challenge when horizontally partitioning large amounts of data is to reduce the network costs for a given workload and a database schema. A common technique to reduce the network costs in parallel database systems is to co-partition tables on their join key in order to avoid expensive remote join operations. However, existing partitioning schemes are limited in that respect since only subsets of tables in complex schemata sharing the same join key can be co-partitioned unless tables are fully replicated. In this paper we present a novel partitioning scheme called predicate-based reference partition (or PREF for short) that allows to co-partition sets of tables based on given join predicates. Moreover, based on PREF, we present two automatic partitioning design algorithms to maximize data-locality. One algorithm only needs the schema and data whereas the other algorithm additionally takes the workload as input. In our experiments we show that our automated design algorithms can partition database schemata of different complexity and thus help to effectively reduce the runtime of queries under a given workload when compared to existing partitioning approaches.
【Keywords】: partitioning schemes
【Paper Link】 【Pages】:31-46
【Authors】: Ziqiang Feng ; Eric Lo ; Ben Kao ; Wenjian Xu
【Abstract】: Scan and lookup are two core operations in main memory column stores. A scan operation scans a column and returns a result bit vector that indicates which records satisfy a filter. Once a column scan is completed, the result bit vector is converted into a list of record numbers, which is then used to look up values from other columns of interest for a query. Recently there are several in-memory data layout proposals that aim to improve the performance of in-memory data processing. However, these solutions all stand at either end of a trade-off --- each is either good in lookup performance or good in scan performance, but not both. In this paper we present ByteSlice, a new main memory storage layout that supports both highly efficient scans and lookups. ByteSlice is a byte-level columnar layout that fully leverages SIMD data-parallelism. Micro-benchmark experiments show that ByteSlice achieves a data scan speed at less than 0.5 processor cycle per column value --- a new limit of main memory data scan, without sacrificing lookup performance. Our experiments on TPC-H data and real data show that ByteSlice offers significant performance improvement over all state-of-the-art approaches.
【Keywords】: column store; main memory; olap; simd; storage layout
【Paper Link】 【Pages】:47-61
【Authors】: Alexander Alexandrov ; Andreas Kunft ; Asterios Katsifodimos ; Felix Schüler ; Lauritz Thamsen ; Odej Kao ; Tobias Herb ; Volker Markl
【Abstract】: The appeal of MapReduce has spawned a family of systems that implement or extend it. In order to enable parallel collection processing with User-Defined Functions (UDFs), these systems expose extensions of the MapReduce programming model as library-based dataflow APIs that are tightly coupled to their underlying runtime engine. Expressing data analysis algorithms with complex data and control flow structure using such APIs reveals a number of limitations that impede programmer's productivity. In this paper we show that the design of data analysis languages and APIs from a runtime engine point of view bloats the APIs with low-level primitives and affects programmer's productivity. Instead, we argue that an approach based on deeply embedding the APIs in a host language can address the shortcomings of current data analysis languages. To demonstrate this, we propose a language for complex data analysis embedded in Scala, which (i) allows for declarative specification of dataflows and (ii) hides the notion of data-parallelism and distributed runtime behind a suitable intermediate representation. We describe a compiler pipeline that facilitates efficient data-parallel processing without imposing runtime engine-bound syntactic or semantic restrictions on the structure of the input programs. We present a series of experiments with two state-of-the-art systems that demonstrate the optimization potential of our approach.
【Keywords】: control flow; data-parallel execution; large-scale data analysis; mapreduce; monad comprehensions; scala macros
【Paper Link】 【Pages】:63-78
【Authors】: Shumo Chu ; Magdalena Balazinska ; Dan Suciu
【Abstract】: Big data analytics often requires processing complex queries using massive parallelism, where the main performance metrics is the communication cost incurred during data reshuffling. In this paper, we describe a system that can compute efficiently complex join queries, including queries with cyclic joins, on a massively parallel architecture. We build on two independent lines of work for multi-join query evaluation: a communication-optimal algorithm for distributed evaluation, and a worst-case optimal algorithm for sequential evaluation. We evaluate these algorithms together, then describe novel, practical optimizations for both algorithms.
【Keywords】: join query evaluation; parallel database system
【Paper Link】 【Pages】:79-91
【Authors】: Tarek Elgamal ; Maysam Yabandeh ; Ashraf Aboulnaga ; Waleed Mustafa ; Mohamed Hefeeda
【Abstract】: Web sites, social networks, sensors, and scientific experiments currently generate massive amounts of data. Owners of this data strive to obtain insights from it, often by applying machine learning algorithms. Many machine learning algorithms, however, do not scale well to cope with the ever increasing volumes of data. To address this problem, we identify several optimizations that are crucial for scaling various machine learning algorithms in distributed settings. We apply these optimizations to the popular Principal Component Analysis (PCA) algorithm. PCA is an important tool in many areas including image processing, data visualization, information retrieval, and dimensionality reduction. We refer to the proposed optimized PCA algorithm as scalable PCA, or sPCA. sPCA achieves scalability via employing efficient large matrix operations, effectively leveraging matrix sparsity, and minimizing intermediate data. We implement sPCA on the widely-used MapReduce platform and on the memory-based Spark platform. We compare sPCA against the closest PCA implementations, which are the ones in Mahout/ MapReduce and MLlib/Spark. Our experiments show that sPCA outperforms both Mahout-PCA and MLlib-PCA by wide margins in terms of accuracy, running time, and volume of intermediate data generated during the computation.
【Keywords】: hadoop; high-dimensional pca; principal component analysis; scalable machine learning; spark pca; sparse pca
【Paper Link】 【Pages】:93-105
【Authors】: Lele Yu ; Yingxia Shao ; Bin Cui
【Abstract】: Distributed matrix computation is a popular approach for many large-scale data analysis and machine learning tasks. However existing distributed matrix computation systems generally incur heavy communication cost during the runtime, which degrades the overall performance. In this paper, we propose a novel matrix computation system, named DMac, which exploits the matrix dependencies in matrix programs for efficient matrix computation in the distributed environment. We decompose each matrix program into a sequence of operations, and reveal the matrix dependencies between operations in the program. We next design a dependency-oriented cost model to select an optimal execution strategy for each operation, and generate a communication efficient execution plan for the matrix computation program. To facilitate the matrix computation in distributed systems, we further divide the execution plan into multiple un-interleaved stages which can run in a distributed cluster with efficient local execution strategy on each worker. The DMac system has been implemented on a popular general-purpose data processing framework, Spark. The experimental results demonstrate that our techniques can significantly improve the performance of a wide range of matrix programs.
【Keywords】: dependency analysis; distributed system; matrix computing
【Paper Link】 【Pages】:107-122
【Authors】: Christina Teflioudi ; Rainer Gemulla ; Olga Mykytiuk
【Abstract】: We study the problem of efficiently retrieving large entries in the product of two given matrices, which arises in a number of data mining and information retrieval tasks. We focus on the setting where the two input matrices are tall and skinny, i.e., with millions of rows and tens to hundreds of columns. In such settings, the product matrix is large and its complete computation is generally infeasible in practice. To address this problem, we propose the LEMP algorithm, which efficiently retrieves only the large entries in the product matrix without actually computing it. LEMP maps the large-entry retrieval problem to a set of smaller cosine similarity search problems, for which existing methods can be used. We also propose novel algorithms for cosine similarity search, which are tailored to our setting. Our experimental study on large real-world datasets indicates that LEMP is up to an order of magnitude faster than state-of-the-art approaches.
【Keywords】: large-entry retrieval; matrix product; maximum inner product search
【Paper Link】 【Pages】:123-135
【Authors】: Jennie Duggan ; Olga Papaemmanouil ; Leilani Battle ; Michael Stonebraker
【Abstract】: Science applications are accumulating an ever-increasing amount of multidimensional data. Although some of it can be processed in a relational database, much of it is better suited to array-based engines. As such, it is important to optimize the query processing of these systems. This paper focuses on efficient query processing of join operations within an array database. These engines invariably ``chunk'' their data into multidimensional tiles that they use to efficiently process spatial queries. As such, traditional relational algorithms need to be substantially modified to take advantage of array tiles. Moreover, most n-dimensional science data is unevenly distributed in array space because its underlying observations rarely follow a uniform pattern. It is crucial that the optimization of array joins be skew-aware. In addition, owing to the scale of science applications, their query processing usually spans multiple nodes. This further complicates the planning of array joins. In this paper, we introduce a join optimization framework that is skew-aware for distributed joins. This optimization consists of two phases. In the first, a logical planner selects the query's algorithm (e.g., merge join), the granularity of the its tiles, and the reorganization operations needed to align the data. The second phase implements this logical plan by assigning tiles to cluster nodes using an analytical cost model. Our experimental results, on both synthetic and real-world data, demonstrate that this optimization framework speeds up array joins by up to 2.5X in comparison to the baseline.
【Keywords】: array database; distributed computation; join optimization; skew
【Paper Link】 【Pages】:137-152
【Authors】: Botong Huang ; Matthias Boehm ; Yuanyuan Tian ; Berthold Reinwald ; Shirish Tatikonda ; Frederick R. Reiss
【Abstract】: Declarative large-scale machine learning (ML) aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single node, in-memory computations to distributed computations on MapReduce (MR) or similar frameworks. State-of-the-art compilers in this context are very sensitive to memory constraints of the master process and MR cluster configuration. Different memory configurations can lead to significant performance differences. Interestingly, resource negotiation frameworks like YARN allow us to explicitly request preferred resources including memory. This capability enables automatic resource elasticity, which is not just important for performance but also removes the need for a static cluster configuration, which is always a compromise in multi-tenancy environments. In this paper, we introduce a simple and robust approach to automatic resource elasticity for large-scale ML. This includes (1) a resource optimizer to find near-optimal memory configurations for a given ML program, and (2) dynamic plan migration to adapt memory configurations during runtime. These techniques adapt resources according to data, program, and cluster characteristics. Our experiments demonstrate significant improvements up to 21x without unnecessary over-provisioning and low optimization overhead.
【Keywords】: large-scale; machine learning; resource elasticity; resource optimization; what-if analysis
【Paper Link】 【Pages】:153-166
【Authors】: Kerim Yasin Oktay ; Sharad Mehrotra ; Vaibhav Khadilkar ; Murat Kantarcioglu
【Abstract】: This paper describes SEMROD, a sensitive data aware MapReduce (MR) framework for hybrid clouds. SEMROD steers data and computation through public and private machines in such a way that no knowledge about sensitive data is leaked to public machines. For this purpose, SEMROD keeps trace of intermediate keys (generated during MR execution) that become sensitive, based on which it makes dynamic task scheduling decisions. SEMROD guarantees that adversaries viz. public machines) cannot gain any ``additional'' information about sensitive data from either the data stored on public machines or the communication between public and private machines during job execution. SEMROD extends naturally from a single MR job to multi-phase MR jobs that result, for instance, from compiling Hive queries into MR jobs. Using SEMROD, computation that may involve sensitive data can exploit public machines, thereby bringing significant performance benefits. Such computation would otherwise be restricted to only private clouds. Our experiments clearly demonstrate performance advantages to using SEMROD as compared with other secure alternatives, even when the percentage of sensitive data is as high as 50%.
【Keywords】: data processing; hybrid cloud; mapreduce; secure
【Paper Link】 【Pages】:167-181
【Authors】: Qian Chen ; Haibo Hu ; Jianliang Xu
【Abstract】: Data integration involves combining data from multiple sources and providing users with a unified query interface. Data integrity has been a key problem in online data integration. Although a variety of techniques have been proposed to address the data consistency and reliability issues, there is little work on assuring the integrity of integrated data and the correctness of query results. In this paper, we take the first step to propose authenticated data integration services to ensure data and query integrity even in the presence of an untrusted integration server. We develop a novel authentication code called homomorphic secret sharing seal that can aggregate the inputs from individual sources faithfully by the untrusted server for future query authentication. Based on this, we design two authenticated index structures and authentication schemes for queries on multi-dimensional data. We further study the freshness problem in multi-source query authentication and propose several advanced update strategies. Analytical models and empirical results show that our seal design and authentication schemes are efficient and robust under various system settings.
【Keywords】: algorithms; data integration; data integrity; experimentation; query authentication; security
【Paper Link】 【Pages】:183-196
【Authors】: Isabelle Hang ; Florian Kerschbaum ; Ernesto Damiani
【Abstract】: A data owner outsourcing the database of a multi user application wants to prevent information leaks caused by outside attackers exploiting software vulnerabilities or by curious personnel. Query processing over encrypted data solves this problem for a single user, but provides only limited functionality in the face of access restrictions for multiple users and keys. ENKI is a system for securely executing queries over sensitive, access restricted data on an outsourced database. It introduces an encryption based access control model and techniques for query execution over encrypted, access restricted data on the database with only a few cases requiring computations on the client. A prototype of ENKI supports all queries seen in three real world use cases and executes queries from TPC-C benchmark with a modest overhead compared to the single user mode.
【Keywords】: encrypted query processing; encryption-based access control; multi user
【Paper Link】 【Pages】:197-211
【Authors】: Vera Zaychik Moffitt ; Julia Stoyanovich ; Serge Abiteboul ; Gerome Miklau
【Abstract】: The management of Web users' personal information is increasingly distributed across a broad array of applications and systems, including online social networks and cloud-based services. Users wish to share data using these systems, but avoiding the risks of unintended disclosures or unauthorized access by applications has become a major challenge. We propose a novel access control model that operates within a distributed data management framework based on datalog. Using this model, users can control access to data they own and control applications they run. They can conveniently specify access control policies providing flexible tuple-level control derived using provenance information. We present a formal specification of the model, an implementation built using an open-source distributed datalog engine, and an extensive experimental evaluation showing that the computational cost of access control is modest.
【Keywords】: collaborative access control; distributed datalog; personal information management; provenance
【Paper Link】 【Pages】:213-225
【Authors】: Prasang Upadhyaya ; Magdalena Balazinska ; Dan Suciu
【Abstract】: Data has value and is increasingly being exchanged for commercial and research purposes. Data, however, is typically accompanied by terms of use, which limit how it can be used. To date, there are only a few, ad-hoc methods to enforce these terms. We propose DataLawyer, a new system to formally specify usage policies and check them automatically at query runtime in a relational database management system (DBMS). We develop a new model to specify policies compactly and precisely. We introduce novel algorithms to efficiently evaluate policies that can cut policy-checking overheads to only a few percent of the total query runtime. We implement DataLawyer and evaluate it on a real database from the health-care domain.
【Keywords】: data markets; licenses; terms of use; usage management
【Paper Link】 【Pages】:227-238
【Authors】: Yanxiang Huang ; Bin Cui ; Wenyu Zhang ; Jie Jiang ; Ying Xu
【Abstract】:
With the arrival of the big data era, opportunities as well as challenges arise in both industry and academia. As an important service in most web applications, accurate real-time recommendation in the context of big data is of high demand. Traditional recommender systems that analyze data and update models at regular time intervals cannot satisfy the requirements of modern web applications, calling for real-time recommender systems. In this paper, we tackle the big",
real-time" and accurate" challenges in real-time recommendation, and propose a general real-time stream recommender system built on Storm named TencentRec from three aspects, i.e.,
system", algorithm", and
data". We analyze the large amount of data streams from a wide range of applications leveraging the considerable computation ability of Storm, together with a data access component and a data storage component developed by us. To deal with various application specific demands, we have implemented several classic practical recommendation algorithms in TencentRec, including the item-based collaborative filtering, the content based, and the demographic based algorithms. Specially, we present a practical scalable item-based CF algorithm in detail, with the super characteristics such as robust to the implicit feedback problem, incremental update and real-time pruning. With the enhancement of real-time data collection and processing, we can capture the recommendation changes in real-time. We deploy the TencentRec in a series of production applications, and observe the superiority of TencentRec in providing accurate real-time recommendations for 10 billion user requests everyday.
【Keywords】: application; big data; practice; real-time recommendation; scalability
【Paper Link】 【Pages】:239-250
【Authors】: Sanjeev Kulkarni ; Nikunj Bhagat ; Maosong Fu ; Vikas Kedigehalli ; Christopher Kellogg ; Sailesh Mittal ; Jignesh M. Patel ; Karthik Ramasamy ; Siddarth Taneja
【Abstract】: Storm has long served as the main platform for real-time analytics at Twitter. However, as the scale of data being processed in real-time at Twitter has increased, along with an increase in the diversity and the number of use cases, many limitations of Storm have become apparent. We need a system that scales better, has better debug-ability, has better performance, and is easier to manage -- all while working in a shared cluster infrastructure. We considered various alternatives to meet these needs, and in the end concluded that we needed to build a new real-time stream data processing system. This paper presents the design and implementation of this new system, called Heron. Heron is now the de facto stream data processing engine inside Twitter, and in this paper we also share our experiences from running Heron in production. In this paper, we also provide empirical evidence demonstrating the efficiency and scalability of Heron.
【Keywords】: real-time data processing.; stream data processing systems
【Paper Link】 【Pages】:251-264
【Authors】: Lucas Braun ; Thomas Etter ; Georgios Gasparis ; Martin Kaufmann ; Donald Kossmann ; Daniel Widmer ; Aharon Avitzur ; Anthony Iliopoulos ; Eliezer Levy ; Ning Liang
【Abstract】: Modern data-centric flows in the telecommunications industry require real time analytical processing over a rapidly changing and large dataset. The traditional approach of separating OLTP and OLAP workloads cannot satisfy this requirement. Instead, a new class of integrated solutions for handling hybrid workloads is needed. This paper presents an industrial use case and a novel architecture that integrates key-value-based event processing and SQL-based analytical processing on the same distributed store while minimizing the total cost of ownership. Our approach combines several well-known techniques such as shared scans, delta processing, a PAX-fashioned storage layout, and an interleaving of scanning and delta merging in a completely new way. Performance experiments show that our system scales out linearly with the number of servers. For instance, our system sustains event streams of 100,000 events per second while simultaneously processing 100 ad-hoc analytical queries per second, using a cluster of 12 commodity servers. In doing so, our system meets all response time goals of our telecommunication customers; that is, 10 milliseconds per event and 100 milliseconds for an ad-hoc analytical query. Moreover, our system beats commercial competitors by a factor of 2.5 in analytical and two orders of magnitude in update performance.
【Keywords】: analytics; event-processing; oltp/olap engine
【Paper Link】 【Pages】:265-276
【Authors】: Paul Suganthan G. C. ; Chong Sun ; Krishna Gayatri K. ; Haojun Zhang ; Frank Yang ; Narasimhan Rampalli ; Shishir Prasad ; Esteban Arcaute ; Ganesh Krishnan ; Rohit Deep ; Vijay Raghavendra ; AnHai Doan
【Abstract】: Big Data industrial systems that address problems such as classification, information extraction, and entity matching very commonly use hand-crafted rules. Today, however, little is understood about the usage of such rules. In this paper we explore this issue. We discuss how these systems differ from those considered in academia. We describe default solutions, their limitations, and reasons for using rules. We show examples of extensive rule usage in industry. Contrary to popular perceptions, we show that there is a rich set of research challenges in rule generation, evaluation, execution, optimization, and maintenance. We discuss ongoing work at WalmartLabs and UW-Madison that illustrate these challenges. Our main conclusions are (1) using rules (together with techniques such as learning and crowdsourcing) is fundamental to building semantics-intensive Big Data systems, and (2) it is increasingly critical to address rule management, given the tens of thousands of rules industrial systems often manage today in an ad-hoc fashion.
【Keywords】: big data; classification; rule management
【Paper Link】 【Pages】:277-281
【Authors】: Stratos Idreos ; Olga Papaemmanouil ; Surajit Chaudhuri
【Abstract】: Data exploration is about efficiently extracting knowledge from data even if we do not know exactly what we are looking for. In this tutorial, we survey recent developments in the emerging area of database systems tailored for data exploration. We discuss new ideas on how to store and access data as well as new ideas on how to interact with a data system to enable users and applications to quickly figure out which data parts are of interest. In addition, we discuss how to exploit lessons-learned from past research, the new challenges data exploration crafts, emerging applications and future research directions.
【Keywords】: data exploration
【Paper Link】 【Pages】:283-284
【Authors】: Christopher Ré ; Divy Agrawal ; Magdalena Balazinska ; Michael J. Cafarella ; Michael I. Jordan ; Tim Kraska ; Raghu Ramakrishnan
【Abstract】: Machine learning seems to be eating the world with a new breed of high-value data-driven applications in image analysis, search, voice recognition, mobile, and office productivity products. To paraphrase Mike Stonebraker, machine learning is no longer a zero-billion-dollar business. As the home of high-value, data-driven applications for over four decades, a natural question for database researchers to ask is: what role should the database community play in these new data-driven machine-learning-based applications?
【Keywords】: database research; machine learning; panel
【Paper Link】 【Pages】:285-297
【Authors】: Abdallah Salama ; Carsten Binnig ; Tim Kraska ; Erfan Zamanian
【Abstract】: In order to deal with mid-query failures in parallel data engines (PDEs), different fault-tolerance schemes are implemented today: (1) fault-tolerance in parallel databases is typically implemented in a coarse-grained manner by restarting a query completely when a mid-query failure occurs, and (2) modern MapReduce-style PDEs implement a fine-grained fault-tolerance scheme, which either materializes intermediate results or implements a lineage model to recover from mid-query failures. However, neither of these schemes can efficiently handle mixed workloads with both short running interactive queries as well as long running batch queries nor do these schemes efficiently support a wide range of different cluster setups which vary in cluster size and other parameters such as the mean time between failures. In this paper, we present a novel cost-based fault-tolerance scheme which tackles this issue. Compared to the existing schemes, our scheme selects a subset of intermediates to be materialized such that the total query runtime is minimized under mid-query failures. Our experiments show that our cost-based fault-tolerance scheme outperforms all existing strategies and always selects the sweet spot for short- and long running queries as well as for different cluster setups.
【Keywords】:
【Paper Link】 【Pages】:299-313
【Authors】: Aaron J. Elmore ; Vaibhav Arora ; Rebecca Taft ; Andrew Pavlo ; Divyakant Agrawal ; Amr El Abbadi
【Abstract】: For data-intensive applications with many concurrent users, modern distributed main memory database management systems (DBMS) provide the necessary scale-out support beyond what is possible with single-node systems. These DBMSs are optimized for the short-lived transactions that are common in on-line transaction processing (OLTP) workloads. One way that they achieve this is to partition the database into disjoint subsets and use a single-threaded transaction manager per partition that executes transactions one-at-a-time in serial order. This minimizes the overhead of concurrency control mechanisms, but requires careful partitioning to limit distributed transactions that span multiple partitions. Previous methods used off-line analysis to determine how to partition data, but the dynamic nature of these applications means that they are prone to hotspots. In these situations, the DBMS needs to reconfigure how data is partitioned in real-time to maintain performance objectives. Bringing the system off-line to reorganize the database is unacceptable for on-line applications. To overcome this problem, we introduce the Squall technique for supporting live reconfiguration in partitioned, main memory DBMSs. Squall supports fine-grained repartitioning of databases in the presence of distributed transactions, high throughput client workloads, and replicated data. An evaluation of our approach on a distributed DBMS shows that Squall can reconfigure a database with no downtime and minimal overhead on transaction latency.
【Keywords】: load-balancing; migration; reconfiguration
【Paper Link】 【Pages】:315-329
【Authors】: Takeshi Mishima ; Yasuhiro Fujiwara
【Abstract】: Database-as-a-service has been gaining popularity in cloud computing because multitenant databases can reduce costs by sharing off-the-shelf resources. However, due to heavy workloads, resource sharing often causes a hot spot; one node is overloaded even while others are not. Unfortunately, a hot spot can lead to violation of service level agreements and destroy customer satisfaction. To efficiently address the hot spot problem, we propose a middleware approach called Madeus that conducts database live migration. To make efficient database live migration possible, we also introduce the lazy snapshot isolation rule (LSIR) that enables concurrently propagating syncsets, which are the datasets needed to synchronize slave with master databases. Madeus provides efficient database live migration by implementing the LSIR under snapshot isolation. Unlike current approaches, Madeus is pure middleware that is transparent to the database management system and based on commodity hardware and software. To demonstrate the superiority of our approach over current approaches, we experimentally evaluated Madeus by using PostgreSQL with the TPC-W benchmark. The results indicate that Madeus achieves more efficient live migration than three other types of middleware approaches, especially under heavy workloads; therefore, it can effectively resolve hot spots.
【Keywords】: database live migration; hot spot; middleware;concurrency control
【Paper Link】 【Pages】:331-346
【Authors】: Peter Alvaro ; Joshua Rosen ; Joseph M. Hellerstein
【Abstract】: In large-scale data management systems, failure is practically a certainty. Fault-tolerant protocols and components are notoriously difficult to implement and debug. Worse still, choosing existing fault-tolerance mechanisms and integrating them correctly into complex systems remains an art form, and programmers have few tools to assist them. We propose a novel approach for discovering bugs in fault-tolerant data management systems: lineage-driven fault injection. A lineage-driven fault injector reasons backwards from correct system outcomes to determine whether failures in the execution could have prevented the outcome. We present MOLLY, a prototype of lineage-driven fault injection that exploits a novel combination of data lineage techniques from the database literature and state-of-the-art satisfiability testing. If fault-tolerance bugs exist for a particular configuration, MOLLY finds them rapidly, in many cases using a order of magnitude fewer executions than random fault injection. Otherwise, MOLLY certifies that the code is bug-free for that configuration.
【Keywords】: fault-tolerance; provenance; verification
【Paper Link】 【Pages】:347-362
【Authors】: Lisi Chen ; Gao Cong
【Abstract】: Massive amount of text data are being generated by a huge number of web users at an unprecedented scale. These data cover a wide range of topics. Users are interested in receiving a few up-to-date representative documents (e.g., tweets) that can provide them with a wide coverage of different aspects of their query topics. To address the problem, we consider the Diversity-Aware Top-k Subscription (DAS) query. Given a DAS query, we continuously maintain an up-to-date result set that contains k most recently returned documents over a text stream for the query. The DAS query takes into account text relevance, document recency, and result diversity. We propose a novel solution to efficiently processing a large number of DAS queries over a stream of documents. We demonstrate the efficiency of our approach on real-world dataset and the experimental results show that our solution is able to achieve a reduction of the processing time by 60--75% compared with two baselines. We also study the effectiveness of the DAS query.
【Keywords】: diversification; publish/subscribe; text stream
【Paper Link】 【Pages】:363-375
【Authors】: Georgios John Fakas ; Zhi Cai ; Nikos Mamoulis
【Abstract】: The abundance and ubiquity of graphs (e.g., Online Social Networks such as Google+ and Facebook; bibliographic graphs such as DBLP) necessitates the effective and efficient search over them. Given a set of keywords that can identify a Data Subject (DS), a recently proposed relational keyword search paradigm produces, as a query result, a set of Object Summaries (OSs). An OS is a tree structure rooted at the DS node (i.e., a tuple containing the keywords) with surrounding nodes that summarize all data held on the graph about the DS. OS snippets, denoted as size-l OSs, have also been investigated. Size-l OSs are partial OSs containing l nodes such that the summation of their importance scores results in the maximum possible total score. However, the set of nodes that maximize the total importance score may result in an uninformative size-l OSs, as very important nodes may be repeated in it, dominating other representative information. In view of this limitation, in this paper we investigate the effective and efficient generation of two novel types of OS snippets, i.e. diverse and proportional size-l OSs, denoted as DSize-l and PSize-l OSs. Namely, apart from the importance of each node, we also consider its frequency in the OS and its repetitions in the snippets. We conduct an extensive evaluation on two real graphs (DBLP and Google+). We verify effectiveness by collecting user feedback, e.g. by asking DBLP authors (i.e. the DSs themselves) to evaluate our results. In addition, we verify the efficiency of our algorithms and evaluate the quality of the snippets that they produce.
【Keywords】: diversity; keyword search; proportionality; ranking; summaries
【Paper Link】 【Pages】:377-392
【Authors】: Xiaochun Yang ; Yaoshu Wang ; Bin Wang ; Wei Wang
【Abstract】: We study efficient query processing for approximate string queries, which find strings within a string collection whose edit distances to the query strings are within the given thresholds. Existing methods typically hinge on the property that globally similar strings must share at least certain number of identical substrings or subsequences. They become ineffective when there are burst errors or when the number of errors is large. In this paper, we explore the opposite paradigm focusing on finding out the differences of database strings to the query string. We propose a new filtering method, called local filtering, based on the idea that two strings exhibiting substantial local dissimilarities must be globally dissimilar. We propose the concept of (positional) local distance to quantify the minimum amount of errors a query fragment contributes to the edit distance between the query and a data string. It also leads to effective pruning rules and can speed up verification via early termination. We devise a family of indexing methods based on the idea of precomputing (positional) local distances for all possible combinations of query fragments and edit distance thresholds. Based on careful analyses of subtle relationships among local distances, novel techniques are proposed to drastically reduce the amount of enumeration with no or little impact on the pruning power. Efficient query processing methods exploiting the new index and bit-parallelism are also proposed. Experimental results on real datasets show that our local filtering-based methods can achieve substantial speedup compared with state-of-the-art methods, and they are robust against factors such as dataset characteristics and large edit distance thresholds.
【Keywords】: approximate query; edit distance; estimation; filtering; string
【Paper Link】 【Pages】:393-404
【Authors】: Minhao Jiang ; Ada Wai-Chee Fu ; Raymond Chi-Wing Wong
【Abstract】: Top-k nearest keyword search has been of interest because of applications ranging from road network location search by keyword to search of information on an RDF repository. We consider the evaluation of a query with a given vertex and a keyword, and the problem is to find a set of $k$ nearest vertices that contain the keyword. The known algorithms for handling this problem only give approximate answers. In this paper, we propose algorithms for top-k nearest keyword search that provide exact solutions and which handle networks of very large sizes. We have also verified the performance of our solutions compared with the best-known approximation algorithms with experiments on real datasets.
【Keywords】: 2-hop labeling; keyword-lookup tree; nearest keyword search
【Paper Link】 【Pages】:405-418
【Authors】: Tao Guo ; Xin Cao ; Gao Cong
【Abstract】: As an important type of spatial keyword query, the m-closest keywords (mCK) query finds a group of objects such that they cover all query keywords and have the smallest diameter, which is defined as the largest distance between any pair of objects in the group. The query is useful in many applications such as detecting locations of web resources. However, the existing work does not study the intractability of this problem and only provides exact algorithms, which are computationally expensive. In this paper, we prove that the problem of answering mCK queries is NP-hard. We first devise a greedy algorithm that has an approximation ratio of 2. Then, we observe that an mCK query can be approximately answered by finding the circle with the smallest diameter that encloses a group of objects together covering all query keywords. We prove that the group enclosed in the circle can answer the mCK query with an approximation ratio of 2 over 3. Based on this, we develop an algorithm for finding such a circle exactly, which has a high time complexity. To improve efficiency, we propose another two algorithms that find such a circle approximately, with a ratio of 2 over √3 + ε. Finally, we propose an exact algorithm that utilizes the group found by the 2 over √3 + ε)-approximation algorithm to obtain the optimal group. We conduct extensive experiments using real-life datasets. The experimental results offer insights into both efficiency and accuracy of the proposed approximation algorithms, and the results also demonstrate that our exact algorithm outperforms the best known algorithm by an order of magnitude.
【Keywords】: geo-textual objects; spatial keyword query
【Paper Link】 【Pages】:419-430
【Authors】: Silu Huang ; Ada Wai-Chee Fu ; Ruifeng Liu
【Abstract】: The computation of Minimum Spanning Trees (MSTs) is a fundamental graph problem with important applications. However, there has been little study of MSTs for temporal graphs, which is becoming common as time information is collected for many existing networks. We define two types of MSTs for temporal graphs, MSTa and MSTw, based on the optimization of time and cost, respectively. We propose efficient linear time algorithms for computing MSTa. We show that computing MSTw is much harder. We design efficient approximation algorithms based on a transformation to the Directed Steiner Tree problem (DST). Our solution also solves the classical DST problem with a better time complexity and the same approximation factor compared to the state-of-the-art algorithm. Our experiments on real temporal networks further verify the effectiveness of our algorithms. For MSTw, our solution is capable of shortening the runtime from 10 hours to 3 seconds.
【Keywords】: steiner tree; temporal graph; minimum spanning tree
【Paper Link】 【Pages】:431-444
【Authors】: Devora Berlowitz ; Sara Cohen ; Benny Kimelfeld
【Abstract】: The problem of enumerating (i.e., generating) all maximal cliques in a graph has received extensive treatment, due to the plethora of applications in various areas such as data mining, bioinformatics, network analysis and community detection. However, requiring the enumerated subgraphs to be full cliques is too restrictive in common real-life scenarios where "almost cliques" are equally useful. Hence, the notion of a k-plex, a clique relaxation that allows every node to be "missing" k neighbors, has been introduced. But this seemingly minor relaxation casts existing algorithms for clique enumeration inapplicable, for inherent reasons. This paper presents the first provably efficient algorithms, both for enumerating the maximal k-plexes and for enumerating the maximal connected k-plexes. Our algorithms run in polynomial delay for a constant k and incremental FPT delay when k is a parameter. The importance of such algorithms is in the areas mentioned above, as well as in new applications. Extensive experimentation over both real and synthetic datasets shows the efficiency of our algorithms, and their scalability with respect to graph size, density and choice of k, as well as their clear superiority over the state-of-the-art.
【Keywords】: enumeration; fixed-parameter tractability; maximal graph clique; maximal k-plex; polynomial delay
【Paper Link】 【Pages】:445-458
【Authors】: Zhiwei Zhang ; Jeffrey Xu Yu ; Lu Qin ; Zechao Shang
【Abstract】: Depth-First Search (DFS), which traverses a graph in the depth- first order, is one of the fundamental graph operations, and the result of DFS over all nodes in G is a spanning tree known as a DFS-Tree. There are many graph algorithms that need DFS such as connected component computation, topological sort, community detection, eulerian path computation, graph bipartiteness testing, planar graph testing, etc, because the in-memory DFS algorithm shows it can be done in linear time w.r.t. the size of G. However, given the fact that real-world graphs grow rapidly in the big data era, the in-memory DFS algorithm cannot be used to handle a large graph that cannot be entirely held in main memory. In this paper, we focus on I/O efficiency and study semi-external algorithms to DFS a graph G which is on disk. Here, like the existing semi-external algorithms, we assume that a spanning tree of G can be held in main memory and the remaining edges of G are kept on disk, and compute the DFS-Tree in main memory with which DFS can be identified. We propose novel divide & conquer algorithms to DFS over a graph G on disk. In brief, we divide a graph into several subgraphs, compute the DFS-Tree for each subgraph independently, and then merge them together to compute the DFS-Tree for the whole graph. With the global DFS-Tree computed we identify DFS. We discuss the valid division, that can lead to the correct DFS, and the challenges to do so. We propose two division algorithms, named Divide-Star and Divide-TD, and a merge algorithm. We conduct extensive experimental studies using four real massive datasets and several synthetic datasets to confirm the I/O efficiency of our approach.
【Keywords】: depth-first search; graph algorithm; i/o efficient
【Paper Link】 【Pages】:459-474
【Authors】: Lijun Chang ; Xuemin Lin ; Lu Qin ; Jeffrey Xu Yu ; Wenjie Zhang
【Abstract】: With the proliferation of graph applications, the problem of efficiently computing all $k$-edge connected components of a graph G for a user-given k has been recently investigated. In this paper, we study the problem of efficiently computing the steiner component with the maximum connectivity; that is, given a set q of query vertices in a graph G, we aim to find the maximum induced subgraph g of G such that g contains q and g has the maximum connectivity, where g is denoted as SMCC. To accommodate online query processing, we present an efficient algorithm based on a novel index such that the algorithm runs in linear time regarding the result size; thus, the algorithm is optimal since it needs at least linear time to output the result. Moreover, in this paper we also investigate variations of the above problem. We show that such a problem with the constraint that the size of the SMCC is not smaller than a given size can also be solved in linear time regarding the result size (thus, optimal). We also show that the problem of computing the connectivity (rather than the graph details) of SMCC can be solved in linear time regarding the query size (thus, optimal). To build the index, we extend the techniques in [7] to accommodate batch processing and computation sharing. To efficiently support the applications with graph updates, we also present novel increment techniques. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our index-based algorithms significantly outperform baseline algorithms by several orders of magnitude and our indexing algorithms are efficient.
【Keywords】: dynamic graph; k-edge connected com- ponent; maximum spanning tree; steiner maximum-connected component
【Paper Link】 【Pages】:475-489
【Authors】: Saket Gurukar ; Sayan Ranu ; Balaraman Ravindran
【Abstract】: A fundamental problem in behavioral analysis of human interactions is to understand how communications unfold. In this paper, we study this problem by mining Communication motifs from dynamic interaction networks. A communication motif is a recurring subgraph that has a similar sequence of information flow. Mining communication motifs requires us to explore the exponential subgraph search space where existing techniques fail to scale. To tackle this scalability bottleneck, we develop a technique called COMMIT. COMMIT converts a dynamic graph into a database of sequences. Through careful analysis in the sequence space, only a small portion of the exponential search space is accessed to identify regions embedding communication motifs. Extensive experiments on three different social networks show COMMIT to be up to two orders of magnitude faster than baseline techniques. Furthermore, qualitative analysis demonstrate communication motifs to be effective in characterizing the recurring patterns of interactions while also revealing the role that the underlying social network plays in shaping human behavior.
【Keywords】: communication motifs; graph mining; interaction networks
【Paper Link】 【Pages】:491-503
【Authors】: Kaustubh Beedkar ; Rainer Gemulla
【Abstract】: We propose LASH, a scalable, distributed algorithm for mining sequential patterns in the presence of hierarchies. LASH takes as input a collection of sequences, each composed of items from some application-specific vocabulary. In contrast to traditional approaches to sequence mining, the items in the vocabulary are arranged in a hierarchy: both input sequences and sequential patterns may consist of items from different levels of the hierarchy. Such hierarchies naturally occur in a number of applications including mining natural-language text, customer transactions, error logs, or event sequences. LASH is the first parallel algorithm for mining frequent sequences with hierarchies; it is designed to scale to very large datasets. At its heart, LASH partitions the data using a novel, hierarchy-aware variant of item-based partitioning and subsequently mines each partition independently and in parallel using a customized mining algorithm called pivot sequence miner. LASH is amenable to a MapReduce implementation; we propose effective and efficient algorithms for both the construction and the actual mining of partitions. Our experimental study on large real-world datasets suggest good scalability and run-time efficiency.
【Keywords】: data mining; frequent sequence mining; hierarchies; mapreduce
【Paper Link】 【Pages】:505-517
【Authors】: Michael Cochez ; Hao Mou
【Abstract】: Many commonly used data-mining techniques utilized across research fields perform poorly when used for large data sets. Sequential agglomerative hierarchical non-overlapping clustering is one technique for which the algorithms' scaling properties prohibit clustering of a large amount of items. Besides the unfavorable time complexity of O(n2), these algorithms have a space complexity of O(n2), which can be reduced to O(n) if the time complexity is allowed to rise to O(n2 log2n). In this paper, we propose the use of locality-sensitive hashing combined with a novel data structure called twister tries to provide an approximate clustering for average linkage. Our approach requires only linear space. Furthermore, its time complexity is linear in the number of items to be clustered, making it feasible to apply it on a larger scale. We evaluate the approach both analytically and by applying it to several data sets.
【Keywords】: average linkage; hierarchical clustering; linear complexity; locality-sensitive hashing
【Paper Link】 【Pages】:519-530
【Authors】: Junhao Gan ; Yufei Tao
【Abstract】: DBSCAN is a popular method for clustering multi-dimensional objects. Just as notable as the method's vast success is the research community's quest for its efficient computation. The original KDD'96 paper claimed an algorithm with O(n log n) running time, where n is the number of objects. Unfortunately, this is a mis-claim; and that algorithm actually requires O(n2) time. There has been a fix in 2D space, where a genuine O(n log n)-time algorithm has been found. Looking for a fix for dimensionality d ≥ 3 is currently an important open problem. In this paper, we prove that for d ≥ 3, the DBSCAN problem requires Ω(n4/3) time to solve, unless very significant breakthroughs---ones widely believed to be impossible---could be made in theoretical computer science. This (i) explains why the community's search for fixing the aforementioned mis-claim has been futile for d ≥ 3, and (ii) indicates (sadly) that all DBSCAN algorithms must be intolerably slow even on moderately large n in practice. Surprisingly, we show that the running time can be dramatically brought down to O(n) in expectation regardless of the dimensionality d, as soon as slight inaccuracy in the clustering results is permitted. We formalize our findings into the new notion of ρ-approximate DBSCAN, which we believe should replace DBSCAN on big data due to the latter's computational intractability.
【Keywords】: algorithm; dbscan; density-based clustering
【Paper Link】 【Pages】:531-543
【Authors】: Azade Nazi ; Mahashweta Das ; Gautam Das
【Abstract】: The increasing popularity and widespread use of online review sites over the past decade has motivated businesses of all types to possess an expansive arsenal of user feedback (preferably positive) in order to mark their reputation and presence in the Web. Though a significant proportion of purchasing decisions today are driven by average numeric scores (e.g., movie rating in IMDB), detailed reviews are critical for activities such as buying an expensive digital SLR camera, reserving a vacation package, etc. Since writing a detailed review for a product (or, a service) is usually time-consuming and may not offer any incentive, the number of useful reviews available in the Web is far from many. The corpus of reviews available at our disposal for making informed decisions also suffers from spam and misleading content, typographical and grammatical errors, etc. In this paper, we address the problem of how to engage the lurkers (i.e., people who read reviews but never take time and effort to write one) to participate and write online reviews by systematically simplifying the reviewing task. Given a user and an item that she wants to review, the task is to identify the top-$k$ meaningful phrases (i.e., tags) from the set of all tags (i.e., available user feedback for items) that, when advised, would help her review an item easily. We refer to it as the TagAdvisor problem, and formulate it as a general-constrained optimization goal. Our framework is centered around three measures - relevance (i.e., how well the result set of tags describes an item to a user), coverage (i.e., how well the result set of tags covers the different aspects of an item), and polarity (i.e., how well sentiment is attached to the result set of tags) in order to help a user review an item satisfactorily. By adopting different definitions of coverage, we identify two concrete problem instances that enable a wide range of real-world scenarios. We show that these problems are NP-hard and develop practical algorithms with theoretical bounds to solve them efficiently. We conduct detailed experiments on synthetic and real data crawled from the web to validate the utility of our problem and effectiveness of our solutions.
【Keywords】: coverage; personalized tag advisor; polarity; relevance
【Paper Link】 【Pages】:545-560
【Authors】: Liping Peng ; Yanlei Diao
【Abstract】: Uncertain data management has become crucial to scientific applications. Recently, array databases have gained popularity for scientific data processing due to performance benefits. In this paper, we address uncertain data management in array databases, which may involve both value uncertainty within individual tuples and position uncertainty regarding where a tuple should belong in an array given uncertain dimension attributes. Our work defines the formal semantics of array operations under both value and position uncertainty. To address the new challenge raised by position uncertainty, we propose a suite of storage and evaluation strategies for array operations, with a focus on a new scheme that bounds the overhead of querying by strategically treating tuples with large variances via replication in storage. Results from real datasets show that for common workloads, our best-performing techniques outperform alternative methods based on state-of-the-art indexes by 1.7x to 4.3x for the Subarray operation and 1 to 2 orders of magnitude for Structure-Join, at only a small storage cost.
【Keywords】: array databases; store-multiple; uncertainty
【Paper Link】 【Pages】:561-576
【Authors】: Simon Razniewski ; Flip Korn ; Werner Nutt ; Divesh Srivastava
【Abstract】: In many applications including loosely coupled cloud databases, collaborative editing and network monitoring, data from multiple sources is regularly used for query answering. For reasons such as system failures, insufficient author knowledge or network issues, data may be temporarily unavailable or generally nonexistent. Hence, not all data needed for query answering may be available. In this paper, we propose a natural class of completeness patterns, expressed by selections on database tables, to specify complete parts of database tables. We then show how to adapt the operators of relational algebra so that they manipulate these completeness patterns to compute completeness patterns pertaining to query answers. Our proposed algebra is computationally sound and complete with respect to the information that the patterns provide. We show that stronger completeness patterns can be obtained by considering not only the schema but also the database instance and we extend the algebra to take into account this additional information. We develop novel techniques to efficiently implement the computation of completeness patterns on query answers and demonstrate their scalability on real data.
【Keywords】: data completeness; data quality
【Paper Link】 【Pages】:577-592
【Authors】: Peng Peng ; Raymond Chi-Wing Wong
【Abstract】: Multi-criteria decision making problem has been well studied for many years. One popular query for multi-criteria decision making is top-k queries which require each user to specify an exact utility function. In many cases, the utility function of a user is probabilistic and finding the distribution on the utility functions has been widely explored in the machine learning areas, such as user's recommender systems, Bayesian learning models and user's preference elicitation, for improving user's experience. Motivated by this, we propose a new type of queries called k-hit queries, which has not been studied before. Given a set D of tuples in the database, the distribution θ on utility functions and a positive integer k, we would like to select a set of k tuples from D in order to maximize the probability that at least one of tuples in the selection set is the favorite of a user. All applications for top-k queries can naturally be used in k-hit queries. In this paper, we present various interesting properties of k-hit queries. Besides, based on these properties, we propose a novel algorithm called k-hit_Alg for k-hit queries. Finally, we conducted comprehensive experiments to show that the performance of our proposed method, k hit_Alg, is superior compared with other existing algorithms which were originally used to answer other existing queries.
【Keywords】: algorithm; geometry; preference query; sampling; skyline query; top-k query
【Paper Link】 【Pages】:593-605
【Authors】: Furong Li ; Mong-Li Lee ; Wynne Hsu ; Wang-Chiew Tan
【Abstract】: To harness the rich amount of information available on the Web today, many organizations start to aggregate public (and private) data to derive new knowledge bases. A fundamental challenge in constructing an accurate integrated knowledge repository from different data sources is to understand how facts across different sources are related to one another over time. This challenge, referred to as the temporal record linkage problem, goes far beyond the traditional record linkage problem as it requires a fine-grained analysis of how two facts are temporally related if they both refer to the same entity. In this paper, we present a new solution for understanding how two facts may be temporally related and exploit the knowledge to profile how entities evolve over time. Our solution makes use of a novel transition model which captures sophisticated patterns of value transitions. Specifically, our transition model captures the probability that an entity may change to a particular attribute value after some time period. This transition model can be considered jointly with various source quality metrics to fine-tune how records should be temporally linked to entities. In particular, we showcase how the freshness of data sources can be built into a source-aware temporal matching algorithm that jointly considers the value transitions and the freshness of data sources to link temporal records to entities in the right time period. In this way, an increasingly complete and up-to-date entity profile can be derived as more and more temporal records are aggregated from different sources. Our suite of experimental results on real world datasets demonstrate that our proposed method is able to outperform the state-of-the-art techniques and build more complete profiles for entities by identifying their true matching temporal records at the right time period.
【Keywords】: entity profiling; information integration; record linkage; temporal data
【Paper Link】 【Pages】:607-618
【Authors】: Yiqing Huang ; Fangzhou Zhu ; Mingxuan Yuan ; Ke Deng ; Yanhua Li ; Bing Ni ; Wenyuan Dai ; Qiang Yang ; Jia Zeng
【Abstract】: We show that telco big data can make churn prediction much more easier from the $3$V's perspectives: Volume, Variety, Velocity. Experimental results confirm that the prediction performance has been significantly improved by using a large volume of training data, a large variety of features from both business support systems (BSS) and operations support systems (OSS), and a high velocity of processing new coming data. We have deployed this churn prediction system in one of the biggest mobile operators in China. From millions of active customers, this system can provide a list of prepaid customers who are most likely to churn in the next month, having $0.96$ precision for the top $50000$ predicted churners in the list. Automatic matching retention campaigns with the targeted potential churners significantly boost their recharge rates, leading to a big business value.
【Keywords】: big data; customer retention; telco churn prediction
【Paper Link】 【Pages】:619-630
【Authors】: Orri Erling ; Alex Averbuch ; Josep-Lluis Larriba-Pey ; Hassan Chafi ; Andrey Gubichev ; Arnau Prat-Pérez ; Minh-Duc Pham ; Peter A. Boncz
【Abstract】: The Linked Data Benchmark Council (LDBC) is now two years underway and has gathered strong industrial participation for its mission to establish benchmarks, and benchmarking practices for evaluating graph data management systems. The LDBC introduced a new choke-point driven methodology for developing benchmark workloads, which combines user input with input from expert systems architects, which we outline. This paper describes the LDBC Social Network Benchmark (SNB), and presents database benchmarking innovation in terms of graph query functionality tested, correlated graph generation techniques, as well as a scalable benchmark driver on a workload with complex graph dependencies. SNB has three query workloads under development: Interactive, Business Intelligence, and Graph Algorithms. We describe the SNB Interactive Workload in detail and illustrate the workload with some early results, as well as the goals for the two other workloads.
【Keywords】: benchmarking; graph databases; rdf databases
【Paper Link】 【Pages】:631-646
【Authors】: Frank Austin Nothaft ; Matt Massie ; Timothy Danford ; Zhao Zhang ; Uri Laserson ; Carl Yeksigian ; Jey Kottalam ; Arun Ahuja ; Jeff Hammerbacher ; Michael Linderman ; Michael J. Franklin ; Anthony D. Joseph ; David A. Patterson
【Abstract】: "Next generation" data acquisition technologies are allowing scientists to collect exponentially more data at a lower cost. These trends are broadly impacting many scientific fields, including genomics, astronomy, and neuroscience. We can attack the problem caused by exponential data growth by applying horizontally scalable techniques from current analytics systems to accelerate scientific processing pipelines. In this paper, we describe ADAM, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28x speedup over current genomics pipelines, while reducing cost by 63%. From building this system, we were able to distill a set of techniques for implementing scientific analyses efficiently using commodity "big data" systems. To demonstrate the generality of our architecture, we then implement a scalable astronomy image processing system which achieves a 2.8--8.9x improvement over the state-of-the-art MPI-based system.
【Keywords】: design
【Paper Link】 【Pages】:647-658
【Authors】: Yue Wang ; Yingzhong Xu ; Yue Liu ; Jian Chen ; Songlin Hu
【Abstract】: Apache Hive has been widely used by Internet companies for big data analytics applications. It can provide the capability of compiling high-level languages into efficient MapReduce workflows, which frees users from complicated and time consuming programming. The popularity of Hive and its HiveQL-compatible systems like Impala and Shark attracts attentions from traditional enterprises as well. However, enterprise big data processing systems such as Smart Grid applications often have to migrate their RDBMS-based legacy applications to Hive rather than directly writing new logic in HiveQL. Considering their differences in syntax and cost model, manual translation from SQL in RDBMS to HiveQL is very difficult, error-prone, and often leads to poor performance. In this paper, we propose QMapper, a tool for automatically translating SQL into proper HiveQL. QMapper consists of a rule-based rewriter and a cost-based optimizer. The experiments based on the TPC-H benchmark demonstrate that, compared to manually rewritten Hive queries provided by Hive contributors, QMapper dramatically reduces the query latency on average. Our real world Smart Grid application also shows its efficiency.
【Keywords】: hive; join optimization; sql on hadoop; system migration
【Paper Link】 【Pages】:659
【Authors】: Jennifer Widom
【Abstract】: Being honored as the ACM Athena Lecturer has inspired me to reflect upon the research I've conducted over my career to date. Conventional wisdom says good things come in threes, so I've picked three of my favorite results to cover during the talk. For each one I'll explain the context and motivation, the result itself, and why it ranks as one of my favorites. The three results span foundations, implementation, and user-interface, and they represent three of my favorite research areas: semistructured data, data streams, and uncertain data.
【Keywords】: data streams; semistructured data; uncertain data
【Paper Link】 【Pages】:661
【Authors】: Laura M. Haas
【Abstract】: Integrating data has always been a challenge. The information management community has made great progress in tackling this challenge, both on the theory and the practice. But in the last ten years, the world has changed dramatically. New platforms, devices and applications have made huge volumes of heterogeneous data available at speeds never contemplated before, while the quality of the available data has if anything degraded. Unstructured and semi-structured formats and no-sql data stores undercut the old reliable tools of schema, forcing applications to deal with data at the instance level. Deep expertise in the data and domain, in the tools and systems for integration and analysis, in mathematics, computer science, and business are needed to discover insights from data, but rarely are all of these skills found in a single individual or even team. Meanwhile, the availability of all these data has raised expectations for rapid breakthroughs in many sciences, for quick solutions to business problems, and for ever more sophisticated applications that combine and analyze information to solve our daily needs. These expectations raise the bar for integration technology, while opening the door for it to play a broader role. Integration has always been a key player in handling data variety, for example, but now more than ever must deal with scale (in the number of types as well as in the volume and speed of data). While data cleansing has been one step of an integration pipeline, this technology must be leveraged throughout data integration, so that the integration process is better able to deal with the uncertainty in data, offering means to eliminate or reduce it, or, to elucidate it by linking important contextual information, such as provenance and usage. The complexity of today's data-driven challenges in fact suggests that the integration process should be context-aware, so that data sets may be combined differently depending on the proposed usage. In the Accelerated Discovery Lab, we support data scientists working with a broad range of data as they try to find the insights to solve problems of business or societal importance. Clearly, integration is essential to insight. However, integration has to be across more than just datasets and schemas, and it has to be done more dynamically and flexibly than the standard tools allow. It is needed at multiple levels: (1) to build rich (but flexible) collections of diverse data, (2) to tightly bind individual data points into entities, allowing deeper explorations and (3) to bring together data and context to enable re-use by users with differing expertise. We think of the environment we are building as an integration hub for data, people and applications. It allows users to import, explore and create data and knowledge, inspired by the work of others, while it captures the patterns of decision-making and the provenance of decisions. I will describe the environment we are creating, the advances in the field that enable it, and the challenges that remain.
【Keywords】: big data; data analytics; information integration
【Paper Link】 【Pages】:663-676
【Authors】: Simon Loesing ; Markus Pilman ; Thomas Etter ; Donald Kossmann
【Abstract】: Database scale-out is commonly implemented by partitioning data across several database instances. This approach, however, has several restrictions. In particular, partitioned databases are inflexible in large-scale deployments and assume a partition-friendly workload in order to scale. In this paper, we analyze an alternative architecture design for distributed relational databases that overcomes the limitations of partitioned databases. The architecture is based on two fundamental principles: We decouple query processing and transaction management from data storage, and we share data across query processing nodes. The combination of these design choices provides scalability, elasticity, and operational flexibility without making any assumptions on the workload. As a drawback, sharing data among multiple database nodes causes synchronization overhead. To address this limitation, we introduce techniques for scalable transaction processing in shared-data environments. Specifically, we describe mechanisms for efficient data access, concurrency control, and data buffering. In combination with new hardware trends, the techniques enable performance characteristics that top state-of-the-art partitioned databases.
【Keywords】: decoupled storage; optimistic concurrency control; shared-data; transaction processing
【Paper Link】 【Pages】:677-689
【Authors】: Thomas Neumann ; Tobias Mühlbauer ; Alfons Kemper
【Abstract】: Multi-Version Concurrency Control (MVCC) is a widely employed concurrency control mechanism, as it allows for execution modes where readers never block writers. However, most systems implement only snapshot isolation (SI) instead of full serializability. Adding serializability guarantees to existing SI implementations tends to be prohibitively expensive. We present a novel MVCC implementation for main-memory database systems that has very little overhead compared to serial execution with single-version concurrency control, even when maintaining serializability guarantees. Updating data in-place and storing versions as before-image deltas in undo buffers not only allows us to retain the high scan performance of single-version systems but also forms the basis of our cheap and fine-grained serializability validation mechanism. The novel idea is based on an adaptation of precision locking and verifies that the (extensional) writes of recently committed transactions do not intersect with the (intensional) read predicate space of a committing transaction. We experimentally show that our MVCC model allows very fast processing of transactions with point accesses as well as read-heavy transactions and that there is little need to prefer SI over full serializability any longer.
【Keywords】: multi-version concurrency control; mvcc; serializability
【Paper Link】 【Pages】:691-706
【Authors】: Hideaki Kimura
【Abstract】: Server hardware is about to drastically change. As typified by emerging hardware such as UC Berkeley's Firebox project and by Intel's Rack-Scale Architecture (RSA), next generation servers will have thousands of cores, large DRAM, and huge NVRAM. We analyze the characteristics of these machines and find that no existing database is appropriate. Hence, we are developing FOEDUS, an open-source, from-scratch database engine whose architecture is drastically different from traditional databases. It extends in-memory database technologies to further scale up and also allows transactions to efficiently manipulate data pages in both DRAM and NVRAM. We evaluate the performance of FOEDUS in a large NUMA machine (16 sockets and 240 physical cores) and find that FOEDUS achieves multiple orders of magnitude higher TPC-C throughput compared to H-Store with anti-caching.
【Keywords】: many-cores; nvram
【Paper Link】 【Pages】:707-722
【Authors】: Joy Arulraj ; Andrew Pavlo ; Subramanya Dulloor
【Abstract】: The advent of non-volatile memory (NVM) will fundamentally change the dichotomy between memory and durable storage in database management systems (DBMSs). These new NVM devices are almost as fast as DRAM, but all writes to it are potentially persistent even after power loss. Existing DBMSs are unable to take full advantage of this technology because their internal architectures are predicated on the assumption that memory is volatile. With NVM, many of the components of legacy DBMSs are unnecessary and will degrade the performance of data intensive applications. To better understand these issues, we implemented three engines in a modular DBMS testbed that are based on different storage management architectures: (1) in-place updates, (2) copy-on-write updates, and (3) log-structured updates. We then present NVM-aware variants of these architectures that leverage the persistence and byte-addressability properties of NVM in their storage and recovery methods. Our experimental evaluation on an NVM hardware emulator shows that these engines achieve up to 5.5X higher throughput than their traditional counterparts while reducing the amount of wear due to write operations by up to 2X. We also demonstrate that our NVM-aware recovery protocols allow these engines to recover almost instantaneously after the DBMS restarts.
【Keywords】: non-volatile memory; oltp; recovery; storage engines
【Paper Link】 【Pages】:731-745
【Authors】: Jun Zhang ; Graham Cormode ; Cecilia M. Procopiuc ; Divesh Srivastava ; Xiaokui Xiao
【Abstract】: Protecting the privacy of individuals in graph structured data while making accurate versions of the data available is one of the most challenging problems in data privacy. Most efforts to date to perform this data release end up mired in complexity, overwhelm the signal with noise, and are not effective for use in practice. In this paper, we introduce a new method which guarantees differential privacy. It specifies a probability distribution over possible outputs that is carefully defined to maximize the utility for the given input, while still providing the required privacy level. The distribution is designed to form a 'ladder', so that each output achieves the highest 'rung' (maximum probability) compared to less preferable outputs. We show how our ladder framework can be applied to problems of counting the number of occurrences of subgraphs, a vital objective in graph analysis, and give algorithms whose cost is comparable to that of computing the count exactly. Our experimental study confirms that our method outperforms existing methods for counting triangles and stars in terms of accuracy, and provides solutions for some problems for which no effective method was previously known. The results of our algorithms can be used to estimate the parameters of suitable graph models, allowing synthetic graphs to be sampled.
【Keywords】: differential privacy; local sensitivity; subgraph counting
【Paper Link】 【Pages】:747-762
【Authors】: Bin Yang ; Issei Sato ; Hiroshi Nakagawa
【Abstract】: Differential privacy provides a rigorous standard for evaluating the privacy of perturbation algorithms. It has widely been regarded that differential privacy is a universal definition that deals with both independent and correlated data and a differentially private algorithm can protect privacy against arbitrary adversaries. However, recent research indicates that differential privacy may not guarantee privacy against arbitrary adversaries if the data are correlated. In this paper, we focus on the private perturbation algorithms on correlated data. We investigate the following three problems: (1) the influence of data correlations on privacy; (2) the influence of adversary prior knowledge on privacy; and (3) a general perturbation algorithm that is private for prior knowledge of any subset of tuples in the data when the data are correlated. We propose a Pufferfish definition of privacy, called Bayesian differential privacy, by which the privacy level of a probabilistic perturbation algorithm can be evaluated even when the data are correlated and when the prior knowledge is incomplete. We present a Gaussian correlation model to accurately describe the structure of data correlations and analyze the Bayesian differential privacy of the perturbation algorithm on the basis of this model. Our results show that privacy is poorest for an adversary who has the least prior knowledge. We further extend this model to a more general one that considers uncertain prior knowledge.
【Keywords】: differential privacy; gaussian markov random field; optimization; output perturbation; private data analysis
【Paper Link】 【Pages】:763-777
【Authors】: Charalampos Mavroforakis ; Nathan Chenette ; Adam O'Neill ; George Kollios ; Ran Canetti
【Abstract】: Order-preserving encryption (OPE) schemes, whose ciphertexts preserve the natural ordering of the plaintexts, allow efficient range query processing over outsourced encrypted databases without giving the server access to the decryption key. Such schemes have recently received increased interest in both the database and the cryptographic communities. In particular, modular order-preserving encryption (MOPE), due to Boldyreva et al., is a promising extension that increases the security of the basic OPE by introducing a secret modular offset to each data value prior to encrypting it. However, executing range queries via MOPE in a naive way allows the adversary to learn this offset, negating any potential security gains of this approach. In this paper, we systematically address this vulnerability and show that MOPE can be used to build a practical system for executing range queries on encrypted data while providing a significant security improvement over the basic OPE. We introduce two new query execution algorithms for MOPE: our first algorithm is efficient if the user's query distribution is well-spread, while the second scheme is efficient even for skewed query distributions. Interestingly, our second algorithm achieves this efficiency by leaking the least-important bits of the data, whereas OPE is known to leak the most-important bits of the data. We also show that our algorithms can be extended to the case where the query distribution is adaptively learned online. We present new, appropriate security models for MOPE and use them to rigorously analyze the security of our proposed schemes. Finally, we design a system prototype that integrates our schemes on top of an existing database system and apply query optimization methods to execute SQL queries with range predicates efficiently. We provide a performance evaluation of our prototype under a number of different database and query distributions, using both synthetic and real datasets
【Keywords】: database encryption; database security model; order preserving encryption; range queries
【Paper Link】 【Pages】:779-794
【Authors】: Tristan Allard ; Georges Hébrail ; Florent Masseglia ; Esther Pacitti
【Abstract】: The advent of on-body/at-home sensors connected to personal devices leads to the generation of fine grain highly sensitive personal data at an unprecendent rate. However, despite the promises of large scale analytics there are obvious privacy concerns that prevent individuals to share their personnal data. In this paper, we propose Chiaroscuro, a complete solution for clustering personal data with strong privacy guarantees. The execution sequence produced by Chiaroscuro is massively distributed on personal devices, coping with arbitrary connections and disconnections. Chiaroscuro builds on our novel data structure, called Diptych, which allows the participating devices to collaborate privately by combining encryption with differential privacy. Our solution yields a high clustering quality while minimizing the impact of the differentially private perturbation. Chiaroscuro is both correct and secure. Finally, we provide an experimental validation of our approach on both real and synthetic sets of time-series.
【Keywords】: clustering; differential privacy; gossip; k-means; secure multi-party computation; sensors; time-series
【Paper Link】 【Pages】:795-810
【Authors】: Zhewei Wei ; Ge Luo ; Ke Yi ; Xiaoyong Du ; Ji-Rong Wen
【Abstract】: A persistent data structure, also known as a multiversion data structure in the database literature, is a data structure that preserves all its previous versions as it is updated over time. Every update (inserting, deleting, or changing a data record) to the data structure creates a new version, while all the versions are kept in the data structure so that any previous version can still be queried. Persistent data structures aim at recording all versions accurately, which results in a space requirement that is at least linear to the number of updates. In many of today's big data applications, in particular for high-speed streaming data, the volume and velocity of the data are so high that we cannot afford to store everything. Therefore, streaming algorithms have received a lot of attention in the research community, which use only sublinear space by sacrificing slightly on accuracy. All streaming algorithms work by maintaining a small data structure in memory, which is usually called a em sketch, summary, or synopsis. The sketch is updated upon the arrival of every element in the stream, thus is ephemeral, meaning that it can only answer queries about the current status of the stream. In this paper, we aim at designing persistent sketches, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.
【Keywords】: approximation; persistence; sketch
【Paper Link】 【Pages】:811-825
【Authors】: Qian Lin ; Beng Chin Ooi ; Zhengkui Wang ; Cui Yu
【Abstract】: Efficient and scalable stream joins play an important role in performing real-time analytics for many cloud applications. However, like in conventional database processing, online theta-joins over data streams are computationally expensive and moreover, being memory-based processing, they impose high memory requirement on the system. In this paper, we propose a novel stream join model, called join-biclique, which organizes a large cluster as a complete bipartite graph. Join-biclique has several strengths over state-of-the-art techniques, including memory-efficiency, elasticity and scalability. These features are essential for building efficient and scalable streaming systems. Based on join-biclique, we develop a scalable distributed stream join system, BiStream, over a large-scale commodity cluster. Specifically, BiStream is designed to support efficient full-history joins, window-based joins and online data aggregation. BiStream also supports adaptive resource management to dynamically scale out and down the system according to its application workloads. We provide both theoretical cost analysis and extensive experimental evaluations to evaluate the efficiency, elasticity and scalability of BiStream.
【Keywords】: data streams; distributed system; stream join
【Paper Link】 【Pages】:827-841
【Authors】: Shaoxu Song ; Aoqian Zhang ; Jianmin Wang ; Philip S. Yu
【Abstract】: Stream data are often dirty, for example, owing to unreliable sensor reading, or erroneous extraction of stock prices. Most stream data cleaning approaches employ a smoothing filter, which may seriously alter the data without preserving the original information. We argue that the cleaning should avoid changing those originally correct/clean data, a.k.a. the minimum change principle in data cleaning. To capture the knowledge about what is clean, we consider the (widely existing) constraints on the speed of data changes, such as fuel consumption per hour, or daily limit of stock prices. Guided by these semantic constraints, in this paper, we propose SCREEN, the first constraint-based approach for cleaning stream data. It is notable that existing data repair techniques clean (a sequence of) data as a whole and fail to support stream computation. To this end, we have to relax the global optimum over the entire sequence to the local optimum in a window. Rather than the commonly observed NP-hardness of general data repairing problems, our major contributions include: (1) polynomial time algorithm for global optimum, (2) linear time algorithm towards local optimum under an efficient Median Principle,(3) support on out-of-order arrivals of data points, and(4) adaptive window size for balancing repair accuracy and efficiency. Experiments on real datasets demonstrate that SCREEN can show significantly higher repair accuracy than the existing approaches such as smoothing.
【Keywords】: data repairing; speed constraints
【Paper Link】 【Pages】:843-857
【Authors】: Long Guo ; Dongxiang Zhang ; Guoliang Li ; Kian-Lee Tan ; Zhifeng Bao
【Abstract】: In this paper, we propose a new location-aware pub/sub system, Elaps, that continuously monitors moving users subscribing to dynamic event streams from social media and E-commerce applications. Users are notified instantly when there is a matching event nearby. To the best of our knowledge, Elaps is the first to take into account continuous moving queries against dynamic event streams. Like existing works on continuous moving query processing,Elaps employs the concept of safe region to reduce communication overhead. However, unlike existing works which assume data from publishers are static, updates to safe regions may be triggered by newly arrived events. In Elaps, we develop a concept called \textit{impact region} that allows us to identify whether a safe region is affected by newly arrived events. Moreover, we propose a novel cost model to optimize the safe region size to keep the communication overhead low. Based on the cost model, we design two incremental methods, iGM and idGM, for safe region construction. In addition, Elaps uses boolean expression, which is more expressive than keywords, to model user intent and we propose a novel index, BEQ-Tree, to handle spatial boolean expression matching. In our experiments, we use geo-tweets from Twitter and venues from Foursquare to simulate publishers and boolean expressions generated from AOL search log to represent users intentions. We test user movement in both synthetic trajectories and real taxi trajectories. The results show that Elaps can significantly reduce the communication overhead and disseminate events to users in real-time.
【Keywords】: continuous moving queries; dynamic event streams; pub/sub
【Paper Link】 【Pages】:859-864
【Authors】: Nick R. Katsipoulakis ; Cory Thoma ; Eric A. Gratta ; Alexandros Labrinidis ; Adam J. Lee ; Panos K. Chrysanthis
【Abstract】: Data Stream Management Systems (DSMS) are crucial for modern high-volume/high-velocity data-driven applications, necessitating a distributed approach to processing them. In addition, data providers often require certain levels of confidentiality for their data, especially in cases of user-generated data, such as those coming out of physical activity/health tracking devices (i.e., our motivating application). This demonstration will showcase Synefo, an infrastructure that enables elastic scaling of DSMS operators, and CryptStream, a framework that provides confidentiality and access controls for data streams while allowing computation on untrusted servers, fused as CE-Storm. We will demonstrate both systems working in tandem and also visualize their behavior over time under different scenarios.
【Keywords】: confidentiality; continuous queries; distributed data stream management system; elasticity
【Paper Link】 【Pages】:865-870
【Authors】: Benjamin Dietrich ; Torsten Grust
【Abstract】: We demonstrate a new incarnation of Habitat, an observational debugger for SQL. In observational debugging, users highlight parts of a presumably faulty query to observe the evaluation of SQL subexpressions and learn about the query's actual runtime behavior. The present version of Habitat has been redesigned from scratch and employs a query instrumentation technique that exclusively relies on the SQL facilities of the underlying RDBMS. We particularly shed light on new features like (1) the debugging of recursive SQL queries and (2) the observation of row groups (before and after aggregation). Habitat can turn any reasonably modern SQL:1999 RDBMS into its own language-level SQL debugger.
【Keywords】: debuggers; observational debugging; sql
【Paper Link】 【Pages】:871-876
【Authors】: Zhifeng Bao ; Yong Zeng ; H. V. Jagadish ; Tok Wang Ling
【Abstract】: Due to the intrinsic ambiguity of keyword queries, users usually need to reformulate their queries multiple times to get the desired information. Even worse, users either have no way to precisely specify their search intention, or have limited domain knowledge on the data to precisely express their search intention. Moreover, they may just have a general interest to explore the data by keyword query. Therefore, our goal is to design an exploratory search paradigm that is able to bring humans more actively into the search process, in order to meet various user information needs, ranging from simple lookup to learning and understanding of the data. Besides, keyword queries against data with structure, such as XML, can run into multiple difficulties: how to identify the search target; more types of ambiguity arise as a keyword can be part of the structure as well as content of data, etc. Effectively addressing these requires solutions to multiple challenges. While some have been addressed to some extent individually, there is no previous effort to develop a comprehensive system to meet these important user needs and meet all of these challenges. Therefore, we propose a framework called ClearMap that natively supports visualized exploratory search paradigm on XML data. In particular, we offer an interactive and visualized mechanism to present the outcome of the query, enable user to explore and manipulate the underlying data to either quickly find desired information or learn the relationship among data items, as well as provide interactive suggestions when their expected results do not exist in the data. A preliminary version of ClearMap and its source code are available for try at http://xmlclearmap.comp.nus.edu.sg.
【Keywords】: interaction; keyword search; usability
【Paper Link】 【Pages】:877-881
【Authors】: Daniel Scheibli ; Christian Dinse ; Alexander Boehm
【Abstract】: QE3D is a novel query plan visualization tool that aims at providing an intuitive and holistic view of distributed query plans executed by the SAP HANA database management system. In this demonstration, we show how its interactive, three-dimensional plan representation helps to understand and quickly identify hotspots in complex, real-world scenarios.
【Keywords】: parallel database; sql; sql query analysis; visualization
【Paper Link】 【Pages】:883-888
【Authors】: John Morcos ; Ziawasch Abedjan ; Ihab Francis Ilyas ; Mourad Ouzzani ; Paolo Papotti ; Michael Stonebraker
【Abstract】: While syntactic transformations require the application of a formula on the input values, such as unit conversion or date format conversions, semantic transformations, such as "zip code to city", require a look-up in some reference data. We recently presented DataXFormer, a system that leverages Web tables, Web forms, and expert sourcing to cover a wide range of transformations. In this demonstration, we present the user-interaction with DataXFormer and show scenarios on how it can be used to transform data and explore the effectiveness and efficiency of several approaches for transformation discovery, leveraging about 112 million tables and online sources.
【Keywords】: data enrichment; data integration; data transformation; deep web; web forms; web tables; wrapper
【Paper Link】 【Pages】:889-894
【Authors】: Yuanzhen Ji ; Hongjin Zhou ; Zbigniew Jerzak ; Anisoara Nica ; Gregor Hackenbroich ; Christof Fetzer
【Abstract】: Executing continuous queries over out-of-order data streams, where tuples are not ordered according to timestamps, is challenging; because high result accuracy and low result latency are two conflicting performance metrics. Although many applications allow trading exact query results for lower latency, they still expect the produced results to meet a certain quality requirement. However, none of existing disorder handling approaches have considered minimizing the result latency while meeting user-specified requirements on the quality of query results. In this demonstration, we showcase AQ-K-slack, an adaptive, buffer-based disorder handling approach, which supports executing sliding window aggregate queries over out-of-order data streams in a quality-driven manner. By adapting techniques from the field of sampling-based approximate query processing and control theory, AQ-K-slack dynamically adjusts the input buffer size at query runtime to minimize the result latency, while respecting a user-specified threshold on relative errors in produced query results. We demonstrate a prototype stream processing system, which extends SAP Event Stream Processor with the implementation of AQ-K-slack. Through an interactive interface, the audience will learn the effect of different factors, such as the aggregate function, the window specification, the result error threshold, and stream properties, on the latency and the accuracy of query results. Moreover, they can experience the effectiveness of AQ-K-slack in obtaining user-desired latency vs. result accuracy trade-offs, compared to naive disorder handling approaches that make extreme trade-offs. For instance, by scarifying 1% result accuracy, our system can reduce the result latency by 80% when compared to the state of the art.
【Keywords】: continuous queries; data stream processing; disorder handling; out-of-order data streams
【Paper Link】 【Pages】:895-900
【Authors】: Ioannis Mytilinis ; Ioannis Giannakopoulos ; Ioannis Konstantinou ; Katerina Doka ; Dimitrios Tsitsigkos ; Manolis Terrovitis ; Lampros Giampouras ; Nectarios Koziris
【Abstract】: The amount of social networking data that is being produced and consumed daily is huge and it is constantly increasing. A user's digital footprint coming from social networks or mobile devices, such as comments and check-ins contains valuable information about his preferences. The collection and analysis of such footprints using also information about the users' friends and their footprints offers many opportunities in areas such as personalized search, recommendations, etc. When the size of the collected data or the complexity of the applied methods increases, traditional storage and processing systems are not enough and distributed approaches are employed. In this work, we present MoDisSENSE, an open-source distributed platform that provides personalized search for points of interest and trending events based on the user's social graph by combining spatio-textual user generated data. The system is designed with scalability in mind, it is built using a combination of latest state-of-the art big data frameworks and its functionality is offered through easy to use mobile and web clients which support the most popular social networks. We give an overview of its architectural components and technologies and we evaluate its performance and scalability using different query types over various cluster sizes. Using the web or mobile clients, users are allowed to register themselves with their own social network credentials, perform socially enhanced queries for POIs, browse the results and explore the automatic blog creation functionality that is extracted by analyzing already collected GPS traces.
【Keywords】: recommendation systems; social mobile applications; spatio-textual data processing
【Paper Link】 【Pages】:901-906
【Authors】: Qiang Hu ; Qi Liu ; Xiaoli Wang ; Anthony K. H. Tung ; Shubham Goyal ; Jisong Yang
【Abstract】: We demonstrate a system, DocRicher, to enrich a text document with social media, that implicitly reference certain passages of it. The aim is to provide an automatic annotation interface to satisfy users' information need, without cumbersome queries to traditional search engines. The system consists of four components: text analysis, query construction, data assignment, and user feedback. Through text analysis, the system decomposes a text document into appropriate topical passages, of which each is represented using detected key phrases. By submitting combinations of these phrases as queries to social media systems, the relevant results are used to suggest new annotations, that are linked to the corresponding passages. We have built a user-friendly visualization tool for users to browse automatically recommended annotations on their reading documents. Users are either allowed to rate a recommended annotation by accepting it or not; or add a new annotation by manually highlighting texts and adding personal comments. Both these annotations are regarded as the ground truth to derive new queries for retrieving more relevant contents. We also apply data fusion to merge the query results from various contexts and retain most relevant ones.
【Keywords】: document enrichment; ranking; social media
【Paper Link】 【Pages】:907-912
【Authors】: Li-Yan Yuan ; Lengdong Wu ; Jia-Huai You ; Yan Chi
【Abstract】: We propose to demonstrate Rubato DB, a highly scalable NewSQL system, supporting various consistency levels from ACID to BASE for OLTP and big data applications. Rubato DB employs the staged grid architecture with a novel formula based protocol for distributed concurrency control. Our demonstration will present Rubato DB as one NewSQL database management system running on a collection of commodity servers against two of benchmark sets. The demo attendees can modify the configuration of system size, fine-tune the query workload, and visualize the performance on the fly by the graphical user interface. Attendees can experiment with various system scales, and thus grasp the potential scalability of Rubato DB, whose performance, with the increase of the number of servers used, can achieve a linear growth for both OLTP application with the strong consistency properties and key-value storage applications with the weak consistency properties.
【Keywords】: acid; architecture; base; concurrency control; scalability
【Paper Link】 【Pages】:913-918
【Authors】: Kai Zeng ; Sameer Agarwal ; Ankur Dave ; Michael Armbrust ; Ion Stoica
【Abstract】: Nearly 15 years ago, Hellerstein, Haas and Wang proposed online aggregation (OLA), a technique that allows users to (1) observe the progress of a query by showing iteratively refined approximate answers, and (2) stop the query execution once its result achieves the desired accuracy. In this demonstration, we present G-OLA, a novel mini-batch execution model that generalizes OLA to support general OLAP queries with arbitrarily nested aggregates using efficient delta maintenance techniques. We have implemented G-OLA in FluoDB, a parallel online query execution framework that is built on top of the Spark cluster computing framework that can scale to massive data sets. We will demonstrate FluoDB on a cluster of 100 machines processing roughly 10TB of real-world session logs from a video-sharing website. Using an ad optimization and an A/B testing based scenario, we will enable users to perform real-time data analysis via web-based query consoles and dashboards.
【Keywords】: online aggregation
【Paper Link】 【Pages】:919-922
【Authors】: Yasushi Sakurai ; Yasuko Matsubara ; Christos Faloutsos
【Abstract】: Given a large collection of time series, such as web-click logs, electric medical records and motion capture sensors, how can we efficiently and effectively find typical patterns? How can we statistically summarize all the sequences, and achieve a meaningful segmentation? What are the major tools for forecasting and outlier detection? Time-series data analysis is becoming of increasingly high importance, thanks to the decreasing cost of hardware and the increasing on-line processing capability. The objective of this tutorial is to provide a concise and intuitive overview of the most important tools that can help us find patterns in large-scale time-series sequences. We review the state of the art in four related fields: (1) similarity search and pattern discovery, (2) linear modeling and summarization, (3) non-linear modeling and forecasting, and (4) the extension of time-series mining and tensor analysis. The emphasis of the tutorial is to provide the intuition behind these powerful tools, which is usually lost in the technical literature, as well as to introduce case studies that illustrate their practical use.
【Keywords】: forecasting; pattern discovery; tensors; time-series
【Paper Link】 【Pages】:923-938
【Authors】: Xiaoyang Wang ; Ying Zhang ; Wenjie Zhang ; Xuemin Lin ; Muhammad Aamir Cheema
【Abstract】: In many domains such as computational geometry and database management, an object may be described by multiple instances (points). Then the distance (or similarity) between two objects is captured by the pair-wise distances among their instances. In the past, numerous nearest neighbor (NN) functions have been proposed to define the distance between objects with multiple instances and to identify the NN object. Nevertheless, considering that a user may not have a specific NN function in mind, it is desirable to provide her with a set of NN candidates. Ideally, the set of NN candidates must include every object that is NN for at least one of the NN functions and must exclude every non-promising object. However, no one has studied the problem of NN candidates computation from this perspective. Although some of the existing works aim at returning a set of candidate objects, they do not focus on the NN functions while computing the candidate objects. As a result, they either fail to include an NN object w.r.t. some NN functions or include a large number of unnecessary objects that have no potential to be the NN regardless of the NN functions. Motivated by this, we classify the existing NN functions for objects with multiple instances into three families by characterizing their key features. Then, we advocate three spatial dominance operators to compute NN candidates where each operator is optimal w.r.t. different coverage of NN functions. Efficient algorithms are proposed for the dominance check and corresponding NN candidates computation. Extensive empirical study on real and synthetic datasets shows that our proposed operators can significantly reduce the number of NN candidates. The comprehensive performance evaluation demonstrates the efficiency of our computation techniques.
【Keywords】: nn candidates; nn functions; spatial dominance
【Paper Link】 【Pages】:939-950
【Authors】: Farhan Tauheed ; Thomas Heinis ; Anastasia Ailamaki
【Abstract】: Simulations have become ubiquitous in many domains of science. Today scientists study natural phenomena by first building massive three-dimensional spatial models and then by simulating the models at discrete intervals of time to mimic the behavior of natural phenomena. One frequently occurring challenge during simulations is the repeated computation of spatial self-joins of the model at each simulation time step. The join is performed to access a group of neighboring spatial objects (groups of particles, molecules or cosmological objects) so that scientists can calculate the cumulative effect (like gravitational force) on an object. Computing a self-join even in memory, soon becomes a performance bottleneck in simulation applications. The problem becomes even worse as scientists continue to improve the precision of simulations by increasing the number as well as the size (3D extent) of the objects. This leads to an exponential increase in join selectivity that challenges the performance and scalability of state-of-the-art approaches. We propose THERMAL-JOIN, a novel spatial self-join algorithm for dynamic memory-resident workloads. The algorithm groups objects in spatial proximity together into hot spots. Hot spots minimize the cost of computing join as objects assigned to a hot spot are guaranteed to overlap with each other. Using a nested spatial grid, THERMAL-JOIN partitions and indexes the dataset to locate hot spots. With experiments we show that our approach provides a speedup between 8 to 12x compared to the state of the art and also scales as scientists improve the precision of their simulations.
【Keywords】: dynamic workload; high selectivity join; scientific data management; spatial self-join
【Paper Link】 【Pages】:951-965
【Authors】: Lu Chen ; Yunjun Gao ; Xinhan Li ; Christian S. Jensen ; Gang Chen ; Baihua Zheng
【Abstract】: Range queries in metric spaces have applications in many areas such as multimedia retrieval, computational biology, and location-based services, where metric uncertain data exists in different forms, resulting from equipment limitations, high-throughput sequencing technologies, privacy preservation, or others. In this paper, we represent metric uncertain data by using an object-level model and a bi-level model, respectively. Two novel indexes, the uncertain pivot B+-tree (UPB-tree) and the uncertain pivot B+-forest (UPB-forest), are proposed accordingly in order to support probabilistic range queries w.r.t. a wide range of uncertain data types and similarity metrics. Both index structures use a small set of effective pivots chosen based on a newly defined criterion, and employ the B+-tree(s) as the underlying index. By design, they are easy to be integrated into any existing DBMS. In addition, we present efficient metric probabilistic range query algorithms, which utilize the validation and pruning techniques based on our derived probability lower and upper bounds. Extensive experiments with both real and synthetic data sets demonstrate that, compared against existing state-of-the-art indexes for metric uncertain data, the UPB-tree and UPB-forest incur much lower construction costs, consume smaller storage spaces, and can support more efficient metric probabilistic range queries.
【Keywords】: index structure; metric space; range query; uncertain data
【Paper Link】 【Pages】:967-982
【Authors】: Sibo Wang ; Wenqing Lin ; Yi Yang ; Xiaokui Xiao ; Shuigeng Zhou
【Abstract】: A public transportation network can often be modeled as a timetable graph where (i) each node represents a station; and (ii) each directed edge (u,v) is associated with a timetable that records the departure (resp. arrival) time of each vehicle at station u (resp. v). Several techniques have been proposed for various types of route planning on timetable graphs, e.g., retrieving the route from a node to another with the shortest travel time. These techniques, however, either provide insufficient query efficiency or incur significant space overheads. This paper presents Timetable Labelling (TTL), an efficient indexing technique for route planning on timetable graphs. The basic idea of TTL is to associate each node $u$ with a set of labels, each of which records the shortest travel time from u to some other node v given a certain departure time from u; such labels would then be used during query processing to improve efficiency. In addition, we propose query algorithms that enable TTL to support three popular types of route planning queries, and investigate how we reduce the space consumption of TTL with advanced preprocessing and label compression methods. By conducting an extensive set of experiments on real world datasets, we demonstrate that TTL significantly outperforms the states of the art in terms of query efficiency, while incurring moderate preprocessing and space overheads.
【Keywords】: algorithm; path queries; transportation network
【Paper Link】 【Pages】:983-998
【Authors】: Aris Anagnostopoulos ; Luca Becchetti ; Adriano Fazzone ; Ida Mele ; Matteo Riondato
【Abstract】: Crowdsourcing is a computational paradigm whose distinctive feature is the involvement of human workers in key steps of the computation. It is used successfully to address problems that would be hard or impossible to solve for machines. As we highlight in this work, the exclusive use of nonexpert individuals may prove ineffective in some cases, especially when the task at hand or the need for accurate solutions demand some degree of specialization to avoid excessive uncertainty and inconsistency in the answers. We address this limitation by proposing an approach that combines the wisdom of the crowd with the educated opinion of experts. We present a computational model for crowdsourcing that envisions two classes of workers with different expertise levels. One of its distinctive features is the adoption of the threshold error model, whose roots are in psychometrics and which we extend from previous theoretical work. Our computational model allows to evaluate the performance of crowdsourcing algorithms with respect to accuracy and cost. We use our model to develop and analyze an algorithm for approximating the best, in a broad sense, of a set of elements. The algorithm uses naïve and expert workers to find an element that is a constant-factor approximation to the best. We prove upper and lower bounds on the number of comparisons needed to solve this problem, showing that our algorithm uses expert and naïve workers optimally up to a constant factor. Finally, we evaluate our algorithm on real and synthetic datasets using the CrowdFlower crowdsourcing platform, showing that our approach is also effective in practice.
【Keywords】: crowdsourcing; human computation; max algorithm; worker models
【Paper Link】 【Pages】:999-1014
【Authors】: Nguyen Quoc Viet Hung ; Duong Chi Thang ; Matthias Weidlich ; Karl Aberer
【Abstract】: In recent years, crowdsourcing has become essential in a wide range of Web applications. One of the biggest challenges of crowdsourcing is the quality of crowd answers as workers have wide-ranging levels of expertise and the worker community may contain faulty workers. Although various techniques for quality control have been proposed, a post-processing phase in which crowd answers are validated is still required. Validation is typically conducted by experts, whose availability is limited and who incur high costs. Therefore, we develop a probabilistic model that helps to identify the most beneficial validation questions in terms of both, improvement of result correctness and detection of faulty workers. Our approach allows us to guide the expert's work by collecting input on the most problematic cases, thereby achieving a set of high quality answers even if the expert does not validate the complete answer set. Our comprehensive evaluation using both real-world and synthetic datasets demonstrates that our techniques save up to 50% of expert efforts compared to baseline methods when striving for perfect result correctness. In absolute terms, for most cases, we achieve close to perfect correctness after expert input has been sought for only 20\% of the questions.
【Keywords】: crowdsourcing; expectation maximization; guiding user feedback; validation
【Paper Link】 【Pages】:1015-1030
【Authors】: Ju Fan ; Guoliang Li ; Beng Chin Ooi ; Kian-Lee Tan ; Jianhua Feng
【Abstract】: Crowdsourcing is widely accepted as a means for resolving tasks that machines are not good at. Unfortunately, Crowdsourcing may yield relatively low-quality results if there is no proper quality control. Although previous studies attempt to eliminate "bad" workers by using qualification tests, the accuracies estimated from qualifications may not be accurate, because workers have diverse accuracies across tasks. Thus, the quality of the results could be further improved by selectively assigning tasks to the workers who are well acquainted with the tasks. To this end, we propose an adaptive crowdsourcing framework, called iCrowd. iCrowd on-the-fly estimates accuracies of a worker by evaluating her performance on the completed tasks, and predicts which tasks the worker is well acquainted with. When a worker requests for a task, iCrowd assigns her a task, to which the worker has the highest estimated accuracy among all online workers. Once a worker submits an answer to a task, iCrowd analyzes her answer and adjusts estimation of her accuracies to improve subsequent task assignments. This paper studies the challenges that arise in iCrowd. The first is how to estimate diverse accuracies of a worker based on her completed tasks. The second is instant task assignment. We deploy iCrowd on Amazon Mechanical Turk, and conduct extensive experiments on real datasets. Experimental results show that iCrowd achieves higher quality than existing approaches.
【Keywords】: adaptive task assignment; crowdsourcing; quality control
【Paper Link】 【Pages】:1031-1046
【Authors】: Yudian Zheng ; Jiannan Wang ; Guoliang Li ; Reynold Cheng ; Jianhua Feng
【Abstract】: A crowdsourcing system, such as the Amazon Mechanical Turk (AMT), provides a platform for a large number of questions to be answered by Internet workers. Such systems have been shown to be useful to solve problems that are difficult for computers, including entity resolution, sentiment analysis, and image recognition. In this paper, we investigate the online task assignment problem: Given a pool of n questions, which of the k questions should be assigned to a worker? A poor assignment may not only waste time and money, but may also hurt the quality of a crowdsourcing application that depends on the workers' answers. We propose to consider quality measures (also known as evaluation metrics) that are relevant to an application during the task assignment process. Particularly, we explore how Accuracy and F-score, two widely-used evaluation metrics for crowdsourcing applications, can facilitate task assignment. Since these two metrics assume that the ground truth of a question is known, we study their variants that make use of the probability distributions derived from workers' answers. We further investigate online assignment strategies, which enables optimal task assignments. Since these algorithms are expensive, we propose solutions that attain high quality in linear time. We develop a system called the Quality-Aware Task Assignment System for Crowdsourcing Applications (QASCA) on top of AMT. We evaluate our approaches on five real crowdsourcing applications. We find that QASCA is efficient, and attains better result quality (of more than 8% improvement) compared with existing methods.
【Keywords】: crowdsourcing; online task assignment; quality control
【Paper Link】 【Pages】:1047-1062
【Authors】: Vasilis Verroios ; Peter Lofgren ; Hector Garcia-Molina
【Abstract】: Latency is a critical factor when using a crowdsourcing platform to solve a problem like entity resolution or sorting. In practice, most frameworks attempt to reduce latency by heuristically splitting a budget of questions into rounds, so that after each round the answers are analyzed and new questions are selected. We focus on one of the most extensively studied crowdsourcing operations, the MAX operation (finding the best element in a collection under human criteria), and we study the problem of budget allocation into rounds for this operation. We provide a polynomial-time dynamic-programming budget allocation algorithm that minimizes the latency when questions form tournaments in each round. Furthermore, we study the general case where questions can be asked in any arbitrary way in each round. Our theoretical results for the general case indicate that our approach is also optimal under certain worst and average-case scenarios. We compare our approach to alternatives on Amazon Mechanical Turk, where many of our theory assumptions do not necessarily hold. We find that our approach is also optimal in practice and achieves a notable improvement over alternatives in most cases.
【Keywords】: crowdsourcing budget allocation; crowdsourcing latency; crowdsourcing maximum operator; crowdsourcing sorting; crowdsourcing top-k
【Paper Link】 【Pages】:1063-1068
【Authors】: Petrie Wong ; Zhian He ; Ziqiang Feng ; Wenjian Xu ; Eric Lo
【Abstract】: Recently, Amazon has announced Redshift, a Parallel-Database-as-a Service (PDaaS). Redshift adopts the "virtual cluster" approach to implement multitenancy, which has the merit of hard isolation among tenants (i.e., tenants do not interfere even when sharing resources). However, that benefit comes with poor resource utilization due to the significant redundancy incurred in the resources. In this demonstration, we present Thrifty, a Parallel-Database-as-a-Service operated using the "shared-process" approach. Compared with Redshift, each tenant in Thrifty does not occupy an exclusive amount of resource but share the database processes together, leading to better resource utilization. To avoid contention among tenants, Thrifty uses a proper cluster design, a tenant placement scheme, and a query routing mechanism to achieve soft isolation. In the demonstration, an attendee will be invited to register with Thrifty as a tenant to rent a parallel database instance. Then the attendee will be allowed to view the dashboard of a Thrifty's administrator. Next, the attendee will be invited to control (e.g., increase) the workload of the tenant so as to see how Thrifty carries out online re- consolidation and elastic scaling.
【Keywords】: cloud databases; consolidation; database-as-a-service; multi-tenant databases; parallel databases
【Paper Link】 【Pages】:1069-1073
【Authors】: Dana Van Aken ; Djellel Eddine Difallah ; Andrew Pavlo ; Carlo Curino ; Philippe Cudré-Mauroux
【Abstract】: Benchmarking is an essential activity when choosing database products, tuning systems, and understanding the trade-offs of the underlying engines. But the workloads available for this effort are often restrictive and non-representative of the ever changing requirements of the modern database applications. We recently introduced OLTP-Bench, an extensible testbed for benchmarking relational databases that is bundled with 15 workloads. The key features that set this framework apart is its ability to tightly control the request rate and dynamically change the transaction mixture. This allows an administrator to compose complex execution targets that recreate real system loads, and opens the doors to new research directions involving tuning for special execution patterns and multi-tenancy. In this demonstration, we highlight OLTP-Bench's important features through the BenchPress game. It allows users to control the benchmark behavior in real time for multiple database management systems.
【Keywords】: benchmarking; configuration; experimentation; flappy bird; oltp; performance; tuning
【Paper Link】 【Pages】:1075-1080
【Authors】: V. M. Megler ; David Maier
【Abstract】: Prior work proposed "Data Near Here" (DNH), a data search engine for scientific archives that is modeled on Internet search engines. DNH performs a periodic, asynchronous scan of each dataset in an archive, extracting lightweight features that are combined to form a dataset summary. During a search, DNH assesses the similarity of the search terms to the summary features and returns to the user, at interactive timescales, a ranked list of datasets for further exploration and analysis. We will demonstrate the search capabilities and ancillary metadata-browsing features for an archive of observational oceanographic data. While comparing search terms to complete datasets might seem ideal, interactive search speed would be impossible with archives of realistic size. We include an analysis showing that our summary-based approach gives a reasonable approximation of such a "complete dataset" similarity measure.
【Keywords】: ranked data search; scientific data
【Paper Link】 【Pages】:1081-1086
【Authors】: Jules Chevalier ; Julien Subercaze ; Christophe Gravier ; Frédérique Laforest
【Abstract】: The Semantic Web has gained substantial momentum over the last decade. It contributes to the manifestation of knowledge from data, and leverages implicit knowledge through reasoning algorithms. The main drawbacks of current reasoning methods over ontologies are two-fold: first they struggle to provide scalability for large datasets, and second, the batch processing reasoners who provide the best scalability so far are unable to infer knowledge from evolving data. We contribute to solving these problems by introducing Slider, an efficient incremental reasoner. Slider goes a significant step beyond existing system, including i) performance, by more than a 70% improvement in average compared to the fastest reasoner available to the best of our knowledge, and ii) inferences on streams of semantic data, by using intrinsic features that are themselves streams-oriented. Slider is fragment agnostic and conceived to handle expanding data with a growing background knowledge base. It natively supports pdf and RDFS, and its architecture allows to extend it to more complex fragments with a minimal effort. In this demo a web-based interface allows the users to visualize the internal behaviour of Slider during the inference, to better understand its design and principles.
【Keywords】: incremental reasoning; streamed reasoning; web of data
【Paper Link】 【Pages】:1087-1092
【Authors】: Ashish Vulimiri ; Carlo Curino ; Philip Brighten Godfrey ; Thomas Jungblut ; Konstantinos Karanasos ; Jitendra Padhye ; George Varghese
【Abstract】: Many large organizations collect massive volumes of data each day in a geographically distributed fashion, at data centers around the globe. Despite their geographically diverse origin the data must be processed and analyzed as a whole to extract insight. We call the problem of supporting large-scale geo-distributed analytics Wide-Area Big Data (WABD). To the best of our knowledge, WABD is currently addressed by copying all the data to a central data center where the analytics are run. This approach consumes expensive cross-data center bandwidth and is incompatible with data sovereignty restrictions that are starting to take shape. We instead propose WANalytics, a system that solves the WABD problem by orchestrating distributed query execution and adjusting data replication across data centers in order to minimize bandwidth usage, while respecting sovereignty requirements. WANalytics achieves an up to 360x reduction in data transfer cost when compared to the centralized approach on both real Microsoft production workloads and standard synthetic benchmarks, including TPC-CH and Berkeley Big-Data. In this demonstration, attendees will interact with a live geo-scale multi-data center deployment of WANalytics, allowing them to experience the data transfer reduction our system achieves, and to explore how it dynamically adapts execution strategy in response to changes in the workload and environment.
【Keywords】: analytics; federation; geo-distribution; olap; sovereignty
【Paper Link】 【Pages】:1093-1098
【Authors】: Huayu Wu ; Jo-Anne Tan ; Wee Siong Ng ; Mingqiang Xue ; Wei Chen
【Abstract】: The tourism industry is a key economic driver for many cities. To understand tourists' traveling patterns can help both public and private relevant sectors design and improve their services to serve tourists better and get additional values from it. The existing approaches to discover tourists' traveling pattern focus on small sets of known tourists extracted from social media or other channels. The accuracy of the mining result cannot be guaranteed due to the small and bias set of samples. In this paper, we present our system FTT (Finding and Tracking Tourists) to identify tourists from public transport commuters in a city, and to further track their movements from one place to another. Our target is a large set of tourists and their trajectories extracted from public transport riding records, which more accurately represent the movements of general tourists. In particular, we design an iterative learning algorithm to find the tourists among public transport commuters, and provide interface to answer user queries on tourists' traveling patterns. The result will be visualized on top of a city map.
【Keywords】: data analytics; public transport; tourist finding; travel pattern
【Paper Link】 【Pages】:1099-1104
【Authors】: Haozhou Wang ; Kai Zheng ; Xiaofang Zhou ; Shazia Wasim Sadiq
【Abstract】: An increasing amount of motion history data, which is called trajectory, is being collected from different sources such as GPS-enabled mobile devices, surveillance cameras and social networks. However it is hard to store and manage trajectory data in traditional database systems, since its variable lengths and asynchronous sampling rates do not fit disk-based and tuple-oriented structures, which are the fundamental structures of traditional database systems. We implement a novel trajectory storage system that is motivated by the success of column store and recent development of in-memory based databases. In this storage design, we try to explore the potential opportunities, which can boost the performance of query processing for trajectory data. To achieve this, we partition the trajectories into frames as column-oriented storage in order to store the sample points of a moving object, which are aligned by the time interval, within the main memory. Furthermore, the frames can be highly compressed and well structured to increase the memory utilization ratio and reduce the CPU-cache missing. It is also easier for parallelizing data processing on the multi-core server since the frames are mutually independent.
【Keywords】: column-oriented data structure; in-memory; trajectory
【Paper Link】 【Pages】:1105-1110
【Authors】: Yonathan Perez ; Rok Sosic ; Arijit Banerjee ; Rohan Puttagunta ; Martin Raison ; Pararth Shah ; Jure Leskovec
【Abstract】: We present Ringo, a system for analysis of large graphs. Graphs provide a way to represent and analyze systems of interacting objects (people, proteins, webpages) with edges between the objects denoting interactions (friendships, physical interactions, links). Mining graphs provides valuable insights about individual objects as well as the relationships among them. In building Ringo, we take advantage of the fact that machines with large memory and many cores are widely available and also relatively affordable. This allows us to build an easy-to-use interactive high-performance graph analytics system. Graphs also need to be built from input data, which often resides in the form of relational tables. Thus, Ringo provides rich functionality for manipulating raw input data tables into various kinds of graphs. Furthermore, Ringo also provides over 200 graph analytics functions that can then be applied to constructed graphs. We show that a single big-memory machine provides a very attractive platform for performing analytics on all but the largest graphs as it offers excellent performance and ease of use as compared to alternative approaches. With Ringo, we also demonstrate how to integrate graph analytics with an iterative process of trial-and-error data exploration and rapid experimentation, common in data mining workloads.
【Keywords】: algorithms; graph analytics; graph processing; graphs; networks; performance
【Paper Link】 【Pages】:1111-1116
【Authors】: Robert Christensen ; Lu Wang ; Feifei Li ; Ke Yi ; Jun Tang ; Natalee Villa
【Abstract】: We present the STORM system to enable spatio-temporal online reasoning and management of large spatio-temporal data. STORM supports interactive spatio-temporal analytics through novel spatial online sampling techniques. Online spatio-temporal aggregation and analytics are then derived based on the online samples, where approximate answers with approximation quality guarantees can be provided immediately from the start of query execution. The quality of these online approximations improve over time. This demonstration proposal describes key ideas in the design of the STORM system, and presents the demonstration plan.
【Keywords】: spatial online analytics; spatial online sampling; storm
【Paper Link】 【Pages】:1117-1122
【Authors】: Jesús Camacho-Rodríguez ; Dario Colazzo ; Ioana Manolescu ; Juan A. M. Naranjo
【Abstract】: XQuery is a general-purpose programming language for processing semi-structured data, and as such, it is very expressive. As a consequence, optimizing and parallelizing complex analytics XQuery queries is still an open, challenging problem. We demonstrate PAXQuery, a novel system that parallelizes the execution of XQuery queries over large collections of XML documents. PAXQuery compiles a rich subset of XQuery into plans expressed in the PArallelization ConTracts (PACT) programming model. Thanks to this translation, the resulting plans are optimized and executed in a massively parallel fashion by the Apache Flink system. The result is a scalable system capable of querying massive amounts of XML data very efficiently, as proved by the experimental results we outline.
【Keywords】: xml data management; xquery parallelization; xquery processing
【Paper Link】 【Pages】:1123-1136
【Authors】: Ingo Müller ; Peter Sanders ; Arnaud Lacurie ; Wolfgang Lehner ; Franz Färber
【Abstract】: For decades researchers have studied the duality of hashing and sorting for the implementation of the relational operators, especially for efficient aggregation. Depending on the underlying hardware and software architecture, the specifically implemented algorithms, and the data sets used in the experiments, different authors came to different conclusions about which is the better approach. In this paper we argue that in terms of cache efficiency, the two paradigms are actually the same. We support our claim by showing that the complexity of hashing is the same as the complexity of sorting in the external memory model. Furthermore we make the similarity of the two approaches obvious by designing an algorithmic framework that allows to switch seamlessly between hashing and sorting during execution. The fact that we mix hashing and sorting routines in the same algorithmic framework allows us to leverage the advantages of both approaches and makes their similarity obvious. On a more practical note, we also show how to achieve very low constant factors by tuning both the hashing and the sorting routines to modern hardware. Since we observe a complementary dependency of the constant factors of the two routines to the locality of the input, we exploit our framework to switch to the faster routine where appropriate. The result is a novel relational aggregation algorithm that is cache-efficient---independently and without prior knowledge of input skew and output cardinality---, highly parallelizable on modern multi-core systems, and operating at a speed close to the memory bandwidth, thus outperforming the state-of-the-art by up to 3.7x.
【Keywords】: adaptive algorithm; aggregation; cache-efficient; group by; grouping; hashing; robust performance; shared-memory; sorting
【Paper Link】 【Pages】:1137-1151
【Authors】: Guoliang Li ; Jian He ; Dong Deng ; Jian Li
【Abstract】: In this paper we study similarity join and search on multi- attribute data. Traditional methods on single-attribute data have pruning power only on single attributes and cannot efficiently support multi-attribute data. To address this problem, we propose a prefix tree index which has holis- tic pruning ability on multiple attributes. We propose a cost model to quantify the prefix tree which can guide the prefix tree construction. Based on the prefix tree, we devise a filter-verification framework to support similarity search and join on multi-attribute data. The filter step prunes a large number of dissimilar results and identifies some candi- dates using the prefix tree and the verification step verifies the candidates to generate the final answer. For similar- ity join, we prove that constructing an optimal prefix tree is NP-complete and develop a greedy algorithm to achieve high performance. For similarity search, since one prefix tree cannot support all possible search queries, we extend the cost model to support similarity search and devise a budget-based algorithm to construct multiple high-quality prefix trees. We also devise a hybrid verification algorithm to improve the verification step. Experimental results show our method significantly outperforms baseline approaches.
【Keywords】: multi-attribute data; similarity join; similarity search
【Paper Link】 【Pages】:1153-1166
【Authors】: Eleni Petraki ; Stratos Idreos ; Stefan Manegold
【Abstract】: Great database systems performance relies heavily on index tuning, i.e., creating and utilizing the best indices depending on the workload. However, the complexity of the index tuning process has dramatically increased in recent years due to ad-hoc workloads and shortage of time and system resources to invest in tuning. This paper introduces holistic indexing, a new approach to automated index tuning in dynamic environments. Holistic indexing requires zero set-up and tuning effort, relying on adaptive index creation as a side-effect of query processing. Indices are created incrementally and partially;they are continuously refined as we process more and more queries. Holistic indexing takes the state-of-the-art adaptive indexing ideas a big step further by introducing the notion of a system which never stops refining the index space, taking educated decisions about which index we should incrementally refine next based on continuous knowledge acquisition about the running workload and resource utilization. When the system detects idle CPU cycles, it utilizes those extra cycles by refining the adaptive indices which are most likely to bring a benefit for future queries. Such idle CPU cycles occur when the system cannot exploit all available cores up to 100%, i.e., either because the workload is not enough to saturate the CPUs or because the current tasks performed for query processing are not easy to parallelize to the point where all available CPU power is exploited. In this paper, we present the design of holistic indexing for column-oriented database architectures and we discuss a detailed analysis against parallel versions of state-of-the-art indexing and adaptive indexing approaches. Holistic indexing is implemented in an open-source column-store DBMS. Our detailed experiments on both synthetic and standard benchmarks (TPC-H) and workloads (SkyServer) demonstrate that holistic indexing brings significant performance gains by being able to continuously refine the physical design in parallel to query processing, exploiting any idle CPU resources.
【Keywords】: holistic indexing; self-organization
【Paper Link】 【Pages】:1167-1182
【Authors】: Barzan Mozafari ; Eugene Zhen Ye Goh ; Dong Young Yoon
【Abstract】: A fundamental problem in database systems is choosing the best physical design, i.e., a small set of auxiliary structures that enable the fastest execution of future queries. Almost all commercial databases come with designer tools that create a number of indices or materialized views (together comprising the physical design) that they exploit during query processing. Existing designers are what we call nominal; that is, they assume that their input parameters are precisely known and equal to some nominal values. For instance, since future workload is often not known a priori, it is common for these tools to optimize for past workloads in hopes that future queries and data will be similar. In practice, however, these parameters are often noisy or missing. Since nominal designers do not take the influence of such uncertainties into account, they find designs that are sub-optimal and remarkably brittle. Often, as soon as the future workload deviates from the past, their overall performance falls off a cliff, leading to customer discontent and expensive redesigns. Thus, we propose a new type of database designer that is robust against parameter uncertainties, so that overall performance degrades more gracefully when future workloads deviate from the past. Users express their risk tolerance by deciding on how much nominal optimality they are willing to trade for attaining their desired level of robustness against uncertain situations. To the best of our knowledge, this paper is the first to adopt the recent breakthroughs in the theory of robust optimization to build a practical framework for solving some of the most fundamental problems in databases, replacing today's brittle designs with a principled world of robust designs that can guarantee predictable and consistent performance.
【Keywords】: database performance; physical design; robust optimization; workload resilience
【Paper Link】 【Pages】:1183-1198
【Authors】: Manas Joglekar ; Hector Garcia-Molina ; Aditya G. Parameswaran ; Christopher Re
【Abstract】: User Defined Function(UDFs) are used increasingly to augment query languages with extra, application dependent functionality. Selection queries involving UDF predicates tend to be expensive, either in terms of monetary cost or latency. In this paper, we study ways to efficiently evaluate selection queries with UDF predicates. We provide a family of techniques for processing queries at low cost while satisfying user-specified precision and recall constraints. Our techniques are applicable to a variety of scenarios including when selection probabilities of tuples are available beforehand, when this information is available but noisy, or when no such prior information is available. We also generalize our techniques to more complex queries. Finally, we test our techniques on real datasets, and show that they achieve significant savings in UDF evaluations of up to $80\%$, while incurring only a small reduction in accuracy.
【Keywords】: approximate query processing; user defined functions
【Paper Link】 【Pages】:1199-1214
【Authors】: Moria Bergman ; Tova Milo ; Slava Novgorodov ; Wang Chiew Tan
【Abstract】: As key decisions are often made based on information contained in a database, it is important for the database to be as complete and correct as possible. For this reason, many data cleaning tools have been developed to automatically resolve inconsistencies in databases. However, data cleaning tools provide only best-effort results and usually cannot eradicate all errors that may exist in a database. Even more importantly, existing data cleaning tools do not typically address the problem of determining what information is missing from a database. To overcome the limitations of existing data cleaning techniques, we present QOCO, a novel query-oriented system for cleaning data with oracles. Under this framework, incorrect (resp. missing) tuples are removed from (added to) the result of a query through edits that are applied to the underlying database, where the edits are derived by interacting with domain experts which we model as oracle crowds. We show that the problem of determining minimal interactions with oracle crowds to derive database edits for removing (adding) incorrect (missing) tuples to the result of a query is NP-hard in general and present heuristic algorithms that interact with oracle crowds. Finally, we implement our algorithms in our prototype system QOCO and show that it is effective and efficient through a comprehensive suite of experiments.
【Keywords】: crowdsourcing; data cleaning
【Paper Link】 【Pages】:1215-1230
【Authors】: Zuhair Khayyat ; Ihab F. Ilyas ; Alekh Jindal ; Samuel Madden ; Mourad Ouzzani ; Paolo Papotti ; Jorge-Arnulfo Quiané-Ruiz ; Nan Tang ; Si Yin
【Abstract】: Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.
【Keywords】: cleansing abstraction; distributed data cleansing; distributed data repair; schema constraints
【Paper Link】 【Pages】:1231-1245
【Authors】: Xiaolan Wang ; Xin Luna Dong ; Alexandra Meliou
【Abstract】: A lot of systems and applications are data-driven, and the correctness of their operation relies heavily on the correctness of their data. While existing data cleaning techniques can be quite effective at purging datasets of errors, they disregard the fact that a lot of errors are systematic, inherent to the process that produces the data, and thus will keep occurring unless the problem is corrected at its source. In contrast to traditional data cleaning, in this paper we focus on data diagnosis: explaining where and how the errors happen in a data generative process. We develop a large-scale diagnostic framework called DATA X-RAY. Our contributions are three-fold. First, we transform the diagnosis problem to the problem of finding common properties among erroneous elements, with minimal domain-specific assumptions. Second, we use Bayesian analysis to derive a cost model that implements three intuitive principles of good diagnoses. Third, we design an efficient, highly-parallelizable algorithm for performing data diagnosis on large-scale data. We evaluate our cost model and algorithm using both real-world and synthetic data, and show that our diagnostic framework produces better diagnoses and is orders of magnitude more efficient than existing techniques.
【Keywords】: data cleaning; data profiling; error diagnosis
【Paper Link】 【Pages】:1247-1261
【Authors】: Xu Chu ; John Morcos ; Ihab F. Ilyas ; Mourad Ouzzani ; Paolo Papotti ; Nan Tang ; Yin Ye
【Abstract】: Classical approaches to clean data have relied on using integrity constraints, statistics, or machine learning. These approaches are known to be limited in the cleaning accuracy, which can usually be improved by consulting master data and involving experts to resolve ambiguity. The advent of knowledge bases KBs both general-purpose and within enterprises, and crowdsourcing marketplaces are providing yet more opportunities to achieve higher accuracy at a larger scale. We propose KATARA, a knowledge base and crowd powered data cleaning system that, given a table, a KB, and a crowd, interprets table semantics to align it with the KB, identifies correct and incorrect data, and generates top-k possible repairs for incorrect data. Experiments show that KATARA can be applied to various datasets and KBs, and can efficiently annotate data and suggest possible repairs.
【Keywords】: crowdsourcing; data cleaning; data quality; knowledge base
【Paper Link】 【Pages】:1263-1277
【Authors】: Sibo Wang ; Xiaokui Xiao ; Chun-Hee Lee
【Abstract】: Data deduplication stands as a building block for data integration and data cleaning. The state-of-the-art techniques focus on how to exploit crowdsourcing to improve the accuracy of deduplication. However, they either incur significant overheads on the crowd or offer inferior accuracy. This paper presents ACD, a new crowd-based algorithm for data deduplication. The basic idea of ACD is to adopt correlation clustering (which is a classic machine-based algorithm for data deduplication) under a crowd-based setting. We propose non-trivial techniques to reduce the time required in performing correlation clustering with the crowd, and devise methods to postprocess the results of correlation clustering for better accuracy of deduplication. With extensive experiments on the Amazon Mechanical Turk, we demonstrate that ACD outperforms the states of the art by offering a high precision of deduplication while incurring moderate crowdsourcing overheads.
【Keywords】: correlating clustering; crowdsourcing; data deduplication
【Paper Link】 【Pages】:1279-1294
【Authors】: Faisal Nawab ; Vaibhav Arora ; Divyakant Agrawal ; Amr El Abbadi
【Abstract】: Cross datacenter replication is increasingly being deployed to bring data closer to the user and to overcome datacenter outages. The extent of the influence of wide-area communication on serializable transactions is not yet clear. In this work, we derive a lower-bound on commit latency. The sum of the commit latency of any two datacenters is at least the Round-Trip Time (RTT) between them. We use the insights and lessons learned while deriving the lower-bound to develop a commit protocol, called Helios, that achieves low commit latencies. Helios actively exchanges transaction logs (history) between datacenters. The received logs are used to decide whether a transaction can commit or not. The earliest point in the received logs that is needed to commit a transaction is decided by Helios to ensure a low commit latency. As we show in the paper, Helios is theoretically able to achieve the lower-bound commit latency. Also, in a real-world deployment on five datacenters, Helios has a commit latency that is close to the optimal.
【Keywords】: cloud computing; geo-replication; multi-datacenter
【Paper Link】 【Pages】:1295-1309
【Authors】: Philip A. Bernstein ; Sudipto Das ; Bailu Ding ; Markus Pilman
【Abstract】: Scaling-out a database system typically requires partitioning the database across multiple servers. If applications do not partition perfectly, then transactions accessing multiple partitions end up being distributed, which has well-known scalability challenges. To address them, we describe a high-performance transaction mechanism that uses optimistic concurrency control on a multi-versioned tree-structured database stored in a shared log. The system scales out by adding servers, without partitioning the database. Our solution is modeled on the Hyder architecture, published by Bernstein, Reid, and Das at CIDR 2011. We present the design and evaluation of the first full implementation of that architecture. The core of the system is a log roll-forward algorithm, called meld, that does optimistic concurrency control. Meld is inherently sequential and is therefore the main bottleneck. Our main algorithmic contributions are optimizations to meld that significantly increase transaction throughput. They use a pipelined design that parallelizes meld onto multiple threads. The slowest pipeline stage is much faster than the original meld algorithm, yielding a 3x improvement of system throughput over the original meld algorithm.
【Keywords】: optimistic concurrency control; scale-out transaction processing
【Paper Link】 【Pages】:1311-1326
【Authors】: Sudip Roy ; Lucja Kot ; Gabriel Bender ; Bailu Ding ; Hossein Hojjat ; Christoph Koch ; Nate Foster ; Johannes Gehrke
【Abstract】: Datastores today rely on distribution and replication to achieve improved performance and fault-tolerance. But correctness of many applications depends on strong consistency properties--something that can impose substantial overheads, since it requires coordinating the behavior of multiple nodes. This paper describes a new approach to achieving strong consistency in distributed systems while minimizing communication between nodes. The key insight is to allow the state of the system to be inconsistent during execution, as long as this inconsistency is bounded and does not affect transaction correctness. In contrast to previous work, our approach uses program analysis to extract semantic information about permissible levels of inconsistency and is fully automated. We then employ a novel homeostasis protocol to allow sites to operate independently, without communicating, as long as any inconsistency is governed by appropriate treaties between the nodes. We discuss mechanisms for optimizing treaties based on workload characteristics to minimize communication, as well as a prototype implementation and experiments that demonstrate the benefits of our approach on common transactional benchmarks.
【Keywords】: coordination-free; geo-replication; program analysis
【Paper Link】 【Pages】:1327-1342
【Authors】: Peter Bailis ; Alan Fekete ; Michael J. Franklin ; Ali Ghodsi ; Joseph M. Hellerstein ; Ion Stoica
【Abstract】: The rise of data-intensive "Web 2.0" Internet services has led to a range of popular new programming frameworks that collectively embody the latest incarnation of the vision of Object-Relational Mapping (ORM) systems, albeit at unprecedented scale. In this work, we empirically investigate modern ORM-backed applications' use and disuse of database concurrency control mechanisms. Specifically, we focus our study on the common use of feral, or application-level, mechanisms for maintaining database integrity, which, across a range of ORM systems, often take the form of declarative correctness criteria, or invariants. We quantitatively analyze the use of these mechanisms in a range of open source applications written using the Ruby on Rails ORM and find that feral invariants are the most popular means of ensuring integrity (and, by usage, are over 37 times more popular than transactions). We evaluate which of these feral invariants actually ensure integrity (by usage, up to 86.9%) and which---due to concurrency errors and lack of database support---may lead to data corruption (the remainder), which we experimentally quantify. In light of these findings, we present recommendations for database system designers for better supporting these modern ORM programming patterns, thus eliminating their adverse effects on application integrity.
【Keywords】: application integrity; concurrency control; impedance mismatch; invariants; orms; ruby on rails
【Paper Link】 【Pages】:1343-1355
【Authors】: Markus Weimer ; Yingda Chen ; Byung-Gon Chun ; Tyson Condie ; Carlo Curino ; Chris Douglas ; Yunseong Lee ; Tony Majestro ; Dahlia Malkhi ; Sergiy Matusevych ; Brandon Myers ; Shravan Narayanamurthy ; Raghu Ramakrishnan ; Sriram Rao ; Russell Sears ; Beysim Sezgin ; Julia Wang
【Abstract】: Resource Managers like Apache YARN have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low-level. This flexibility comes at a high cost in terms of developer effort, as each application must repeatedly tackle the same challenges (e.g., fault-tolerance, task scheduling and coordination) and re-implement common mechanisms (e.g., caching, bulk-data transfers). This paper presents REEF, a development framework that provides a control-plane for scheduling and coordinating task-level (data-plane) work on cluster resources obtained from a Resource Manager. REEF provides mechanisms that facilitate resource re-use for data caching, and state management abstractions that greatly ease the development of elastic data processing work-flows on cloud platforms that support a Resource Manager service. REEF is being used to develop several commercial offerings such as the Azure Stream Analytics service. Furthermore, we demonstrate REEF development of a distributed shell application, a machine learning algorithm, and a port of the CORFU [4] system. REEF is also currently an Apache Incubator project that has attracted contributors from several instititutions.1 http://reef.incubator.apache.org
【Keywords】: big data; databases; distributed systems; hadoop; high performance computing; machine learning
【Paper Link】 【Pages】:1357-1369
【Authors】: Bikas Saha ; Hitesh Shah ; Siddharth Seth ; Gopal Vijayaraghavan ; Arun C. Murthy ; Carlo Curino
【Abstract】: The broad success of Hadoop has led to a fast-evolving and diverse ecosystem of application engines that are building upon the YARN resource management layer. The open-source implementation of MapReduce is being slowly replaced by a collection of engines dedicated to specific verticals. This has led to growing fragmentation and repeated efforts with each new vertical engine re-implementing fundamental features (e.g. fault-tolerance, security, stragglers mitigation, etc.) from scratch. In this paper, we introduce Apache Tez, an open-source framework designed to build data-flow driven processing runtimes. Tez provides a scaffolding and library components that can be used to quickly build scalable and efficient data-flow centric engines. Central to our design is fostering component re-use, without hindering customizability of the performance-critical data plane. This is in fact the key differentiator with respect to the previous generation of systems (e.g. Dryad, MapReduce) and even emerging ones (e.g. Spark), that provided and mandated a fixed data plane implementation. Furthermore, Tez provides native support to build runtime optimizations, such as dynamic partition pruning for Hive. Tez is deployed at Yahoo!, Microsoft Azure, LinkedIn and numerous Hortonworks customer sites, and a growing number of engines are being integrated with it. This confirms our intuition that most of the popular vertical engines can leverage a core set of building blocks. We complement qualitative accounts of real-world adoption with quantitative experimental evidence that Tez-based implementations of Hive, Pig, Spark, and Cascading on YARN outperform their original YARN implementation on popular benchmarks (TPC-DS, TPC-H) and production workloads.
【Keywords】: apache hadoop; big data; distributed data processing; open source
【Paper Link】 【Pages】:1371-1382
【Authors】: Molham Aref ; Balder ten Cate ; Todd J. Green ; Benny Kimelfeld ; Dan Olteanu ; Emir Pasalic ; Todd L. Veldhuizen ; Geoffrey Washburn
【Abstract】: The LogicBlox system aims to reduce the complexity of software development for modern applications which enhance and automate decision-making and enable their users to evolve their capabilities via a ``self-service'' model. Our perspective in this area is informed by over twenty years of experience building dozens of mission-critical enterprise applications that are in use by hundreds of large enterprises across industries such as retail, telecommunications, banking, and government. We designed and built LogicBlox to be the system we wished we had when developing those applications. In this paper, we discuss the design considerations behind the LogicBlox system and give an overview of its implementation, highlighting innovative aspects. These include: LogiQL, a unified and declarative language based on Datalog; the use of purely functional data structures; novel join processing strategies; advanced incremental maintenance and live programming facilities; a novel concurrency control scheme; and built-in support for prescriptive and predictive analytics.
【Keywords】: datalog; incremental maintenance; leapfrog triejoin; live programming; logicblox; logiql; predictive analytics; transaction repair
【Paper Link】 【Pages】:1383-1394
【Authors】: Michael Armbrust ; Reynold S. Xin ; Cheng Lian ; Yin Huai ; Davies Liu ; Joseph K. Bradley ; Xiangrui Meng ; Tomer Kaftan ; Michael J. Franklin ; Ali Ghodsi ; Matei Zaharia
【Abstract】: Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.
【Keywords】: data warehouse; databases; hadoop; machine learning; spark
【Paper Link】 【Pages】:1403-1408
【Authors】: Semih Salihoglu ; Jaeho Shin ; Vikesh Khanna ; Ba Quan Truong ; Jennifer Widom
【Abstract】: We address the problem of debugging programs written for Pregel-like systems. After interviewing Giraph and GPS users, we developed Graft. Graft supports the debugging cycle that users typically go through: (1) Users describe programmatically the set of vertices they are interested in inspecting. During execution, Graft captures the context information of these vertices across supersteps. (2) Using Graft's GUI, users visualize how the values and messages of the captured vertices change from superstep to superstep,narrowing in suspicious vertices and supersteps. (3) Users replay the exact lines of the code vertex.compute() function that executed for the suspicious vertices and supersteps, by copying code that Graft generates into their development environments' line-by-line debuggers. Graft also has features to construct end-to-end tests for Giraph programs. Graft is open-source and fully integrated into Apache Giraph's main code base.
【Keywords】: debugging; distributed bulk synchronous parallel graph systems; distributed graph systems
【Paper Link】 【Pages】:1409-1414
【Authors】: Dongqing Xiao ; Armir Bashllari ; Tyler Menard ; Mohamed Y. Eltabakh
【Abstract】: In this paper, we demonstrate the InsightNotes system, a summary-based annotation management engine over relational databases. InsightNotes addresses the unique challenges that arise in modern applications (especially scientific applications) that rely on rich and large-scale repositories of curation and annotation information. In these applications, the number and size of the raw annotations may grow beyond what end-users and scientists can comprehend and analyze. InsightNotes overcomes these limitations by integrating mining and summarization techniques with the annotation management engine in novel ways. The objective is to create concise and meaningful representations of the raw annotations, called ''annotation summaries'', to be the basic unit of processing. The core functionalities of InsightNotes include: (1) Extensibility, where domain experts can define the summary types suitable for their application, (2) Incremental Maintenance, where the system efficiently maintains the annotation summaries under the continuous addition of new annotations, (3) Summary-Aware Query Processing and Propagation, where the execution engine and query operators are extended for manipulating and propagating the annotation summaries within the query pipeline under complex transformations, and (4) Zoom-in Query Processing, where end-users can interactively expand specific annotation summaries of interest and retrieve their detailed (raw) annotations. We will demonstrate the InsightNotes's features using a real-world annotated database from the ornithological domain (the science of studying birds). We will design an interactive demonstration that engage the audience in annotating the data, visualizing how annotations are summarized and propagated, and zooming-in when desired to retrieve more details.
【Keywords】: query processing; scientific annotations; summary-based annotation management
【Paper Link】 【Pages】:1415-1420
【Authors】: Anja Gruenheid ; Donald Kossmann ; Theodoros Rekatsinas ; Divesh Srivastava
【Abstract】: As the world evolves around us, so does the digital coverage of it. Events of diverse types, associated with different actors and various locations, are continuously captured by multiple information sources such as news articles, blogs, social media etc. day by day. In the digital world, these events are represented through information snippets that contain information on the involved entities, a description of the event, when the event occurred, etc. In our work, we observe that events (and their corresponding digital representations) are often inter-connected, i.e., they form stories which represent evolving relationships between events over time. Take as an example the plane crash in Ukraine in July 2014 which involved multiple entities such as "Ukraine", "Malaysia", and "Russia" and multiple events ranging from the actual crash to the incident investigation and the presentation of the investigator's findings. In this demonstration we present StoryPivot, a framework that helps its users to detect evolving stories in event datasets over time. To resolve stories, we differentiate between story identification, the problem of connecting events over time within a source, and story alignment, the problem of integrating stories across sources. The goal of this demonstration is to present an interactive exploration of both these problems and how events can be dynamically interpreted and put into context in real-world datasets.
【Keywords】: event management; event processing; story detection; story evolution
【Paper Link】 【Pages】:1421-1426
【Authors】: Alexander Ulrich ; Torsten Grust
【Abstract】: We demonstrate the insides and outs of a query compiler based on the flattening transformation, a translation technique designed by the programming language community to derive efficient data-parallel implementations from iterative programs. Flattening admits the straightforward formulation of intricate query logic including deeply nested loops over (possibly ordered) data or the construction of rich data structures. To demonstrate the level of expressiveness that can be achieved, we will bring a compiler frontend that accepts queries embedded into the Haskell programming language. Compilation via flattening takes places in a series of simple steps all of which will be made tangible by the demonstration. The final output is a program of lifted primitive operations which existing query engines can efficiently implement. We provide backends based on PostgreSQL and VectorWise to make this point however, most set-oriented or data-parallel engines could benefit from a flattening-based query compiler.
【Keywords】: flattening; list comprehensions; nested data parallelism
【Paper Link】 【Pages】:1427-1432
【Authors】: Martin Jergler ; Mohammad Sadoghi ; Hans-Arno Jacobsen
【Abstract】: Unlike traditional activity-flow-based models, data-centric workflows primarily focus on the data to drive a business. This enables the unification of operational management, concurrent process analytics, compliance with process or associated data constraints, and adaptability to changing environments. In this demonstration, we present D2Worm, a Distributed Data-centric Workflow Management system. D2Worm allows users to (1) graphically model data-centric workflows in a declarative fashion based on the Guard-Stage-Milestone (GSM) meta-model, (2) automatically compile the modelled workflow into several fine-granular workflow units (WFUs), and (3) deploy these WFUs on distributed infrastructures. A WFU is a system component that manages a subset of the workflow's data model and, at the same time, represents part of the global control flow by evaluating conditions over the data. WFUs communicate with each other over a publish/subscribe messaging infrastructure that allows the architecture to scale from a single node to dozens of machines distributed over different data-centers. In addition, D2Worm is able to (4) concurrently execute multiple workflow instances and monitor their behavior in real-time.
【Keywords】: data-centric workflows; publish/subscribe; workflow distribution
【Paper Link】 【Pages】:1433-1438
【Authors】: Yael Amsterdamer ; Anna Kukliansky ; Tova Milo
【Abstract】: The joint processing of general data, which can refer to objective data such as geographical locations, with individual data, which is related to the habits and opinions of individuals, is required in many real-life scenarios. For this purpose, crowd mining platforms combine searching knowledge bases for general data, with mining the crowd for individual, unrecorded data. Existing such platforms require queries to be stated in a formal language. To bridge the gap between naïve users, who are not familiar with formal query languages, and crowd mining platforms, we develop NL2CM, a prototype system which translates natural language (NL) questions into well-formed crowd mining queries. The mix of general and individual information needs raises unique challenges. In particular, the different types of needs must be identified and translated into separate query parts. To account for these challenges, we develop new, dedicated modules and embed them within the modular and easily extensible architecture of NL2CM. Some of the modules interact with the user during the translation process to resolve uncertainties and complete missing data. We demonstrate NL2CM by translating questions of the audience, in different domains, into NL2CM, a crowd mining query language which is based on SPARQL.
【Keywords】: crowd mining; individual and general data; natural language to query translation
【Paper Link】 【Pages】:1439-1443
【Authors】: Sergey Dudoladov ; Chen Xu ; Sebastian Schelter ; Asterios Katsifodimos ; Stephan Ewen ; Kostas Tzoumas ; Volker Markl
【Abstract】: Over the past years, parallel dataflow systems have been employed for advanced analytics in the field of data mining where many algorithms are iterative. These systems typically provide fault tolerance by periodically checkpointing the algorithm's state and, in case of failure, restoring a consistent state from a checkpoint. In prior work, we presented an optimistic recovery mechanism that in certain cases eliminates the need to checkpoint the intermediate state of an iterative algorithm. In case of failure, our mechanism uses a compensation function to transit the algorithm to a consistent state, from which the execution can continue and successfully converge. Since this recovery mechanism does not checkpoint any state, it achieves optimal failure-free performance while guaranteeing fault tolerance. In this paper, we demonstrate our recovery mechanism with the Apache Flink data processing engine. During our demonstration, attendees will be able to run graph algorithms and trigger failures to observe the algorithms recovering with compensation functions instead of checkpoints.
【Keywords】: fault-tolerance; iterative algorithms; optimistic recovery
【Paper Link】 【Pages】:1445-1450
【Authors】: Saliha Lallali ; Nicolas Anciaux ; Iulian Sandu Popa ; Philippe Pucheral
【Abstract】: The emerging Personal Could paradigm holds the promise of a Privacy-by-Design storage and computing platform where personal data remain under the individual's control while being shared by valuable applications. However, leaving the data management control to user's hands pushes the security issues to the user's platform. This demonstration presents a Secure Personal Cloud Platform relying on a query and access control engine embedded in a tamper resistant hardware device connected to the user's platform. The main difficulty lies in the design of an inverted document index and its related search and update algorithms capable of tackling the strong hardware constraints of these devices. We have implemented our engine on a real tamper resistant hardware device and present its capacity to regulate the access to a personal dataspace. The objective of this demonstration is to show (1) that secure hardware is a key enabler of the Personal Cloud paradigm and (2) that new embedded indexing and querying techniques can tackle the hardware constraints of tamper-resistant devices and provide scalable solutions for the Personal Cloud.
【Keywords】: access control; embedded search engine; personal cloud
【Paper Link】 【Pages】:1451-1456
【Authors】: Katerina Doka ; Nikolaos Papailiou ; Dimitrios Tsoumakos ; Christos Mantas ; Nectarios Koziris
【Abstract】: Big data analytics tools are steadily gaining ground at becoming indispensable to businesses worldwide. The complexity of the tasks they execute is ever increasing due to the surge in data and task heterogeneity. Current analytics platforms, while successful in harnessing multiple aspects of this ``data deluge", bind their efficacy to a single data and compute model and often depend on proprietary systems. However, no single execution engine is suitable for all types of computation and no single data store is suitable for all types of data. To this end, we demonstrate IReS, the Intelligent Resource Scheduler for complex analytics workflows executed over multi-engine environments. Our system models the cost and performance of the required tasks over the available platforms. IReS is then able to match distinct workflow parts to the execution and/or storage engine among the available ones in order to optimize with respect to a user-defined policy. During the demo, the attendees will be able to execute workflows that match real use cases and parametrize the input datasets and optimization policy. The underlying platform supports multiple compute and data engines, allowing the user to choose any subset of them. Through the inspection of the produced plan, its execution and the collection and presentation of numerous cost and performance metrics, the audience will experience first-hand how IReS takes advantage of heterogeneous runtimes and data stores and effectively models operator cost and performance for actual and diverse workflows.
【Keywords】: analytics workflows; big data; cost modelling; multi-engine optimization; profiling
【Paper Link】 【Pages】:1457-1462
【Authors】: Tilmann Rabl ; Manuel Danisch ; Michael Frank ; Sebastian Schindler ; Hans-Arno Jacobsen
【Abstract】: With the rapidly decreasing prices for storage and storage systems ever larger data sets become economical. While only few years ago only successful transactions would be recorded in sales systems, today every user interaction will be stored for ever deeper analysis and richer user modeling. This has led to the development of big data systems, which offer high scalability and novel forms of analysis. Due to the rapid development and ever increasing variety of the big data landscape, there is a pressing need for tools for testing and benchmarking. Vendors have little options to showcase the performance of their systems but to use trivial data sets like TeraSort or WordCount. Since customers' real data is typically subject to privacy regulations and rarely can be utilized, simplistic proof-of-concepts have to be used, leaving both, customers and vendors, unclear of the target use-case performance. As a solution, we present an automatic approach to data synthetization from existing data sources. Our system enables a fully automatic generation of large amounts of complex, realistic, synthetic data.
【Keywords】: data generator; dbsynth; pdgf
【Paper Link】 【Pages】:1463-1475
【Authors】: Claude Barthels ; Simon Loesing ; Gustavo Alonso ; Donald Kossmann
【Abstract】: Database systems running on a cluster of machines, i.e. rack-scale databases, are a common architecture for many large databases and data appliances. As the data movement across machines is often a significant bottleneck, these systems typically use a low-latency, high-throughput network such as InfiniBand. To achieve the necessary performance, parallel join algorithms must take advantage of the primitives provided by the network to speed up data transfer. In this paper we focus on implementing parallel in-memory joins using Remote Direct Memory Access (RDMA), a communication mechanism to transfer data directly into the memory of a remote machine. The results of this paper are, to our knowledge, the first detailed analysis of parallel hash joins using RDMA. To capture their behavior independently of the network characteristics, we develop an analytical model and test our implementation on two different types of networks. The experimental results show that the model is accurate and the resulting distributed join exhibits good performance.
【Keywords】: distributed join; distributed query processing; join processing with rdma; rack scale databases
【Paper Link】 【Pages】:1477-1492
【Authors】: Max Heimel ; Martin Kiefer ; Volker Markl
【Abstract】: Quickly and accurately estimating the selectivity of multidimensional predicates is a vital part of a modern relational query optimizer. The state-of-the art in this field are multidimensional histograms, which offer good estimation quality but are complex to construct and hard to maintain. Kernel Density Estimation (KDE) is an interesting alternative that does not suffer from these problems. However, existing KDE-based selectivity estimators can hardly compete with the estimation quality of state-of-the art methods. In this paper, we substantially expand the state-of-the-art in KDE-based selectivity estimation by improving along three dimensions: First, we demonstrate how to numerically optimize a KDE model, leading to substantially improved estimates. Second, we develop methods to continuously adapt the estimator to changes in both the database and the query workload. Finally, we show how to drastically improve the performance by pushing computations onto a GPU. We provide an implementation of our estimator and experimentally evaluate it on a variety of datasets and workloads, demonstrating that it efficiently scales up to very large model sizes, adapts itself to database changes, and typically outperforms the estimation quality of both existing Kernel Density Estimators as well as state-of-the-art multidimensional histograms.
【Keywords】: bandwidth selection; cardinality estimation; gpgpu; gpu-accelerated databases; gpu-assisted query optimization; graphics cards; kernel density estimator; opencl; query optimization; selectivity estimation; self-tuning databases
【Paper Link】 【Pages】:1493-1508
【Authors】: Orestis Polychroniou ; Arun Raghavan ; Kenneth A. Ross
【Abstract】: Analytical databases are continuously adapting to the underlying hardware in order to saturate all sources of parallelism. At the same time, hardware evolves in multiple directions to explore different trade-offs. The MIC architecture, one such example, strays from the mainstream CPU design by packing a larger number of simpler cores per chip, relying on SIMD instructions to fill the performance gap. Databases have been attempting to utilize the SIMD capabilities of CPUs. However, mainstream CPUs have only recently adopted wider SIMD registers and more advanced instructions, since they do not rely primarily on SIMD for efficiency. In this paper, we present novel vectorized designs and implementations of database operators, based on advanced SIMD operations, such as gathers and scatters. We study selections, hash tables, and partitioning; and combine them to build sorting and joins. Our evaluation on the MIC-based Xeon Phi co-processor as well as the latest mainstream CPUs shows that our vectorization designs are up to an order of magnitude faster than the state-of-the-art scalar and vector approaches. Also, we highlight the impact of efficient vectorization on the algorithmic design of in-memory database operators, as well as the architectural design and power efficiency of hardware, by making simple cores comparably fast to complex cores. This work is applicable to CPUs and co-processors with advanced SIMD capabilities, using either many simple cores or fewer complex cores.
【Keywords】: simd; vectorization; xeon phi
【Paper Link】 【Pages】:1509-1524
【Authors】: Yinan Li ; Craig Chasseur ; Jignesh M. Patel
【Abstract】: In-memory data analytic systems that use vertical bit-parallel scan methods generally use encoding techniques. We observe that in such environments, there is an opportunity to turn skew in both the data and predicate distributions (usually a problem for query processing) into a benefit that can be leveraged to encode the column values. This paper proposes a padded encoding scheme to address this opportunity. The proposed scheme creates encodings that map common attribute values to codes that can easily be distinguished from other codes by only examining a few bits in the full code. Consequently, scans on columns stored using the padded encoding scheme can safely prune the computation without examining all the bits in the code, thereby reducing the memory bandwidth and CPU cycles that are consumed when evaluating scan queries. Our padded encoding method results in a fixed-length encoding, as fixed-length encodings are easier to manage. However, the proposed padded encoding may produce longer (fixed-length) codes than those produced by popular order-preserving encoding methods, such as dictionary-based encoding. This additional space overhead has the potential to negate the gains from early pruning of the scan computation. However, as we demonstrate empirically, the additional space overhead is generally small, and the padded encoding scheme provides significant performance improvements.
【Keywords】: analytics; bit-parallel; encoding; scan; skew
【Paper Link】 【Pages】:1525-1537
【Authors】: Hui Li ; Sourav S. Bhowmick ; Jiangtao Cui ; Yunjun Gao ; Jianfeng Ma
【Abstract】: State-of-the-art classical influence maximization (IM) techniques are "competition-unaware" as they assume that a group (company) finds seeds (users) in a network independent of other groups who are also simultaneously interested in finding such seeds in the same network. However, in reality several groups often compete for the same market (e.g., Samsung, HTC, and Apple for the smart phone market) and hence may attempt to select seeds in the same network. This has led to increasing body of research in devising IM techniques for competitive networks. Despite the considerable progress made by these efforts toward finding seeds in a more realistic settings, unfortunately, they still make several unrealistic assumptions (e.g., a new company being aware of a rival's strategy, alternate seed selection, etc.) making their deployment impractical in real-world networks. In this paper, we propose a novel framework based on game theory to provide a more realistic solution to the IM problem in competitive networks by jettisoning these unrealistic assumptions. Specifically, we seek to find the "best" IM strategy (an algorithm or a mixture of algorithms) a group should adopt in the presence of rivals so that it can maximize its influence. As each group adopts some strategy, we model the problem as a game with each group as competitors and the expected influences under the strategies as payoffs. We propose a novel algorithm called GetReal to find each group's best solution by leveraging the competition between different groups. Specifically, it seeks to find whether there exist a Nash Equilibrium (NE) in a game, which guarantees that there exist an "optimal" strategy for each group. Our experimental study on real-world networks demonstrates the superiority of our solution in a more realistic environment.
【Keywords】: competitive network; game theory; influence maximization; nash equilibrium; pure and mixed strategies
【Paper Link】 【Pages】:1539-1554
【Authors】: Youze Tang ; Yanchen Shi ; Xiaokui Xiao
【Abstract】: Given a social network G and a positive integer k, the influence maximization problem asks for k nodes (in G) whose adoptions of a certain idea or product can trigger the largest expected number of follow-up adoptions by the remaining nodes. This problem has been extensively studied in the literature, and the state-of-the-art technique runs in O((k+l) (n+m) log n ε2) expected time and returns a (1-1 e-ε)-approximate solution with at least 1 - 1/n l probability. This paper presents an influence maximization algorithm that provides the same worst-case guarantees as the state of the art, but offers significantly improved empirical efficiency. The core of our algorithm is a set of estimation techniques based on martingales, a classic statistical tool. Those techniques not only provide accurate results with small computation overheads, but also enable our algorithm to support a larger class of information diffusion models than existing methods do. We experimentally evaluate our algorithm against the states of the art under several popular diffusion models, using real social networks with up to 1.4 billion edges. Our experimental results show that the proposed algorithm consistently outperforms the states of the art in terms of computation efficiency, and is often orders of magnitude faster.
【Keywords】: influence maximization; sampling
【Paper Link】 【Pages】:1555-1569
【Authors】: Zhiting Hu ; Junjie Yao ; Bin Cui ; Eric P. Xing
【Abstract】: How does online content propagate on social networks? Billions of users generate, consume, and spread tons of information every day. This unprecedented scale of dynamics becomes invaluable to reflect our zeitgeist. However, most present diffusion extraction works have only touched individual user level and cannot obtain comprehensive clues. This paper introduces a new approach, i.e., COmmunity Level Diffusion (COLD), to uncover and explore temporal diffusion. We model topics and communities in a unified latent framework, and extract inter-community influence dynamics. With a well-designed multi-component model structure and a parallel inference implementation on GraphLab, the COLD method is expressive while remaining efficient. The extracted community level patterns enable diffusion exploration from a new perspective. We leverage the compact yet robust representations to develop new prediction and analysis applications. Extensive experiments on large social datasets show significant improvement in prediction accuracy. We can also find communities play very different roles in diffusion processes depending on their interest. Our method guarantees high scalability with increasing data size.
【Keywords】: community detection; graph model; information diffusion
【Paper Link】 【Pages】:1571-1585
【Authors】: Kijung Shin ; Jinhong Jung ; Lee Sael ; U. Kang
【Abstract】: Given a large graph, how can we calculate the relevance between nodes fast and accurately? Random walk with restart (RWR) provides a good measure for this purpose and has been applied to diverse data mining applications including ranking, community detection, link prediction, and anomaly detection. Since calculating RWR from scratch takes long, various preprocessing methods, most of which are related to inverting adjacency matrices, have been proposed to speed up the calculation. However, these methods do not scale to large graphs because they usually produce large and dense matrices which do not fit into memory. In this paper, we propose BEAR, a fast, scalable, and accurate method for computing RWR on large graphs. BEAR comprises the preprocessing step and the query step. In the preprocessing step, BEAR reorders the adjacency matrix of a given graph so that it contains a large and easy-to-invert submatrix, and precomputes several matrices including the Schur complement of the submatrix. In the query step, BEAR computes the RWR scores for a given query node quickly using a block elimination approach with the matrices computed in the preprocessing step. Through extensive experiments, we show that BEAR significantly outperforms other state-of-the-art methods in terms of preprocessing and query speed, space efficiency, and accuracy.
【Keywords】: proximity; random walk with restart; relevance score
【Paper Link】 【Pages】:1587-1602
【Authors】: Natali Ruchansky ; Francesco Bonchi ; David García-Soriano ; Francesco Gullo ; Nicolas Kourtellis
【Abstract】: The Wiener index of a graph is the sum of all pairwise shortest-path distances between its vertices. In this paper we study the novel problem of finding a minimum Wiener connector: given a connected graph G=(V,E) and a set Q ⊆ V of query vertices, find a subgraph of G that connects all query vertices and has minimum Wiener index. We show that MIN WIENER CONNECTOR admits a polynomial-time (albeit impractical) exact algorithm for the special case where the number of query vertices is bounded. We show that in general the problem is NP-hard, and has no PTAS unless P = NP. Our main contribution is a constant-factor approximation algorithm running in time Õ(|Q||E|). A thorough experimentation on a large variety of real-world graphs confirms that our method returns smaller and denser solutions than other methods, and does so by adding to the query set Q a small number of ``important'' vertices (i.e., vertices with high centrality).
【Keywords】: approximation algorithms; graph mining; network design; wiener index
【Paper Link】 【Pages】:1603-1616
【Authors】: Senjuti Basu Roy ; Laks V. S. Lakshmanan ; Rui Liu
【Abstract】: There has been significant recent interest in the area of group recommendations, where, given groups of users of a recommender system, one wants to recommend top-$k$ items to a group that maximize the satisfaction of the group members, according to a chosen semantics of group satisfaction. Examples semantics of satisfaction of a recommended itemset to a group include the so-called least misery (LM) and aggregate voting (AV). We consider the complementary problem of how to form groups such that the users in the formed groups are most satisfied with the suggested top-k recommendations. We assume that the recommendations will be generated according to one of the two group recommendation semantics -- LM or AV. Rather than assuming groups are given, or rely on ad hoc group formation dynamics, our framework allows a strategic approach for forming groups of users in order to maximize satisfaction. We show that the problem is NP-hard to solve optimally under both semantics. Furthermore, we develop two efficient algorithms for group formation under LM and show that they achieve bounded absolute error. We develop efficient heuristic algorithms for group formation under AV. We validate our results and demonstrate the scalability and effectiveness of our group formation algorithms on two large real data sets.
【Keywords】: algorithm design; group recommendations; optimization methods
【Paper Link】 【Pages】:1617-1628
【Authors】: Nikos Armenatzoglou ; Huy Pham ; Vasilis Ntranos ; Dimitris Papadias ; Cyrus Shahabi
【Abstract】: Graph partitioning has attracted considerable attention due to its high practicality for real-world applications. It is particularly relevant to social networks because it enables the grouping of users into communities for market analysis and advertising purposes. In this paper, we introduce RMGP, a type of real-time multi-criteria graph partitioning for social networks that groups the users based on their connectivity and their similarity to a set of input classes. We consider RMGP as an on-line task, which may be frequently performed for different query parameters (e.g., classes). In order to overcome the serious performance issues associated with the large social graphs found in practice, we develop solutions based on a game theoretic framework. Specifically, we consider each user as a player, whose goal is to find the class that optimizes his objective function. We propose algorithms based on best-response dynamics, analyze their properties, and show their efficiency and effectiveness on real datasets under centralized and decentralized scenarios.
【Keywords】: game theory; graph partitioning; social networks
【Paper Link】 【Pages】:1629-1643
【Authors】: Jieying She ; Yongxin Tong ; Lei Chen
【Abstract】: Online event-based social network (EBSN) platforms are becoming popular these days. An important task of managing EBSNs is to arrange proper social events to interested users. Existing approaches usually assume that each user only attends one event or ignore location information. The overall utility of such strategy is limited in real world: 1) each user may attend multiple events; 2) attending multiple events will incur spatio-temporal conflicts and travel expenses. Thus, a more intelligent EBSN platform that provides personalized event planning for each participant is desired. In this paper, we first formally define the problem of Utility-aware Social Event-participant Planning (USEP), which is proven to be NP-hard. To solve the USEP problem, we first devise a greedy-based heuristic algorithm, which performs fast under certain circumstances but has no approximation guarantee. We then present a two-step approximation framework, which not only guarantees a 1/2-approximation ratio but also includes a series of optimization techniques to improve its space/time efficiency. Finally, we verify the efficiency and effectiveness of the proposed methods through extensive experiments on real and synthetic datasets.
【Keywords】: event planning; event-based social network
【Paper Link】 【Pages】:1645-1656
【Authors】: Xiangmin Zhou ; Lei Chen ; Yanchun Zhang ; Longbing Cao ; Guangyan Huang ; Chen Wang
【Abstract】: The creation of sharing communities has resulted in the astonishing increasing of digital videos, and their wide applications in the domains such as entertainment, online news broadcasting etc. The improvement of these applications relies on effective solutions for social user access to video data. This fact has driven the recent research interest in social recommendation in shared communities. Although certain effort has been put into video recommendation in shared communities, the contextual information on social users has not been well exploited for effective recommendation. In this paper, we propose an approach based on the content and social information of videos for the recommendation in sharing communities. Specifically, we first exploit a robust video cuboid signature together with the Earth Mover's Distance to capture the content relevance of videos. Then, we propose to identify the social relevance of clips using the set of users belonging to a video. We fuse the content relevance and social relevance to identify the relevant videos for recommendation. Following that, we propose a novel scheme called sub-community-based approximation together with a hash-based optimization for improving the efficiency of our solution. Finally, we propose an algorithm for efficiently maintaining the social updates in dynamic shared communities. The extensive experiments are conducted to prove the high effectiveness and efficiency of our proposed video recommendation approach.
【Keywords】: online video recommendation; social relevance
【Paper Link】 【Pages】:1657-1668
【Authors】: Shreya Prasad ; Arash Fard ; Vishrut Gupta ; Jorge Martinez ; Jeff LeFevre ; Vincent Xu ; Meichun Hsu ; Indrajit Roy
【Abstract】: A typical predictive analytics workflow will pre-process data in a database, transfer the resulting data to an external statistical tool such as R, create machine learning models in R, and then apply the model on newly arriving data. Today, this workflow is slow and cumbersome. Extracting data from databases, using ODBC connectors, can take hours on multi-gigabyte datasets. Building models on single-threaded R does not scale. Finally, it is nearly impossible to use R or other common tools, to apply models on terabytes of newly arriving data. We solve all the above challenges by integrating HP Vertica with Distributed R, a distributed framework for R. This paper presents the design of a high performance data transfer mechanism, new data-structures in Distributed R to maintain data locality with database table segments, and extensions to Vertica for saving and deploying R models. Our experiments show that data transfers from Vertica are 6x faster than using ODBC connections. Even complex predictive analysis on 100s of gigabytes of database tables can complete in minutes, and is as fast as in-memory systems like Spark running directly on a distributed file system.
【Keywords】: hp vertica; in-database; machine learning; r
【Paper Link】 【Pages】:1669-1681
【Authors】: Quoc Trung Tran ; Konstantinos Morfonios ; Neoklis Polyzotis
【Abstract】: Analyzing and understanding the characteristics of the incoming workload is crucial in unraveling trends and tuning the performance of a database system. In this work, we present Oracle Workload Intelligence (WI), a tool for workload modeling and mining, as our attempt to infer the processes that generate a given workload. WI consists of two main functionalities. First, WI derives a model that captures the main characteristics of the workload without overfitting, which makes it likely to generalize well to unseen instances of the workload. Such a model provides insights into the most frequent code paths in the application that drives the workload, and also enables optimizations inside the database system that target sequences of query statements. Second, WI can compare the models of different snapshots of the workload to detect whether the workload has changed. Such changes might indicate new trends, regressions, problems, or even security issues. We demonstrate the effectiveness of WI with an experimental study on synthetic workloads and customer-provided application benchmarks.
【Keywords】: significant pattern; workload intelligence; workload modeling
【Paper Link】 【Pages】:1683-1694
【Authors】: John Colgrove ; John D. Davis ; John Hayes ; Ethan L. Miller ; Cary Sandvig ; Russell Sears ; Ari Tamches ; Neil Vachharajani ; Feng Wang
【Abstract】: Although flash storage has largely replaced hard disks in consumer class devices, enterprise workloads pose unique challenges that have slowed adoption of flash in ``performance tier'' storage appliances. In this paper, we describe Purity, the foundation of Pure Storage's Flash Arrays, the first all-flash enterprise storage system to support compression, deduplication, and high-availability. Purity borrows techniques from modern database and key-value storage architectures, and introduces novel storage primitives that have wide applicability to data management systems. For instance, all writes in Purity are monotonic, and deletions are handled using an atomic predicate-based tuple elision primitive. Purity's redundancy mechanisms are optimized for SSD failure modes and performance characteristics, allowing for fast recovery from component failures and lower space overhead than the best hard disk systems. We built deduplication and data compression schemes atop these primitives. Flash changes storage capacity/performance tradeoffs: unlike disk-based systems, flash deployments are rarely performance bound. A single Purity appliance can provide over 7GiB/s of throughput on 32KiB random I/Os, even through multiple device failures, and while providing asynchronous off-site replication. Typical installations have 99.9% latencies under 1ms, and production arrays average 5.4x data reduction and 99.999% availability. Purity takes advantage of storage performance increasing more rapidly than computational performance to build a simpler (with respect to engineering, installation, and management) scale-up storage appliance that supports hundreds of terabytes of highly-available, high-performance storage. The resulting performance and capacity supports many customer deployments of multiple applications, including scale-out and parallel systems, such as MongoDB and Oracle RAC, on a single Purity appliance.
【Keywords】: deduplication; enterprise flash storage; high availability; log structured storage; scale up architectures; storage area networks
【Paper Link】 【Pages】:1695-1706
【Authors】: Pawel Terlecki ; Fei Xu ; Marianne Shaw ; Valeri Kim ; Richard Michael Grantham Wesley
【Abstract】: The rapid increase in data volumes and complexity of applied analytical tasks poses a big challenge for visualization solutions. It is important to keep the experience highly interactive, so that users stay engaged and can perform insightful data exploration. Query processing usually dominates the cost of visualization generation. Therefore, in order to achieve acceptable response times, one needs to utilize backend capabilities to the fullest and apply techniques, such as caching or prefetching. In this paper we discuss key data processing components in Tableau: the query processor, query caches, Tableau Data Engine [1, 2] and Data Server. Furthermore, we cover recent performance improvements related to the number and quality of remote queries, broader reuse of cached data, and application of inter and intra query parallelism.
【Keywords】: column store; concurrency; data server; data visualization; query batching; tableau data engine
【Paper Link】 【Pages】:1707-1711
【Authors】: Stratis D. Viglas
【Abstract】: Non-volatile memory promises to bridge the gap between main memory and secondary storage by offering a universal storage device. Its performance profile is unique in that its latency is close to main memory and it is byte addressable, but it exhibits asymmetric I/O in that writes are more expensive than reads. These properties imply that it cannot act as a drop-in replacement for either main-memory or disk. Therefore, we must revisit the salient aspects of data management in light of this new technology. In what follows we present the current work in the area with a view towards identifying the open problems and exposing the research opportunities. In particular, we address issues like: (a) incorporating non-volatile memory into the data management stack, (b) supporting transactions and ensuring persistence and recovery, and (c) query processing.
【Keywords】: non-volatile memory; performance; persistence; query processing; recovery
【Paper Link】 【Pages】:1713-1728
【Authors】: Xu Chu ; Yeye He ; Kaushik Chakrabarti ; Kris Ganjam
【Abstract】: It is well known today that pages on the Web contain a large number of content-rich relational tables. Such tables have been systematically extracted in a number of efforts to empower important applications such as table search and schema discovery. However, a significant fraction of relational tables are not embedded in the standard HTML table tags, and are thus difficult to extract. In particular, a large number of relational tables are known to be in a ``list'' form, which contains a list of clearly separated rows that are not separated into columns. In this work, we address the important problem of automatically extracting multi-column relational tables from such lists. Our key intuition lies in the simple observation that in correctly-extracted tables, values in the same column are coherent, both at a syntactic and at a semantic level. Using a background corpus of over 100 million tables crawled from the Web, we quantify semantic coherence based on a statistical measure of value co-occurrence in the same column from the corpus. We then model table extraction as a principled optimization problem -- we allocate tokens in each row sequentially to a fixed number of columns, such that the sum of coherence across all pairs of values in the same column is maximized. Borrowing ideas from $A^\star$ search and metric distance, we develop an efficient 2-approximation algorithm. We conduct large-scale table extraction experiments using both real Web data and proprietary enterprise spreadsheet data. Our approach considerably outperforms the state-of-the-art approaches in terms of quality, achieving over 90% F-measure across many cases.
【Keywords】: html lists; information extraction; table extraction; web tables
【Paper Link】 【Pages】:1729-1744
【Authors】: Jialu Liu ; Jingbo Shang ; Chi Wang ; Xiang Ren ; Jiawei Han
【Abstract】: Text data are ubiquitous and play an essential role in big data applications. However, text data are mostly unstructured. Transforming unstructured text into structured units (e.g., semantically meaningful phrases) will substantially reduce semantic ambiguity and enhance the power and efficiency at manipulating such data using database technology. Thus mining quality phrases is a critical research problem in the field of databases. In this paper, we propose a new framework that extracts quality phrases from text corpora integrated with phrasal segmentation. The framework requires only limited training but the quality of phrases so generated is close to human judgment. Moreover, the method is scalable: both computation time and required space grow linearly as corpus size increases. Our experiments on large text corpora demonstrate the quality and efficiency of the new method.
【Keywords】: phrasal segmentation; phrase mining
【Paper Link】 【Pages】:1745-1760
【Authors】: Immanuel Trummer ; Alon Y. Halevy ; Hongrae Lee ; Sunita Sarawagi ; Rahul Gupta
【Abstract】: Even with the recent developments in Web search of answering queries from structured data, search engines are still limited to queries with an objective answer, such as EUROPEAN CAPITALS or WOODY ALLEN MOVIES. However, many queries are subjective, such as SAFE CITIES, or CUTE ANIMALS. The underlying knowledge bases of search engines do not contain answers to these queries because they do not have a ground truth. We describe the Surveyor system that mines the dominant opinion held by authors of Web content about whether a subjective property applies to a given entity. The evidence on which SURVEYOR relies is statements extracted from Web text that either support the property or claim its negation. The key challenge that SURVEYOR faces is that simply counting the number of positive and negative statements does not suffice, because there are multiple hidden biases with which content tends to be authored on the Web. SURVEYOR employs a probabilistic model of how content is authored on the Web. As one example, this model accounts for correlations between the subjective property and the frequency with which it is mentioned on the Web. The parameters of the model are specialized to each property and entity type. Surveyor was able to process a large Web snapshot within a few hours, resulting in opinions for over 4~billion entity-property combinations. We selected a subset of 500 entity-property combinations and compared our results to the dominant opinion of a large number of Amazon Mechanical Turk (AMT) workers. The predictions of Surveyor match the results from AMT in 77\% of all cases (and 87\% for test cases where inter-worker agreement is high), significantly outperforming competing approaches.
【Keywords】: subjective properties; text mining; user behavior model
【Paper Link】 【Pages】:1761-1775
【Authors】: Wen Hua ; Kai Zheng ; Xiaofang Zhou
【Abstract】: Nowadays microblogging sites, such as Twitter and Chinese Sina Weibo, have established themselves as an invaluable information source, which provides a huge collection of manually-generated tweets with broad range of topics from daily life to breaking news. Entity linking is indispensable for understanding and maintaining such information, which in turn facilitates many real-world applications such as tweet clustering and classification, personalized microblog search, and so forth. However, tweets are short, informal and error-prone, rendering traditional approaches for entity linking in documents largely inapplicable. Recent work addresses this problem by utilising information from other tweets and linking entities in a batch manner. Nevertheless, the high computational complexity makes this approach infeasible for real-time applications given the high arrival rate of tweets. In this paper, we propose an efficient solution to link entities in tweets by analyzing their social and temporal context. Our proposed framework takes into consideration three features, namely entity popularity, entity recency, and user interest information embedded in social interactions to assist the entity linking task. Effective indexing structures along with incremental algorithms have also been developed to reduce the computation and maintenance costs of our approach. Experimental results based on real tweet datasets verify the effectiveness and efficiency of our proposals.
【Keywords】: entity popularity; entity recency; microblog entity linking; social temporal context; user interest
【Paper Link】 【Pages】:1777-1792
【Authors】: Nikolaos Papailiou ; Dimitrios Tsoumakos ; Panagiotis Karras ; Nectarios Koziris
【Abstract】: The pace at which data is described, queried and exchanged using the RDF specification has been ever increasing with the proliferation of Semantic Web. Minimizing SPARQL query response times has been an open issue for the plethora of RDF stores, yet SPARQL result caching techniques have not been extensively utilized. In this work we present a novel system that addresses graph-based, workload-adaptive indexing of large RDF graphs by caching SPARQL query results. At the heart of the system lies a SPARQL query canonical labelling algorithm that is used to uniquely index and reference SPARQL query graphs as well as their isomorphic forms. We integrate our canonical labelling algorithm with a dynamic programming planner in order to generate the optimal join execution plan, examining the utilization of both primitive triple indexes and cached query results. By monitoring cache requests, our system is able to identify and cache SPARQL queries that, even if not explicitly issued, greatly reduce the average response time of a workload. The proposed cache is modular in design, allowing integration with different RDF stores. Incorporating it to an open-source, distributed RDF engine that handles large scale RDF datasets, we prove that workload-adaptive caching can reduce average response times by up to two orders of magnitude and offer interactive response times for complex workloads and huge RDF datasets.
【Keywords】: adaptive indexing; caching; distributed databases; query planning; rdf; sparql
【Paper Link】 【Pages】:1793-1808
【Authors】: Medha Atre
【Abstract】: SPARQL basic graph pattern (BGP) (a.k.a. SQL inner-join) query optimization is a well researched area. However, optimization of OPTIONAL pattern queries (a.k.a. SQL left-outer-joins) poses additional challenges, due to the restrictions on the reordering of left-outer-joins. The occurrence of such queries tends to be as high as 50% of the total queries (e.g., DBPedia query logs). In this paper, we present Left Bit Right (LBR), a technique for well-designed nested BGP and OPTIONAL pattern queries. Through LBR, we propose a novel method to represent such queries using a graph of supernodes, which is used to aggressively prune the RDF triples, with the help of compressed indexes. We also propose novel optimization strategies -- first of a kind, to the best of our knowledge -- that combine together the characteristics of acyclicity of queries, minimality, and nullification, best-match operators. In this paper, we focus on OPTIONAL patterns without UNIONs or FILTERs, but we also show how UNIONs and FILTERs can be handled with our technique using a query rewrite. Our evaluation on RDF graphs of up to and over one billion triples, on a commodity laptop with 8 GB memory, shows that LBR can process well-designed low-selectivity complex queries up to 11 times faster compared to the state-of-the-art RDF column-stores as Virtuoso and MonetDB, and for highly selective queries, LBR is at par with them.
【Keywords】: compressed bitvectors.; left-outer-joins; query optimization; semi-joins; sparql optional patterns
【Paper Link】 【Pages】:1809-1824
【Authors】: Weiguo Zheng ; Lei Zou ; Xiang Lian ; Jeffrey Xu Yu ; Shaoxu Song ; Dongyan Zhao
【Abstract】: A challenging task in the natural language question answering (Q/A for short) over RDF knowledge graph is how to bridge the gap between unstructured natural language questions (NLQ) and graph-structured RDF data (GOne of the effective tools is the "template", which is often used in many existing RDF Q/A systems. However, few of them study how to generate templates automatically. To the best of our knowledge, we are the first to propose a join approach for template generation. Given a workload D of SPARQL queries and a set N of natural language questions, the goal is to find some pairs q, n, for q∈ D ∧ n ∈, N, where SPARQL query q is the best match for natural language question n. These pairs provide promising hints for automatic template generation. Due to the ambiguity of the natural languages, we model the problem above as an uncertain graph join task. We propose several structural and probability pruning techniques to speed up joining. Extensive experiments over real RDF Q/A benchmark datasets confirm both the effectiveness and efficiency of our approach.
【Keywords】: graph database; question answering; rdf
【Paper Link】 【Pages】:1825-1838
【Authors】: Shi Qiao ; Z. Meral Özsoyoglu
【Abstract】: As more RDF data management systems and RDF data querying techniques emerge, RDF benchmarks providing a controllable and comparable testing environment for applications are needed. To address the needs of diverse applications, we propose an application-specific framework, called RBench, to generate RDF benchmarks. RBench takes an RDF dataset from any application as a template, and generates a set of synthetic datasets with similar characteristics including graph structure and literal labels, for the required "size scaling factor" and the "degree scaling factor". RBench analyzes several features from the given RDF dataset, and uses them to reconstruct the new benchmark graph. A flexible query load generation process is then proposed according to the design of RBench. Efficiency and usability of RBench are demonstrated via experimental results.
【Keywords】: application-specific benchmarking
【Paper Link】 【Pages】:1839-1853
【Authors】: Ahmed El-Roby ; Ashraf Aboulnaga
【Abstract】: There has recently been an increase in the number of RDF knowledge bases published on the Internet. These rich RDF data sets can be useful in answering many queries, but much more interesting queries can be answered by integrating information from different data sets. This has given rise to research on automatically linking different RDF data sets representing different knowledge bases. This is challenging due to their scale and semantic heterogeneity. Various approaches have been proposed, but there is room for improving the quality of the generated links. In this paper, we present ALEX, a system that aims at improving the quality of links between RDF data sets by using feedback provided by users on the answers to linked data queries. ALEX starts with a set of candidate links obtained using any automatic linking algorithm. ALEX utilizes user feedback to discover new links that did not exist in the set of candidate links while preserving link precision. ALEX discovers these new links by finding links that are similar to a link approved by the user through feedback on queries. ALEX uses a Monte-Carlo reinforcement learning method to learn how to explore in the space of possible links around a given link. Our experiments on real-world data sets show that ALEX is efficient and significantly improves the quality of links.
【Keywords】: automatic linking; federated query processing; knoweldge bases; linked data; rdf; reinforcement learning
【Paper Link】 【Pages】:1855-1870
【Authors】: John Paparrizos ; Luis Gravano
【Abstract】: The proliferation and ubiquity of temporal data across many disciplines has generated substantial interest in the analysis and mining of time series. Clustering is one of the most popular data mining methods, not only due to its exploratory power, but also as a preprocessing step or subroutine for other techniques. In this paper, we present k-Shape, a novel algorithm for time-series clustering. k-Shape relies on a scalable iterative refinement procedure, which creates homogeneous and well-separated clusters. As its distance measure, k-Shape uses a normalized version of the cross-correlation measure in order to consider the shapes of time series while comparing them. Based on the properties of that distance measure, we develop a method to compute cluster centroids, which are used in every iteration to update the assignment of time series to clusters. To demonstrate the robustness of k-Shape, we perform an extensive experimental evaluation of our approach against partitional, hierarchical, and spectral clustering methods, with combinations of the most competitive distance measures. k-Shape outperforms all scalable approaches in terms of accuracy. Furthermore, k-Shape also outperforms all non-scalable (and hence impractical) combinations, with one exception that achieves similar accuracy results. However, unlike k-Shape, this combination requires tuning of its distance measure and is two orders of magnitude slower than k-Shape. Overall, k-Shape emerges as a domain-independent, highly accurate, and highly efficient clustering approach for time series with broad applications.
【Keywords】: centroid computation; clustering; distance measure; time-series data
【Paper Link】 【Pages】:1871-1886
【Authors】: Jingbo Zhou ; Anthony K. H. Tung
【Abstract】: It is useful to predict future values in time series data, for example when there are many sensors monitoring environments such as urban space. The Gaussian Process (GP) model is considered as a promising technique for this setting. However, the GP model requires too high a training cost to be tractable for large data. Though approximation methods have been proposed to improve GP's scalability, they usually can only capture global trends in the data and fail to preserve small-scale patterns, resulting in unsatisfactory performance. We propose a new method to apply the GP for sensor time series prediction. Instead of (eagerly) training GPs on entire datasets, we custom-build query-dependent GPs on small fractions of the data for each prediction request. Implementing this idea in practice at scale requires us to overcome two obstacles. On the one hand, a central challenge with such a semi-lazy learning model is the substantial model-building effort at kNN query time, which could lead to unacceptable latency. We propose a novel two-level inverted-like index to support kNN search using the DTW on the GPU, making such "just-in-time" query-dependent model construction feasible for real-time applications. On the other hand, several parameters should be tuned for each time series individually since different sensors have different data generating processes in diverse environments. Manually configuring the parameters is usually not feasible due to the large number of sensors. To address this, we devise an adaptive auto-tuning mechanism to automatically determine and dynamically adjust the parameters for each time series with little human assistance. Our method has the following strong points: (a) it can make prediction in real time without a training phase; (b) it can yield superior prediction accuracy; and (c) it can effectively estimate the analytical predictive uncertainty. To illustrate our points, we present SMiLer, a semi-lazy time series prediction system for sensors. Extensive experiments on real-world datasets demonstrate its effectiveness and efficiency. In particular, by devising a two-level inverted-like index on the GPU with an enhanced lower bound of the DTW, SMiLer accelerates the efficiency of kNN search by one order of magnitude over its baselines. The prediction accuracy of SMiLer is better than the state-of-the-art competitors (up to 10 competitors) with better estimation of predictive uncertainty.
【Keywords】: dtw; gaussian process; gpu; predictive analysis; semi-lazy learning; sensors; time series
【Paper Link】 【Pages】:1887-1901
【Authors】: Wen Sun ; Achille Fokoue ; Kavitha Srinivas ; Anastasios Kementsietsidis ; Gang Hu ; Guo Tong Xie
【Abstract】: We show that existing mature, relational optimizers can be exploited with a novel schema to give better performance for property graph storage and retrieval than popular noSQL graph stores. The schema combines relational storage for adjacency information with JSON storage for vertex and edge attributes. We demonstrate that this particular schema design has benefits compared to a purely relational or purely JSON solution. The query translation mechanism translates Gremlin queries with no side effects into SQL queries so that one can leverage relational query optimizers. We also conduct an empirical evaluation of our schema design and query translation mechanism with two existing popular property graph stores. We show that our system is 2-8 times better on query performance, and 10-30 times better in throughput on 4.3 billion edge graphs compared to existing stores.
【Keywords】: gremlin; property graphs; relational storage
【Paper Link】 【Pages】:1903-1916
【Authors】: Dayu Yuan ; Prasenjit Mitra ; Huiwen Yu ; C. Lee Giles
【Abstract】: Indices are commonly built into graph databases in order to support fast searches. Any given graph database and the distribution of queries will change over time. Therefore, the cost of processing queries using a static graph index increases because the index is built to optimize old snapshots of the database. There is growing research interest in determining how to update a graph index with the purpose of adapting to database and query changes. Updating features in a graph index is typically an NP-hard problem. In addition, because the features are chosen from a large number of frequent subgraphs, a multi-pass algorithm is not scalable to big datasets. In order to address this issue, we propose a time-efficient one-pass algorithm that is designed to update a graph index by scanning each frequent subgraph at most once. The algorithm replaces a feature with a new subgraph if the latter is ``better" than the former one. We use the branch and bound technique to skip subgraphs that cannot outperform any of the features in the graph index. We further use a decomposed index and reduce the space complexity from O(|G||Q|) to O(|G| + |Q|), where G is database graphs and Q is a query workload. Through the empirical study, we show that the one-pass algorithm is 5--100 times faster than all previous algorithms for updating graph indices. In addition, the one-pass algorithm guarantees the return of a close to optimum solution. Our experiments show that when the one-pass algorithm is used to update an index, the query-processing speed is $1$--$2$ times faster than that of other cutting-edge indices, i.e., the FGindex and the gIndex.
【Keywords】: graph database; graph mining
【Paper Link】 【Pages】:1917-1923
【Authors】: Anurag Gupta ; Deepak Agarwal ; Derek Tan ; Jakub Kulesza ; Rahul Pathak ; Stefano Stefani ; Vidhya Srinivasan
【Abstract】: Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse solution that makes it simple and cost-effective to efficiently analyze large volumes of data using existing business intelligence tools. Since launching in February 2013, it has been Amazon Web Service's (AWS) fastest growing service, with many thousands of customers and many petabytes of data under management. Amazon Redshift's pace of adoption has been a surprise to many participants in the data warehousing community. While Amazon Redshift was priced disruptively at launch, available for as little as $1000/TB/year, there are many open-source data warehousing technologies and many commercial data warehousing engines that provide free editions for development or under some usage limit. While Amazon Redshift provides a modern MPP, columnar, scale-out architecture, so too do many other data warehousing engines. And, while Amazon Redshift is available in the AWS cloud, one can build data warehouses using EC2 instances and the database engine of one's choice with either local or network-attached storage. In this paper, we discuss an oft-overlooked differentiating characteristic of Amazon Redshift -- simplicity. Our goal with Amazon Redshift was not to compete with other data warehousing engines, but to compete with non-consumption. We believe the vast majority of data is collected but not analyzed. We believe, while most database vendors target larger enterprises, there is little correlation in today's economy between data set size and company size. And, we believe the models used to procure and consume analytics technology need to support experimentation and evaluation. Amazon Redshift was designed to bring data warehousing to a mass market by making it easy to buy, easy to tune and easy to manage while also being fast and cost-effective.
【Keywords】: amazon redshift; columnar database:redshift:data warehousing:mpp
【Paper Link】 【Pages】:1925-1940
【Authors】: Mukund Deshpande ; Dhruva Ray ; Sameer Dixit ; Avadhoot Agasti
【Abstract】: The field of data analysis seeks to extract value from data for either business or scientific benefit. This field has seen a renewed interest with the advent of big data technologies and a new organizational role called data scientist. Even with the new found focus, the task of analyzing large amounts of data is still challenging and time-consuming. The essence of data analysis involves setting up data pipe-lines which consists of several operations that are chained together - starting from data collection, data quality checks, data integration, data analysis and data visualization (including the setting up of interaction paths in that visualization). In our opinion, the challenges stem from from the technology diversity at each stage of the data pipeline as well as the lack of process around the analysis. In this paper we present a platform that aims to significantly reduce the time it takes to build data pipelines. The platform attempts to achieve this in following ways. Allow the user to describe the entire data pipeline with a single language and idioms - all the way from data ingestion to insight expression (via visualization and end-user interaction). Provide a rich library of parts that allow users to quickly assemble a data analysis pipeline in the language. Allow for a collaboration model that allows multiple users to work together on a data analysis pipeline as well as leverage and extend prior work with minimal effort. We studied the efficacy of the platform for a data hackathon competition conducted in our organization. The hackathon provided us with a way to study the impact of the approach. Rich data pipelines which traditionally took weeks to build were constructed and deployed in hours. Consequently, we believe that the complexity of designing and running the data analysis pipeline can be significantly reduced; leading to a marked improvement in the productivity of data analysts/data scientists.
【Keywords】: big data; data analysis; data pipeline; data visualization
【Paper Link】 【Pages】:1941-1953
【Authors】: Immanuel Trummer ; Christoph Koch
【Abstract】: Query plans offer diverse tradeoffs between conflicting cost metrics such as execution time, energy consumption, or execution fees in a multi-objective scenario. It is convenient for users to choose the desired cost tradeoff in an interactive process, dynamically adding constraints and finally selecting the best plan based on a continuously refined visualization of optimal cost tradeoffs. Multi-objective query optimization (MOQO) algorithms must possess specific properties to support such an interactive process: First, they must be anytime algorithms, generating multiple result plan sets of increasing quality with low latency between consecutive results. Second, they must be incremental, meaning that they avoid regenerating query plans when being invoked several times for the same query but with slightly different user constraints. We present an incremental anytime algorithm for MOQO, analyze its complexity and show that it offers an attractive tradeoff between result update frequency, single invocation time complexity, and amortized time over multiple invocations. Those properties make it suitable to be used within an interactive query optimization process. We evaluate the algorithm in comparison with prior work on TPC-H queries; our implementation is based on the Postgres database management system.
【Keywords】: anytime; incremental; multi-objective; query optimization
【Paper Link】 【Pages】:1955-1967
【Authors】: Niccolo' Meneghetti ; Denis Mindolin ; Paolo Ciaccia ; Jan Chomicki
【Abstract】: Skylines assume that all attributes are equally important, as each dimension can always be traded off for another. Prioritized skylines (p-skylines) take into account non-compensatory preferences, where some dimensions are deemed more important than others, and trade-offs are constrained by the relative importance of the attributes involved. In this paper we show that querying using non-compensatory preferences is computationally efficient. We focus on preferences that are representable with p-expressions, and develop an efficient in-memory divide-and-conquer algorithm for answering p-skyline queries. Our algorithm is output-sensitive; this is very desirable in the context of preference queries, since the output is expected to be, on average, only a small fraction of the input. We prove that our method is well behaved in both the worst- and the average-case scenarios. Additionally, we develop a general framework for benchmarking p-skyline algorithms, showing how to sample prioritized preference relations uniformly, and how to highlight the effect of data correlation on performance. We conclude our study with extensive experimental results.
【Keywords】: algorithms; experimentation; p-skyline; pareto accumulation; performance; preference; preference query; prioritized accumulation; skyline
【Paper Link】 【Pages】:1969-1984
【Authors】: Arun Kumar ; Jeffrey F. Naughton ; Jignesh M. Patel
【Abstract】: Enterprise data analytics is a booming area in the data management industry. Many companies are racing to develop toolkits that closely integrate statistical and machine learning techniques with data management systems. Almost all such toolkits assume that the input to a learning algorithm is a single table. However, most relational datasets are not stored as single tables due to normalization. Thus, analysts often perform key-foreign key joins before learning on the join output. This strategy of learning after joins introduces redundancy avoided by normalization, which could lead to poorer end-to-end performance and maintenance overheads due to data duplication. In this work, we take a step towards enabling and optimizing learning over joins for a common class of machine learning techniques called generalized linear models that are solved using gradient descent algorithms in an RDBMS setting. We present alternative approaches to learn over a join that are easy to implement over existing RDBMSs. We introduce a new approach named factorized learning that pushes ML computations through joins and avoids redundancy in both I/O and computations. We study the tradeoff space for all our approaches both analytically and empirically. Our results show that factorized learning is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach. We also discuss extensions of all our approaches to multi-table joins as well as to Hive.
【Keywords】: analytics; feature engineering; joins; machine learning
【Paper Link】 【Pages】:1985-2000
【Authors】: Yannis Katsis ; Kian Win Ong ; Yannis Papakonstantinou ; Kevin Keliang Zhao
【Abstract】: Prior Incremental View Maintenance (IVM) algorithms specify the view tuples that need to be modified by computing diff sets, which we call tuple-based diffs since a diff set contains one diff tuple for each to-be-modified view tuple. idIVM assumes the base tables have keys and performs IVM by computing ID-based diff sets that compactly identify the to-be-modified tuples through their IDs. This work makes the following contributions: (a) An ID-based IVM system for a large subset of SQL that includes the algebraic operators selection, join, grouping and aggregation, generalized projection involving functions, antisemijoin (and therefore negation/difference) and union. The system is based on a modular approach, allowing one to extend the supported language simply by adding one algebraic operator at-a-time, along with equations describing how ID-based changes are propagated through the operator. (b) An efficient algorithm that creates an IVM plan for a given view in four passes that are polynomial in the size of the view expression. (c) A formal analysis comparing the ID-based IVM algorithm to prior IVM approaches and analytically showing when one outperforms the other. (d) An experimental comparison of the ID-based IVM algorithm to prior IVM algorithms showing the superiority of the former in common use cases.
【Keywords】: incremental view maintenance; materialized views
【Paper Link】 【Pages】:2001-2016
【Authors】: Fotis Psallidas ; Bolin Ding ; Kaushik Chakrabarti ; Surajit Chaudhuri
【Abstract】: An enterprise information worker is often aware of a few example tuples that should be present in the output of the query. Query discovery systems have been developed to discover project-join queries that contain the given example tuples in their output. However, they require the output to exactly contain all the example tuples and do not perform any ranking. To address this limitation, we study the problem of efficiently discovering top-k project join queries which approximately contain the given example tuples in their output. We extend our algorithms to incrementally produce results as soon as the user finishes typing/modifying a cell. Our experiments on real-life and synthetic datasets show that our proposed solution is significantly more efficient compared with applying state-of-the-art algorithms.
【Keywords】: example spreadsheet; relevance ranking; sql query discovery
【Paper Link】 【Pages】:2017-2030
【Authors】: Karim Ibrahim ; Xiao Du ; Mohamed Y. Eltabakh
【Abstract】: Annotation management and data curation has been extensively studied in the context of relational databases. However, existing annotation management techniques share a common limitation, which is that they are all passive engines, i.e., they only manage the annotations obtained from external sources such as DB admins, domain experts, and curation tools. They neither learn from the available annotations nor exploit the annotations-to-data correlations to further enhance the quality of the annotated database. Delegating such crucial and complex tasks to end-users---especially under large-scale databases and annotation sets---is clearly the wrong choice. In this paper, we propose the Nebula system, an advanced and proactive annotation management engine in relational databases. Nebula complements the state-of-art techniques in annotation management by learning from the available annotations, analyzing their content and semantics, and understanding their correlations with the data. And then, Nebula proactively discovers and recommends potentially missing annotation-to-data attachments. We propose context-aware ranking and prioritization of the discovered attachments that take into account the relationships among the data tuples and their annotations. We also propose approximation techniques and expert-enabled verification mechanisms that adaptively maintain high-accuracy predictions while minimizing the experts' involvement. Nebula is realized on top of an existing annotation management engine, and experimentally evaluated to illustrate the effectiveness of the proposed techniques, and to demonstrate the potential gain in enhancing the quality of annotated databases.
【Keywords】: annotated database; keyword search; proactive annotation management
【Paper Link】 【Pages】:2031-2046
【Authors】: Ngai Meng Kou ; Leong Hou U ; Nikos Mamoulis ; Zhiguo Gong
【Abstract】: Peer reviewing is a standard process for assessing the quality of submissions at academic conferences and journals. A very important task in this process is the assignment of reviewers to papers. However, achieving an appropriate assignment is not easy, because all reviewers should have similar load and the subjects of the assigned papers should be consistent with the reviewers' expertise. In this paper, we propose a generalized framework for fair reviewer assignment. We first extract the domain knowledge from the reviewers' published papers and model this knowledge as a set of topics. Then, we perform a group assignment of reviewers to papers, which is a generalization of the classic Reviewer Assignment Problem (RAP), considering the relevance of the papers to topics as weights. We study a special case of the problem, where reviewers are to be found for just one paper (Journal Assignment Problem) and propose an exact algorithm which is fast in practice, as opposed to brute-force solutions. For the general case of having to assign multiple papers, which is too hard to be solved exactly, we propose a greedy algorithm that achieves a 1/2-approximation ratio compared to the exact solution. This is a great improvement compared to the 1/3-approximation solution proposed in previous work for the simpler coverage-based reviewer assignment problem, where there are no weights on topics. We theoretically prove the approximation bound of our solution and experimentally show that it is superior to the current state-of-the-art.
【Keywords】: group coverage; paper reviewer assignment; stage deepening greedy
【Paper Link】 【Pages】:2047-2061
【Authors】: Mingwang Tang ; Feifei Li ; Yufei Tao
【Abstract】: In online tracking, an observer S receives a sequence of values, one per time instance, from a data source that is described by a function f. A tracker T wants to continuously maintain an approximation that is within an error threshold of the value f(t) at any time instance t, with small communication overhead. This problem was recently formalized and studied, and a principled approach with optimal competitive ratio was proposed. This work extends the study of online tracking to a distributed setting, where a tracker T wants to track a function f that is computed from a set of functions f1 , . . . , fm from m distributed observers and respective data sources. This formulation finds numerous important and natural applications, e.g., sensor networks, distributed systems, measurement networks, and pub-sub systems. We formalize this problem and present effective online algorithms for various topologies of a distributed system/network for different aggregate functions. Experiments on large real data sets demonstrate the excellent performance of our methods in practice.
【Keywords】: distributed online tracking; online tracking; tracking
【Paper Link】 【Pages】:2063-2066
【Authors】: Xin Luna Dong ; Divesh Srivastava
【Abstract】: Large-scale knowledge repositories are becoming increasingly important as a foundation for enabling a wide variety of complex applications. In turn, building high-quality knowledge repositories critically depends on the technologies of knowledge curation and knowledge fusion, which share many similar goals with data integration, while facing even more challenges in extracting knowledge from both structured and unstructured data, across a large variety of domains, and in multiple languages. Our tutorial highlights the similarities and differences between knowledge management and data integration, and has two goals. First, we introduce the Database community to the techniques proposed for the problems of entity linkage and relation extraction by the Knowledge Management, Natural Language Processing, and Machine Learning communities. Second, we give a detailed survey of the work done by these communities in knowledge fusion, which is critical to discover and clean errors present in sources and the many mistakes made in the process of knowledge extraction from sources. Our tutorial is example driven and hopes to build bridges between the Database community and other disciplines to advance research in this important area.
【Keywords】: entity linkage; knowledge curation; knowledge fusion
【Paper Link】 【Pages】:2067-2068
【Authors】: Mansheng Yang ; Richard T. B. Ma
【Abstract】: Task migration happens when distributed data processing systems scale in real-time. To handle the task migration process more gracefully, we propose three task migration methods: (i) worker level migration, (ii) executor level migration, and (iii) executor level migration with reliable messaging. We implement our migration methods on Apache Storm. Our experiments show that, compared with Storm's original migration implementation, our methods significantly reduce the performance degradation and the number of task failures during each migration.
【Keywords】: distributed real-time data processing; storm; task migration
【Paper Link】 【Pages】:2069-2070
【Authors】: Oreoluwatomiwa O. Babarinsa ; Stratos Idreos
【Abstract】: As main-memory sizes have grown, data systems have been able to process entire large-scale data-sets in memory. However, because memory speeds have been not been keeping pace with CPU speeds, the cost of moving data into CPU caches has begun to dominate certain operations within in-memory data systems. Recent advances in hardware architectures point to near memory computation capabilities becoming possible soon. This allows us to rethink how database systems process queries and how they split computation across the various computational units. In this paper, we present JAFAR, a near data processing accelerator for pushing selects down to memory. Through a detailed simulation of JAFAR hardware we show it has the potential to provide up to 900\% improvement for select operations in modern column-stores.
【Keywords】: column-store database; near-data processing; processing-in-memory
【Paper Link】 【Pages】:2071-2072
【Authors】: Trevor Clinkenbeard ; Anisoara Nica
【Abstract】: The research presented in this paper analyzes different algorithms for scheduling a set of potentially interdependent jobs in order to minimize the total runtime, or makespan, when data communication costs are considered. On distributed file systems, such as the HDFS, files are spread across a cluster of machines. Once a request, such as a query, having as input the data in these files is translated into a set of jobs, these jobs must be scheduled across machines in the cluster. Jobs consume input files stored in the distributed file system or in cluster nodes, and produce output which is potentially consumed by future jobs. If a job needs a particular file as input, the job must either be executed on the same machine, or it must incur a time penalty to copy the file, increasing latency for the job. At the same time, independent jobs are ideally scheduled at the same time on different machines, in order to take advantage of parallelism. Both minimizing communication costs and maximizing parallelism serve to minimize the total runtime of a set of jobs. Furthermore, the problem gets more complex when certain jobs must wait for previous jobs to provide input, as is frequently the case when a set of jobs represents the steps of a query on a distributed database.
【Keywords】: distributed databases; job scheduling
【Paper Link】 【Pages】:2073-2074
【Authors】: Styliani Pantela ; Stratos Idreos
【Abstract】: Just-In-Time (JIT) compilation increasingly becomes a key technology for modern database systems. It allows the creation of code on-the-fly to perfectly match an active query. In the past, it has been argued that a query should be compiled to a single loop that performs all query actions, for example, all selects over all relevant columns. On the other hand, vectorization -- a common feature in modern data systems -- allows for better results by evaluating the query predicates sequentially in different tight for-loops. In this paper, we study JIT compilation for modern in-memory column-stores in detail and we show that, contrary to the common belief that vectorization outweighs the benefits of having one loop, there are cases in which creating a single loop is actually the optimal solution. In fact, deciding between multiple or a single loop is not a static decision; instead, it depends on (per column) query selectivity. We perform our experiments on a modern column-store prototype that supports vectorization and we show that, depending on selectivity, a different code layout is optimal. When a select operator is implemented with a no-branch design, for low selectivity creating multiple loops performs better than a single loop. A single tight loop performs better otherwise.
【Keywords】: column-stores; just-in-time code generation
【Paper Link】 【Pages】:2075-2076
【Authors】: Adam Perelman ; Christopher Ré
【Abstract】: Modern data analytics workloads frequently involve complex join queries where the pairwise-join-based algorithms used by most RDBMS engines are suboptimal. In this study, we explore two algorithms that are asymptotically faster than pairwise algorithms for a large class of queries. The first is Yannakakis' classical algorithm for acyclic queries. The second is a more recent algorithm which works for any query and which is optimal with respect to the worst-case size of the output. We introduce a query compiler, DunceCap, which uses these two algorithms and variations on them to produce optimal query plans, and find that these plans can outperform standard RDBMS algorithms as well as simple worst-case optimal algorithms by an order of magnitude on a variety of queries.
【Keywords】: algorithms; graph processing; relational joins
【Paper Link】 【Pages】:2077-2078
【Authors】: Susan Tu ; Christopher Ré
【Abstract】: Joins are central to data processing. However, traditional query plans for joins, which are based on choosing the order of pairwise joins, are provably suboptimal. They often perform poorly on cyclic graph queries, which have become increasingly important to modern data analytics. Other join algorithms exist: Yannakakis', for example, operates on acyclic queries in runtime proportional to the input size plus the output size \cite{yannakakis}. More recently, Ngo et al. published a join algorithm that is optimal on worst-case inputs \cite{worst}. My contribution is to explore query planning using these join algorithms. In our approach, every query plan can be viewed as a generalized hypertree decomposition (GHD). We score each GHD using the minimal fractional hypertree width, which Ngo et al. show allows us to bound its worst-case runtime. We benchmark our plans using datasets from the Stanford Large Network Dataset Collection \cite{dataset} and find that our performance compares favorably against that of LogicBlox, a commercial system that implements a worst-case optimal join algorithm.
【Keywords】: graph processing; joins; query planning