ACM SIGMOD Conference 2013:New York, NY, USA

Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013. ACM 【DBLP Link

Paper Num: 150 || Session Num: 40

Research session 1: data analytics 3

1. Cumulon: optimizing statistical data analysis in the cloud.

Paper Link】 【Pages】:1-12

【Authors】: Botong Huang ; Shivnath Babu ; Jun Yang

【Abstract】: We present Cumulon, a system designed to help users rapidly develop and intelligently deploy matrix-based big-data analysis programs in the cloud. Cumulon features a flexible execution model and new operators especially suited for such workloads. We show how to implement Cumulon on top of Hadoop/HDFS while avoiding limitations of MapReduce, and demonstrate Cumulon's performance advantages over existing Hadoop-based systems for statistical data analysis. To support intelligent deployment in the cloud according to time/budget constraints, Cumulon goes beyond database-style optimization to make choices automatically on not only physical operators and their parameters, but also hardware provisioning and configuration settings. We apply a suite of benchmarking, simulation, modeling, and search techniques to support effective cost-based optimization over this rich space of deployment plans.

【Keywords】: cloud; data parallelism; linear algebra; statistical computing

2. Shark: SQL and rich analytics at scale.

Paper Link】 【Pages】:13-24

【Authors】: Reynold S. Xin ; Josh Rosen ; Matei Zaharia ; Michael J. Franklin ; Scott Shenker ; Ion Stoica

【Abstract】: Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g. iterative machine learning) at scale, and efficiently recovers from failures mid-query. This allows Shark to run SQL queries up to 100X faster than Apache Hive, and machine learning programs more than 100X faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a MapReduce-like execution engine, and the fine-grained fault tolerance properties that such engine provides. It extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL. The result is a system that matches the speedups reported for MPP analytic databases over MapReduce, while offering fault tolerance properties and complex analytics capabilities that they lack.

【Keywords】: data warehouse; databases; hadoop; machine learning; shark; spark

3. Parallel analytics as a service.

Paper Link】 【Pages】:25-36

【Authors】: Petrie Wong ; Zhian He ; Eric Lo

【Abstract】: Recently, massively parallel processing relational database systems (MPPDBs) have gained much momentum in the big data analytic market. With the advent of hosted cloud computing, we envision that the offering of MPPDB-as-a-Service (MPPDBaaS) will become attractive for companies having analytical tasks on only hundreds gigabytes to some ten terabytes of data because they can enjoy high-end parallel analytics at a cheap cost. This paper presents Thrifty, a prototype implementation of MPPDB-as-a-service. The major research issue is how to achieve a lower total cost of ownership by consolidating thousands of MPPDB tenants on to a shared hardware infrastructure, with a performance SLA that guarantees the tenants can obtain the query results as if they are executing their queries on dedicated machines. Thrifty achieves the goal by using a tenant-driven design that includes (1) a cluster design that carefully arranges the nodes in the cluster into groups and creates an MPPDB for each group of nodes, (2) a tenant placement that assigns each tenant to several MPPDBs (for high availability service through replication), and (3) a query routing algorithm that routes a tenant's query to the proper MPPDB at run-time. Experiments show that in a MPPDBaaS with 5000 tenants, where each tenant requests 2 to 32 nodes MPPDB to query against 200GB to 3.2TB of data, Thrifty can serve all the tenants with a 99.9% performance SLA guarantee and a high availability replication factor of 3, using only 18.7% of the nodes requested by the tenants.

【Keywords】: cloud databases; consolidation; database-as-a-service; multi-tenant databases; parallel databases

Research session 2: XML 3

Paper Link】 【Pages】:37-48

【Authors】: Ba Quan Truong ; Sourav S. Bhowmick ; Curtis E. Dyreson ; Aixin Sun

【Abstract】: Keyword search for smallest lowest common ancestors (SLCAs) in XML data has been widely accepted as a meaningful way to identify matching nodes where their subtrees contain an input set of keywords. Although SLCA and its variants (e.g.,MLCA) perform admirably in identifying matching nodes, surprisingly, they perform poorly for searches on irregular schemas that have missing elements, that is, (sub)elements that are optional, or appear in some instances of an element type but not all (e.g., a "population" subelement in a "city" element might be optional, appearing when the population is known and absent when the population is unknown). In this paper, we generalize the SLCA search paradigm to support queries involving missing elements. Specifically, we propose a novel property called optionality resilience that specifies the desired behaviors of an XML keyword search (XKS) approach for queries involving missing elements. We present two variants of a novel algorithm called MESSIAH (Missing Element-conSciouS hIgh-quality SLCA searcH), which are optionality resilient to irregular documents. MESSIAH logically transforms an XML document to a minimal full document where all missing elements are represented as empty elements, i.e., the irregular schema is made "regular", and then employs efficient strategies to identify partial and complete full SLCA nodes (SLCA nodes in the full document) from it. Specifically, it generates the same SLCA nodes as any state-of-the-art approach when the query does not involve missing elements but avoids irrelevant results when missing elements are involved. Our experimental study demonstrates the ability of MESSIAH to produce superior quality search results.

【Keywords】: full slca; missing elements; optionality resilience; xml keyword search

Paper Link】 【Pages】:49-60

【Authors】: Sara Cohen

【Abstract】: Given a tree Q and a large set of trees T = {T1,...,Tn}, the subtree similarity-search problem is that of finding the subtrees of trees among T that are most similar to Q, using the tree edit distance metric. Determining similarity using tree edit distance has been proven useful in a variety of application areas. While subtree similarity-search has been studied in the past, solutions required traversal of all of T, which poses a severe bottleneck in processing time, as T grows larger. This paper proposes the first index structure for subtree similarity-search, provided that the unit cost function is used. Extensive experimentation and comparison to previous work shows the huge improvement gained when using the proposed index structure and processing algorithm.

【Keywords】: edit distance; indexing

6. Discovering XSD keys from XML data.

Paper Link】 【Pages】:61-72

【Authors】: Marcelo Arenas ; Jonny Daenen ; Frank Neven ; Martín Ugarte ; Jan Van den Bussche ; Stijn Vansummeren

【Abstract】: A great deal of research into the learning of schemas from XML data has been conducted in recent years to enable the automatic discovery of XML Schemas from XML documents when no schema, or only a low-quality one is available. Unfortunately, and in strong contrast to, for instance, the relational model, the automatic discovery of even the simplest of XML constraints, namely XML keys, has been left largely unexplored in this context. A major obstacle here is the unavailability of a theory on reasoning about XML keys in the presence of XML schemas, which is needed to validate the quality of candidate keys. The present paper embarks on a fundamental study of such a theory and classifies the complexity of several crucial properties concerning XML keys in the presence of an XSD, like, for instance, testing for consistency, boundedness, satisfiability, universality, and equivalence. Of independent interest, novel results are obtained related to cardinality estimation of XPath result sets. A mining algorithm is then developed within the framework of levelwise search. The algorithm leverages known discovery algorithms for functional dependencies in the relational model, but incorporates the above mentioned properties to assess and refine the quality of derived keys. An experimental study on an extensive body of real world XML data evaluating the effectiveness of the proposed algorithm is provided.

【Keywords】: key; mining; xml

Research session 3: transactions 3

7. A scalable lock manager for multicores.

Paper Link】 【Pages】:73-84

【Authors】: Hyungsoo Jung ; Hyuck Han ; Alan David Fekete ; Gernot Heiser ; Heon Young Yeom

【Abstract】: Modern implementations of DBMS software are intended to take advantage of high core counts that are becoming common in high-end servers. However, we have observed that several database platforms, including MySQL, Shore-MT, and a commercial system, exhibit throughput collapse as load increases, even for a workload with little or no logical contention for locks. Our analysis of MySQL identifies latch contention within the lock manager as the bottleneck responsible for this collapse. We design a lock manager with reduced latching, implement it in MySQL, and show that it avoids the collapse and generally improves performance. Our efficient implementation of a lock manager is enabled by a staged allocation and de-allocation of locks. Locks are pre-allocated in bulk, so that the lock manager only has to perform simple list-manipulation operations during the acquire and release phases of a transaction. De-allocation of the lock data-structures is also performed in bulk, which enables the use of fast implementations of lock acquisition and release, as well as concurrent deadlock checking.

【Keywords】: concurrent programming; multicores; scalable lock manager

8. Controlled lock violation.

Paper Link】 【Pages】:85-96

【Authors】: Goetz Graefe ; Mark Lillibridge ; Harumi A. Kuno ; Joseph Tucek ; Alistair C. Veitch

【Abstract】: In databases with a large buffer pool, a transaction may run in less time than it takes to log the transaction's commit record on stable storage. Such cases motivate a technique called early lock release: immediately after appending its commit record to the log buffer in memory, a transaction may release its locks. Thus, it cuts overall lock duration to a fraction and reduces lock contention accordingly. Early lock release also has its problems. The initial mention of early lock release was incomplete, the first detailed description and implementation was incorrect with respect to read-only transactions, and the most recent design initially had errors and still does not cover unusual lock modes such as "increment" locks. Thus, we set out to achieve the same goals as early lock release but with a different, simpler, and more robust approach. The resulting technique, controlled lock violation, requires no new theory, applies to any lock mode, promises less implementation effort and slightly less run-time effort, and also optimizes distributed transactions, e.g., in systems that rely on multiple replicas for high availability and high reliability. In essence, controlled lock violation retains locks until the transaction is durable but permits other transactions to violate its locks while flushing its commit log record to stable storage.

【Keywords】: concurrency; dbms; locking; transaction processing

9. X-FTL: transactional FTL for SQLite databases.

Paper Link】 【Pages】:97-108

【Authors】: Woon-Hak Kang ; Sang-Won Lee ; Bongki Moon ; Gi-Hwan Oh ; Changwoo Min

【Abstract】: In the era of smartphones and mobile computing, many popular applications such as Facebook, twitter, Gmail, and even Angry birds game manage their data using SQLite. This is mainly due to the development productivity and solid transactional support. For transactional atomicity, however, SQLite relies on less sophisticated but costlier page-oriented journaling mechanisms. Hence, this is often cited as the main cause of tardy responses in mobile applications. Flash memory does not allow data to be updated in place, and the copy-on-write strategy is adopted by most flash storage devices. In this paper, we propose X-FTL, a transactional flash translation layer(FTL) for SQLite databases. By offloading the burden of guaranteeing the transactional atomicity from a host system to flash storage and by taking advantage of the copy-on-write strategy used in modern FTLs, X-FTL drastically improves the transactional throughput almost for free without resorting to costly journaling schemes. We have implemented X-FTL on an SSD development board called OpenSSD, and modified SQLite and ext4 file system minimally to make them compatible with the extended abstractions provided by X-FTL. We demonstrate the effectiveness of X-FTL using real and synthetic SQLite workloads for smartphone applications, TPC-C benchmark for OLTP databases, and FIO benchmark for file systems.

【Keywords】: copy-on-write; flash storage devices; flash translation layer; sqlite; transactional atomicity

Research session 4: data storage 3

10. Optimal splitters for temporal and multi-version databases.

Paper Link】 【Pages】:109-120

【Authors】: Wangchao Le ; Feifei Li ; Yufei Tao ; Robert Christensen

【Abstract】: Temporal and multi-version databases are ideal candidates for a distributed store, which offers large storage space, and parallel and distributed processing power from a cluster of (commodity) machines. A key challenge is to achieve a good load balancing algorithm for storage and processing of these data, which is done by partitioning the database. We introduce the concept of optimal splitters for temporal and multi-version databases, which induce a partition of the input data set, and guarantee that the size of the maximum bucket be minimized among all possible configurations, given a budget for the desired number of buckets. We design efficient methods for memory- and disk resident data respectively, and show that they significantly outperform competing baseline methods both theoretically and empirically on large real data sets.

【Keywords】: multi-version databases; optimal splitters; temporal data

11. Building an efficient RDF store over a relational database.

Paper Link】 【Pages】:121-132

【Authors】: Mihaela A. Bornea ; Julian Dolby ; Anastasios Kementsietsidis ; Kavitha Srinivas ; Patrick Dantressangle ; Octavian Udrea ; Bishwaranjan Bhattacharjee

【Abstract】: Efficient storage and querying of RDF data is of increasing importance, due to the increased popularity and widespread acceptance of RDF on the web and in the enterprise. In this paper, we describe a novel storage and query mechanism for RDF which works on top of existing relational representations. Reliance on relational representations of RDF means that one can take advantage of 35+ years of research on efficient storage and querying, industrial-strength transaction support, locking, security, etc. However, there are significant challenges in storing RDF in relational, which include data sparsity and schema variability. We describe novel mechanisms to shred RDF into relational, and novel query translation techniques to maximize the advantages of this shredded representation. We show that these mechanisms result in consistently good performance across multiple RDF benchmarks, even when compared with current state-of-the-art stores. This work provides the basis for RDF support in DB2 v.10.1.

【Keywords】: efficient storage; query optimization; rdf; sparql

12. Automatic synthesis of out-of-core algorithms.

Paper Link】 【Pages】:133-144

【Authors】: Yannis Klonatos ; Andres Nötzli ; Andrej Spielmann ; Christoph Koch ; Viktor Kuncak

【Abstract】: We present a system for the automatic synthesis of efficient algorithms specialized for a particular memory hierarchy and a set of storage devices. The developer provides two independent inputs: 1) an algorithm that ignores memory hierarchy and external storage aspects; and 2) a description of the target memory hierarchy, including its topology and parameters. Our system is able to automatically synthesize memory-hierarchy and storage-device-aware algorithms out of those specifications, for tasks such as joins and sorting. The framework is extensible and allows developers to quickly synthesize custom out-of-core algorithms as new storage technologies become available.

【Keywords】: memory hierarchies; out-of-core algorithms; synthesis

Research session 5: schema matching and spatial databases I 3

13. InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables.

Paper Link】 【Pages】:145-156

【Authors】: Meihui Zhang ; Kaushik Chakrabarti

【Abstract】: Users often need to gather information about "entities" of interest. Recent efforts try to automate this task by leveraging the vast corpus of HTML tables; this is referred to as "entity augmentation". The accuracy of entity augmentation critically depends on semantic relationships between web tables as well as semantic labels of those tables. Current techniques work well for string-valued and static attributes but perform poorly for numeric and time-varying attributes. In this paper, we first build a semantic graph that (i) labels columns with unit, scale and timestamp information and (ii) computes semantic matches between columns even when the same numeric attribute is expressed in different units or scales. Second, we develop a novel entity augmentation API suited for numeric and time-varying attributes that leverages the semantic graph. Building the graph is challenging as such label information is often missing from the column headers. Our key insight is to leverage the wealth of tables on the web and infer label information from semantically matching columns of other web tables; this complements "local" extraction from column headers. However, this creates an interdependence between labels and semantic matches; we address this challenge by representing the task as a probabilistic graphical model that jointly discovers labels and semantic matches over all columns. Our experiments on real-life datasets show that (i) our semantic graph contains higher quality labels and semantic matches and (ii) entity augmentation based on the above graph has significantly higher precision and recall compared with the state-of-the-art.

【Keywords】: semantic matching; web table

14. Value invention in data exchange.

Paper Link】 【Pages】:157-168

【Authors】: Patricia C. Arocena ; Boris Glavic ; Renée J. Miller

【Abstract】: The creation of values to represent incomplete information, often referred to as value invention, is central in data exchange. Within schema mappings, Skolem functions have long been used for value invention as they permit a precise representation of missing information. Recent work on a powerful mapping language called second-order tuple generating dependencies (SO tgds), has drawn attention to the fact that the use of arbitrary Skolem functions can have negative computational and programmatic properties in data exchange. In this paper, we present two techniques for understanding when the Skolem functions needed to represent the correct semantics of incomplete information are computationally well-behaved. Specifically, we consider when the Skolem functions in second-order (SO) mappings have a first-order (FO) semantics and are therefore programmatically and computationally more desirable for use in practice. Our first technique, linearization, significantly extends the Nash, Bernstein and Melnik unskolemization algorithm, by understanding when the sets of arguments of the Skolem functions in a mapping are related by set inclusion. We show that such a linear relationship leads to mappings that have FO semantics and are expressible in popular mapping languages including source-to-target tgds and nested tgds. Our second technique uses source semantics, specifically functional dependencies (including keys), to transform SO mappings into equivalent FO mappings. We show that our algorithms are applicable to a strictly larger class of mappings than previous approaches, but more importantly we present an extensive experimental evaluation that quantifies this difference (about 78% improvement) over an extensive schema mapping benchmark and illustrates the applicability of our results on real mappings.

【Keywords】: data exchange; linearization; schema mappings; skolemization; unskolemization; value invention

15. Indexing methods for moving object databases: games and other applications.

Paper Link】 【Pages】:169-180

【Authors】: Hanan Samet ; Jagan Sankaranarayanan ; Michael Auerbach

【Abstract】: Moving object databases arise in numerous applications such as traffic monitoring, crowd tracking, and games. They all require keeping track of objects that move and thus the database of objects must be constantly updated. The cover fieldtree (more commonly known as the loose quadtree and the loose octree, depending on the dimension of the underlying space) is designed to overcome the drawback of spatial data structures that associate objects with their minimum enclosing quadtree (octree) cells which is that the size of these cells depends more on the position of the objects and less on their size. In fact, the size of these cells may be as large as the entire space from which the objects are drawn. The loose quadtree (octree) overcomes this drawback by expanding the size of the space that is spanned by each quadtree (octree) cell c of width w by a cell expansion factor p (p>0) so that the expanded cell is of width (1+p)*w and an object is associated with its minimum enclosing expanded quadtree (octree) cell. It is shown that for an object o with minimum bounding hypercube box b of radius r (i.e., half the length of a side of the hypercube), the maximum possible width w of the minimum enclosing expanded quadtree cell c is just a function of r and p, and is independent of the position of o. Normalizing w via division by 2r enables calculating the range of possible expanded quadtree cell sizes as a function of p. For p >= 0.5 the range consists of just two values and usually just one value for p >= 1. This makes updating very simple and fast as for p >= 0.5, there are at most two possible new cells associated with the moved object and thus the update can be done in O(1) time. Experiments with random data showed that the update time to support motion in such an environment is minimized when p is infinitesimally less than 1, with as much as a one order of magnitude increase in the number of updates that can be handled vis-a-vis the p=0 case in a given unit of time. Similar results for updates were obtained for an N-body simulation where improved query performance and scalability were also observed. Finally, in order amplify the paper, a video tiled "Crates and Barrels" was produced which is an N-body simulation of 14,000 objects. The video is available from the following URL: http://www.youtube.com/watch?v=Sokq3FRGc0s. An applet to illustrate the behavior of the loose quadtree was developed and is available from http://donar.umiacs.umd. edu/quadtree/rectangles/loosequad.html.

【Keywords】: cover fieldtree; game databases; game programming; loose octree; loose quadtree; moving objects; spatial data structures; spatial databases; spatial indexing

Research session 6: graph connectivity 3

16. I/O efficient: computing SCCs in massive graphs.

Paper Link】 【Pages】:181-192

【Authors】: Zhiwei Zhang ; Jeffrey Xu Yu ; Lu Qin ; Lijun Chang ; Xuemin Lin

【Abstract】: A strongly connected component (SCC) is a maximal subgraph of a directed graph G in which every pair of nodes are reachable from each other in the SCC. With such a property, a general directed graph can be represented by a directed acyclic graph DAG by contracting an SCC of G to a node in DAG. In many real applications that need graph pattern matching, topological sorting, or reachability query processing, the best way to deal with a general directed graph is to deal with its DAG representation. Therefore, finding all SCCs in a directed graph G is a critical operation. The existing in-memory algorithms based on depth first search (DFS) can find all SCCs in linear time w.r.t. the size of a graph. However, when a graph cannot resident entirely in the main memory, the existing external or semi-external algorithms to find all SCCs have limitation to achieve high I/O efficiency. In this paper, we study new I/O efficient semi-external algorithms to find all SCCs for a massive directed graph G that cannot reside in main memory entirely. To overcome the deficiency of the existing DFS based semi-external algorithm that heavily relies on a total order, we explore a weak order based on which we investigate new algorithms. We propose a new two phase algorithm, namely, tree construction and tree search. In the tree construction phase, a spanning tree of G can be constructed in bounded sequential scans of G. In the tree search phase, it needs to sequentially scan the graph once to find all SCCs. In addition, we propose a new single phase algorithm, which combines the tree construction and tree search phases into a single phase, with three new optimization techniques. They are early acceptance, early rejection, and batch processing. By the single phase algorithm with the new optimization techniques, we can significantly reduce the number of I/Os and CPU cost. We conduct extensive experimental studies using 4 real datasets including a massive real dataset, and several synthetic datasets to confirm the I/O efficiency of our approaches.

【Keywords】: graph algorithm; i/o efficient; scc computing

17. TF-Label: a topological-folding labeling scheme for reachability querying in a large graph.

Paper Link】 【Pages】:193-204

【Authors】: James Cheng ; Silu Huang ; Huanhuan Wu ; Ada Wai-Chee Fu

【Abstract】: Reachability querying is a basic graph operation with numerous important applications in databases, network analysis, computational biology, software engineering, etc. Although many indexes have been proposed to answer reachability queries, most of them are only efficient for handling relatively small graphs. We propose TF-label, an efficient and scalable labeling scheme for processing reachability queries. TF-label is constructed based on a novel topological folding (TF) that recursively folds an input graph into half so as to reduce the label size, thus improving query efficiency. We show that TF-label is efficient to construct and propose efficient algorithms and optimization schemes. Our experiments verify that TF-label is significantly more scalable and efficient than the state-of-the-art methods in both index construction and query processing.

【Keywords】: graph indexing; graph querying; graph reachability

18. Efficiently computing k-edge connected components via graph decomposition.

Paper Link】 【Pages】:205-216

【Authors】: Lijun Chang ; Jeffrey Xu Yu ; Lu Qin ; Xuemin Lin ; Chengfei Liu ; Weifa Liang

【Abstract】: Efficiently computing k-edge connected components in a large graph, G = (V, E), where V is the vertex set and E is the edge set, is a long standing research problem. It is not only fundamental in graph analysis but also crucial in graph search optimization algorithms. Consider existing techniques for computing k-edge connected components are quite time consuming and are unlikely to be scalable for large scale graphs, in this paper we firstly propose a novel graph decomposition paradigm to iteratively decompose a graph G for computing its k-edge connected components such that the number of drilling-down iterations h is bounded by the "depth" of the k-edge connected components nested together to form G, where h usually is a small integer in practice. Secondly, we devise a novel, efficient threshold-based graph decomposition algorithm, with time complexity O(l × |E|), to decompose a graph G at each iteration, where l usually is a small integer with l « |V|. As a result, our algorithm for computing k-edge connected components significantly improves the time complexity of an existing state-of-the-art technique from O(|V|2|E| + |V|3 log |V|) to O(h × l × |E|). Finally, we conduct extensive performance studies on large real and synthetic graphs. The performance studies demonstrate that our techniques significantly outperform the state-of-the-art solution by several orders of magnitude.

【Keywords】: graph decomposition; k-edge connected components; minimum cut

Research session 7: crowdsourcing 3

19. An online cost sensitive decision-making method in crowdsourcing systems.

Paper Link】 【Pages】:217-228

【Authors】: Jinyang Gao ; Xuan Liu ; Beng Chin Ooi ; Haixun Wang ; Gang Chen

【Abstract】: Crowdsourcing has created a variety of opportunities for many challenging problems by leveraging human intelligence. For example, applications such as image tagging, natural language processing, and semantic-based information retrieval can exploit crowd-based human computation to supplement existing computational algorithms. Naturally, human workers in crowdsourcing solve problems based on their knowledge, experience, and perception. It is therefore not clear which problems can be better solved by crowdsourcing than solving solely using traditional machine-based methods. Therefore, a cost sensitive quantitative analysis method is needed. In this paper, we design and implement a cost sensitive method for crowdsourcing. We online estimate the profit of the crowdsourcing job so that those questions with no future profit from crowdsourcing can be terminated. Two models are proposed to estimate the profit of crowdsourcing job, namely the linear value model and the generalized non-linear model. Using these models, the expected profit of obtaining new answers for a specific question is computed based on the answers already received. A question is terminated in real time if the marginal expected profit of obtaining more answers is not positive. We extends the method to publish a batch of questions in a HIT. We evaluate the effectiveness of our proposed method using two real world jobs on AMT. The experimental results show that our proposed method outperforms all the state-of-art methods.

【Keywords】: crowdsourcing; decision-making

20. Leveraging transitive relations for crowdsourced joins.

Paper Link】 【Pages】:229-240

【Authors】: Jiannan Wang ; Guoliang Li ; Tim Kraska ; Michael J. Franklin ; Jianhua Feng

【Abstract】: The development of crowdsourced query processing systems has recently attracted a significant attention in the database community. A variety of crowdsourced queries have been investigated. In this paper, we focus on the crowdsourced join query which aims to utilize humans to find all pairs of matching objects from two collections. As a human-only solution is expensive, we adopt a hybrid human-machine approach which first uses machines to generate a candidate set of matching pairs, and then asks humans to label the pairs in the candidate set as either matching or non-matching. Given the candidate pairs, existing approaches will publish all pairs for verification to a crowdsourcing platform. However, they neglect the fact that the pairs satisfy transitive relations. As an example, if o1 matches with o2, and o2 matches with o3, then we can deduce that o1 matches with o3 without needing to crowdsource (o1, o3). To this end, we study how to leverage transitive relations for crowdsourced joins. We propose a hybrid transitive-relations and crowdsourcing labeling framework which aims to crowdsource the minimum number of pairs to label all the candidate pairs. We prove the optimal labeling order and devise a parallel labeling algorithm to efficiently crowdsource the pairs following the order. We evaluate our approaches in both simulated environment and a real crowdsourcing platform. Experimental results show that our approaches with transitive relations can save much more money and time than existing methods, with a little loss in the result quality.

【Keywords】: crowdsourcing; entity resolution; join operator; transitive relations

21. Crowd mining.

Paper Link】 【Pages】:241-252

【Authors】: Yael Amsterdamer ; Yael Grossman ; Tova Milo ; Pierre Senellart

【Abstract】: Harnessing a crowd of Web users for data collection has recently become a wide-spread phenomenon. A key challenge is that the human knowledge forms an open world and it is thus difficult to know what kind of information we should be looking for. Classic databases have addressed this problem by data mining techniques that identify interesting data patterns. These techniques, however, are not suitable for the crowd. This is mainly due to properties of the human memory, such as the tendency to remember simple trends and summaries rather than exact details. Following these observations, we develop here for the first time the foundations of crowd mining. We first define the formal settings. Based on these, we design a framework of generic components, used for choosing the best questions to ask the crowd and mining significant patterns from the answers. We suggest general implementations for these components, and test the resulting algorithm's performance on benchmarks that we designed for this purpose. Our algorithm consistently outperforms alternative baseline algorithms.

【Keywords】: association rule learning; crowd mining; crowdsourcing

Research session 8: social media 3

22. Efficient sentiment correlation for large-scale demographics.

Paper Link】 【Pages】:253-264

【Authors】: Mikalai Tsytsarau ; Sihem Amer-Yahia ; Themis Palpanas

【Abstract】: Analyzing sentiments of demographic groups is becoming important for the Social Web, where millions of users provide opinions on a wide variety of content. While several approaches exist for mining sentiments from product reviews or micro-blogs, little attention has been devoted to aggregating and comparing extracted sentiments for different demographic groups over time, such as 'Students in Italy' or 'Teenagers in Europe'. This problem demands efficient and scalable methods for sentiment aggregation and correlation, which account for the evolution of sentiment values, sentiment bias, and other factors associated with the special characteristics of web data. We propose a scalable approach for sentiment indexing and aggregation that works on multiple time granularities and uses incrementally updateable data structures for online operation. Furthermore, we describe efficient methods for computing meaningful sentiment correlations, which exploit pruning based on demographics and use top-k correlations compression techniques. We present an extensive experimental evaluation with both synthetic and real datasets, demonstrating the effectiveness of our pruning techniques and the efficiency of our solution.

【Keywords】: indexing; sentiment aggregation; sentiment correlation

23. EBM: an entropy-based model to infer social strength from spatiotemporal data.

Paper Link】 【Pages】:265-276

【Authors】: Huy Pham ; Cyrus Shahabi ; Yan Liu

【Abstract】: The ubiquity of mobile devices and the popularity of location-based-services have generated, for the first time, rich datasets of people's location information at a very high fidelity. These location datasets can be used to study people's behavior - for example, social studies have shown that people, who are seen together frequently at the same place and at the same time, are most probably socially related. In this paper, we are interested in inferring these social connections by analyzing people's location information, which is useful in a variety of application domains from sales and marketing to intelligence analysis. In particular, we propose an entropy-based model (EBM) that not only infers social connections but also estimates the strength of social connections by analyzing people's co-occurrences in space and time. We examine two independent ways: diversity and weighted frequency, through which co-occurrences contribute to social strength. In addition, we take the characteristics of each location into consideration in order to compensate for cases where only limited location information is available. We conducted extensive sets of experiments with real-world datasets including both people's location data and their social connections, where we used the latter as the ground-truth to verify the results of applying our approach to the former. We show that our approach outperforms the competitors.

【Keywords】: data mining; geospatial; social computing; social network; social strength; spatial; spatiotemporal

Paper Link】 【Pages】:277-288

【Authors】: Wanyun Cui ; Yanghua Xiao ; Haixun Wang ; Yiqi Lu ; Wei Wang

【Abstract】: A great deal of research has been conducted on modeling and discovering communities in complex networks. In most real life networks, an object often participates in multiple overlapping communities. In view of this, recent research has focused on mining overlapping communities in complex networks. The algorithms essentially materialize a snapshot of the overlapping communities in the network. This approach has three drawbacks, however. First, the mining algorithm uses the same global criterion to decide whether a subgraph qualifies as a community. In other words, the criterion is fixed and predetermined. But in reality, communities for different vertices may have very different characteristics. Second, it is costly, time consuming, and often unnecessary to find communities for an entire network. Third, the approach does not support dynamically evolving networks. In this paper, we focus on online search of overlapping communities, that is, given a query vertex, we find meaningful overlapping communities the vertex belongs to in an online manner. In doing so, each search can use community criterion tailored for the vertex in the search. To support this approach, we introduce a novel model for overlapping communities, and we provide theoretical guidelines for tuning the model. We present several algorithms for online overlapping community search and we conduct comprehensive experiments to demonstrate the effectiveness of the model and the algorithms. We also suggest many potential applications of our model and algorithms.

【Keywords】: community search; graph mining; social networks

Research session 9: systems, performance I 3

25. BitWeaving: fast scans for main memory data processing.

Paper Link】 【Pages】:289-300

【Authors】: Yinan Li ; Jignesh M. Patel

【Abstract】: This paper focuses on running scans in a main memory data processing system at "bare metal" speed. Essentially, this means that the system must aim to process data at or near the speed of the processor (the fastest component in most system configurations). Scans are common in main memory data processing environments, and with the state-of-the-art techniques it still takes many cycles per input tuple to apply simple predicates on a single column of a table. In this paper, we propose a technique called BitWeaving that exploits the parallelism available at the bit level in modern processors. BitWeaving operates on multiple bits of data in a single cycle, processing bits from different columns in each cycle. Thus, bits from a batch of tuples are processed in each cycle, allowing BitWeaving to drop the cycles per column to below one in some case. BitWeaving comes in two flavors: BitWeaving/V which looks like a columnar organization but at the bit level, and BitWeaving/H which packs bits horizontally. In this paper we also develop the arithmetic framework that is needed to evaluate predicates using these BitWeaving organizations. Our experimental results show that both these methods produce significant performance benefits over the existing state-of-the-art methods, and in some cases produce over an order of magnitude in performance improvement.

【Keywords】: analytics; bit-parallel; indexing; intra-cycle parallelism; storage organization

26. Performance and resource modeling in highly-concurrent OLTP workloads.

Paper Link】 【Pages】:301-312

【Authors】: Barzan Mozafari ; Carlo Curino ; Alekh Jindal ; Samuel Madden

【Abstract】: Database administrators of Online Transaction Processing (OLTP) systems constantly face difficult questions. For example, "What is the maximum throughput I can sustain with my current hardware?", "How much disk I/O will my system perform if the requests per second double?", or "What will happen if the ratio of transactions in my system changes?". Resource prediction and performance analysis are both vital and difficult in this setting. Here the challenge is due to high degrees of concurrency, competition for resources, and complex interactions between transactions, all of which non-linearly impact performance. Although difficult, such analysis is a key component in enabling database administrators to understand which queries are eating up the resources, and how their system would scale under load. In this paper, we introduce our framework, called DBSeer, that addresses this problem by employing statistical models that provide resource and performance analysis and prediction for highly concurrent OLTP workloads. Our models are built on a small amount of training data from standard log information collected during normal system operation. These models are capable of accurately measuring several performance metrics, including resource consumption on a per-transaction-type basis, resource bottlenecks, and throughput at different load levels. We have validated these models on MySQL/Linux with numerous experiments on standard benchmarks (TPC-C) and real workloads (Wikipedia), observing high accuracy (within a few percent error) when predicting all of the above metrics.

【Keywords】: database-as-a-service; multi-tenancy; oltp; performance prediction transactions muti-tenancy; performance predictions

Paper Link】 【Pages】:313-324

【Authors】: Kyu-Young Whang ; Tae-Seob Yun ; Yeon-Mi Yeo ; Il-Yeol Song ; Hyuk-Yoon Kwon ; In-Joong Kim

【Abstract】: Recently, parallel search engines have been implemented based on scalable distributed file systems such as Google File System. However, we claim that building a massively-parallel search engine using a parallel DBMS can be an attractive alternative since it supports a higher-level (i.e., SQL-level) interface than that of a distributed file system for easy and less error-prone application development while providing scalability. Regarding higher-level functionality, we can draw a parallel with the traditional O/S file system vs. DBMS. In this paper, we propose a new approach of building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS. To estimate the performance, we propose a hybrid (i.e., analytic and experimental) performance model for the parallel search engine. We argue that the model can accurately estimate the performance of a massively-parallel (e.g., 300-node) search engine using the experimental results obtained from a small-scale (e.g., 5-node) one. We show that the estimation error between the model and the actual experiment is less than 2.13% by observing that the bulk of the query processing time is spent at the slave (vs. at the master and network) and by estimating the time spent at the slave based on actual measurement. Using our model, we demonstrate a commercial-level scalability and performance of our architecture. Our proposed system ODYS is capable of handling 1 billion queries per day (81 queries/sec) for 30 billion Web pages by using only 43,472 nodes with an average query response time of 194 ms. By using twice as many (86,944) nodes, ODYS can provide an average query response time of 148 ms. These results show that building a massively-parallel search engine using a parallel DBMS is a viable approach with advantages of supporting the high-level (i.e., DBMS-level), SQL-like programming interface.

【Keywords】: db-ir tight integration; massively-parallel search engine; parallel dbms

Research session 10: graph management 3

28. Massive graph triangulation.

Paper Link】 【Pages】:325-336

【Authors】: Xiaocheng Hu ; Yufei Tao ; Chin-Wan Chung

【Abstract】: This paper studies I/O-efficient algorithms for settling the classic triangle listing problem, whose solution is a basic operator in dealing with many other graph problems. Specifically, given an undirected graph G, the objective of triangle listing is to find all the cliques involving 3 vertices in G. The problem has been well studied in internal memory, but remains an urgent difficult challenge when G does not fit in memory, rendering any algorithm to entail frequent I/O accesses. Although previous research has attempted to tackle the challenge, the state-of-the-art solutions rely on a set of crippling assumptions to guarantee good performance. Motivated by this, we develop a new algorithm that is provably I/O and CPU efficient at the same time, without making any assumption on the input G at all. The algorithm uses ideas drastically different from all the previous approaches, and outperformed the existing competitors by a factor over an order of magnitude in our extensive experimentation.

【Keywords】: graph; i/o-efficient algorithm; triangle

Paper Link】 【Pages】:337-348

【Authors】: Wook-Shin Han ; Jinsoo Lee ; Jeong-Hoon Lee

【Abstract】: Given a query graph q and a data graph g, the subgraph isomorphism search finds all occurrences of q in g and is considered one of the most fundamental query types for many real applications. While this problem belongs to NP-hard, many algorithms have been proposed to solve it in a reasonable time for real datasets. However, a recent study has shown, through an extensive benchmark with various real datasets, that all existing algorithms have serious problems in their matching order selection. Furthermore, all algorithms blindly permutate all possible mappings for query vertices, often leading to useless computations. In this paper, we present an efficient and robust subgraph search solution, called TurboISO, which is turbo-charged with two novel concepts, candidate region exploration and the combine and permute strategy (in short, Comb/Perm). The candidate region exploration identifies on-the-fly candidate subgraphs (i.e, candidate regions), which contain embeddings, and computes a robust matching order for each candidate region explored. The Comb/Perm strategy exploits the novel concept of the neighborhood equivalence class (NEC). Each query vertex in the same NEC has identically matching data vertices. During subgraph isomorphism search, Comb/Perm generates only combinations for each NEC instead of permutating all possible enumerations. Thus, if a chosen combination is determined to not contribute to a complete solution, all possible permutations for that combination will be safely pruned. Extensive experiments with many real datasets show that TurboISO consistently and significantly outperforms all competitors by up to several orders of magnitude.

【Keywords】: candidate region exploration; combine and permute strategy; neighborhood equivalence class; subgraph isomorphism

30. Fast exact shortest-path distance queries on large networks by pruned landmark labeling.

Paper Link】 【Pages】:349-360

【Authors】: Takuya Akiba ; Yoichi Iwata ; Yuichi Yoshida

【Abstract】: We propose a new exact method for shortest-path distance queries on large-scale networks. Our method precomputes distance labels for vertices by performing a breadth-first search from every vertex. Seemingly too obvious and too inefficient at first glance, the key ingredient introduced here is pruning during breadth-first searches. While we can still answer the correct distance for any pair of vertices from the labels, it surprisingly reduces the search space and sizes of labels. Moreover, we show that we can perform 32 or 64 breadth-first searches simultaneously exploiting bitwise operations. We experimentally demonstrate that the combination of these two techniques is efficient and robust on various kinds of large-scale real-world networks. In particular, our method can handle social networks and web graphs with hundreds of millions of edges, which are two orders of magnitude larger than the limits of previous exact methods, with comparable query time to those of previous methods.

【Keywords】: graphs; query processing; shortest paths

Research session 11: text databases 3

31. Improving regular-expression matching on strings using negative factors.

Paper Link】 【Pages】:361-372

【Authors】: Xiaochun Yang ; Bin Wang ; Tao Qiu ; Yaoshu Wang ; Chen Li

【Abstract】: The problem of finding matches of a regular expression (RE) on a string exists in many applications such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques become inefficient when there are many candidate occurrences that need to be verified. In this paper we propose a novel technique that prunes false negatives by utilizing negative factors, which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be integrated with many existing algorithms to improve their efficiency significantly. We give a full specification of this technique. We develop an efficient algorithm that utilizes negative factors to prune candidates, then improve it by using bit operations to process negative factors in parallel. We show that negative factors, when used together with necessary factors (substrings that must appear in each answer), can achieve much better pruning power. We analyze the large number of negative factors, and develop an algorithm for finding a small number of high-quality negative factors. We conducted a thorough experimental study of this technique on real data sets, including DNA sequences, proteins, and text documents, and show the significant performance improvement when applying the technique in existing algorithms. For instance, it improved the search speed of the popular Gnu Grep tool by 11 to 74 times for text documents.

【Keywords】: long sequence; performance; regular expression

32. String similarity measures and joins with synonyms.

Paper Link】 【Pages】:373-384

【Authors】: Jiaheng Lu ; Chunbin Lin ; Wei Wang ; Chen Li ; Haiyong Wang

【Abstract】: A string similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings "Sam" and "Samuel" can be considered similar. Most existing work that computes the similarity of two strings only considers syntactic similarities, e.g., number of common words or q-grams. While these are indeed indicators of similarity, there are many important cases where syntactically different strings can represent the same real-world object. For example, "Bill" is a short form of "William". Given a collection of predefined synonyms, the purpose of the paper is to explore such existing knowledge to evaluate string similarity measures more effectively and efficiently, thereby boosting the quality of string matching. In particular, we first present an expansion-based framework to measure string similarities efficiently while considering synonyms. Because using synonyms in similarity measures is, while expressive, computationally expensive (NP-hard), we propose an efficient algorithm, called selective-expansion, which guarantees the optimality in many real scenarios. We then study a novel indexing structure called SI-tree, which combines both signature and length filtering strategies, for efficient string similarity joins with synonyms. We develop an estimator to approximate the size of candidates to enable an online selection of signature filters to further improve the efficiency. This estimator provides strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs, thus making our method attractive both in theory and in practice. Finally, the results from an empirical study of the algorithms verify the effectiveness and efficiency of our approach.

【Keywords】: filter estimation; similarity join; similarity search

33. Efficient top-k algorithms for approximate substring matching.

Paper Link】 【Pages】:385-396

【Authors】: Younghoon Kim ; Kyuseok Shim

【Abstract】: There is a wide range of applications that require to query a large database of texts to search for similar strings or substrings. Traditional approximate substring matching requests a user to specify a similarity threshold. Without top-k approximate substring matching, users have to try repeatedly different maximum distance threshold values when the proper threshold is unknown in advance. In our paper, we first propose the efficient algorithms for finding the top-k approximate substring matches with a given query string in a set of data strings. To reduce the number of expensive distance computations, the proposed algorithms utilize our novel filtering techniques which take advantages of q-grams and inverted q-gram indexes available. We conduct extensive experiments with real-life data sets. Our experimental results confirm the effectiveness and scalability of our proposed algorithms.

【Keywords】: edit distance; inverted q-gram index; top-k approximate substring matching

Research session 12: systems, performance II 3

34. Towards high-throughput gibbs sampling at scale: a study across storage managers.

Paper Link】 【Pages】:397-408

【Authors】: Ce Zhang ; Christopher Ré

【Abstract】: Factor graphs and Gibbs sampling are a popular combination for Bayesian statistical methods that are used to solve diverse problems including insurance risk models, pricing models, and information extraction. Given a fixed sampling method and a fixed amount of time, an implementation of a sampler that achieves a higher throughput of samples will achieve a higher quality than a lower-throughput sampler. We study how (and whether) traditional data processing choices about materialization, page layout, and buffer-replacement policy need to be changed to achieve high-throughput Gibbs sampling for factor graphs that are larger than main memory. We find that both new theoretical and new algorithmic techniques are required to understand the tradeoff space for each choice. On both real and synthetic data, we demonstrate that traditional baseline approaches may achieve two orders of magnitude lower throughput than an optimal approach. For a handful of popular tasks across several storage backends, including HBase and traditional unix files, we show that our simple prototype achieves competitive (and sometimes better) throughput compared to specialized state-of-the-art approaches on factor graphs that are larger than main memory.

【Keywords】: gibbs sampling; scalability; storage manager

35. Latch-free data structures for DBMS: design, implementation, and evaluation.

Paper Link】 【Pages】:409-420

【Authors】: Takashi Horikawa

【Abstract】: The fact that multi-core CPUs have become so common and that the number of CPU cores in one chip has continued to rise means that a server machine can easily contain an extremely high number of CPU cores. The CPU scalability of IT systems is thus attracting a considerable amount of research attention. Some systems, such as ACID-compliant DBMSs, are said to be difficult to scale, probably due to the mutual exclusion required to ensure data consistency. Possible countermeasures include latch-free (LF) data structures, an elemental technology to improve the CPU scalability by eliminating the need for mutual exclusion. This paper investigates these LF data structures with a particular focus on their applicability and effectiveness. Some existing LF data structures (such as LF hash tables) have been adapted to PostgreSQL, one of the most popular open-source DBMSs. The performance improvement was evaluated with a benchmark program simulating real-world transactions. Measurement results obtained from state-of-the-art 80-core machines demonstrated that the LF data structures were effective for performance improvement in a many-core situation in which DBT-1 throughput increased by about 2.5 times. Although the poor performance of the original DBMS was due to a severe latch-related bottleneck and can be improved by parameter tuning, it is of practical importance that LF data structures provided performance improvement without deep understanding of the target system behavior that is necessary for the parameter tuning.

【Keywords】: benchmark program; bottleneck; critical section; database management system; formal verification; many-core system; performance tuning; scalability; symmetric multiprocessing

36. DBMS metrology: measuring query time.

Paper Link】 【Pages】:421-432

【Authors】: Sabah Currim ; Richard T. Snodgrass ; Young-Kyoon Suh ; Rui Zhang ; Matthew Wong Johnson ; Cheng Yi

【Abstract】: It is surprisingly hard to obtain accurate and precise measurements of the time spent executing a query. We review relevant process and overall measures obtainable from the Linux kernel and introduce a structural causal model relating these measures. A thorough correlational analysis provides strong support for this model. Using this model, we developed a timing protocol, which (1) performs sanity checks to ensure validity of the data, (2) drops some query executions via clearly motivated predicates, (3) drops some entire queries at a cardinality, again via clearly motivated predicates, (4) for those that remain, for each computes a single measured time by a carefully justified formula over the underlying measures of the remaining query executions, and (5) performs post-analysis sanity checks. The resulting query time measurement procedure, termed the Tucson Protocol, applies to proprietary and open-source DBMSes.

【Keywords】: accuracy; repeatability; tucson protocol

Research session 13: information extraction 3

37. Quality and efficiency for kernel density estimates in large data.

Paper Link】 【Pages】:433-444

【Authors】: Yan Zheng ; Jeffrey Jestes ; Jeff M. Phillips ; Feifei Li

【Abstract】: Kernel density estimates are important for a broad variety of applications. Their construction has been well-studied, but existing techniques are expensive on massive datasets and/or only provide heuristic approximations without theoretical guarantees. We propose randomized and deterministic algorithms with quality guarantees which are orders of magnitude more efficient than previous algorithms. Our algorithms do not require knowledge of the kernel or its bandwidth parameter and are easily parallelizable. We demonstrate how to implement our ideas in a centralized setting and in MapReduce, although our algorithms are applicable to any large-scale data processing framework. Extensive experiments on large real datasets demonstrate the quality, efficiency, and scalability of our techniques.

【Keywords】: distributed and parallel kde; kernel density estimate (kde)

Paper Link】 【Pages】:445-456

【Authors】: Yasuhiro Fujiwara ; Makoto Nakatsuji ; Hiroaki Shiokawa ; Takeshi Mishima ; Makoto Onizuka

【Abstract】: Personalized PageRank (PPR) has been successfully applied to various applications. In real applications, it is important to set PPR parameters in an ad-hoc manner when finding similar nodes because of dynamically changing nature of graphs. Through interactive actions, interactive similarity search supports users to enhance the efficacy of applications. Unfortunately, if the graph is large, interactive similarity search is infeasible due to its high computation cost. Previous PPR approaches cannot effectively handle interactive similarity search since they need precomputation or approximate computation of similarities. The goal of this paper is to efficiently find the top-k nodes with exact node ranking so as to effectively support interactive similarity search based on PPR. Our solution is Castanet. The key Castanet operations are (1) estimate upper/lower bounding similarities iteratively, and (2) prune unnecessary nodes dynamically to obtain top-k nodes in each iteration. Experiments show that our approach is much faster than existing approaches.

【Keywords】: graph database; personalized pagerank; top-k search

39. Provenance-based dictionary refinement in information extraction.

Paper Link】 【Pages】:457-468

【Authors】: Sudeepa Roy ; Laura Chiticariu ; Vitaly Feldman ; Frederick Reiss ; Huaiyu Zhu

【Abstract】: Dictionaries of terms and phrases (e.g. common person or organization names) are integral to information extraction systems that extract structured information from unstructured text. Using noisy or unrefined dictionaries may lead to many incorrect results even when highly precise and sophisticated extraction rules are used. In general, the results of the system are dependent on dictionary entries in arbitrary complex ways, and removal of a set of entries can remove both correct and incorrect results. Further, any such refinement critically requires laborious manual labeling of the results. In this paper, we study the dictionary refinement problem and address the above challenges. Using provenance of the outputs in terms of the dictionary entries, we formalize an optimization problem of maximizing the quality of the system with respect to the refined dictionaries, study complexity of this problem, and give efficient algorithms. We also propose solutions to address incomplete labeling of the results where we estimate the missing labels assuming a statistical model. We conclude with a detailed experimental evaluation using several real-world extractors and competition datasets to validate our solutions. Beyond information extraction, our provenance-based techniques and solutions may find applications in view-maintenance in general relational settings.

【Keywords】: information extraction; optimization; provenance; refinement

Research session 14: query processing and optimization 3

40. CS2: a new database synopsis for query estimation.

Paper Link】 【Pages】:469-480

【Authors】: Feng Yu ; Wen-Chi Hou ; Cheng Luo ; Dunren Che ; Mengxia Zhu

【Abstract】: Fast and accurate estimations for complex queries are profoundly beneficial for large databases with heavy workloads. In this research, we propose a statistical summary for a database, called CS2 (Correlated Sample Synopsis), to provide rapid and accurate result size estimations for all queries with joins and arbitrary selections. Unlike the state-of-the-art techniques, CS2 does not completely rely on simple random samples, but mainly consists of correlated sample tuples that retain join relationships with less storage. We introduce a statistical technique, called reverse sample, and design a powerful estimator, called reverse estimator, to fully utilize correlated sample tuples for query estimation. We prove both theoretically and empirically that the reverse estimator is unbiased and accurate using CS2. Extensive experiments on multiple datasets show that CS2 is fast to construct and derives more accurate estimations than existing methods with the same space budget.

【Keywords】: database synopsis; query optimization; selectivity estimation

41. Branch-and-bound algorithm for reverse top-k queries.

Paper Link】 【Pages】:481-492

【Authors】: Akrivi Vlachou ; Christos Doulkeridis ; Kjetil Nørvåg ; Yannis Kotidis

【Abstract】: Top-k queries return to the user only the k best objects based on the individual user preferences and comprise an essential tool for rank-aware query processing. Assuming a stored data set of user preferences, reverse top-k queries have been introduced for retrieving the users that deem a given database object as one of their top-k results. Reverse top-k queries have already attracted significant interest in research, due to numerous real-life applications such as market analysis and product placement. Currently, the most efficient algorithm for computing the reverse top-k set is RTA. RTA has two main drawbacks when processing a reverse top-k query: (i) it needs to access all stored user preferences, and (ii) it cannot avoid executing a top-k query for each user preference that belongs to the result set. To address these limitations, in this paper, we identify useful properties for processing reverse top-k queries without accessing each user's individual preferences nor executing the top-k query. We propose an intuitive branch-and-bound algorithm for processing reverse top-k queries efficiently and discuss novel optimizations to boost its performance. Our experimental evaluation demonstrates the efficiency of the proposed algorithm that outperforms RTA by a large margin.

【Keywords】: branch-and-bound algorithm; reverse top-k query

Paper Link】 【Pages】:493-504

【Authors】: Guido Moerkotte ; Pit Fender ; Marius Eich

【Abstract】: Reordering more than traditional joins (e.g. outerjoins, antijoins) requires some care, since not all reorderings are valid. To prevent invalid plans, two approaches have been described in the literature. We show that both approaches still produce invalid plans. We present three conflict detectors. All of them are (1) correct, i.e., prevent invalid plans, (2) easier to understand and implement than the previous (buggy) approaches, (3) more flexible in the sense that the restriction that all predicates must reject nulls is no longer required, and (4) extensible in the sense that it is easy to add new operators. Further, the last of our three approaches is complete, i.e., it allows for the generation of all valid plans within the core search space.

【Keywords】: join ordering; non-inner joins; query optimization

Research session 15: cloud computing 3

43. Trinity: a distributed graph engine on a memory cloud.

Paper Link】 【Pages】:505-516

【Authors】: Bin Shao ; Haixun Wang ; Yatao Li

【Abstract】: Computations performed by graph algorithms are data driven, and require a high degree of random data access. Despite the great progresses made in disk technology, it still cannot provide the level of efficient random access required by graph computation. On the other hand, memory-based approaches usually do not scale due to the capacity limit of single machines. In this paper, we introduce Trinity, a general purpose graph engine over a distributed memory cloud. Through optimized memory management and network communication, Trinity supports fast graph exploration as well as efficient parallel computing. In particular, Trinity leverages graph access patterns in both online and offline computation to optimize memory and communication for best performance. These enable Trinity to support efficient online query processing and offline analytics on large graphs with just a few commodity machines. Furthermore, Trinity provides a high level specification language called TSL for users to declare data schema and communication protocols, which brings great ease-of-use for general purpose graph management and computing. Our experiments show Trinity's performance in both low latency graph queries as well as high throughput graph analytics on web-scale, billion-node graphs.

【Keywords】: distributed system; graph database; memory cloud

44. Characterizing tenant behavior for placement and crisis mitigation in multitenant DBMSs.

Paper Link】 【Pages】:517-528

【Authors】: Aaron J. Elmore ; Sudipto Das ; Alexander Pucher ; Divyakant Agrawal ; Amr El Abbadi ; Xifeng Yan

【Abstract】: A multitenant database management system (DBMS) in the cloud must continuously monitor the trade-off between efficient resource sharing among multiple application databases (tenants) and their performance. Considering the scale of \attn{hundreds to} thousands of tenants in such multitenant DBMSs, manual approaches for continuous monitoring are not tenable. A self-managing controller of a multitenant DBMS faces several challenges. For instance, how to characterize a tenant given its variety of workloads, how to reduce the impact of tenant colocation, and how to detect and mitigate a performance crisis where one or more tenants' desired service level objective (SLO) is not achieved. We present Delphi, a self-managing system controller for a multitenant DBMS, and Pythia, a technique to learn behavior through observation and supervision using DBMS-agnostic database level performance measures. Pythia accurately learns tenant behavior even when multiple tenants share a database process, learns good and bad tenant consolidation plans (or packings), and maintains a pertenant history to detect behavior changes. Delphi detects performance crises, and leverages Pythia to suggests remedial actions using a hill-climbing search algorithm to identify a new tenant placement strategy to mitigate violating SLOs. Our evaluation using a variety of tenant types and workloads shows that Pythia can learn a tenant's behavior with more than 92% accuracy and learn the quality of packings with more than 86% accuracy. During a performance crisis, Delphi is able to reduce 99th percentile latencies by 80%, and can consolidate 45% more tenants than a greedy baseline, which balances tenant load without modeling tenant behavior.

【Keywords】: database consolidation; elastic data management; multitenancy; shared nothing architectures; tenant characterization

45. Minimal MapReduce algorithms.

Paper Link】 【Pages】:529-540

【Authors】: Yufei Tao ; Wenqing Lin ; Xiaokui Xiao

【Abstract】: MapReduce has become a dominant parallel computing paradigm for big data, i.e., colossal datasets at the scale of tera-bytes or higher. Ideally, a MapReduce system should achieve a high degree of load balancing among the participating machines, and minimize the space usage, CPU and I/O time, and network transfer at each machine. Although these principles have guided the development of MapReduce algorithms, limited emphasis has been placed on enforcing serious constraints on the aforementioned metrics simultaneously. This paper presents the notion of minimal algorithm, that is, an algorithm that guarantees the best parallelization in multiple aspects at the same time, up to a small constant factor. We show the existence of elegant minimal algorithms for a set of fundamental database problems, and demonstrate their excellent performance with extensive experiments.

【Keywords】: big data; mapreduce; minimal algorithm

Research session 16: data cleaning 3

46. NADEEF: a commodity data cleaning system.

Paper Link】 【Pages】:541-552

【Authors】: Michele Dallachiesa ; Amr Ebaid ; Ahmed Eldawy ; Ahmed K. Elmagarmid ; Ihab F. Ilyas ; Mourad Ouzzani ; Nan Tang

【Abstract】: Despite the increasing importance of data quality and the rich theoretical and practical contributions in all aspects of data cleaning, there is no single end-to-end off-the-shelf solution to (semi-)automate the detection and the repairing of violations w.r.t. a set of heterogeneous and ad-hoc quality constraints. In short, there is no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve application-specific data quality problems. In this paper, we present NADEEF, an extensible, generalized and easy-to-deploy data cleaning platform. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows the users to specify multiple types of data quality rules, which uniformly define what is wrong with the data and (possibly) how to repair it through writing code that implements predefined classes. We show that the programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. Treating user implemented interfaces as black-boxes, the core provides algorithms to detect errors and to clean data. The core is designed in a way to allow cleaning algorithms to cope with multiple rules holistically, i.e. detecting and repairing data errors without differentiating between various types of rules. We showcase two implementations for core repairing algorithms. These two implementations demonstrate the extensibility of our core, which can also be replaced by other user-provided algorithms. Using real-life data, we experimentally verify the generality, extensibility, and effectiveness of our system.

【Keywords】: conditional functional dependency; data cleaning; etl; matching dependency

47. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes.

Paper Link】 【Pages】:553-564

【Authors】: Mohamed Yakout ; Laure Berti-Equille ; Ahmed K. Elmagarmid

【Abstract】: Various computational procedures or constraint-based methods for data repairing have been proposed over the last decades to identify errors and, when possible, correct them. However, these approaches have several limitations including the scalability and quality of the values to be used in replacement of the errors. In this paper, we propose a new data repairing approach that is based on maximizing the likelihood of replacement data given the data distribution, which can be modeled using statistical machine learning techniques. This is a novel approach combining machine learning and likelihood methods for cleaning dirty databases by value modification. We develop a quality measure of the repairing updates based on the likelihood benefit and the amount of changes applied to the database. We propose SCARE (SCalable Automatic REpairing), a systematic scalable framework that follows our approach. SCARE relies on a robust mechanism for horizontal data partitioning and a combination of machine learning techniques to predict the set of possible updates. Due to data partitioning, several updates can be predicted for a single record based on local views on each data partition. Therefore, we propose a mechanism to combine the local predictions and obtain accurate final predictions. Finally, we experimentally demonstrate the effectiveness, efficiency, and scalability of our approach on real-world datasets in comparison to recent data cleaning approaches.

【Keywords】: data cleaning; inconsistent data

48. Determining the relative accuracy of attributes.

Paper Link】 【Pages】:565-576

【Authors】: Yang Cao ; Wenfei Fan ; Wenyuan Yu

【Abstract】: The relative accuracy problem is to determine, given tuples t1 and t2 that refer to the same entity e, whether t1[A] is more accurate than t2A, i.e., t1A is closer to the true value of the A attribute of e than t2A. This has been a longstanding issue for data quality, and is challenging when the true values of e are unknown. This paper proposes a model for determining relative accuracy. (1) We introduce a class of accuracy rules and an inference system with a chase procedure, to deduce relative accuracy. (2) We identify and study several fundamental problems for relative accuracy. Given a set Ie of tuples pertaining to the same entity e and a set of accuracy rules, these problems are to decide whether the chase process terminates, is Church-Rosser, and leads to a unique target tuple te composed of the most accurate values from Ie for all the attributes of e. (3) We propose a framework for inferring accurate values with user interaction. (4) We provide algorithms underlying the framework, to find the unique target tuple te whenever possible; when there is no enough information to decide a complete te, we compute top-k candidate targets based on a preference model. (5) Using real-life and synthetic data, we experimentally verify the effectiveness and efficiency of our method.

【Keywords】: data accuracy; data cleaning

Research session 17: complex event processing 3

49. Photon: fault-tolerant and scalable joining of continuous data streams.

Paper Link】 【Pages】:577-588

【Authors】: Rajagopal Ananthanarayanan ; Venkatesh Basker ; Sumit Das ; Ashish Gupta ; Haifeng Jiang ; Tianhao Qiu ; Alexey Reznichenko ; Deomid Ryabkov ; Manpreet Singh ; Shivakumar Venkataraman

【Abstract】: Web-based enterprises process events generated by millions of users interacting with their websites. Rich statistical data distilled from combining such interactions in near real-time generates enormous business value. In this paper, we describe the architecture of Photon, a geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency, where the streams may be unordered or delayed. The system fully tolerates infrastructure degradation and datacenter-level outages without any manual intervention. Photon guarantees that there will be no duplicates in the joined output (at-most-once semantics) at any point in time, that most joinable events will be present in the output in real-time (near-exact semantics), and exactly-once semantics eventually. Photon is deployed within Google Advertising System to join data streams such as web search queries and user clicks on advertisements. It produces joined logs that are used to derive key business metrics, including billing for advertisers. Our production deployment processes millions of events per minute at peak with an average end-to-end latency of less than 10 seconds. We also present challenges and solutions in maintaining large persistent state across geographically distant locations, and highlight the design principles that emerged from our experience.

【Keywords】: continuous streams; fault-tolerance; paxos; stream joining

50. Utility-maximizing event stream suppression.

Paper Link】 【Pages】:589-600

【Authors】: Di Wang ; Yeye He ; Elke A. Rundensteiner ; Jeffrey F. Naughton

【Abstract】: Complex Event Processing (CEP) has emerged as a technology for monitoring event streams in search of user specified event patterns. When a CEP system is deployed in sensitive environments the user may wish to mitigate leaks of private information while ensuring that useful nonsensitive patterns are still reported. In this paper we consider how to suppress events in a stream to reduce the disclosure of sensitive patterns while maximizing the detection of nonsensitive patterns. We first formally define the problem of utility-maximizing event suppression with privacy preferences, and analyze its computational hardness. We then design a suite of real-time solutions to solve this problem. Our first solution optimally solves the problem at the event-type level. The second solution, at the event-instance level, further optimizes the event-type level solution by exploiting runtime event distributions using advanced pattern match cardinality estimation techniques. Our user study and experimental evaluation over both real-world and synthetic event streams show that our algorithms are effective in maximizing utility yet still efficient enough to offer near real-time system responsiveness.

【Keywords】: cep; privacy; utility

51. ε-Matching: event processing over noisy sequences in real time.

Paper Link】 【Pages】:601-612

【Authors】: Zheng Li ; Tingjian Ge ; Cindy X. Chen

【Abstract】: Regular expression matching over sequences in real time is a crucial task in complex event processing on data streams. Given that such data sequences are often noisy and errors have temporal and spatial correlations, performing regular expression matching effectively and efficiently is a challenging task. Instead of the traditional approach of learning a distribution of the stream first and then processing queries, we propose a new approach that efficiently does the matching based on an error model. In particular, our algorithms are based on the realistic Markov chain error model, and report all matching paths to trace relevant basic events that trigger the matching. This is much more informative than a single matching path. We also devise algorithms to efficiently return only top-k matching paths, and to handle negations in an extended regular expression. Finally, we conduct a comprehensive experimental study to evaluate our algorithms using real datasets.

【Keywords】: complex event processing; noisy stream; pattern matching

Research session 18: systems, performance III 3

52. Toward practical query pricing with QueryMarket.

Paper Link】 【Pages】:613-624

【Authors】: Paraschos Koutris ; Prasang Upadhyaya ; Magdalena Balazinska ; Bill Howe ; Dan Suciu

【Abstract】: We develop a new pricing system, QueryMarket, for flexible query pricing in a data market based on an earlier theoretical framework (Koutris et al., PODS 2012). To build such a system, we show how to use an Integer Linear Programming formulation of the pricing problem for a large class of queries, even when pricing is computationally hard. Further, we leverage query history to avoid double charging when queries purchased over time have overlapping information, or when the database is updated. We then present a technique that fairly shares revenue when multiple sellers are involved. Finally, we implement our approach in a prototype and evaluate its performance on several query workloads.

【Keywords】: data pricing; integer linear programming

53. Generalized scale independence through incremental precomputation.

Paper Link】 【Pages】:625-636

【Authors】: Michael Armbrust ; Eric Liang ; Tim Kraska ; Armando Fox ; Michael J. Franklin ; David A. Patterson

【Abstract】: Developers of rapidly growing applications must be able to anticipate potential scalability problems before they cause performance issues in production environments. A new type of data independence, called scale independence, seeks to address this challenge by guaranteeing a bounded amount of work is required to execute all queries in an application, independent of the size of the underlying data. While optimization strategies have been developed to provide these guarantees for the class of queries that are scale-independent when executed using simple indexes, there are important queries for which such techniques are insufficient. Executing these more complex queries scale-independently requires precomputation using incrementally-maintained materialized views. However, since this precomputation effectively shifts some of the query processing burden from execution time to insertion time, a scale-independent system must be careful to ensure that storage and maintenance costs do not threaten scalability. In this paper, we describe a scale-independent view selection and maintenance system, which uses novel static analysis techniques that ensure that created views do not themselves become scaling bottlenecks. Finally, we present an empirical analysis that includes all the queries from the TPC-W benchmark and validates our implementation's ability to maintain nearly constant high-quantile query and update latency even as an application scales to hundreds of machines.

【Keywords】: materialized view selection; scalability; scale independence

54. Simulation of database-valued markov chains using SimSQL.

Paper Link】 【Pages】:637-648

【Authors】: Zhuhua Cai ; Zografoula Vagena ; Luis Leopoldo Perez ; Subramanian Arumugam ; Peter J. Haas ; Christopher M. Jermaine

【Abstract】: This paper describes the SimSQL system, which allows for SQLbased specification, simulation, and querying of database-valued Markov chains, i.e., chains whose value at any time step comprises the contents of an entire database. SimSQL extends the earlier Monte Carlo database system (MCDB), which permitted Monte Carlo simulation of static database-valued random variables. Like MCDB, SimSQL uses user-specified "VG functions" to generate the simulated data values that are the building blocks of a simulated database. The enhanced functionality of SimSQL is enabled by the ability to parametrize VG functions using stochastic tables, so that one stochastic database can be used to parametrize the generation of another stochastic database, which can parametrize another, and so on. Other key extensions include the ability to explicitly define recursive versions of a stochastic table and the ability to execute the simulation in a MapReduce environment. We focus on applying SimSQL to Bayesian machine learning.

【Keywords】: databases; machine learning; markov chains

Research session 19: privacy 3

55. Recursive mechanism: towards node differential privacy and unrestricted joins.

Paper Link】 【Pages】:653-664

【Authors】: Shixi Chen ; Shuigeng Zhou

【Abstract】: Existing differential privacy (DP) studies mainly consider aggregation on data sets where each entry corresponds to a particular participant to be protected. In many situations, a user may pose a relational algebra query on a database with sensitive data, and desire differentially private aggregation on the result of the query. However, no existing work is able to release such aggregation when the query contains unrestricted join operations. This severely limits the applications of existing DP techniques because many data analysis tasks require unrestricted joins. One example is subgraph counting on a graph. Furthermore, existing methods for differentially private subgraph counting support only edge DP and are subject to very simple subgraphs. Until recent, whether any nontrivial graph statistics can be released with reasonable accuracy for arbitrary kind of input graphs under node DP was still an open problem. In this paper, we propose a novel differentially private mechanism that supports unrestricted joins, to release an approximation of a linear statistic of the result of some positive relational algebra calculation over a sensitive database. The error bound of the approximate answer is roughly proportional to the empirical sensitivity of the query --- a new notion that measures the maximum possible change to the query answer when a participant withdraws its data from the sensitive database. For subgraph counting, our mechanism provides a solution to achieve node DP, for any kind of subgraphs.

【Keywords】: differential privacy; node differential privacy;; query processing

56. PrivGene: differentially private model fitting using genetic algorithms.

Paper Link】 【Pages】:665-676

【Authors】: Jun Zhang ; Xiaokui Xiao ; Yin Yang ; Zhenjie Zhang ; Marianne Winslett

【Abstract】: epsilon-differential privacy is rapidly emerging as the state-of-the-art scheme for protecting individuals' privacy in published analysis results over sensitive data. The main idea is to perform random perturbations on the analysis results, such that any individual's presence in the data has negligible impact on the randomized results. This paper focuses on analysis tasks that involve model fitting, i.e., finding the parameters of a statistical model that best fit the dataset. For such tasks, the quality of the differentially private results depends upon both the effectiveness of the model fitting algorithm, and the amount of perturbations required to satisfy the privacy guarantees. Most previous studies start from a state-of-the-art, non-private model fitting algorithm, and develop a differentially private version. Unfortunately, many model fitting algorithms require intensive perturbations to satisfy -differential privacy, leading to poor overall result quality. Motivated by this, we propose PrivGene, a general-purpose differentially private model fitting solution based on genetic algorithms (GA). PrivGene needs significantly less perturbations than previous methods, and it achieves higher overall result quality, even for model fitting tasks where GA is not the first choice without privacy considerations. Further, PrivGene performs the random perturbations using a novel technique called the enhanced exponential mechanism, which improves over the exponential mechanism by exploiting the special properties of model fitting tasks. As case studies, we apply PrivGene to three common analysis tasks involving model fitting: logistic regression, SVM classification, and k-means clustering. Extensive experiments using real data confirm the high result quality of PrivGene, and its superiority over existing methods.

【Keywords】: differential privacy; genetic algorithms; model fitting

57. Information preservation in statistical privacy and bayesian estimation of unattributed histograms.

Paper Link】 【Pages】:677-688

【Authors】: Bing-Rong Lin ; Daniel Kifer

【Abstract】: In statistical privacy, utility refers to two concepts: information preservation -- how much statistical information is retained by a sanitizing algorithm, and usability -- how (and with how much difficulty) does one extract this information to build statistical models, answer queries, etc. Some scenarios incentivize a separation between information preservation and usability, so that the data owner first chooses a sanitizing algorithm to maximize a measure of information preservation and, afterward, the data consumers process the sanitized output according to their needs [22, 46]. We analyze a variety of utility measures and show that the average (over possible outputs of the sanitizer) error of Bayesian decision makers forms the unique class of utility measures that satisfy three axioms related to information preservation. The axioms are agnostic to Bayesian concepts such as subjective probabilities and hence strengthen support for Bayesian views in privacy research. In particular, this result connects information preservation to aspects of usability -- if the information preservation of a sanitizing algorithm should be measured as the average error of a Bayesian decision maker, shouldn't Bayesian decision theory be a good choice when it comes to using the sanitized outputs for various purposes? We put this idea to the test in the unattributed histogram problem where our decision- theoretic post-processing algorithm empirically outperforms previously proposed approaches.

【Keywords】: decision theory; differential privacy; privacy; utility

Research session 20: spatial databases II 3

58. Collective spatial keyword queries: a distance owner-driven approach.

Paper Link】 【Pages】:689-700

【Authors】: Cheng Long ; Raymond Chi-Wing Wong ; Ke Wang ; Ada Wai-Chee Fu

【Abstract】: Recently, spatial keyword queries become a hot topic in the literature. One example of these queries is the collective spatial keyword query (CoSKQ) which is to find a set of objects in the database such that it covers a set of given keywords collectively and has the smallest cost. Unfortunately, existing exact algorithms have severe scalability problems and existing approximate algorithms, though scalable, cannot guarantee near-to-optimal solutions. In this paper, we study the CoSKQ problem and address the above issues. Firstly, we consider the CoSKQ problem using an existing cost measurement called the maximum sum cost. This problem is called MaxSum-CoSKQ and is known to be NP-hard. We observe that the maximum sum cost of a set of objects is dominated by at most three objects which we call the distance owners of the set. Motivated by this, we propose a distance owner-driven approach which involves two algorithms: one is an exact algorithm which runs faster than the best-known existing algorithm by several orders of magnitude and the other is an approximate algorithm which improves the best-known constant approximation factor from 2 to 1.375. Secondly, we propose a new cost measurement called diameter cost and CoSKQ with this measurement is called Dia-CoSKQ. We prove that Dia-CoSKQ is NP-hard. With the same distance owner-driven approach, we design two algorithms for Dia-CoSKQ: one is an exact algorithm which is efficient and scalable and the other is an approximate algorithm which gives a √3-factor approximation. We conducted extensive experiments on real datasets which verified that the proposed exact algorithms are scalable and the proposed approximate algorithms return near-to-optimal solutions.

【Keywords】: distance owner-driven approach; spatial keyword querying

59. TOUCH: in-memory spatial join by hierarchical data-oriented partitioning.

Paper Link】 【Pages】:701-712

【Authors】: Sadegh Nobari ; Farhan Tauheed ; Thomas Heinis ; Panagiotis Karras ; Stéphane Bressan ; Anastasia Ailamaki

【Abstract】: Efficient spatial joins are pivotal for many applications and particularly important for geographical information systems or for the simulation sciences where scientists work with spatial models. Past research has primarily focused on disk-based spatial joins; efficient in-memory approaches, however, are important for two reasons: a) main memory has grown so large that many datasets fit in it and b) the in-memory join is a very time-consuming part of all disk-based spatial joins. In this paper we develop TOUCH, a novel in-memory spatial join algorithm that uses hierarchical data-oriented space partitioning, thereby keeping both its memory footprint and the number of comparisons low. Our results show that TOUCH outperforms known in-memory spatial-join algorithms as well as in-memory implementations of disk-based join approaches. In particular, it has a one order of magnitude advantage over the memory-demanding state of the art in terms of number of comparisons (i.e., pairwise object comparisons), as well as execution time, while it is two orders of magnitude faster when compared to approaches with a similar memory footprint. Furthermore, TOUCH is more scalable than competing approaches as data density grows.

【Keywords】: indexing; scalable algorithms; spatial joins; touch

60. Finding time period-based most frequent path in big trajectory data.

Paper Link】 【Pages】:713-724

【Authors】: Wuman Luo ; Haoyu Tan ; Lei Chen ; Lionel M. Ni

【Abstract】: The rise of GPS-equipped mobile devices has led to the emergence of big trajectory data. In this paper, we study a new path finding query which finds the most frequent path (MFP) during user-specified time periods in large-scale historical trajectory data. We refer to this query as time period-based MFP (TPMFP). Specifically, given a time period T, a source v_s and a destination v_d, TPMFP searches the MFP from v_s to v_d during T. Though there exist several proposals on defining MFP, they only consider a fixed time period. Most importantly, we find that none of them can well reflect people's common sense notion which can be described by three key properties, namely suffix-optimal (i.e., any suffix of an MFP is also an MFP), length-insensitive (i.e., MFP should not favor shorter or longer paths), and bottleneck-free (i.e., MFP should not contain infrequent edges). The TPMFP with the above properties will reveal not only common routing preferences of the past travelers, but also take the time effectiveness into consideration. Therefore, our first task is to give a TPMFP definition that satisfies the above three properties. Then, given the comprehensive TPMFP definition, our next task is to find TPMFP over huge amount of trajectory data efficiently. Particularly, we propose efficient search algorithms together with novel indexes to speed up the processing of TPMFP. To demonstrate both the effectiveness and the efficiency of our approach, we conduct extensive experiments using a real dataset containing over 11 million trajectories.

【Keywords】: big trajectory data; path finding

Research session 21: data streams 3

61. Integrating scale out and fault tolerance in stream processing using operator state management.

Paper Link】 【Pages】:725-736

【Authors】: Raul Castro Fernandez ; Matteo Migliavacca ; Evangelia Kalyvianaki ; Peter Pietzuch

【Abstract】: As users of "big data" applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the "pay-as-you-go" model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs-systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results. Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the checkpointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures.

【Keywords】: fault tolerance; scalability; stateful stream processing

62. Quantiles over data streams: an experimental study.

Paper Link】 【Pages】:737-748

【Authors】: Lu Wang ; Ge Luo ; Ke Yi ; Graham Cormode

【Abstract】: A fundamental problem in data management and analysis is to generate descriptions of the distribution of data. It is most common to give such descriptions in terms of the cumulative distribution, which is characterized by the quantiles of the data. The design and engineering of efficient methods to find these quantiles has attracted much study, especially in the case where the data is described incrementally, and we must compute the quantiles in an online, streaming fashion. Yet while such algorithms have proved to be tremendously useful in practice, there has been limited formal comparison of the competing methods, and no comprehensive study of their performance. In this paper, we remedy this deficit by providing a taxonomy of different methods, and describe efficient implementations. In doing so, we propose and analyze variations that have not been explicitly studied before, yet which turn out to perform the best. To illustrate this, we provide detailed experimental comparisons demonstrating the tradeoffs between space, time, and accuracy for quantile computation.

【Keywords】: data stream algorithms; quantiles

63. An efficient query indexing mechanism for filtering geo-textual data.

Paper Link】 【Pages】:749-760

【Authors】: Lisi Chen ; Gao Cong ; Xin Cao

【Abstract】: Massive amount of data that are geo-tagged and associated with text information are being generated at an unprecedented scale. Users may want to be notified of interesting geo-textual objects during a period of time. For example, a user may want to be informed when tweets containing term "garage sale" are posted within 5 km of the user's home in the next 72 hours. In this paper, for the first time we study the problem of matching a stream of incoming Boolean Range Continuous queries over a stream of incoming geo-textual objects in real time. We develop a new system for addressing the problem. In particular, we propose a hybrid index, called IQ-tree, and novel cost models for managing a stream of incoming Boolean Range Continuous queries. We also propose algorithms for matching the queries with incoming geo-textual objects based on the index. Results of empirical studies with implementations of the proposed techniques demonstrate that the paper's proposals offer scalability and are capable of excellent performance.

【Keywords】: filtering; geo-textual data; query index; subscribing

Research session 22: distributed systems 3

64. Bolt-on causal consistency.

Paper Link】 【Pages】:761-772

【Authors】: Peter Bailis ; Ali Ghodsi ; Joseph M. Hellerstein ; Ion Stoica

【Abstract】: We consider the problem of separating consistency-related safety properties from availability and durability in distributed data stores via the application of a "bolt-on" shim layer that upgrades the safety of an underlying general-purpose data store. This shim provides the same consistency guarantees atop a wide range of widely deployed but often inflexible stores. As causal consistency is one of the strongest consistency models that remain available during system partitions, we develop a shim layer that upgrades eventually consistent stores to provide convergent causal consistency. Accordingly, we leverage widely deployed eventually consistent infrastructure as a common substrate for providing causal guarantees. We describe algorithms and shim implementations that are suitable for a large class of application-level causality relationships and evaluate our techniques using an existing, production-ready data store and with real-world explicit causality relationships.

【Keywords】: causal consistency; eventual consistency; separation of concerns

65. RTP: robust tenant placement for elastic in-memory database clusters.

Paper Link】 【Pages】:773-784

【Authors】: Jan Schaffner ; Tim Januschowski ; Megan Kercher ; Tim Kraska ; Hasso Plattner ; Michael J. Franklin ; Dean Jacobs

【Abstract】: In the cloud services industry, a key issue for cloud operators is to minimize operational costs. In this paper, we consider algorithms that elastically contract and expand a cluster of in-memory databases depending on tenants' behavior over time while maintaining response time guarantees. We evaluate our tenant placement algorithms using traces obtained from one of SAP's production on-demand applications. Our experiments reveal that our approach lowers operating costs for the database cluster of this application by a factor of 2.2 to 10, measured in Amazon EC2 hourly rates, in comparison to the state of the art. In addition, we carefully study the trade-off between cost savings obtained by continuously migrating tenants and the robustness of servers towards load spikes and failures.

【Keywords】: cloud computing; data placement; fault tolerance; in-memory databases; multi tenancy

66. Inter-media hashing for large-scale retrieval from heterogeneous data sources.

Paper Link】 【Pages】:785-796

【Authors】: Jingkuan Song ; Yang Yang ; Yi Yang ; Zi Huang ; Heng Tao Shen

【Abstract】: In this paper, we present a new multimedia retrieval paradigm to innovate large-scale search of heterogenous multimedia data. It is able to return results of different media types from heterogeneous data sources, e.g., using a query image to retrieve relevant text documents or images from different data sources. This utilizes the widely available data from different sources and caters for the current users' demand of receiving a result list simultaneously containing multiple types of data to obtain a comprehensive understanding of the query's results. To enable large-scale inter-media retrieval, we propose a novel inter-media hashing (IMH) model to explore the correlations among multiple media types from different data sources and tackle the scalability issue. To this end, multimedia data from heterogeneous data sources are transformed into a common Hamming space, in which fast search can be easily implemented by XOR and bit-count operations. Furthermore, we integrate a linear regression model to learn hashing functions so that the hash codes for new data points can be efficiently generated. Experiments conducted on real-world large-scale multimedia datasets demonstrate the superiority of our proposed method compared with state-of-the-art techniques.

【Keywords】: hashing; heterogeneous data source; indexing; inter-media retrieval

Research session 23: data mining 3

67. Mind the gap: large-scale frequent sequence mining.

Paper Link】 【Pages】:797-808

【Authors】: Iris Miliaraki ; Klaus Berberich ; Rainer Gemulla ; Spyros Zoupanos

【Abstract】: Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this paper, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called "gap constraints", which can be used to limit the output to a controlled set of frequent sequences. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of w-equivalency, which is a generalization of the notion of a "projected database" used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the context of text mining suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.

【Keywords】: data mining; frequent sequence mining; maprecuce

68. Reverse engineering complex join queries.

Paper Link】 【Pages】:809-820

【Authors】: Meihui Zhang ; Hazem Elmeleegy ; Cecilia M. Procopiuc ; Divesh Srivastava

【Abstract】: We study the following problem: Given a database D with schema G and an output table Out, compute a join query Q that generates OUT from D. A simpler variant allows Q to return a superset of Out. This problem has numerous applications, both by itself, and as a building block for other problems. Related prior work imposes conditions on the structure of Q which are not always consistent with the application, but simplify computation. We discuss several natural SQL queries that do not satisfy these conditions and cannot be discovered by prior work. In this paper, we propose an efficient algorithm that discovers queries with arbitrary join graphs. A crucial insight is that any graph can be characterized by the combination of a simple structure, called a star, and a series of merge steps over the star. The merge steps define a lattice over graphs derived from the same star. This allows us to explore the set of candidate solutions in a principled way and quickly prune out a large number of infeasible graphs. We also design several optimizations that significantly reduce the running time. Finally, we conduct an extensive experimental study over a benchmark database and show that our approach is scalable and accurately discovers complex join queries.

【Keywords】: query join graph; query lattice; sql query discovery

69. A direct mining approach to efficient constrained graph pattern discovery.

Paper Link】 【Pages】:821-832

【Authors】: Feida Zhu ; Zequn Zhang ; Qiang Qu

【Abstract】: Despite the wealth of research on frequent graph pattern mining, how to efficiently mine the complete set of those with constraints still poses a huge challenge to the existing algorithms mainly due to the inherent bottleneck in the mining paradigm. In essence, mining requests with explicitly-specified constraints cannot be handled in a way that is direct and precise. In this paper, we propose a direct mining framework to solve the problem and illustrate our ideas in the context of a particular type of constrained frequent patterns --- the "skinny" patterns, which are graph patterns with a long backbone from which short twigs branch out. These patterns, which we formally define as l-long δ-skinny patterns, are able to reveal insightful spatial and temporal trajectory patterns in mobile data mining, information diffusion, adoption propagation, and many others. Based on the key concept of a canonical diameter, we develop SkinnyMine, an efficient algorithm to mine all the l-long δ-skinny patterns guaranteeing both the completeness of our mining result as well as the unique generation of each target pattern. We also present a general direct mining framework together with two properties of reducibility and continuity for qualified constraints. Our experiments on both synthetic and real data demonstrate the effectiveness and scalability of our approach.

【Keywords】: constrained pattern mining; direct mining; frequent graph pattern mining; skinny pattern

Research session 24: road networks and trajectories 3

70. Calibrating trajectory data for similarity-based analysis.

Paper Link】 【Pages】:833-844

【Authors】: Han Su ; Kai Zheng ; Haozhou Wang ; Jiamin Huang ; Xiaofang Zhou

【Abstract】: Due to the prevalence of GPS-enabled devices and wireless communications technologies, spatial trajectories that describe the movement history of moving objects are being generated and accumulated at an unprecedented pace. Trajectory data in a database are intrinsically heterogeneous, as they represent discrete approximations of original continuous paths derived using different sampling strategies and different sampling rates. Such heterogeneity can have a negative impact on the effectiveness of trajectory similarity measures, which are the basis of many crucial trajectory processing tasks. In this paper, we pioneer a systematic approach to trajectory calibration that is a process to transform a heterogeneous trajectory dataset to one with (almost) unified sampling strategies. Specifically, we propose an anchor-based calibration system that aligns trajectories to a set of anchor points, which are fixed locations independent of trajectory data. After examining four different types of anchor points for the purpose of building a stable reference system, we propose a geometry-based calibration approach that considers the spatial relationship between anchor points and trajectories. Then a more advanced model-based calibration method is presented, which exploits the power of machine learning techniques to train inference models from historical trajectory data to improve calibration effectiveness. Finally, we conduct extensive experiments using real trajectory datasets to demonstrate the effectiveness and efficiency of the proposed calibration system.

【Keywords】: similarity measure; trajectory calibration; trajectory database

71. On optimal worst-case matching.

Paper Link】 【Pages】:845-856

【Authors】: Cheng Long ; Raymond Chi-Wing Wong ; Philip S. Yu ; Minhao Jiang

【Abstract】: Bichromatic reverse nearest neighbor (BRNN) queries have been studied extensively in the literature of spatial databases. Given a set P of service-providers and a set O of customers, a BRNN query is to find which customers in O are "interested" in a given service-provider in P. Recently, it has been found that this kind of queries lacks the consideration of the capacities of service-providers and the demands of customers. In order to address this issue, some spatial matching problems have been proposed, which, however, cannot be used for some real-life applications like emergency facility allocation where the maximum matching cost (or distance) should be minimized. In this paper, we propose a new problem called Spatial Matching for Minimizing Maximum matching distance (SPM-MM). Then, we design two algorithms for SPM-MM, Threshold-Adapt and Swap-Chain. Threshold-Adapt is simple and easy to understand but not scalable to large datasets due to its relatively high time/space complexity. Swap-Chain, which follows a fundamentally different idea from Threshold-Adapt, runs faster than Threshold-Adapt by orders of magnitude and uses significantly less memory. We conducted extensive empirical studies which verified the efficiency and scalability of Swap-Chain.

【Keywords】: bottleneck matching; optimal worst-case spatial matching

72. Shortest path and distance queries on road networks: towards bridging theory and practice.

Paper Link】 【Pages】:857-868

【Authors】: Andy Diwen Zhu ; Hui Ma ; Xiaokui Xiao ; Siqiang Luo ; Youze Tang ; Shuigeng Zhou

【Abstract】: Given two locations s and t in a road network, a distance query returns the minimum network distance from s to t, while a shortest path query computes the actual route that achieves the minimum distance. These two types of queries find important applications in practice, and a plethora of solutions have been proposed in past few decades. The existing solutions, however, are optimized for either practical or asymptotic performance, but not both. In particular, the techniques with enhanced practical efficiency are mostly heuristic-based, and they offer unattractive worst-case guarantees in terms of space and time. On the other hand, the methods that are worst-case efficient often entail prohibitive preprocessing or space overheads, which render them inapplicable for the large road networks (with millions of nodes) commonly used in modern map applications. This paper presents Arterial Hierarchy (AH), an index structure that narrows the gap between theory and practice in answering shortest path and distance queries on road networks. On the theoretical side, we show that, under a realistic assumption, AH answers any distance query in Õ(log α) time, where α = dmax/dmin, and dmax (resp. dmin) is the largest (resp. smallest) L∞ distance between any two nodes in the road network. In addition, any shortest path query can be answered in Õ(k + log α) time, where k is the number of nodes on the shortest path. On the practical side, we experimentally evaluate AH on a large set of real road networks with up to twenty million nodes, and we demonstrate that (i) AH outperforms the state of the art in terms of query time, and (ii) its space and pre-computation overheads are moderate.

【Keywords】: algorithms; road network; shortest path

Research session 25: security 2

73. Fine-grained disclosure control for app ecosystems.

Paper Link】 【Pages】:869-880

【Authors】: Gabriel Bender ; Lucja Kot ; Johannes Gehrke ; Christoph Koch

【Abstract】: The modern computing landscape contains an increasing number of app ecosystems, where users store personal data on platforms such as Facebook or smartphones. APIs enable third-party applications (apps) to utilize that data. A key concern associated with app ecosystems is the confidentiality of user data. In this paper, we develop a new model of disclosure in app ecosystems. In contrast with previous solutions, our model is data-derived and semantically meaningful. Information disclosure is modeled in terms of a set of distinguished security views. Each query is labeled with the precise set of security views that is needed to answer it, and these labels drive policy decisions. We explain how our disclosure model can be used in practice and provide algorithms for labeling conjunctive queries for the case of single-atom security views. We show that our approach is useful by demonstrating the scalability of our algorithms and by applying it to the real-world disclosure control system used by Facebook.

【Keywords】: app ecosystems; database security; view rewriting

74. Lightweight authentication of linear algebraic queries on data streams.

Paper Link】 【Pages】:881-892

【Authors】: Stavros Papadopoulos ; Graham Cormode ; Antonios Deligiannakis ; Minos N. Garofalakis

【Abstract】: We consider a stream outsourcing setting, where a data owner delegates the management of a set of disjoint data streams to an untrusted server. The owner authenticates his streams via signatures. The server processes continuous queries on the union of the streams for clients trusted by the owner. Along with the results, the server sends proofs of result correctness derived from the owner's signatures, which are easily verifiable by the clients. We design novel constructions for a collection of fundamental problems over streams represented as linear algebraic queries. In particular, our basic schemes authenticate dynamic vector sums and dot products, as well as dynamic matrix products. These techniques can be adapted for authenticating a wide range of important operations in streaming environments, including group by queries, joins, in-network aggregation, similarity matching, and event processing. All our schemes are very lightweight, and offer strong cryptographic guarantees derived from formal definitions and proofs. We experimentally confirm the practicality of our schemes.

【Keywords】: data integrity; data streams; query authentication

Research session 26: indexing 2

75. Column imprints: a secondary index structure.

Paper Link】 【Pages】:893-904

【Authors】: Lefteris Sidirourgos ; Martin L. Kersten

【Abstract】: Large scale data warehouses rely heavily on secondary indexes, such as bitmaps and b-trees, to limit access to slow IO devices. However, with the advent of large main memory systems, cache conscious secondary indexes are needed to improve also the transfer bandwidth between memory and cpu. In this paper, we introduce column imprint, a simple but efficient cache conscious secondary index. A column imprint is a collection of many small bit vectors, each indexing the data points of a single cacheline. An imprint is used during query evaluation to limit data access and thus minimize memory traffic. The compression for imprints is cpu friendly and exploits the empirical observation that data often exhibits local clustering or partial ordering as a side-effect of the construction process. Most importantly, column imprint compression remains effective and robust even in the case of unclustered data, while other state-of-the-art solutions fail. We conducted an extensive experimental evaluation to assess the applicability and the performance impact of the column imprints. The storage overhead, when experimenting with real world datasets, is just a few percent over the size of the columns being indexed. The evaluation time for over 40000 range queries of varying selectivity revealed the efficiency of the proposed index compared to zonemaps and bitmaps with WAH compression.

【Keywords】: columnar databases; secondary index

76. DeltaNI: an efficient labeling scheme for versioned hierarchical data.

Paper Link】 【Pages】:905-916

【Authors】: Jan Finis ; Robert Brunel ; Alfons Kemper ; Thomas Neumann ; Franz Färber ; Norman May

【Abstract】: Main-memory database systems are emerging as the new backbone of business applications. Besides flat relational data representations also hierarchical ones are essential for these modern applications; therefore we devise a new indexing and versioning approach for hierarchies that is deeply integrated into the relational kernel. We propose the DeltaNI index as a versioned pendant of the nested intervals (NI) labeling scheme. The index is space- and time-efficient and yields a gapless, fixed-size integer NI labeling for each version while also supporting branching histories. In contrast to a naive NI labeling, it facilitates even complex updates of the tree structure. As many query processing techniques that work on top of the NI labeling have already been proposed, our index can be used as a building block for processing various kinds of queries. We evaluate the performance of the index on large inputs consisting of millions of nodes and thousands of versions. Thereby we show that DeltaNI scales well and can deliver satisfying performance for large business scenarios.

【Keywords】: database indexing; hierarchical data; hierarchy indexing; labeling schemes; multiversion indexing; nested intervals

Keynote addresses 2

77. Big data in capital markets.

Paper Link】 【Pages】:917-918

【Authors】: Alex Nazaruk ; Michael Rauchman

【Abstract】: Over the past decade global securities markets have dramatically changed. Evolution of market structure in combination with advances in computer technologies led to emergence of electronic securities trading. Securities transactions that used to be conducted in person and over the phone are now predominantly executed by automated trading systems. This resulted in significant fragmentation of the markets, vast increase in the exchange volumes and even greater increase in the number of orders. In this talk we present and analyze forces behind the wide proliferation of electronic securities trading in US stocks and options markets. We also make a high-level introduction into electronic securities market structure. We discuss trading objectives of different classes of market participants and analyze how their activity affects data volumes. We also present typical securities trading firm data flow and analyze various types of data it uses in its trading operations. We close with the implications this "sea change" has on DBMS requirements in capital markets.

【Keywords】: big data; capital markets; securities; transactions

78. Managing database technology at enterprise scale.

Paper Link】 【Pages】:919-920

【Authors】: Paul Yaron

【Abstract】: Paul Yaron is responsible for Non-Mainframe, Relational Database Architecture, Engineering and Strategy for JPMC globally. JP Morgan is a leading financial services firm with assets over $2 trillion, operates 40 major datacenters around the globe, servicing over 60 countries with over 250,000 employees. It partners with 170 regulators and manages 230 Petabytes of data, JPMC depends on over 23,000 database instances to service multiple business units. With a deployment of such scope, JPMC leverages solutions from most major database, security and operating system vendors. This talk will discuss the challenges and strategies of managing the evolving ecosystem of "all data", from information security, to internal virtualization strategies. Engineering reliable globally scalable and compliant data management solutions demands a model for proactively measuring the risk complexity of an ecosystem for expert focus and potential proactive remediation. The research for quantitative measurement of database (or other) ecosystem entropy appears sparse. JPMC is looking to share its ideas in this space with the academic community as the need for such quantitative measures are increasingly important as ecosystems move from islands of single tenant risk into multi-tenant risk clusters.

【Keywords】: big data; integration; security

Panel 1

79. We are drowning in a sea of least publishable units (LPUs).

Paper Link】 【Pages】:921-922

【Authors】: David J. DeWitt ; Ihab F. Ilyas ; Jeffrey F. Naughton ; Michael Stonebraker

【Abstract】: Our field is drowning in a sea of conference submissions. We assert that the sheer number of papers has begun to seriously hurt the quality of the work that the field is doing and that the field is going to implode unless we take action to remedy the situation. In order to improve the quality of the papers being published we must reduce the number being submitted. This will require a change in the culture of our field where "more" is being equated to "better" by both hiring and promotion committees. In this panel we will explore some ideas for correcting the situation.

【Keywords】: conference submissions; reviewing practices; tenure expectations

Tutorials 6

80. Rethinking eventual consistency.

Paper Link】 【Pages】:923-928

【Authors】: Philip A. Bernstein ; Sudipto Das

【Abstract】: There has been a resurgence of work on replicated, distributed database systems to meet the demands of intermittently-connected clients and of disaster-tolerant databases that span data centers. Many systems weaken the criteria for replica-consistency or isolation, and in some cases add new mechanisms, to improve partition-tolerance, availability, and performance. We present a framework for comparing these criteria and mechanisms, to help architects navigate through this complex design space.

【Keywords】: eventual consistency; replication

81. Workload management for big data analytics.

Paper Link】 【Pages】:929-932

【Authors】: Ashraf Aboulnaga ; Shivnath Babu

【Abstract】:

【Keywords】: analytics; mapreduce; parallel database systems; workload management

82. Knowledge harvesting in the big-data era.

Paper Link】 【Pages】:933-938

【Authors】: Fabian M. Suchanek ; Gerhard Weikum

【Abstract】: The proliferation of knowledge-sharing communities such as Wikipedia and the progress in scalable information extraction from Web and text sources have enabled the automatic construction of very large knowledge bases. Endeavors of this kind include projects such as DBpedia, Freebase, KnowItAll, ReadTheWeb, and YAGO. These projects provide automatically constructed knowledge bases of facts about named entities, their semantic classes, and their mutual relationships. They contain millions of entities and hundreds of millions of facts about them. Such world knowledge in turn enables cognitive applications and knowledge-centric services like disambiguating natural-language text, semantic search for entities and relations in Web and enterprise data, and entity-oriented analytics over unstructured contents. Prominent examples of how knowledge bases can be harnessed include the Google Knowledge Graph and the IBM Watson question answering system. This tutorial presents state-of-the-art methods, recent advances, research opportunities, and open challenges along this avenue of knowledge harvesting and its applications. Particular emphasis will be on the twofold role of knowledge bases for big-data analytics: using scalable distributed algorithms for harvesting knowledge from Web and text sources, and leveraging entity-centric knowledge for deeper interpretation of and better intelligence with Big Data.

【Keywords】: big data; entity recognition; information extraction; knowledge base; ontology; web contents

83. Machine learning for big data.

Paper Link】 【Pages】:939-942

【Authors】: Tyson Condie ; Paul Mineiro ; Neoklis Polyzotis ; Markus Weimer

【Abstract】: Statistical Machine Learning has undergone a phase transition from a pure academic endeavor to being one of the main drivers of modern commerce and science. Even more so, recent results such as those on tera-scale learning [1] and on very large neural networks [2] suggest that scale is an important ingredient in quality modeling. This tutorial introduces current applications, techniques and systems with the aim of cross-fertilizing research between the database and machine learning communities. The tutorial covers current large scale applications of Machine Learning, their computational model and the workflow behind building those. Based on this foundation, we present the current state-of-the-art in systems support in the bulk of the tutorial. We also identify critical gaps in the state-of-the-art. This leads to the closing of the seminar, where we introduce two sets of open research questions: Better systems support for the already established use cases of Machine Learning and support for recent advances in Machine Learning research.

【Keywords】: big data; databases; machine learning

84. Data management perspectives on business process management: tutorial overview.

Paper Link】 【Pages】:943-948

【Authors】: Richard Hull ; Jianwen Su ; Roman Vaculín

【Abstract】: Traditional approaches to Business Process Management (BPM) focus primarily on the process aspects, and treat the persistent data accessed and manipulated by the business processes as second class citizens. A recent approach to BPM, based on "business artifacts", is centered on a modeling framework that places data and process on an equal footing. The approach has been shown useful in various application domains, and one variant of business artifacts forms the basis of the emerging OMG Case Management Model and Notation (CMMN) standard. Research results have been developed around conceptual models, enterprise interoperation, business intelligence, and verification. This data-centric approach has the potential to provide the basis for a new generation of BPM technology in support of diverse application, and fueled by the insights into abstraction and data management that have been the hallmark of database research since the 70's.

【Keywords】: bpm; business artifacts; business process management; data-centric; workflow

85. Data stream warehousing.

Paper Link】 【Pages】:949-952

【Authors】: Lukasz Golab ; Theodore Johnson

【Abstract】:

【Keywords】: data streams; data warehousing; real-time analytics

Demo session 1: data intensive applications 10

86. Data-driven neuroscience: enabling breakthroughs via innovative data management.

Paper Link】 【Pages】:953-956

【Authors】: Alexandros Stougiannis ; Mirjana Pavlovic ; Farhan Tauheed ; Thomas Heinis ; Anastasia Ailamaki

【Abstract】: Scientists in all disciplines increasingly rely on simulations to develop a better understanding of the subject they are studying. For example the neuroscientists we collaborate with in the Blue Brain project have started to simulate the brain on a supercomputer. The level of detail of their models is unprecedented as they model details on the subcellular level (e.g., the neurotransmitter). This level of detail, however, also leads to a true data deluge and the neuroscientists have only few tools to efficiently analyze the data. This demonstration showcases three innovative spatial management techniques that have substantial impact on computational neuroscience and other disciplines in that they allow to build, analyze and simulate bigger and more detailed models. More particularly, we demonstrate a tool that integrates three spatial data management techniques that have enabled breakthroughs in neuroscience: FLAT that enables efficient querying of spatial data, SCOUT that allows for fast exploration of spatial data and HiDOP that makes efficient data discovery possible.

【Keywords】: scientific data management; spatial databases; spatial indexing

87. TsingNUS: a location-based service system towards live city.

Paper Link】 【Pages】:957-960

【Authors】: Guoliang Li ; Nan Zhang ; Ruicheng Zhong ; Sitong Liu ; Weihuang Huang ; Ju Fan ; Kian-Lee Tan ; Lizhu Zhou ; Jianhua Feng

【Abstract】: We present our system towards live city, called TsingNUS, aiming to provide users with more user-friendly location-aware search experiences. TsingNUS crawls location-based user-generated content from the Web (e.g., Foursquare and Twitter), cleans and integrates them to provide users with rich well-structured data. TsingNUS provides three user-friendly search paradigms: location-aware instant search, location-aware similarity search and direction-aware search. Instant search returns relevant answers instantly as users type in queries letter by letter, which can help users to save typing efforts significantly. Location-aware similarity search enables fuzzy matching between queries and the underlying data, which can tolerate typing errors. The two features boost the search performance and improve the experiences for mobile users who often misspell the keywords due to the limitation of the mobile phone's keyboard. In addition, users have direction-aware search requirements in many applications. For example, a driver on the highway wants to find the nearest gas station or restaurant. She has a search requirement that the answers should be in front of her driving direction. TsingNUS enables direction-aware search to address this problem and allows users to search in specific directions. Moreover, TsingNUS incorporates continuous search to efficiently support continuously moving queries in a client-server system which can reduce the number of queries submitted to the server and communication cost between the client and server. We have implemented and deployed a system which has been commonly used and widely accepted.

【Keywords】: direction-aware search; instant search; location-aware search

88. WOW: what the world of (data) warehousing can learn from the World of Warcraft.

Paper Link】 【Pages】:961-964

【Authors】: René Müller ; Tim Kaldewey ; Guy M. Lohman ; John McPherson

【Abstract】: Although originally designed to accelerate pixel monsters, graphics Processing Units (GPUs) have been used for some time as accelerators for selected data base operations. However, to the best of our knowledge, no one has yet reported building a complete system that allows executing complex analytics queries, much less an entire data warehouse benchmark at realistic scale. In this demo, we showcase such a complete system prototype running on a high-end GPU paired with an IBM storage system that achieves >90% hardware efficiency. Our solution delivers sustainable high throughput for business analytics queries in a realistic scenario, i.e., the Star Schema Benchmark at scale factor 1,000. Attendees can interact with our system through a graphical user interface on a tablet PC. They will be able to experience first hand how queries that require processing more than six billion rows, or 100 GB of data, are answered in less than 20 seconds. The user interface allows submitting queries, live performance monitoring of the current query all the way down to the operator level, and viewing the result once the query completes.

【Keywords】: gpgpu; hardware acceleration; olap; query processing

89. Rule-based application development using Webdamlog.

Paper Link】 【Pages】:965-968

【Authors】: Serge Abiteboul ; Emilien Antoine ; Gerome Miklau ; Julia Stoyanovich ; Jules Testard

【Abstract】: We present the WebdamLog system for managing distributed data on the Web in a peer-to-peer manner. We demonstrate the main features of the system through an application called Wepic for sharing pictures between attendees of the sigmod conference. Using Wepic, the attendees will be able to share, download, rate and annotate pictures in a highly decentralized manner. We show how WebdamLog handles heterogeneity of the devices and services used to share data in such a Web setting. We exhibit the simple rules that define the Wepic application and show how to easily modify the Wepic application.

【Keywords】: datalog; declarative; peer to peer

90. Speeding up database applications with Pyxis.

Paper Link】 【Pages】:969-972

【Authors】: Alvin Cheung ; Owen Arden ; Samuel Madden ; Andrew C. Myers

【Abstract】: We propose to demonstrate Pyxis, a system that optimizes database applications by pushing computation to the database server. Our system applies program analysis techniques to the application source code to determine pieces of application logic that should be moved to the database server to improve performance. This frees the developer from the need to understand the intricacies of database operations or learn a new programming language for stored procedures. In addition, by dynamically monitoring resource utilization on the database server, Pyxis can migrate computation between application and database in response to workload changes. Our previous experiments have shown that Pyxis can decrease latency up to 3x for transactional applications, and improve throughput up to 1.7x when compared to a standard implementation using embedded SQL statements in application logic. We will demonstrate these capabilities via a visualization of real-time performance as well as an interactive code partitioning tool we have developed.

【Keywords】: database applications; performance; program optimization

91. Peckalytics: analyzing experts and interests on Twitter.

Paper Link】 【Pages】:973-976

【Authors】: Alex Cheng ; Nilesh Bansal ; Nick Koudas

【Abstract】: We provide a description of Peckalytics, its technology and functionality. Peckalytics processes the entire Twitter data stream in real time and provides a flexible search interface to identify experts in any topic area as well as users with interests in any topic. It provides flexible analytics around sets of experts, their followers as well as sets of users with specific interests. The system is implemented to scale for large data sizes. At the time of this writing it operates on an archive of 30 billion tweets, with 220,000 new tweets crawled every minute. In addition to raw tweets, the social graph of users, and profile information, Peckalytics makes novel use of Twitter lists to assess the expertise of different users. Our aim is to facilitate targeting and optimization of advertising campaigns on the Twitter platform.

【Keywords】: experts analysis; interest analysis; twitter analytics

92. Packing experiments for sharing and publication.

Paper Link】 【Pages】:977-980

【Authors】: Fernando Seabra Chirigati ; Dennis Shasha ; Juliana Freire

【Abstract】: Reproducibility is a core component of the scientific process. Revisiting and reusing past results allow science to move forward - "standing on the shoulders of giants", as Newton once said. An impediment to the adoption of computational reproducibility is that authors find it difficult to generate a compendium that encompasses all the required components to correctly reproduce their experiments. Even when a compendium is available, reviewers and readers may have difficulties in verifying the results on platforms different from the ones where the experiments were originally run. As a step towards simplifying the process of creating reproducible experiments, we have developed ReproZip, a tool that automatically captures the provenance of experiments and packs all the necessary files, library dependencies and variables to reproduce the results. Reviewers can then unpack and run the experiments without having to install any additional software. We will demonstrate real use cases for ReproZip, how packages are created, and how reviewers can validate and explore experiments.

【Keywords】: computational reproducibility; provenance; reprozip

93. CHIC: a combination-based recommendation system.

Paper Link】 【Pages】:981-984

【Authors】: Manasi Vartak ; Samuel Madden

【Abstract】: Current recommender systems are focused largely on recommending items based on similarity. For instance, Netflix can recommend movies similar to previously viewed movies, and Amazon can recommend items based on ratings of similar users. Although similarity-based recommendation works well for books and movies, it provides an incomplete solution for items such as clothing or furniture which are inherently used in combination with other items of the same type, e.g., shirt with pants, and desk with a chair. As a result, the decision to buy a clothing or furniture item depends not only on the item itself, but also on how well it works with other items of that type. Recommending such items therefore requires a combination-based recommendation system that given an item, can suggest interesting and diverse combinations containing that item. This problem is challenging because features affecting combination quality are often difficult to identify; quality, being a function of all items in the combination, cannot be computed independently; and there are an exponential number of combinations to explore. In this demonstration, we present CHIC, a first-of-its-kind, combination-based recommendation system for clothing. The audience will interact with our system through the CHIC mobile app which allows the user to take a picture of a clothing item and search for interesting combinations containing the item instantly. The audience can also compete with CHIC to create alternate ensembles and compare quality. Finally, we highlight via visualizations the core modules of CHIC including model building and our novel search and classification algorithm, C-Search.

【Keywords】: combination; recommendation

94. Noah: a dynamic ridesharing system.

Paper Link】 【Pages】:985-988

【Authors】: Charles Tian ; Yan Huang ; Zhi Liu ; Favyen Bastani ; Ruoming Jin

【Abstract】: This demo presents Noah: a dynamic ridesharing system. Noah supports large scale real-time ridesharing with service guarantee on road networks. Taxis and trip requests are dynamically matched. Different from traditional systems, a taxi can have more than one customer on board given that all waiting time and service time constraints of trips are satisfied. Noah's real-time response relies on three main components: (1) a fast shortest path algorithm with caching on road networks; (2) fast dynamic matching algorithms to schedule ridesharing on the fly; (3) a spatial indexing method for fast retrieving moving taxis. Users will be able to submit requests from a smartphone, choose specific parameters such as number of taxis in the system, service constraints, and matching algorithms, to explore the internal functionalities and implementations of Noah. The system analyzer will show the system performance including average waiting time, average detour percentage, average response time, and average level of sharing. Taxis, routes, and requests will be animated and visualized through Google Maps API. The demo is based on trips of 17,000 Shanghai taxis for one day (May 29, 2009); the dataset contains 432,327 trips. Each trip includes the starting and destination coordinates and the start time. An iPhone application is implemented to allow users to submit a trip request to the Noah system during the demonstration.

【Keywords】: dynamic ridesharing; kinetic tree; mobile indexing; road networks

95. A query answering system for data with evolution relationships.

Paper Link】 【Pages】:989-992

【Authors】: Siarhei Bykau ; Flavio Rizzolo ; Yannis Velegrakis

【Abstract】: Evolving data has attracted considerable research attention. Researchers have focused on modeling and querying of schema/instance-level structural changes, such as, insertion, deletion and modification of attributes. Databases with such a functionality are known as temporal databases. A limitation of the temporal databases is that they treat changes as independent events, while often the appearance (or elimination) of some structure in the database is the result of an evolution of some existing structure. We claim that maintaining the causal relationship between the two structures is of major importance since it allows additional reasoning to be performed and answers to be generated for queries that previously had no answers. We present the TrenDS, a system for exploiting the evolution relationships between the structures in the database. In particular, our system combines different structures that are associated through evolution relationships into virtual structures to be used during query answering. The virtual structures define ``possible'' database instances, in a fashion similar to the possible worlds in the probabilistic databases. TrenDS uses a query answering mechanism that allows queries to be answered over these possible databases without materializing them. Evaluation of such queries raises many technical challenges, since it requires the discovery of Steiner forests on the evolution graphs.

【Keywords】: data evolution; graph algorithms; probabilistic databases

Demo session 2: data analysis and mining; privacy; security 11

96. GeoDeepDive: statistical inference using familiar data-processing languages.

Paper Link】 【Pages】:993-996

【Authors】: Ce Zhang ; Vidhya Govindaraju ; Jackson Borchardt ; Tim Foltz ; Christopher Ré ; Shanan Peters

【Abstract】: We describe our proposed demonstration of GeoDeepDive, a system that helps geoscientists discover information and knowledge buried in the text, tables, and figures of geology journal articles. This requires solving a host of classical data management challenges including data acquisition (e.g., from scanned documents), data extraction, and data integration. SIGMOD attendees will see demonstrations of three aspects of our system: (1) an end-to-end system that is of a high enough quality to perform novel geological science, but is written by a small enough team so that each aspect can be manageably explained; (2) a simple feature engineering system that allows a user to write in familiar SQL or Python; and (3) the effect of different sources of feedback on result quality including expert labeling, distant supervision, traditional rules, and crowd-sourced data. Our prototype builds on our work integrating statistical inference and learning tools into traditional database systems. If successful, our demonstration will allow attendees to see that data processing systems that use machine learning contain many familiar data processing problems such as efficient querying, indexing, and supporting tools for database-backed websites, none of which are machine-learning problems, per se.

【Keywords】: demonstration; geoscience; statistical inference

97. Fact checking and analyzing the web.

Paper Link】 【Pages】:997-1000

【Authors】: François Goasdoué ; Konstantinos Karanasos ; Yannis Katsis ; Julien Leblay ; Ioana Manolescu ; Stamatis Zampetakis

【Abstract】: Fact checking and data journalism are currently strong trends. The sheer amount of data at hand makes it difficult even for trained professionals to spot biased, outdated or simply incorrect information. We propose to demonstrate FactMinder, a fact checking and analysis assistance application. SIGMOD attendees will be able to analyze documents using FactMinder and experience how background knowledge and open data repositories help build insightful overviews of current topics.

【Keywords】: linked data; online fact checking; semantic annotations

98. Data mining algorithms as a service in the cloud exploiting relational database systems.

Paper Link】 【Pages】:1001-1004

【Authors】: Carlos Ordonez ; Javier García-García ; Carlos Garcia-Alvarado ; Wellington Cabrera ; Veerabhadran Baladandayuthapani ; Mohammed S. Quraishi

【Abstract】: We present a novel cloud system based on DBMS technology, where data mining algorithms are offered as a service. A local DBMS connects to the cloud and the cloud system returns computed data mining models as small relational tables that are archived and which can be easily transferred, queried and integrated with the client database. Unlike other analytic systems, our solution is not based on MapReduce. Our system avoids exporting large tables outside the local DBMS and thus it avoids transmitting large volumes of data to the cloud. The system offers three processing modes: local, cloud and hybrid, where a linear cost model is used to choose processing mode. In hybrid mode processing is split between the local DBMS and the cloud DBMS. Our system has a job scheduler with FIFO, SJF and RR policies to enhance response time and get partial results early. The cloud DBMS performs dynamic job scheduling, model computation and model archive management. Our system incorporates several optimizations: local data set summarization with sufficient statistics, sampling, caching matrices in RAM and selectively transmitting small matrices, back and forth. We show that in general the most efficient computing mechanism is hybrid processing: summarizing or sampling the data set in the local DBMS, transferring small matrices back and forth, leaving mathematically complex methods as a task for the cloud DBMS.

【Keywords】: algorithm; cloud; dbms; statistical model; udf

99. SONDY: an open source platform for social dynamics mining and analysis.

Paper Link】 【Pages】:1005-1008

【Authors】: Adrien Guille ; Cécile Favre ; Hakim Hacid ; Djamel A. Zighed

【Abstract】: This paper describes SONDY, a tool for analysis of trends and dynamics in online social network data. SONDY addresses two audiences: (i) end-users who want to explore social activity and (ii) researchers who want to experiment and compare mining techniques on social data. SONDY helps end-users like media analysts or journalists understand social network users interests and activity by providing emerging topics and events detection as well as network analysis functionalities. To this end, the application proposes visualizations such as interactive time-lines that summarize information and colored user graphs that reflect the structure of the network. SONDY also provides researchers an easy way to compare and evaluate recent techniques to mine social data, implement new algorithms and extend the application without being concerned with how to make it accessible. In the demo, participants will be invited to explore information from several datasets of various sizes and origins (such as a dataset consisting of 7,874,772 messages published by 1,697,759 Twitter users during a period of 7 days) and apply the different functionalities of the platform in real-time.

【Keywords】: network analysis; online social networks; topic detection

100. Interactive data mining with 3D-parallel-coordinate-trees.

Paper Link】 【Pages】:1009-1012

【Authors】: Elke Achtert ; Hans-Peter Kriegel ; Erich Schubert ; Arthur Zimek

【Abstract】: Parallel coordinates are an established technique to visualize high-dimensional data, in particular for data mining purposes. A major challenge is the ordering of axes, as any axis can have at most two neighbors when placed in parallel on a 2D plane. By extending this concept to a 3D visualization space we can place several axes next to each other. However, finding a good arrangement often does not necessarily become easier, as still not all axes can be arranged pairwise adjacently to each other. Here, we provide a tool to explore complex data sets using 3D-parallel-coordinate-trees, along with a number of approaches to arrange the axes.

【Keywords】: high-dimensional data; parallel coordinates; visualization

101. Stat!: an interactive analytics environment for big data.

Paper Link】 【Pages】:1013-1016

【Authors】: Mike Barnett ; Badrish Chandramouli ; Robert DeLine ; Steven M. Drucker ; Danyel Fisher ; Jonathan Goldstein ; Patrick Morrison ; John C. Platt

【Abstract】: Exploratory analysis on big data requires us to rethink data management across the entire stack -- from the underlying data processing techniques to the user experience. We demonstrate Stat! -- a visualization and analytics environment that allows users to rapidly experiment with exploratory queries over big data. Data scientists can use Stat! to quickly refine to the correct query, while getting immediate feedback after processing a fraction of the data. Stat! can work with multiple processing engines in the backend; in this demo, we use Stat! with the Microsoft StreamInsight streaming engine. StreamInsight is used to generate incremental early results to queries and refine these results as more data is processed. Stat! allows data scientists to explore data, dynamically compose multiple queries to generate streams of partial results, and display partial results in both textual and visual form.

【Keywords】: analytics; big data; interactive; tool; visualization

102. PARAS: interactive parameter space exploration for association rule mining.

Paper Link】 【Pages】:1017-1020

【Authors】: Abhishek Mukherji ; Xika Lin ; Christopher R. Botaish ; Jason Whitehouse ; Elke A. Rundensteiner ; Matthew O. Ward ; Carolina Ruiz

【Abstract】: We demonstrate our PARAS technology for supporting interactive association mining at near real-time speeds. Key technical innovations of PARAS, in particular, stable region abstractions and rule redundancy management supporting novel parameter space-centric exploratory queries will be showcased. The audience will be able to interactively explore the parameter space view of rules. They will experience near real-time speeds achieved by PARAS for operations, such as comparing rule sets mined using different parameter values, that would otherwise take hours of computation and much manual investigation. Overall, we will demonstrate that the PARAS system provides a rich experience to data analysts through parameter tuning recommendations while significantly reducing the trial-and-error interactions.

【Keywords】: association rules; interestingness parameters; visual mining

103. STEM: a spatio-temporal miner for bursty activity.

Paper Link】 【Pages】:1021-1024

【Authors】: Theodoros Lappas ; Marcos R. Vieira ; Dimitrios Gunopulos ; Vassilis J. Tsotras

【Abstract】: Burst identification has been extensively studied in the context of document streams, where a burst is generally exhibited when an unusually high frequency is observed for a term t. Previous works have focused exclusively on either temporal or spatial burstiness patterns. The former represents bursty timeframes within a single stream, while the latter characterizes sets of streams that simultaneously exhibited a bursty behavior for a user-specified timeframe. Our previous work was the first to study the spatiotemporal burstiness of terms. In this context, a burstiness pattern consists of both a timeframe and a set of streams, both of which need to be identified automatically. In this paper we describe STEM (Spatio-TEmporal Miner), a system for finding spatiotemporal burstiness patterns in a collection of spatially distributed frequency streams. STEM implements the full functionality required to mine spatiotemporal burstiness patterns from virtually any collection of geostamped streams. Examples of such collections include document streams (e.g. online newspapers), geo-aware microblogging platforms (e.g. Twitter). This paper describes the STEM system and discusses how its features can be accessed via a user-friendly interface.

【Keywords】: spatiotemporal burstiness; streams

104. The farm: where pig scripts are bred and raised.

Paper Link】 【Pages】:1025-1028

【Authors】: Craig P. Sayers ; Alkis Simitsis ; Georgia Koutrika ; Alejandro Guerrero Gonzalez ; David Tamez Cantu ; Meichun Hsu

【Abstract】: Even though scripting languages like Pig allow for simpler coding, performing analytics over Big Data using Map-Reduce engines remains challenging. To further assist developers, and support novice users, we offer "The Farm", a catalog of scriptable services supporting creation, discovery, composition, and optimized execution. Each Pig script added to The Farm becomes an executable service, with inputs and outputs defined by relation schemas. Those services are discoverable using natural language search, and composable using a drag-and-drop interface. To support efficient execution, composed services are automatically merged to a single executable script, which can then be run by a growing selection of platform-specific optimizers and interpreters.

【Keywords】: analytics; cloud services; composition; natural language

105. LinkIT: privacy preserving record linkage and integration via transformations.

Paper Link】 【Pages】:1029-1032

【Authors】: Luca Bonomi ; Li Xiong ; James J. Lu

【Abstract】: We propose to demonstrate an open-source tool, LinkIT, for privacy preserving record Linkage and Integration via data Transformations. LinkIT implements novel algorithms that support data transformations for linking sensitive attributes, and is designed to work with our previously developed tool, FRIL (Fine-grained Record Integration and Linkage), to provide a complete record linkage solution. LinkIT can be also used as a stand-alone secure transformation tool to link string records. The system uses a novel embedding technique based on frequent variable length grams mined from original records with differential privacy, and utilizes a personalized threshold for performing linkage in the embedded space. Compared to the state-of-the-art secure transformation method [16], LinkIT guarantees stronger privacy with better scalability while achieving comparable utility results.

【Keywords】: privacy; record linkage; security

106. Secure database-as-a-service with Cipherbase.

Paper Link】 【Pages】:1033-1036

【Authors】: Arvind Arasu ; Spyros Blanas ; Ken Eguro ; Manas Joglekar ; Raghav Kaushik ; Donald Kossmann ; Ravishankar Ramamurthy ; Prasang Upadhyaya ; Ramarathnam Venkatesan

【Abstract】: Data confidentiality is one of the main concerns for users of public cloud services. The key problem is protecting sensitive data from being accessed by cloud administrators who have root privileges and can remotely inspect the memory and disk contents of the cloud servers. While encryption is the basic mechanism that can leveraged to provide data confidentiality, providing an efficient database-as-a-service that can run on encrypted data raises several interesting challenges. In this demonstration we outline the functionality of Cipherbase --- a full fledged SQL database system that supports the full generality of a database system while providing high data confidentiality. Cipherbase has a novel architecture that tightly integrates custom-designed trusted hardware for performing operations on encrypted data securely such that an administrator cannot get access to any plaintext corresponding to sensitive data.

【Keywords】: encryption; privacy; security; trusted hardware

Demo session 3: database optimization; performance 11

107. DBalancer: distributed load balancing for NoSQL data-stores.

Paper Link】 【Pages】:1037-1040

【Authors】: Ioannis Konstantinou ; Dimitrios Tsoumakos ; Ioannis Mytilinis ; Nectarios Koziris

【Abstract】: Unanticipated load spikes or skewed data access patterns may lead to severe performance degradation in data serving applications, a typical problem of distributed NoSQL data-stores. In these cases, load balancing is a necessary operation. In this demonstration, we present the DBalancer, a generic distributed module that can be installed on top of a typical NoSQL data-store and provide an efficient and highly configurable load balancing mechanism. Balancing is performed by simple message exchanges and typical data movement operations supported by most modern NoSQL data-stores. We present the system's architecture, we describe in detail its modules and their interaction and we implement a suite of different algorithms on top of it. Through a web-based interactive GUI we allow the users to launch NoSQL clusters of various sizes, to apply numerous skewed and dynamic workloads and to compare the implemented load balancing algorithms. Videos and graphs showcasing each algorithm's effect on a number of indicative performance and cost metrics will be created on the fly for every setup. By browsing the results of different executions users will be able to grasp each algorithm's balancing mechanisms and performance impact in a number of representative setups.

【Keywords】: cloud computing; load balancing; nosql

108. COCCUS: self-configured cost-based query services in the cloud.

Paper Link】 【Pages】:1041-1044

【Authors】: Ioannis Konstantinou ; Verena Kantere ; Dimitrios Tsoumakos ; Nectarios Koziris

【Abstract】: Recently, a large number of pay-as-you-go data services are offered over cloud infrastructures. Data service providers need appropriate and flexible query charging mechanisms and query optimization that take into consideration cloud operational expenses, pricing strategies and user preferences. Yet, existing solutions are static and non-configurable. We demonstrate COCCUS, a modular system for cost-aware query execution, adaptive query charge and optimization of cloud data services. The audience can set their queries along with their execution preferences and budget constraints, while COCCUS adaptively determines query charge and manages secondary data structures according to various economic policies. We demonstrate COCCUS's operation over centralized and shared nothing CloudDBMS architectures on top of public and private IaaS clouds. The audience is enabled to set economic policies and execute various workloads through a comprehensive GUI. COCCUS's adaptability is showcased using real-time graphs depicting a number of key performance metrics.

【Keywords】: amortization; cloud computing; economy

109. Workload optimization using SharedDB.

Paper Link】 【Pages】:1045-1048

【Authors】: Georgios Giannikis ; Darko Makreshanski ; Gustavo Alonso ; Donald Kossmann

【Abstract】: This demonstration presents SharedDB, an implementation of a relational database system capable of executing all SQL operators by sharing computation and resources across all running queries. SharedDB sidesteps the traditional query-at-a-time approach and executes queries in batches. Unlike proposed multi-query optimization ideas, in SharedDB queries do not have to contain common subexpressions in order to be part of the same batch, which allows for a higher degree of sharing. By sharing as much as possible, SharedDB avoids repeating parts of computation that is common across all running queries. The goal of this demonstration is to show the ability of shared query execution to a) answer complex and diverse workloads, and b) reduce the interaction among concurrently executed queries that is observed in traditional systems and leads to performance deterioration and instabilities.

【Keywords】: main memory; shared query processing

110. SciQL: array data processing inside an RDBMS.

Paper Link】 【Pages】:1049-1052

【Authors】: Ying Zhang ; Martin L. Kersten ; Stefan Manegold

【Abstract】: Scientific discoveries increasingly rely on the ability to efficiently grind massive amounts of experimental data using database technologies. To bridge the gap between the needs of the Data-Intensive Research fields and the current DBMS technologies, we have introduced SciQL (pronounced as 'cycle'). SciQL is the first SQL-based declarative query language for scientific applications with both tables and arrays as first class citizens. It provides a seamless symbiosis of array-, set- and sequence- interpretations. A key innovation is the extension of value-based grouping of SQL:2003 with structural grouping, i.e., group array elements based on their positions. This leads to a generalisation of window-based query processing with wide applicability in science domains. In this demo, we showcase a proof of concept implementation of SciQL in the relational database system MonetDB. First, with the Conway's Game of Life application implemented purely in SciQL queries, we demonstrate the storage of arrays in the MonetDB as first class citizens, and the execution of a comprehensive set of basic operations on arrays. Then, to show the usefulness of SciQL for real-world array data processing use cases, we demonstrate how various common image processing and remote sensing operations are executed as SciQL queries. The audience is invited to challenge SciQL with their use cases.

【Keywords】: array database; array query language; scientific databases; sciql

111. Iterative parallel data processing with stratosphere: an inside look.

Paper Link】 【Pages】:1053-1056

【Authors】: Stephan Ewen ; Sebastian Schelter ; Kostas Tzoumas ; Daniel Warneke ; Volker Markl

【Abstract】: Iterative algorithms occur in many domains of data analysis, such as machine learning or graph analysis. With increasing interest to run those algorithms on very large data sets, we see a need for new techniques to execute iterations in a massively parallel fashion. In prior work, we have shown how to extend and use a parallel data flow system to efficiently run iterative algorithms in a shared-nothing environment. Our approach supports the incremental processing nature of many of those algorithms. In this demonstration proposal we illustrate the process of implementing, compiling, optimizing, and executing iterative algorithms on Stratosphere using examples from graph analysis and machine learning. For the first step, we show the algorithm's code and a visualization of the produced data flow programs. The second step shows the optimizer's execution plan choices, while the last phase monitors the execution of the program, visualizing the state of the operators and additional metrics, such as per-iteration runtime and number of updates. To show that the data flow abstraction supports easy creation of custom programming APIs, we also present programs written against a lightweight Pregel API that is layered on top of our system with a small programming effort.

【Keywords】: graph processing; iterative algorithms; machine learning; parallel databases; query execution; query optimization

112. CARTILAGE: adding flexibility to the Hadoop skeleton.

Paper Link】 【Pages】:1057-1060

【Authors】: Alekh Jindal ; Jorge-Arnulfo Quiané-Ruiz ; Samuel Madden

【Abstract】: Modern enterprises have to deal with a variety of analytical queries over very large datasets. In this respect, Hadoop has gained much popularity since it scales to thousand of nodes and terabytes of data. However, Hadoop suffers from poor performance, especially in I/O performance. Several works have proposed alternate data storage for Hadoop in order to improve the query performance. However, many of these works end up making deep changes in Hadoop or HDFS. As a result, they are (i) difficult to adopt by several users, and (ii) not compatible with future Hadoop releases. In this paper, we present CARTILAGE, a comprehensive data storage framework built on top of HDFS. CARTILAGE allows users full control over their data storage, including data partitioning, data replication, data layouts, and data placement. Furthermore, CARTILAGE can be layered on top of an existing HDFS installation. This means that Hadoop, as well as other query engines, can readily make use of CARTILAGE. We describe several use-cases of CARTILAGE and propose to demonstrate the flexibility and efficiency of CARTILAGE through a set of novel scenarios.

【Keywords】: ease of use; flexible storage; hdfs; portability

113. Continuous outlier detection in data streams: an extensible framework and state-of-the-art algorithms.

Paper Link】 【Pages】:1061-1064

【Authors】: Dimitrios Georgiadis ; Maria Kontaki ; Anastasios Gounaris ; Apostolos N. Papadopoulos ; Kostas Tsichlas ; Yannis Manolopoulos

【Abstract】: Anomaly detection is an important data mining task, aiming at the discovery of elements that show significant diversion from the expected behavior; such elements are termed as outliers. One of the most widely employed criteria for determining whether an element is an outlier is based on the number of neighboring elements within a fixed distance (R), against a fixed threshold (k). Such outliers are referred to as distance-based outliers and are the focus of this work. In this demo, we show both an extendible framework for outlier detection algorithms and specific outlier detection algorithms for the demanding case where outlier detection is continuously performed over a data stream. More specifically: i) first we demonstrate a novel flavor of an open-source publicly available tool for Massive Online Analysis (MOA) that is endowed with capabilities to encapsulate algorithms that continuously detect outliers and ii) second, we present four online outlier detection algorithms. Two of these algorithms have been designed by the authors of this demo, with a view to improving on key aspects related to outlier mining, such as running time, flexibility and space requirements.

【Keywords】: continuous processing; data streams; metric space; outlier detection

114. FAST: differentially private real-time aggregate monitor with filtering and adaptive sampling.

Paper Link】 【Pages】:1065-1068

【Authors】: Liyue Fan ; Li Xiong ; Vaidy S. Sunderam

【Abstract】: Sharing aggregate statistics of private data can be of great value when data mining can be performed in real-time to understand important phenomena such as influenza outbreaks or traffic congestion. However, to this date there have been no tools for releasing real-time aggregated data with differential privacy, a strong and provable privacy guarantee. We propose FAST, a real-time system that allows differentially private aggregate sharing and time-series analytics. FAST employs a set of novel, adaptive strategies to improve the utility of shared/released data while guaranteeing the user-specified level of differential privacy. We will demonstrate the challenges and our solutions in the context of prepared data sets as well as live participation data dynamically collected among the SIGMOD'13 attendees.

【Keywords】: differential privacy; estimation; sampling; time series

115. Execution and optimization of continuous queries with cyclops.

Paper Link】 【Pages】:1069-1072

【Authors】: Harold Lim ; Shivnath Babu

【Abstract】: As the data collected by enterprises grows in scale, there is a growing trend of performing data analytics on large datasets. Batch processing systems that can handle petabyte scale of data, such as Hadoop, have flourished and gained traction in the industry. As the results of batch analytics have been used to continuously improve front-facing user experience, there is a growing interest in pushing the processing latency down. This trend has fueled a resurgence in the development and usage of execution engines that can process continuous queries. An important class of continuous queries is windowed aggregation queries. Such queries arise in a wide range of applications such as generating personalized content and results. Today, considerable manual effort goes into finding the most suitable execution engine for these queries and on tuning query performance on these engines. An ecosystem composed of multiple execution engines may be needed in order to run the overall query workload efficiently given the diverse set of requirements that arise in practice. Cyclops is a continuous query processing platform that manages and orchestrates windowed aggregation queries in an ecosystem composed of multiple continuous query execution engines. Cyclops employs a cost-based approach for picking the most suitable engine and plan for executing a given query. This demonstration first presents an interactive visualization of the rich execution plan space of windowed aggregation queries, which allows users to analyze and understand the differences among plans. The next part of the demonstration will drill down into the design of Cyclops. For a given query, we show the cost spectrum of query execution plans across three different execution engines---Esper, Storm, and Hadoop---as estimated by Cyclops.

【Keywords】: continuous queries; windowed aggregation queries

116. Less watts, more performance: an intelligent storage engine for data appliances.

Paper Link】 【Pages】:1073-1076

【Authors】: Louis Woods ; Jens Teubner ; Gustavo Alonso

【Abstract】: In this demonstration, we present Ibex, a novel storage engine featuring hybrid, FPGA-accelerated query processing. In Ibex, an FPGA is inserted along the path between the storage devices and the database engine. The FPGA acts as an intelligent storage engine supporting query off-loading from the query engine. Apart from significant performance improvements for many common SQL queries, the demo will show how Ibex reduces data movement, CPU usage, and overall energy consumption in database appliances.

【Keywords】: data appliance; energy; fpga; ibex; intelligent; storage

117. A demonstration of SQLVM: performance isolation in multi-tenant relational database-as-a-service.

Paper Link】 【Pages】:1077-1080

【Authors】: Vivek R. Narasayya ; Sudipto Das ; Manoj Syamala ; Surajit Chaudhuri ; Feng Li ; Hyunjung Park

【Abstract】: Sharing resources of a single database server among multiple tenants is common in multi-tenant Database-as-a-Service providers, such as Microsoft SQL Azure. Multi-tenancy enables cost reduction for the cloud service provider which it can pass on as savings to the tenants. However, resource sharing can adversely affect a tenant's performance due to other tenants' workloads contending for shared resources. Service providers today do not provide any assurances to a tenant in terms of isolating its performance from other co-located tenants. SQLVM, a project at Microsoft Research, is an abstraction for performance isolation which is built on a promise of reserving key database server resources, such as CPU, I/O and memory, for each tenant. The key challenge is in supporting this abstraction within a RDBMS without statically allocating resources to tenants, while ensuring low overheads and scaling to large numbers of tenants. This demonstration will show how SQLVM can effectively isolate a tenant's performance from other tenant workloads co-located at the same database server. Our demonstration will use various scripted scenarios and a data collection and visualization framework to illustrate performance isolation using SQLVM.

【Keywords】: cloud computing; multitenancy; performance isolation; relational database-as-a-service; resource isolation

Demo session 4: graphs and networks; potpourris 11

118. SQUIN: a traversal based query execution system for the web of linked data.

Paper Link】 【Pages】:1081-1084

【Authors】: Olaf Hartig

【Abstract】: The World Wide Web (WWW) currently evolves into a Web of Linked Data where content providers publish and link their data as they have done with hypertext for the last 20 years. We understand this emerging dataspace as a huge, distributed database which is -at best- partially known to query execution systems. To tap the full potential of the Web, such a system must be able to answer a query using data from initially unknown data sources. For this purpose, traditional query execution paradigms are unsuitable because those assume a fixed set of potentially relevant data sources beforehand. We demonstrate the query execution system SQUIN which implements a novel query execution approach. The main idea is to integrate the traversal of data links into the result construction process. This approach allows the execution engine to discover potentially relevant data during the query execution. In our demonstration, attendees can query the Web of Linked Data using SQUIN and, thus, learn about the new query execution approach. Furthermore, attendees can experience the suitability of the approach for Web applications by using a simple, Linked Data based mash-up implemented on top of SQUIN.

【Keywords】: link traversal based query execution; linked data; web of data

119. GRDB: a system for declarative and interactive analysis of noisy information networks.

Paper Link】 【Pages】:1085-1088

【Authors】: Walaa Eldin Moustafa ; Hui Miao ; Amol Deshpande ; Lise Getoor

【Abstract】: There is a growing interest in methods for analyzing data describing networks of all types, including biological, physical, social, and scientific collaboration networks. Typically the data describing these networks is observational, and thus noisy and incomplete; it is often at the wrong level of fidelity and abstraction for meaningful data analysis. This demonstration presents GrDB, a system that enables data analysts to write declarative programs to specify and combine different network data cleaning tasks, visualize the output, and engage in the process of decision review and correction if necessary. The declarative interface of GrDB makes it very easy to quickly write analysis tasks and execute them over data, while the visual component facilitates debugging the program and performing fine grained corrections.

【Keywords】: datalog; graph data; social network analysis

120. HiNGE: enabling temporal network analytics at scale.

Paper Link】 【Pages】:1089-1092

【Authors】: Udayan Khurana ; Amol Deshpande

【Abstract】: However, much of the prior work on those topics has been restricted to static networks, a primary reason being the lack of efficient temporal data management systems to store and query large dynamic network datasets. In this demonstration proposal, we present HiNGE (Historical Network/Graph Explorer), a system that enables interactive exploration and analytics over large evolving networks through visualization and node-centric metric computations. HiNGE is built on top of a distributed graph database system that stores the entire history of a network, and enables efficiently retrieving and analyzing multiple graph snapshots from arbitrary time points in the past. The cornerstone of our system is a novel hierarchical parallelizable index structure, called DeltaGraph, that enables compact recording of the historical trace of a network on disk, and supports efficient retrieval of historical snapshots for single-site or parallel processing. The other key component of our system is an in-memory graph data structure, called GraphPool, that can maintain hundreds of historical graph snapshots in main memory in a non-redundant manner. We demonstrate the efficient and usability of our system at performing temporal analytics over large-scale dynamic networks.

【Keywords】: graph databases; historical data management; snapshot queries; social network analysis; temporal data management

121. Research-insight: providing insight on research by publication network analysis.

Paper Link】 【Pages】:1093-1096

【Authors】: Fangbo Tao ; Xiao Yu ; Kin Hou Lei ; George Brova ; Xiao Cheng ; Jiawei Han ; Rucha Kanade ; Yizhou Sun ; Chi Wang ; Lidan Wang ; Tim Weninger

【Abstract】: A database contains rich, inter-related, multi-typed data and information, forming one or a set of gigantic, intercon- nected, heterogeneous information networks. Much knowl- edge can be derived from such information networks if we systematically develop an effective and scalable database-oriented information network analysis technology. In this system demo, we take a computer science research publica- tion network as an example, which is an information net- work derived from an integration of DBLP, other web-based information about researchers, and partially available cita- tion data, and construct a Research-Insight system in order to demonstrate the power of database-oriented information network analysis. We show that nontrivial research insight can be obtained from such analysis, including (1) ranking, clustering, classification and similarity search of researchers, terms and venues for research subfields and themes, (2) recommending good researchers and good research papers to read or cite when conducting research on certain topics (3) predicting potential collaborators for certain theme-oriented research, and (4) predicting advisor-advisee rela- tionships and affiliation history based on historical research publications. Although some of these functions have been studied in recent research, effective and scalable realization of such functions in large networks still poses challenging research problems. Moreover, some function are our on- going research tasks. By integrating these functionalities, Research-Insight may not only provide with us insightful rec- ommendations in CS research but also help us gain insight on how to perform effective data mining in large databases.

【Keywords】: heterogeneous information network; recommendation system

122. QUBLE: blending visual subgraph query formulation with query processing on large networks.

Paper Link】 【Pages】:1097-1100

【Authors】: Ho Hoang Hung ; Sourav S. Bhowmick ; Ba Quan Truong ; Byron Choi ; Shuigeng Zhou

【Abstract】: In a previous paper, we laid out the vision of a novel graph query processing paradigm where instead of processing a visual query graph after its construction, it interleaves visual query formulation and processing by exploiting the latency offered by the GUI [4]. Our recent attempts at implementing this vision [4,6], show significant improvement in the system response time (SRT) for subgraph queries. However, these efforts are designed specifically for graph databases containing a large collection of small or medium-sized graphs. Consequently, its frequent fragment-based action-aware indexing schemes and query processing strategy are unsuitable for supporting subgraph queries on large networks containing thousands of nodes and edges. In this demonstration, we present a novel system called QUBLE (QUery Blender for Large nEtworks) to realize this novel paradigm on large networks. We demonstrate various innovative features of QUBLE and its promising performance.

【Keywords】: action-aware indexes; action-aware query processing; blending; frequent fragments; large networks; query formulation; small infrequent fragments; visual graph querying

123. StreamWorks: a system for dynamic graph search.

Paper Link】 【Pages】:1101-1104

【Authors】: Sutanay Choudhury ; Lawrence B. Holder ; George Chin Jr. ; Abhik Ray ; Sherman Beus ; John Feo

【Abstract】: Acting on time-critical events by processing ever growing social media, news or cyber data streams is a major technical challenge. Many of these data sources can be modeled as multi-relational graphs. Mining and searching for subgraph patterns in a continuous setting requires an efficient approach to incremental graph search. The goal of our work is to enable real-time search capabilities for graph databases. This demonstration will present a dynamic graph query system that leverages the structural and semantic characteristics of the underlying multi-relational graph.

【Keywords】: continuous queries; dynamic graph search; subgraph matching

124. Query processing on prefix trees live.

Paper Link】 【Pages】:1105-1108

【Authors】: Thomas Kissinger ; Benjamin Schlegel ; Dirk Habich ; Wolfgang Lehner

【Abstract】: Modern database systems have to process huge amounts of data and should provide results with low latency at the same time. To achieve this, data is nowadays typically hold completely in main memory, to benefit of its high bandwidth and low access latency that could never be reached with disks. Current in-memory databases are usually column-stores that exchange columns or vectors between operators and suffer from a high tuple reconstruction overhead. In this demonstration proposal, we present DexterDB, which implements our novel prefix tree-based processing model that makes indexes the first-class citizen of the database system. The core idea is that each operator takes a set of indexes as input and builds a new index as output that is indexed on the attribute requested by the successive operator. With that, we are able to build composed operators, like the multi-way-select-join-group. Such operators speed up the processing of complex OLAP queries so that DexterDB outperforms state-of-the-art in-memory databases. Our demonstration focuses on the different optimization options for such query plans. Hence, we built an interactive GUI that connects to a DexterDB instance and allows the manipulation of query optimization parameters. The generated query plans and important execution statistics are visualized to help the visitor to understand our processing model.

【Keywords】: in-memory query processing; indexing; prefix trees

125. xPAD: a platform for analytic data flows.

Paper Link】 【Pages】:1109-1112

【Authors】: Alkis Simitsis ; Kevin Wilkinson ; Petar Jovanovic

【Abstract】: As enterprises become more automated, real-time, and data-driven, they need to integrate new data sources and specialized processing engines. The traditional business intelligence architecture of Extract-Transform-Load (ETL) flows, followed by querying, reporting, and analytic operations, is being generalized to analytic data flows that utilize a variety of data types and operations. These complicated flows are difficult to design, implement and maintain since they span a variety of systems. Additionally, new design requirements may be imposed such as design for fault-tolerance, freshness, maintainability, sampling, etc. To reduce development time and maintenance costs, automation is needed. We present xPAD, our platform to manage analytic data flows. xPAD enables flow design. We show how these designs can be optimized, not just for performance, but for other objectives as well. xPAD is engine-agnostic. We show how it can generate executable code for a number of execution engines. It can also import existing flows from other engines and optimize those flows. In that way, it can transform a flow written for one engine into an optimized flow for a different engine. In our demonstration, we will also use various example flows to show optimization for different objectives and comparison of flow execution on different engines.

【Keywords】: analytics; code generation; data flows; optimization

126. PBS at work: advancing data management with consistency metrics.

Paper Link】 【Pages】:1113-1116

【Authors】: Peter Bailis ; Shivaram Venkataraman ; Michael J. Franklin ; Joseph M. Hellerstein ; Ion Stoica

【Abstract】: A large body of recent work has proposed analytical and empirical techniques for quantifying the data consistency properties of distributed data stores. In this demonstration, we begin to explore the wide range of new database functionality they enable, including dynamic query tuning, consistency SLAs, monitoring, and administration. Our demonstration will exhibit how both application programmers and database administrators can leverage these features. We describe three major application scenarios and present a system architecture for supporting them. We also describe our experience in integrating Probabilistically Bounded Staleness (PBS) predictions into Cassandra, a popular NoSQL store and sketch a demo platform that will allow SIGMOD attendees to experience the importance and applicability of real-time consistency metrics.

【Keywords】: auto-tuning; eventual consistency; monitoring; prediction

127. The power of data use management in action.

Paper Link】 【Pages】:1117-1120

【Authors】: Prasang Upadhyaya ; Nick R. Anderson ; Magdalena Balazinska ; Bill Howe ; Raghav Kaushik ; Ravishankar Ramamurthy ; Dan Suciu

【Abstract】: In this demonstration, we show-case a database management system extended with a new type of component that we call a Data Use Manager (DUM). The DUM enables DBAs to attach policies to data loaded into the DBMS. It then monitors how users query the data, flags potential policy violations, recommends possible fixes, and supports offline analysis of user activities related to data policies. The demonstration uses real healthcare data.

【Keywords】: access control; data use management

128. CTrace: semantic comparison of multi-granularity process traces.

Paper Link】 【Pages】:1121-1124

【Authors】: Qing Liu ; Kerry Taylor ; Xiang Zhao ; Geoffrey Squire ; Xuemin Lin ; Corne Kloppers ; Richard Miller

【Abstract】: A process trace describes the processes taken in a workflow to generate a particular result. Given many process traces, each with a large amount of very low level information, it is a challenge to make process traces meaningful to different users. It is more challenging to compare two complex process traces generated by heterogenous systems and have different levels of granularity. We present CTrace, a system that (1) lets users explore the conceptual abstraction of large process traces with different levels of granularity, and (2) provides semantic comparison among traces in which both the structural and the semantic similarity are considered. The above functions are underpinned by a novel notion of multi-granularity process trace and efficient multi-granularity similarity comparison algorithms.

【Keywords】: multi-granularity; semantics; similarity

Industry session 1: big data I 3

129. The big data ecosystem at LinkedIn.

Paper Link】 【Pages】:1125-1134

【Authors】: Roshan Sumbaly ; Jay Kreps ; Sam Shah

【Abstract】: The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn's Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the ``last mile'' issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.

【Keywords】: big data; data mining; data pipeline; hadoop; machine learning; offline processing

130. On brewing fresh espresso: LinkedIn's distributed data serving platform.

Paper Link】 【Pages】:1135-1146

【Authors】: Lin Qiao ; Kapil Surlaker ; Shirshanka Das ; Tom Quiggle ; Bob Schulman ; Bhaskar Ghosh ; Antony Curtis ; Oliver Seeliger ; Zhen Zhang ; Aditya Auradkar ; Chris Beaver ; Gregory Brandt ; Mihir Gandhi ; Kishore Gopalakrishna ; Wai Ip ; Swaroop Jagadish ; Shi Lu ; Alexander Pachev ; Aditya Ramesh ; Abraham Sebastian ; Rupa Shanbhag ; Subbu Subramaniam ; Yun Sun ; Sajid Topiwala ; Cuong Tran ; Jemiah Westerman ; David Zhang

【Abstract】: Espresso is a document-oriented distributed data serving platform that has been built to address LinkedIn's requirements for a scalable, performant, source-of-truth primary store. It provides a hierarchical document model, transactional support for modifications to related documents, real-time secondary indexing, on-the-fly schema evolution and provides a timeline consistent change capture stream. This paper describes the motivation and design principles involved in building Espresso, the data model and capabilities exposed to clients, details of the replication and secondary indexing implementation and presents a set of experimental results that characterize the performance of the system along various dimensions. When we set out to build Espresso, we chose to apply best practices in industry, already published works in research and our own internal experience with different consistency models. Along the way, we built a novel generic distributed cluster management framework, a partition-aware change- capture pipeline and a high-performance inverted index implementation.

【Keywords】: distributed key-document store; fault-tolerance

Paper Link】 【Pages】:1147-1158

【Authors】: Gilad Mishne ; Jeff Dalton ; Zhenghua Li ; Aneesh Sharma ; Jimmy Lin

【Abstract】: We present the architecture behind Twitter's real-time related query suggestion and spelling correction service. Although these tasks have received much attention in the web search literature, the Twitter context introduces a real-time "twist": after significant breaking news events, we aim to provide relevant results within minutes. This paper provides a case study illustrating the challenges of real-time data processing in the era of "big data". We tell the story of how our system was built twice: our first implementation was built on a typical Hadoop-based analytics stack, but was later replaced because it did not meet the latency requirements necessary to generate meaningful real-time results. The second implementation, which is the system deployed in production today, is a custom in-memory processing engine specifically designed for the task. This experience taught us that the current typical usage of Hadoop as a "big data" platform, while great for experimentation, is not well suited to low-latency processing, and points the way to future work on data analytics platforms that can handle "big" as well as "fast" data.

【Keywords】: hadoop; log analysis; mapreduce

Industry session 2: enterprise data management 3

132. Enhancements to SQL server column stores.

Paper Link】 【Pages】:1159-1168

【Authors】: Per-Åke Larson ; Cipri Clinciu ; Campbell Fraser ; Eric N. Hanson ; Mostafa Mokhtar ; Michal Nowakiewicz ; Vassilis Papadimos ; Susan L. Price ; Srikumar Rangarajan ; Remus Rusanu ; Mayukh Saubhasik

【Abstract】: SQL Server 2012 introduced two innovations targeted for data warehousing workloads: column store indexes and batch (vectorized) processing mode. Together they greatly improve performance of typical data warehouse queries, routinely by 10X and in some cases by a 100X or more. The main limitations of the initial version are addressed in the upcoming release. Column store indexes are updatable and can be used as the base storage for a table. The repertoire of batch mode operators has been expanded, existing operators have been improved, and query optimization has been enhanced. This paper gives an overview of SQL Server's column stores and batch processing, in particular the enhancements introduced in the upcoming release.

【Keywords】: column store; columnar storage; data warehousing; index; olap

133. Query containment in entity SQL.

Paper Link】 【Pages】:1169-1172

【Authors】: Guillem Rull ; Philip A. Bernstein ; Ivo Garcia dos Santos ; Yannis Katsis ; Sergey Melnik ; Ernest Teniente

【Abstract】: We describe a software architecture we have developed for a constructive containment checker of Entity SQL queries defined over extended ER schemas expressed in Microsoft's Entity Data Model. Our application of interest is compilation of object-to-relational mappings for Microsoft's ADO.NET Entity Framework, which has been shipping since 2007. The supported language includes several features which have been individually addressed in the past but, to the best of our knowledge, they have not been addressed all at once before. Moreover, when embarking on an implementation, we found no guidance in the literature on how to modularize the software or apply published algorithms to a commercially-supported language. This paper reports on our experience in addressing these real-world challenges.

【Keywords】: canonical instance; entity data model; query; query containment

134. Timeline index: a unified data structure for processing queries on temporal data in SAP HANA.

Paper Link】 【Pages】:1173-1184

【Authors】: Martin Kaufmann ; Amin Amiri Manjili ; Panagiotis Vagenas ; Peter M. Fischer ; Donald Kossmann ; Franz Färber ; Norman May

【Abstract】: Managing temporal data is becoming increasingly important for many applications. Several database systems already support the time dimension, but provide only few temporal operators, which also often exhibit poor performance characteristics. On the academic side, a large number of algorithms and data structures have been proposed, but they often address a subset of these temporal operators only. In this paper, we develop the Timeline Index as a novel, unified data structure that efficiently supports temporal operators such as temporal aggregation, time travel, and temporal joins. As the Timeline Index is independent of the physical order of the data, it provides flexibility in physical design; e.g., it supports any kind of compression scheme, which is crucial for main memory column stores. Our experiments show that the Timeline Index has predictable performance and beats state-of-the-art approaches significantly, sometimes by orders of magnitude.

【Keywords】: algorithm; index; temporal data; temporal operator

Industry session 3: big data II and web 3

135. LinkBench: a database benchmark based on the Facebook social graph.

Paper Link】 【Pages】:1185-1196

【Authors】: Timothy G. Armstrong ; Vamsi Ponnekanti ; Dhruba Borthakur ; Mark Callaghan

【Abstract】: Database benchmarks are an important tool for database researchers and practitioners that ease the process of making informed comparisons between different database hardware, software and configurations. Large scale web services such as social networks are a major and growing database application area, but currently there are few benchmarks that accurately model web service workloads. In this paper we present a new synthetic benchmark called LinkBench. LinkBench is based on traces from production databases that store "social graph" data at Facebook, a major social network. We characterize the data and query workload in many dimensions, and use the insights gained to construct a realistic synthetic benchmark. LinkBench provides a realistic and challenging test for persistent storage of social and web service data, filling a gap in the available tools for researchers, developers and administrators.

【Keywords】: database benchmarks; database workload analysis; hbase; mysql; social networks

136. BigBench: towards an industry standard benchmark for big data analytics.

Paper Link】 【Pages】:1197-1208

【Authors】: Ahmad Ghazal ; Tilmann Rabl ; Minqing Hu ; Francois Raab ; Meikel Poess ; Alain Crolotte ; Hans-Arno Jacobsen

【Abstract】: There is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to evaluate and compare the performance of these systems. In this paper, we present BigBench, an end-to-end big data benchmark proposal. The underlying business model of BigBench is a product retailer. The proposal covers a data model and synthetic data generator that addresses the variety, velocity and volume aspects of big data systems containing structured, semi-structured and unstructured data. The structured part of the BigBench data model is adopted from the TPC-DS benchmark, which is enriched with semi-structured and unstructured data components. The semi-structured part captures registered and guest user clicks on the retailer's website. The unstructured data captures product reviews submitted online. The data generator designed for BigBench provides scalable volumes of raw data based on a scale factor. The BigBench workload is designed around a set of queries against the data model. From a business prospective, the queries cover the different categories of big data analytics proposed by McKinsey. From a technical prospective, the queries are designed to span three different dimensions based on data sources, query processing types and analytic techniques. We illustrate the feasibility of BigBench by implementing it on the Teradata Aster Database. The test includes generating and loading a 200 Gigabyte BigBench data set and testing the workload by executing the BigBench queries (written using Teradata Aster SQL-MR) and reporting their response times.

【Keywords】: benchmarking; big data; map reduce

137. Building, maintaining, and using knowledge bases: a report from the trenches.

Paper Link】 【Pages】:1209-1220

【Authors】: Omkar Deshpande ; Digvijay S. Lamba ; Michel Tourn ; Sanjib Das ; Sri Subramaniam ; Anand Rajaraman ; Venky Harinarayan ; AnHai Doan

【Abstract】: A knowledge base (KB) contains a set of concepts, instances, and relationships. Over the past decade, numerous KBs have been built, and used to power a growing array of applications. Despite this flurry of activities, however, surprisingly little has been published about the end-to-end process of building, maintaining, and using such KBs in industry. In this paper we describe such a process. In particular, we describe how we build, update, and curate a large KB at Kosmix, a Bay Area startup, and later at WalmartLabs, a development and research lab of Walmart. We discuss how we use this KB to power a range of applications, including query understanding, Deep Web search, in-context advertising, event monitoring in social media, product search, social gifting, and social mining. Finally, we discuss how the KB team is organized, and the lessons learned. Our goal with this paper is to provide a real-world case study, and to contribute to the emerging direction of building, maintaining, and using knowledge bases for data management applications.

【Keywords】: data integration; human curation; information extraction; knowledge base; social media; taxonomy; wikipedia

138. Query processing on smart SSDs: opportunities and challenges.

Paper Link】 【Pages】:1221-1230

【Authors】: Jaeyoung Do ; Yang-Suk Kee ; Jignesh M. Patel ; Chanik Park ; Kwanghyun Park ; David J. DeWitt

【Abstract】: Data storage devices are getting "smarter." Smart Flash storage devices (a.k.a. "Smart SSD") are on the horizon and will package CPU processing and DRAM storage inside a Smart SSD, and make that available to run user programs inside a Smart SSD. The focus of this paper is on exploring the opportunities and challenges associated with exploiting this functionality of Smart SSDs for relational analytic query processing. We have implemented an initial prototype of Microsoft SQL Server running on a Samsung Smart SSD. Our results demonstrate that significant performance and energy gains can be achieved by pushing selected query processing components inside the Smart SSDs. We also identify various changes that SSD device manufacturers can make to increase the benefits of using Smart SSDs for data processing applications, and also suggest possible research opportunities for the database community.

【Keywords】: smart ssd

139. Micro adaptivity in Vectorwise.

Paper Link】 【Pages】:1231-1242

【Authors】: Bogdan Raducanu ; Peter A. Boncz ; Marcin Zukowski

【Abstract】: Performance of query processing functions in a DBMS can be affected by many factors, including the hardware platform, data distributions, predicate parameters, compilation method, algorithmic variations and the interactions between these. Given that there are often different function implementations possible, there is a latent performance diversity which represents both a threat to performance robustness if ignored (as is usual now) and an opportunity to increase the performance if one would be able to use the best performing implementation in each situation. Micro Adaptivity, proposed here, is a framework that keeps many alternative function implementations (flavors) in a system. It uses a learning algorithm to choose the most promising flavor potentially at each function call, guided by the actual costs observed so far. We argue that Micro Adaptivity both increases performance robustness, and saves development time spent in finding and tuning heuristics and cost model thresholds in query optimization. In this paper, we (i) characterize a number of factors that cause performance diversity between primitive flavors, (ii) describe an e-greedy learning algorithm that casts the flavor selection into a multi-armed bandit problem, and (iii) describe the software framework for Micro Adaptivity that we implemented in the Vectorwise system. We provide micro-benchmarks, and an overall evaluation on TPC-H, showing consistent improvements.

【Keywords】: adaptive; query processing; self tuning

140. Hekaton: SQL server's memory-optimized OLTP engine.

Paper Link】 【Pages】:1243-1254

【Authors】: Cristian Diaconu ; Craig Freedman ; Erik Ismert ; Per-Åke Larson ; Pravin Mittal ; Ryan Stonecipher ; Nitin Verma ; Mike Zwilling

【Abstract】: Hekaton is a new database engine optimized for memory resident data and OLTP workloads. Hekaton is fully integrated into SQL Server; it is not a separate system. To take advantage of Hekaton, a user simply declares a table memory optimized. Hekaton tables are fully transactional and durable and accessed using T-SQL in the same way as regular SQL Server tables. A query can reference both Hekaton tables and regular tables and a transaction can update data in both types of tables. T-SQL stored procedures that reference only Hekaton tables can be compiled into machine code for further performance improvements. The engine is designed for high con-currency. To achieve this it uses only latch-free data structures and a new optimistic, multiversion concurrency control technique. This paper gives an overview of the design of the Hekaton engine and reports some experimental results.

【Keywords】: compilation to native code; lock-free data structures; main-memory databases; multiversion concurrency control; oltp; optimistic concurrency control; sql server

Industry session 5: big data III and more 3

141. Split query processing in polybase.

Paper Link】 【Pages】:1255-1266

【Authors】: David J. DeWitt ; Alan Halverson ; Rimma V. Nehme ; Srinath Shankar ; Josep Aguilar-Saborit ; Artin Avanes ; Miro Flasza ; Jim Gramling

【Abstract】: This paper presents Polybase, a feature of SQL Server PDW V2 that allows users to manage and query data stored in a Hadoop cluster using the standard SQL query language. Unlike other database systems that provide only a relational view over HDFS-resident data through the use of an external table mechanism, Polybase employs a split query processing paradigm in which SQL operators on HDFS-resident data are translated into MapReduce jobs by the PDW query optimizer and then executed on the Hadoop cluster. The paper describes the design and implementation of Polybase along with a thorough performance evaluation that explores the benefits of employing a split query processing paradigm for executing queries that involve both structured data in a relational DBMS and unstructured data in Hadoop. Our results demonstrate that while the use of a split-based query execution paradigm can improve the performance of some queries by as much as 10X, one must employ a cost-based query optimizer that considers a broad set of factors when deciding whether or not it is advantageous to push a SQL operator to Hadoop. These factors include the selectivity factor of the predicate, the relative sizes of the two clusters, and whether or not their nodes are co-located. In addition, differences in the semantics of the Java and SQL languages must be carefully considered in order to avoid altering the expected results of a query.

【Keywords】: hadoop; hdfs; parallel database systems; split query execution

142. Petabyte scale databases and storage systems at Facebook.

Paper Link】 【Pages】:1267-1268

【Authors】: Dhruba Borthakur

【Abstract】: At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack data store for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of click logs and combine it with the power of Apache HBase to store all Facebook Messages. This paper describes the reasons why each of these databases is appropriate for that workload and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We describe the techniques we have used to map the Facebook Graph Database into a set of relational tables. We speak of how we plan to do big-data deployments across geographical locations and our requirements for a new breed of pure-memory and pure-SSD based transactional database. Esteemed researchers in the Database Management community have benchmarked query latencies on Hive/Hadoop to be less performant than a traditional Parallel DBMS. We describe why these benchmarks are insufficient for Big Data deployments and why we continue to use Hadoop/Hive. We present an alternate set of benchmark techniques that measure capacity of a database, the value/byte in that database and the efficiency of inbuilt crowd-sourcing techniques to reduce administration costs of that database.

【Keywords】: bigdata; facebook

143. Incremental mapping compilation in an object-to-relational mapping system.

Paper Link】 【Pages】:1269-1280

【Authors】: Philip A. Bernstein ; Marie Jacob ; Jorge Pérez ; Guillem Rull ; James F. Terwilliger

【Abstract】: In an object-to-relational mapping system (ORM), mapping expressions explain how to expose relational data as objects and how to store objects in tables. If mappings are sufficiently expressive, then it is possible to define lossy mappings. If a user updates an object, stores it in the database based on a lossy mapping, and then retrieves the object from the database, the user might get a different result than the updated state of the object; that is, the mapping might not "roundtrip." To avoid this, the ORM should validate that user-defined mappings roundtrip the data. However, this problem is NP-hard, so mapping validation can be very slow for large or complex mappings. We circumvent this problem by developing an incremental compiler for OR mappings. Given a validated mapping, a modification to the object schema is compiled into incremental modifications of the mapping. We define the problem formally, present algorithms to solve it for Microsoft's Entity Framework, and report on an implementation. For some mappings, incremental compilation is over 100 times faster than a full mapping compilation, in one case dropping from 8 hours to 50 seconds.

【Keywords】: incremental compilation; object-to-relational mapping

Undergraduate research 6

144. FriendRouter: real-time path finder in social networks.

Paper Link】 【Pages】:1281-1282

【Authors】: Wladston Viana ; Mirella M. Moro

【Abstract】: Online social networks have become a platform for running and optimizing classical algorithms. Here, we introduce a tool for finding paths between social network users in real-time, a task that classical solutions are not tailored for.

【Keywords】: path search algorithms; social networks

145. Adaptive log compression for massive log data.

Paper Link】 【Pages】:1283-1284

【Authors】: Robert Christensen ; Feifei Li

【Abstract】: We present a novel adaptive log compression scheme. Results show 30% improvement on compression ratios over existing approaches.

【Keywords】: adaptive log compression; log compression; log data management

146. BUZZARD: a NUMA-aware in-memory indexing system.

Paper Link】 【Pages】:1285-1286

【Authors】: Lukas M. Maas ; Thomas Kissinger ; Dirk Habich ; Wolfgang Lehner

【Abstract】: With the availability of large main memory capacities, in-memory index structures have become an important component of modern data management platforms. Current research even suggests index-based query processing as an alternative or supplement for traditional tuple-at-a-time processing models. However, while simple sequential scan operations can fully exploit the high bandwidth provided by main memory, indexes are mainly latency bound and spend most of their time waiting for memory accesses. Considering current hardware trends, the problem of high memory latency is further exacerbated as modern shared-memory multiprocessors with non-uniform memory access (NUMA) become increasingly common. On those NUMA platforms, the execution time of index operations is dominated by memory access latency that increases dramatically when accessing memory on remote sockets. Therefore, good index performance can only be achieved through careful optimization of the index structure to the given topology. BUZZARD is a NUMA-aware in-memory indexing system. Using adaptive data partitioning techniques, BUZZARD distributes a prefix-tree-based index across the NUMA system and hands off incoming requests to worker threads located on each partition's respective NUMA node. This approach reduces the number of remote memory accesses to a minimum and improves cache utilization. In addition, all indexes inside BUZZARD are only accessed by their respective owner, eliminating the need for synchronization primitives like compare-and-swap.

【Keywords】: in-memory; indexing; numa; prefix trees

147. Resa: realtime elastic streaming analytics in the cloud.

Paper Link】 【Pages】:1287-1288

【Authors】: Tian Tan ; Richard T. B. Ma ; Marianne Winslett ; Yin Yang ; Yong Yu ; Zhenjie Zhang

【Abstract】: We propose Resa, a novel framework for robust, elastic and realtime stream processing in the cloud. In addition to traditional functionalities of streaming and cloud systems, Resa provides (i) a novel mechanism that handles dynamic additions and removals nodes in an operator, and (ii) a node re-assignment scheme that minimizes output latency using a queuing model. We have implemented Resa on top of Twitter Storm. Experiments using real data demonstrate the effectiveness and efficiency of Resa.

【Keywords】: cloud; migration; resource allocation; stream

148. Natural language question answering over RDF data.

Paper Link】 【Pages】:1289-1290

【Authors】: Ruizhe Huang ; Lei Zou

【Abstract】: As more and more RDF data becomes available, such as DBpedia, Yago and Freebase, it is desired to provide users with simple interfaces to access the datasets. Although the SPARQL query language is a standard way to query RDF data, it remains tedious and difficult even for expert users because of the formality of the language and the complexity of the underlying schema of RDF data. An ideal system should allow users to express queries in their own languages. In this work, we propose a methodology to translate natural language questions into SPARQL queries, which can be answered by existing RDF engines and fulfill users? information need.

【Keywords】: natural language; question answering; rdf; sparql

149. Mobile interaction and query optimizationin a protein-ligand data analysis system.

Paper Link】 【Pages】:1291-1292

【Authors】: Marvin Lapeine ; Katherine G. Herbert ; Emily Hill ; Nina M. Goodey

【Abstract】: With current trends in integrating phylogenetic analysis into pharma-research, computing systems that integrate the two areas can help the drug discovery field. DrugTree is a tool that overlays ligand data on a protein-motivated phylogenetic tree. While initial tests of DrugTree are successful, it has been noticed that there are a number of lags concerning querying the tree. Due to the interleaving nature of the data, query optimization can become problematic since the data is being obtained from multiple sources, integrated and then presented to the user with the phylogenetic imposed upon the phylogenetic analysis layer. This poster presents our initial methodologies for addressing the query optimization issues. Our approach applies standards as well as uses novel mechanisms to help improve performance time.

【Keywords】: bioinformatics; data integration; pharmaceutical informatics; phylogenetics; query processing

New researcher symposium 1

150. SIGMOD 2013 new researcher symposium.

Paper Link】 【Pages】:1293-1294

【Authors】: Anish Das Sarma ; Xin Luna Dong

【Abstract】:

【Keywords】: data management career; job search; paper writing; research