ACM SIGMOD Conference 2012:Scottsdale, AZ, USA

Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012. ACM 【DBLP Link

Paper Num: 112 || Session Num: 28

Distributed and parallel databases 3

1. Calvin: fast distributed transactions for partitioned database systems.

Paper Link】 【Pages】:1-12

【Authors】: Alexander Thomson ; Thaddeus Diamond ; Shu-Chun Weng ; Kun Ren ; Philip Shao ; Daniel J. Abadi

【Abstract】: Many distributed storage systems achieve high data access throughput via partitioning and replication, each system with its own advantages and tradeoffs. In order to achieve high scalability, however, today's systems generally reduce transactional support, disallowing single transactions from spanning multiple partitions. Calvin is a practical transaction scheduling and data replication layer that uses a deterministic ordering guarantee to significantly reduce the normally prohibitive contention costs associated with distributed transactions. Unlike previous deterministic database system prototypes, Calvin supports disk-based storage, scales near-linearly on a cluster of commodity machines, and has no single point of failure. By replicating transaction inputs rather than effects, Calvin is also able to support multiple consistency levels---including Paxos-based strong consistency across geographically distant replicas---at no cost to transactional throughput.

【Keywords】: determinism; distributed database systems; replication; transaction processing

2. Advanced partitioning techniques for massively distributed computation.

Paper Link】 【Pages】:13-24

【Authors】: Jingren Zhou ; Nicolas Bruno ; Wei Lin

【Abstract】: An increasing number of companies rely on distributed data storage and processing over large clusters of commodity machines for critical business decisions. Although plain MapReduce systems provide several benefits, they carry certain limitations that impact developer productivity and optimization opportunities. Higher level programming languages plus conceptual data models have recently emerged to address such limitations. These languages offer a single machine programming abstraction and are able to perform sophisticated query optimization and apply efficient execution strategies. In massively distributed computation, data shuffling is typically the most expensive operation and can lead to serious performance bottlenecks if not done properly. An important optimization opportunity in this environment is that of judicious placement of repartitioning operators and choice of alternative implementations. In this paper we discuss advanced partitioning strategies, their implementation, and how they are integrated in the Microsoft Scope system. We show experimentally that our approach significantly improves performance for a large class of real-world jobs.

【Keywords】: distributed computation; partitioning; query optimization; scope

3. SkewTune: mitigating skew in mapreduce applications.

Paper Link】 【Pages】:25-36

【Authors】: YongChul Kwon ; Magdalena Balazinska ; Bill Howe ; Jerome A. Rolia

【Abstract】: We present an automatic skew mitigation approach for user-defined MapReduce programs and present SkewTune, a system that implements this approach as a drop-in replacement for an existing MapReduce implementation. There are three key challenges: (a) require no extra input from the user yet work for all MapReduce applications, (b) be completely transparent, and (c) impose minimal overhead if there is no skew. The SkewTune approach addresses these challenges and works as follows: When a node in the cluster becomes idle, SkewTune identifies the task with the greatest expected remaining processing time. The unprocessed input data of this straggling task is then proactively repartitioned in a way that fully utilizes the nodes in the cluster and preserves the ordering of the input data so that the original output can be reconstructed by concatenation. We implement SkewTune as an extension to Hadoop and evaluate its effectiveness using several real applications. The results show that SkewTune can significantly reduce job runtime in the presence of skew and adds little to no overhead in the absence of skew.

【Keywords】: mapreduce; partitioning; skew; user-defined operator

Indexing and physical database design I 3

4. Parallel main-memory indexing for moving-object query and update workloads.

Paper Link】 【Pages】:37-48

【Authors】: Darius Sidlauskas ; Simonas Saltenis ; Christian S. Jensen

【Abstract】: We are witnessing a proliferation of Internet-worked, geo-positioned mobile devices such as smartphones and personal navigation devices. Likewise, location-related services that target the users of such devices are proliferating. Consequently, server-side infrastructures are needed that are capable of supporting the location-related query and update workloads generated by very large populations of such moving objects. This paper presents a main-memory indexing technique that aims to support such workloads. The technique, called PGrid, uses a grid structure that is capable of exploiting the parallelism offered by modern processors. Unlike earlier proposals that maintain separate structures for updates and queries, PGrid allows both long-running queries and rapid updates to operate on a single data structure and thus offers up-to-date query results. Because PGrid does not rely on creating snapshots, it avoids the stop-the-world problem that occurs when workload processing is interrupted to perform such snapshotting. Its concurrency control mechanism relies instead on hardware-assisted atomic updates as well as object-level copying, and it treats updates as non-divisible operations rather than as combinations of deletions and insertions; thus, the query semantics guarantee that no objects are missed in query results. Empirical studies demonstrate that PGrid scales near-linearly with the number of hardware threads on four modern multi-core processors. Since both updates and queries are processed on the same current data-store state, PGrid outperforms snapshot-based techniques in terms of both query freshness and CPU cycle-wise efficiency.

【Keywords】: parallelism; spatio-temporal indexing

5. Divergent physical design tuning for replicated databases.

Paper Link】 【Pages】:49-60

【Authors】: Mariano P. Consens ; Kleoni Ioannidou ; Jeff LeFevre ; Neoklis Polyzotis

【Abstract】: We introduce divergent designs as a novel tuning paradigm for database systems that employ replication. A divergent design installs a different physical configuration (e.g., indexes and materialized views) with each database replica, specializing replicas for different subsets of the workload. At runtime, queries are routed to the subset of the replicas configured to yield the most efficient execution plans. When compared to uniformly designed replicas, divergent replicas can potentially execute their subset of the queries significantly faster, and their physical configurations could be initialized and maintained(updated) in less time. However, the specialization of divergent replicas limits the ability to load-balance the workload at runtime. We formalize the divergent design problem, characterize the properties of good designs, and analyze the complexity of identifying the optimal divergent design. Our paradigm captures the trade-off between load balancing among all n replicas vs. load balancing among m ≤ n specialized replicas. We develop an effective algorithm (leveraging single-node-tuning functionality) to compute good divergent designs for all the points of this trade-off. Experimental results validate the effectiveness of the algorithm and demonstrate that divergent designs can substantially improve workload performance.

【Keywords】: divergence; physical design tuning; replicated databases

6. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems.

Paper Link】 【Pages】:61-72

【Authors】: Andrew Pavlo ; Carlo Curino ; Stanley B. Zdonik

【Abstract】: The advent of affordable, shared-nothing computing systems portends a new class of parallel database management systems (DBMS) for on-line transaction processing (OLTP) applications that scale without sacrificing ACID guarantees [7, 9]. The performance of these DBMSs is predicated on the existence of an optimal database design that is tailored for the unique characteristics of OLTP workloads. Deriving such designs for modern DBMSs is difficult, especially for enterprise-class OLTP systems, since they impose extra challenges: the use of stored procedures, the need for load balancing in the presence of time-varying skew, complex schemas, and deployments with larger number of partitions. To this purpose, we present a novel approach to automatically partitioning databases for enterprise-class OLTP systems that significantly extends the state of the art by: (1) minimizing the number distributed transactions, while concurrently mitigating the effects of temporal skew in both the data distribution and accesses, (2) extending the design space to include replicated secondary indexes, (4) organically handling stored procedure routing, and (3) scaling of schema complexity, data size, and number of partitions. This effort builds on two key technical contributions: an analytical cost model that can be used to quickly estimate the relative coordination cost and skew for a given workload and a candidate database design, and an informed exploration of the huge solution space based on large neighborhood search. To evaluate our methods, we integrated our database design tool with a high-performance parallel, main memory DBMS and compared our methods against both popular heuristics and a state-of-the-art research prototype [17]. Using a diverse set of benchmarks, we show that our approach improves throughput by up to a factor of 16x over these other approaches.

【Keywords】: h-store; kb; oltp; parallel; shared-nothing

Data cleaning and integration 3

7. Sample-driven schema mapping.

Paper Link】 【Pages】:73-84

【Authors】: Li Qian ; Michael J. Cafarella ; H. V. Jagadish

【Abstract】: End-users increasingly find the need to perform light-weight, customized schema mapping. State-of-the-art tools provide powerful functions to generate schema mappings, but they usually require an in-depth understanding of the semantics of multiple schemas and their correspondences, and are thus not suitable for users who are technically unsophisticated or when a large number of mappings must be performed. We propose a system for sample-driven schema mapping. It automatically constructs schema mappings, in real time, from user-input sample target instances. Because the user does not have to provide any explicit attribute-level match information, she is isolated from the possibly complex structure and semantics of both the source schemas and the mappings. In addition, the user never has to master any operations specific to schema mappings: she simply types data values into a spreadsheet-style interface. As a result, the user can construct mappings with a much lower cognitive burden. In this paper we present Mweaver, a prototype sample-driven schema mapping system. It employs novel algorithms that enable the system to obtain desired mapping results while meeting interactive response performance requirements. We show the results of a user study that compares Mweaver with two state-of-the-art mapping tools across several mapping tasks, both real and synthetic. These suggest that the Mweaver system enables users to perform practical mapping tasks in about 1/5th the time needed by the state-of-the-art tools.

【Keywords】: data integration; sample-driven; schema mapping; usability

8. Can we beat the prefix filtering?: an adaptive framework for similarity join and search.

Paper Link】 【Pages】:85-96

【Authors】: Jiannan Wang ; Guoliang Li ; Jianhua Feng

【Abstract】: As two important operations in data cleaning, similarity join and similarity search have attracted much attention recently. Existing methods to support similarity join usually adopt a prefix-filtering-based framework. They select a prefix of each object and prune object pairs whose prefixes have no overlap. We have an observation that prefix lengths have significant effect on the performance. Different prefix lengths lead to significantly different performance, and prefix filtering does not always achieve high performance. To address this problem, in this paper we propose an adaptive framework to support similarity join. We propose a cost model to judiciously select an appropriate prefix for each object. To efficiently select prefixes, we devise effective indexes. We extend our method to support similarity search. Experimental results show that our framework beats the prefix-filtering-based framework and achieves high efficiency.

【Keywords】: adaptive framework; cost model; prefix filtering; similarity join; similarity search

9. InfoGather: entity augmentation and attribute discovery by holistic matching with web tables.

Paper Link】 【Pages】:97-108

【Authors】: Mohamed Yakout ; Kris Ganjam ; Kaushik Chakrabarti ; Surajit Chaudhuri

【Abstract】: The Web contains a vast corpus of HTML tables, specifically entity attribute tables. We present three core operations, namely entity augmentation by attribute name, entity augmentation by example and attribute discovery, that are useful for "information gathering" tasks (e.g., researching for products or stocks). We propose to use web table corpus to perform them automatically. We require the operations to have high precision and coverage, have fast (ideally interactive) response times and be applicable to any arbitrary domain of entities. The naive approach that attempts to directly match the user input with the web tables suffers from poor precision and coverage. Our key insight is that we can achieve much higher precision and coverage by considering indirectly matching tables in addition to the directly matching ones. The challenge is to be robust to spuriously matched tables: we address it by developing a holistic matching framework based on topic sensitive pagerank and an augmentation framework that aggregates predictions from multiple matched tables. We propose a novel architecture that leverages preprocessing in MapReduce to achieve extremely fast response times at query time. Our experiments on real-life datasets and 573M web tables show that our approach has (i) significantly higher precision and coverage and (ii) four orders of magnitude faster response times compared with the state-of-the-art approach.

【Keywords】: augmentation; data integration; page rank; web application; web tables

Query processing and optimization 3

10. Interactive regret minimization.

Paper Link】 【Pages】:109-120

【Authors】: Danupon Nanongkai ; Ashwin Lall ; Atish Das Sarma ; Kazuhisa Makino

【Abstract】: We study the notion of regret ratio proposed in [19] Nanongkai et al. [VLDB10] to deal with multi-criteria decision making in database systems. The regret minimization query proposed in [19] Nanongkai et al. was shown to have features of both skyline and top-k: it does not need information from the user but still controls the output size. While this approach is suitable for obtaining a reasonably small regret ratio, it is still open whether one can make the regret ratio arbitrarily small. Moreover, it remains open whether reasonable questions can be asked to the users in order to improve efficiency of the process. In this paper, we study the problem of minimizing regret ratio when the system is enhanced with interaction. We assume that when presented with a set of tuples the user can tell which tuple is most preferred. Under this assumption, we develop the problem of interactive regret minimization where we fix the number of questions and tuples per question that we can display, and aim at minimizing the regret ratio. We try to answer two questions in this paper: (1) How much does interaction help? That is, how much can we improve the regret ratio when there are interactions? (2) How efficient can interaction be? In particular, we measure how many questions we have to ask the user in order to make her regret ratio small enough. We answer both questions from both theoretical and practical standpoints. For the first question, we show that interaction can reduce the regret ratio almost exponentially. To do this, we prove a lower bound for the previous approach (thereby resolving an open problem from [19] Nanongkai et al.), and develop an almost-optimal upper bound that makes the regret ratio exponentially smaller. Our experiments also confirm that, in practice, interactions help in improving the regret ratio by many orders of magnitude. For the second question, we prove that when our algorithm shows a reasonable number of points per question, it only needs a few questions to make the regret ratio small. Thus, interactive regret minimization seems to be a necessary and sufficient way to deal with multi-criteria decision making in database systems.

【Keywords】: regret minimization; skyline; top-k

11. MCJoin: a memory-constrained join for column-store main-memory databases.

Paper Link】 【Pages】:121-132

【Authors】: Steven Keith Begley ; Zhen He ; Yi-Ping Phoebe Chen

【Abstract】: There exists a need for high performance, read-only main-memory database systems for OLAP-style application scenarios. Most of the existing works in this area are centered around the domain of column-store databases, which are particularly well suited to OLAP-style scenarios and have been shown to overcome the memory bottleneck issues that have been found to hinder the more traditional row-store database systems. One of the main database operations these systems are focused on optimizing is the JOIN operation. However, all these existing systems use join algorithms that are designed with the unrealistic assumption that there is unlimited temporary memory available to perform the join. In contrast, we propose a Memory Constrained Join algorithm (MCJoin) which is both high performing and also performs all of its operations within a tight given memory constraint. Extensive experimental results show that MCJoin outperforms a naive memory constrained version of the state-of-the-art Radix-Clustered Hash Join algorithm in all of the situations tested, with margins of up to almost 500%.

【Keywords】: column store; hash join; main memory

12. Holistic optimization by prefetching query results.

Paper Link】 【Pages】:133-144

【Authors】: Karthik Ramachandra ; S. Sudarshan

【Abstract】: In this paper we address the problem of optimizing performance of database/web-service backed applications by means of automatically prefetching query results. Prefetching has been performed in earlier work based on predicting query access patterns; however such prediction is often of limited value, and can perform unnecessary prefetches. There has been some earlier work on program analysis and rewriting to automatically insert prefetch requests; however, such work has been restricted to rewriting of single procedures. In many cases, the query is in a procedure which does not offer much scope for prefetching within the procedure; in contrast, our approach can perform prefetching in a calling procedure, even when the actual query is in a called procedure, thereby greatly improving the benefits due to prefetching. Our approach does not perform any intrusive changes to the source code, and places prefetch instructions at the earliest possible points while avoiding wasteful prefetches. We have incorporated our techniques into a tool for holistic optimization called DBridge, to prefetch query results in Java programs that use JDBC. Our tool can be easily extended to handle Hibernate API calls as well as Web service requests. Our experiments on several real world applications demonstrate the applicability and significant performance gains due to our techniques.

【Keywords】: holistic query optimization

Social networks and graph databases I 3

13. Managing large dynamic graphs efficiently.

Paper Link】 【Pages】:145-156

【Authors】: Jayanta Mondal ; Amol Deshpande

【Abstract】: There is an increasing need to ingest, manage, and query large volumes of graph-structured data arising in applications like social networks, communication networks, biological networks, and so on. Graph databases that can explicitly reason about the graphical nature of the data, that can support flexible schemas and node-centric or edge-centric analysis and querying, are ideal for storing such data. However, although there is much work on single-site graph databases and on efficiently executing different types of queries over large graphs, to date there is little work on understanding the challenges in distributed graph databases, needed to handle the large scale of such data. In this paper, we propose the design of an in-memory, distributed graph data management system aimed at managing a large-scale dynamically changing graph, and supporting low-latency query processing over it. The key challenge in a distributed graph database is that, partitioning a graph across a set of machines inherently results in a large number of distributed traversals across partitions to answer even simple queries. We propose aggressive replication of the nodes in the graph for supporting low-latency querying, and investigate three novel techniques to minimize the communication bandwidth and the storage requirements. First, we develop a hybrid replication policy that monitors node read-write frequencies to dynamically decide what data to replicate, and whether to do eager or lazy replication. Second, we propose a clustering-based approach to amortize the costs of making these replication decisions. Finally, we propose using a fairness criterion to dictate how replication decisions should be made. We provide both theoretical analysis and efficient algorithms for the optimization problems that arise. We have implemented our framework as a middleware on top of the open-source CouchDB key-value store. We evaluate our system on a social graph, and show that our system is able to handle very large graphs efficiently, and that it reduces the network bandwidth consumption significantly.

【Keywords】: feed delivery; graph databases; replication; social networks

14. Query preserving graph compression.

Paper Link】 【Pages】:157-168

【Authors】: Wenfei Fan ; Jianzhong Li ; Xin Wang ; Yinghui Wu

【Abstract】: It is common to find graphs with millions of nodes and billions of edges in, e.g., social networks. Queries on such graphs are often prohibitively expensive. These motivate us to propose query preserving graph compression, to compress graphs relative to a class Λ of queries of users' choice. We compute a small Gr from a graph G such that (a) for any query Q Ε Λ Q, Q(G) = Q'(Gr), where Q' Ε Λ can be efficiently computed from Q; and (b) any algorithm for computing Q(G) can be directly applied to evaluating Q' on Gr as is. That is, while we cannot lower the complexity of evaluating graph queries, we reduce data graphs while preserving the answers to all the queries in Λ. To verify the effectiveness of this approach, (1) we develop compression strategies for two classes of queries: reachability and graph pattern queries via (bounded) simulation. We show that graphs can be efficiently compressed via a reachability equivalence relation and graph bisimulation, respectively, while reserving query answers. (2) We provide techniques for aintaining compressed graph Gr in response to changes ΔG to the original graph G. We show that the incremental maintenance problems are unbounded for the two lasses of queries, i.e., their costs are not a function of the size of ΔG and changes in Gr. Nevertheless, we develop incremental algorithms that depend only on ΔG and Gr, independent of G, i.e., we do not have to decompress Gr to propagate the changes. (3) Using real-life data, we experimentally verify that our compression techniques could reduce graphs in average by 95% for reachability and 57% for graph pattern matching, and that our incremental maintenance algorithms are efficient.

【Keywords】: graph compression; pattern queries; reachability queries

15. SCARAB: scaling reachability computation on large graphs.

Paper Link】 【Pages】:169-180

【Authors】: Ruoming Jin ; Ning Ruan ; Saikat Dey ; Jeffrey Xu Yu

【Abstract】: Most of the existing reachability indices perform well on small- to medium- size graphs, but reach a scalability bottleneck around one million vertices/edges. As graphs become increasingly large, scalability is quickly becoming the major research challenge for the reachability computation today. Can we construct indices which scale to graphs with tens of millions of vertices and edges? Can the existing reachability indices which perform well on moderate-size graphs be scaled to very large graphs? In this paper, we propose SCARAB (standing for SCAlable ReachABility), a unified reachability computation framework: it not only can scale the existing state-of-the-art reachability indices, which otherwise could only be constructed and work on moderate size graphs, but also can help speed up the online query answering approaches. Our experimental results demonstrate that SCARAB can perform on graphs with millions of vertices/edges and is also much faster then GRAIL, the state-of-the-art scalability index approach.

【Keywords】: reachability backbone; reachability join test; scalable reachability

Data visualization, error reporting 3

16. Skimmer: rapid scrolling of relational query results.

Paper Link】 【Pages】:181-192

【Authors】: Manish Singh ; Arnab Nandi ; H. V. Jagadish

【Abstract】: A relational database often yields a large set of tuples as the result of a query. Users browse this result set to find the information they require. If the result set is large, there may be many pages of data to browse. Since results comprise tuples of alphanumeric values that have few visual markers, it is hard to browse the data quickly, even if it is sorted. In this paper, we describe the design of a system for browsing relational data by scrolling through it at a high speed. Rather than showing the user a fast changing blur, the system presents the user with a small number of representative tuples. Representative tuples are selected to provide a "good impression" of the query result. We show that the information loss to the user is limited, even at high scrolling speeds, and that our algorithms can pick good representatives fast enough to provide for real-time, high-speed scrolling over large datasets.

【Keywords】: fast browsing; scrolling history; tuple sampling

17. Efficient spatial sampling of large geographical tables.

Paper Link】 【Pages】:193-204

【Authors】: Anish Das Sarma ; Hongrae Lee ; Hector Gonzalez ; Jayant Madhavan ; Alon Y. Halevy

【Abstract】: Large-scale map visualization systems play an increasingly important role in presenting geographic datasets to end users. Since these datasets can be extremely large, a map rendering system often needs to select a small fraction of the data to visualize them in a limited space. This paper addresses the fundamental challenge of thinning: determining appropriate samples of data to be shown on specific geographical regions and zoom levels. Other than the sheer scale of the data, the thinning problem is challenging because of a number of other reasons: (1) data can consist of complex geographical shapes, (2) rendering of data needs to satisfy certain constraints, such as data being preserved across zoom levels and adjacent regions, and (3) after satisfying the constraints, an optimal solution needs to be chosen based on objectives such as maximality, fairness, and importance of data. This paper formally defines and presents a complete solution to the thinning problem. First, we express the problem as a integer programming formulation that efficiently solves thinning for desired objectives. Second, we present more efficient solutions for maximality, based on DFS traversal of a spatial tree. Third, we consider the common special case of point datasets, and present an even more efficient randomized algorithm. Finally, we have implemented all techniques from this paper in Google Maps visualizations of Fusion Tables, and we describe a set of experiments that demonstrate the tradeoffs among the algorithms.

【Keywords】: data visualization; geographical databases; indexing; maps; query processing; spatial sampling

18. Declarative error management for robust data-intensive applications.

Paper Link】 【Pages】:205-216

【Authors】: Carl-Christian Kanne ; Vuk Ercegovac

【Abstract】: We present an approach to declaratively manage run-time errors in data-intensive applications. When large volumes of raw data meet complex third-party libraries, deterministic run-time errors become likely, and existing query processors typically stop without returning a result when a run-time error occurs. The ability to degrade gracefully in the presence of run-time errors, and partially execute jobs, is typically limited to specific operators such as bulkloading. We generalize this concept to all operators of a query processing system, introducing a novel data type "partial result with errors" and corresponding operators. We show how to extend existing error-unaware operators to support this type, and as an added benefit, eliminate side-effect based error reporting. We use declarative specifications of acceptable results to control the semantics of error-aware operators. We have incorporated our approach into a declarative query processing system, which compiles the language constructs into instrumented execution plans for clusters of machines. We experimentally validate that the instrumentation overhead is below 20% in microbenchmarks, and not detectable when running I/O-intensive workloads.

【Keywords】: big data; error handling; exception handling; fault tolerance

Storage systems, query processing and optimization 3

19. bLSM: a general purpose log structured merge tree.

Paper Link】 【Pages】:217-228

【Authors】: Russell Sears ; Raghu Ramakrishnan

【Abstract】: Data management workloads are increasingly write-intensive and subject to strict latency SLAs. This presents a dilemma: Update in place systems have unmatched latency but poor write throughput. In contrast, existing log structured techniques improve write throughput but sacrifice read performance and exhibit unacceptable latency spikes. We begin by presenting a new performance metric: read fanout, and argue that, with read and write amplification, it better characterizes real-world indexes than approaches such as asymptotic analysis and price/performance. We then present bLSM, a Log Structured Merge (LSM) tree with the advantages of B-Trees and log structured approaches: (1) Unlike existing log structured trees, bLSM has near-optimal read and scan performance, and (2) its new "spring and gear" merge scheduler bounds write latency without impacting throughput or allowing merges to block writes for extended periods of time. It does this by ensuring merges at each level of the tree make steady progress without resorting to techniques that degrade read performance. We use Bloom filters to improve index performance, and find a number of subtleties arise. First, we ensure reads can stop after finding one version of a record. Otherwise, frequently written items would incur multiple B-Tree lookups. Second, many applications check for existing values at insert. Avoiding the seek performed by the check is crucial.

【Keywords】: log structured merge tree; merge scheduling; read amplification; read fanout; write amplification

20. Skeleton automata for FPGAs: reconfiguring without reconstructing.

Paper Link】 【Pages】:229-240

【Authors】: Jens Teubner ; Louis Woods ; Chongling Nie

【Abstract】: While the performance opportunities of field-programmable gate arrays field (FPGAs)field for high-volume query processing are well-known, system makers still have to compromise between desired query expressiveness and high compilation effort. The cost of the latter is the primary limitation in building efficient FPGA/CPU hybrids. In this work we report on an FPGA-based stream processing engine that does not have this limitation. We provide a hardware implementation of XML projection [14] that can be reconfigured in less than a micro-second, yet supports a rich and expressive dialect of XPath. By performing XML projection in the network, we can fully leverage its filtering effect and improve XQuery performance by several factors. These improvements are made possible by a new design approach for FPGA acceleration, called skeleton automata. Skeleton automata separate the structure of finite-state automata from their semantics. Since individual queries only affect the latter, with our approach query workload changes can be accommodated fast and with high expressiveness.

【Keywords】: fpga; projection; skeleton automaton; xml; xquery

21. NoDB: efficient query execution on raw data files.

Paper Link】 【Pages】:241-252

【Authors】: Ioannis Alagiannis ; Renata Borovica ; Miguel Branco ; Stratos Idreos ; Anastasia Ailamaki

【Abstract】: As data collections become larger and larger, data loading evolves to a major bottleneck. Many applications already avoid using database systems, e.g., scientific data analysis and social networks, due to the complexity and the increased data-to-query time. For such applications data collections keep growing fast, even on a daily basis, and we are already in the era of data deluge where we have much more data than what we can move, store, let alone analyze. Our contribution in this paper is the design and roadmap of a new paradigm in database systems, called NoDB, which do not require data loading while still maintaining the whole feature set of a modern database system. In particular, we show how to make raw data files a first-class citizen, fully integrated with the query engine. Through our design and lessons learned by implementing the NoDB philosophy over a modern DBMS, we discuss the fundamental limitations as well as the strong opportunities that such a research path brings. We identify performance bottlenecks specific for in situ processing, namely the repeated parsing and tokenizing overhead and the expensive data type conversion costs. To address these problems, we introduce an adaptive indexing mechanism that maintains positional information to provide efficient access to raw data files, together with a flexible caching structure. Our implementation over PostgreSQL, called PostgresRaw, is able to avoid the loading cost completely, while matching the query performance of plain PostgreSQL and even outperforming it in many cases. We conclude that NoDB systems are feasible to design and implement over modern database architectures, bringing an unprecedented positive effect in usability and performance.

【Keywords】: adaptive loading; in situ querying; positional map

Data streams and sensor networks 3

22. High-performance complex event processing over XML streams.

Paper Link】 【Pages】:253-264

【Authors】: Barzan Mozafari ; Kai Zeng ; Carlo Zaniolo

【Abstract】: Much research attention has been given to delivering high-performance systems that are capable of complex event processing (CEP) in a wide range of applications. However, many current CEP systems focus on processing efficiently data having a simple structure, and are otherwise limited in their ability to support efficiently complex continuous queries on structured or semi-structured information. However, XML streams represent a very popular form of data exchange, comprising large portions of social network and RSS feeds, financial records, configuration files, and similar applications requiring advanced CEP queries. In this paper, we present the XSeq language and system that support CEP on XML streams, via an extension of XPath that is both powerful and amenable to an efficient implementation. Specifically, the XSeq language extends XPath with natural operators to express sequential and Kleene-* patterns over XML streams, while remaining highly amenable to efficient implementation. XSeq is designed to take full advantage of recent advances in the field of automata on Visibly Pushdown Automata (VPA), where higher expressive power can be achieved without compromising efficiency (whereas the amenability to efficient implementation was not demonstrated in XPath extensions previously proposed). We illustrate XSeq's power for CEP applications through examples from different domains, and provide formal results on its expressiveness and complexity. Finally, we present several optimization techniques for XSeq queries. Our extensive experiments indicate that XSeq brings outstanding performance to CEP applications: two orders of magnitude improvement are obtained over the same queries executed in general-purpose XML engines.

【Keywords】: complex event processing; visibly pushdown automata; xml

23. Prediction-based geometric monitoring over distributed data streams.

Paper Link】 【Pages】:265-276

【Authors】: Nikos Giatrakos ; Antonios Deligiannakis ; Minos N. Garofalakis ; Izchak Sharfman ; Assaf Schuster

【Abstract】: Many modern streaming applications, such as online analysis of financial, network, sensor and other forms of data are inherently distributed in nature. An important query type that is the focal point in such application scenarios regards actuation queries, where proper action is dictated based on a trigger condition placed upon the current value that a monitored function receives. Recent work studies the problem of (non-linear) sophisticated function tracking in a distributed manner. The main concept behind the geometric monitoring approach proposed there, is for each distributed site to perform the function monitoring over an appropriate subset of the input domain. In the current work, we examine whether the distributed monitoring mechanism can become more efficient, in terms of the number of communicated messages, by extending the geometric monitoring framework to utilize prediction models. We initially describe a number of local estimators (predictors) that are useful for the applications that we consider and which have already been shown particularly useful in past work. We then demonstrate the feasibility of incorporating predictors in the geometric monitoring framework and show that prediction-based geometric monitoring in fact generalizes the original geometric monitoring framework. We propose a large variety of different prediction-based monitoring models for the distributed threshold monitoring of complex functions. Our extensive experimentation with a variety of real data sets, functions and parameter settings indicates that our approaches can provide significant communication savings ranging between two times and up to three orders of magnitude, compared to the transmission cost of the original monitoring framework.

【Keywords】: continuous distributed monitoring; data streams; prediction models

24. Online windowed subsequence matching over probabilistic sequences.

Paper Link】 【Pages】:277-288

【Authors】: Zheng Li ; Tingjian Ge

【Abstract】: Windowed subsequence matching over deterministic strings has been studied in previous work in the contexts of knowledge discovery, data mining, and molecular biology. However, we observe that in these applications, as well as in data stream monitoring, complex event processing, and time series data processing in which streams can be mapped to strings, the strings are often noisy and probabilistic. We study this problem in the online setting where efficiency is paramount. We first formulate the query semantics, and propose an exact algorithm. Then we propose a randomized approximation algorithm that is faster and, in the mean time, provably accurate. Moreover, we devise a filtering algorithm to further enhance the efficiency with an optimization technique that is adaptive to sequence stream contents. Finally, we propose algorithms for patterns with negations. In order to verify the algorithms, we conduct a systematic empirical study using three real datasets and some synthetic datasets.

【Keywords】: online; subsequence matching; uncertain sequence

Mobile databases 3

25. MaskIt: privately releasing user context streams for personalized mobile applications.

Paper Link】 【Pages】:289-300

【Authors】: Michaela Götz ; Suman Nath ; Johannes Gehrke

【Abstract】: The rise of smartphones equipped with various sensors has enabled personalization of various applications based on user contexts extracted from sensor readings. At the same time it has raised serious concerns about the privacy of user contexts. In this paper, we present MASKIT, a technique to filter a user context stream that provably preserves privacy. The filtered context stream can be released to applications or be used to answer their queries. Privacy is defined with respect to a set of sensitive contexts specified by the user. MASKIT limits what adversaries can learn from the filtered stream about the user being in a sensitive context - even if the adversaries are powerful and have knowledge about the filtering system and temporal correlations in the context stream. At the heart of MASKIT is a privacy check deciding whether to release or suppress the current user context. We present two novel privacy checks and explain how to choose the one with the higher utility for a user. Our experiments on real smartphone context traces of 91 users demonstrate the high utility of MASKIT.

【Keywords】: data privacy; mobile applications

26. Authenticating location-based services without compromising location privacy.

Paper Link】 【Pages】:301-312

【Authors】: Haibo Hu ; Jianliang Xu ; Qian Chen ; Ziwei Yang

【Abstract】: The popularity of mobile social networking services (mSNSs) is propelling more and more businesses, especially those in retailing and marketing, into mobile and location-based forms. To address the trust issue, the service providers are expected to deliver their location-based services in an authenticatable manner, so that the correctness of the service results can be verified by the client. However, existing works on query authentication cannot preserve the privacy of the data being queried, which are sensitive user locations when it comes to location-based services and mSNSs. In this paper, we address this challenging problem by proposing a comprehensive solution that preserves unconditional location privacy when authenticating range queries. Three authentication schemes for $R$-tree and grid-file index, together with two optimization techniques, are developed. Cost models, security analysis, and experimental results consistently show the effectiveness, reliability and robustness of the proposed schemes under various system settings and query workloads.

【Keywords】: lbs; privacy-preserving; query authentication

27. Effective caching of shortest paths for location-based services.

Paper Link】 【Pages】:313-324

【Authors】: Jeppe Rishede Thomsen ; Man Lung Yiu ; Christian S. Jensen

【Abstract】: Web search is ubiquitous in our daily lives. Caching has been extensively used to reduce the computation time of the search engine and reduce the network traffic beyond a proxy server. Another form of web search, known as online shortest path search, is popular due to advances in geo-positioning. However, existing caching techniques are ineffective for shortest path queries. This is due to several crucial differences between web search results and shortest path results, in relation to query matching, cache item overlapping, and query cost variation. Motivated by this, we identify several properties that are essential to the success of effective caching for shortest path search. Our cache exploits the optimal subpath property, which allows a cached shortest path to answer any query with source and target nodes on the path. We utilize statistics from query logs to estimate the benefit of caching a specific shortest path, and we employ a greedy algorithm for placing beneficial paths in the cache. Also, we design a compact cache structure that supports efficient query matching at runtime. Empirical results on real datasets confirm the effectiveness of our proposed techniques.

【Keywords】: caching; shortest path

Data analytics 3

28. Towards a unified architecture for in-RDBMS analytics.

Paper Link】 【Pages】:325-336

【Authors】: Xixuan Feng ; Arun Kumar ; Benjamin Recht ; Christopher Ré

【Abstract】: The increasing use of statistical data analysis in enterprise applications has created an arms race among database vendors to offer ever more sophisticated in-database analytics. One challenge in this race is that each new statistical technique must be implemented from scratch in the RDBMS, which leads to a lengthy and complex development process. We argue that the root cause for this overhead is the lack of a unified architecture for in-database analytics. Our main contribution in this work is to take a step towards such a unified architecture. A key benefit of our unified architecture is that performance optimizations for analytics techniques can be studied generically instead of an ad hoc, per-technique fashion. In particular, our technical contributions are theoretical and empirical studies of two key factors that we found impact performance: the order data is stored, and parallelization of computations on a single-node multicore RDBMS. We demonstrate the feasibility of our architecture by integrating several popular analytics techniques into two commercial and one open-source RDBMS. Our architecture requires changes to only a few dozen lines of code to integrate a new statistical technique. We then compare our approach with the native analytics tools offered by the commercial RDBMSes on various analytics tasks, and validate that our approach achieves competitive or higher performance, while still achieving the same quality.

【Keywords】: analytics; convex programming; incremental gradient descent; user-defined aggregate

29. Tiresias: the database oracle for how-to queries.

Paper Link】 【Pages】:337-348

【Authors】: Alexandra Meliou ; Dan Suciu

【Abstract】: How-To queries answer fundamental data analysis questions of the form: "How should the input change in order to achieve the desired output". As a Reverse Data Management problem, the evaluation of how-to queries is harder than their "forward" counterpart: hypothetical, or what-if queries. In this paper, we present Tiresias, the first system that provides support for how-to queries, allowing the definition and integrated evaluation of a large set of constrained optimization problems, specifically Mixed Integer Programming problems, on top of a relational database system. Tiresias generates the problem variables, constraints and objectives by issuing standard SQL statements, allowing for its integration with any RDBMS. The contributions of this work are the following: (a) we define how-to queries using possible world semantics, and propose the specification language TiQL (for Tiresias Query Language) based on simple extensions to standard Datalog. (b) We define translation rules that generate a Mixed Integer Program (MIP) from TiQL specifications, which can be solved using existing tools. (c) Tiresias implements powerful "data-aware" optimizations that are beyond the capabilities of modern MIP solvers, dramatically improving the system performance. (d) Finally, an extensive performance evaluation on the TPC-H dataset demonstrates the effectiveness of these optimizations, particularly highlighting the ability to apply divide-and-conquer methods to break MIP problems into smaller instances.

【Keywords】: constrained optimization; data-driven optimization; tiql; tiresias

30. GUPT: privacy preserving data analysis made easy.

Paper Link】 【Pages】:349-360

【Authors】: Prashanth Mohan ; Abhradeep Thakurta ; Elaine Shi ; Dawn Song ; David E. Culler

【Abstract】: It is often highly valuable for organizations to have their data analyzed by external agents. However, any program that computes on potentially sensitive data risks leaking information through its output. Differential privacy provides a theoretical framework for processing data while protecting the privacy of individual records in a dataset. Unfortunately, it has seen limited adoption because of the loss in output accuracy, the difficulty in making programs differentially private, lack of mechanisms to describe the privacy budget in a programmer's utilitarian terms, and the challenging requirement that data owners and data analysts manually distribute the limited privacy budget between queries. This paper presents the design and evaluation of a new system, GUPT, that overcomes these challenges. Unlike existing differentially private systems such as PINQ and Airavat, it guarantees differential privacy to programs not developed with privacy in mind, makes no trust assumptions about the analysis program, and is secure to all known classes of side-channel attacks. GUPT uses a new model of data sensitivity that degrades privacy of data over time. This enables efficient allocation of different levels of privacy for different user applications while guaranteeing an overall constant level of privacy and maximizing the utility of each application. GUPT also introduces techniques that improve the accuracy of output while achieving the same level of privacy. These approaches enable GUPT to easily execute a wide variety of data analysis programs while providing both utility and privacy.

【Keywords】: algorithms; data mining; differential privacy; security

Crowdsourcing, uncertainty in databases 3

31. CrowdScreen: algorithms for filtering data with humans.

Paper Link】 【Pages】:361-372

【Authors】: Aditya G. Parameswaran ; Hector Garcia-Molina ; Hyunjung Park ; Neoklis Polyzotis ; Aditya Ramesh ; Jennifer Widom

【Abstract】: Given a large set of data items, we consider the problem of filtering them based on a set of properties that can be verified by humans. This problem is commonplace in crowdsourcing applications, and yet, to our knowledge, no one has considered the formal optimization of this problem. (Typical solutions use heuristics to solve the problem.) We formally state a few different variants of this problem. We develop deterministic and probabilistic algorithms to optimize the expected cost (i.e., number of questions) and expected error. We experimentally show that our algorithms provide definite gains with respect to other strategies. Our algorithms can be applied in a variety of crowdsourcing scenarios and can form an integral part of any query processor that uses human computation.

【Keywords】: crowdsourcing; filtering; human computation; predicates

32. Local structure and determinism in probabilistic databases.

Paper Link】 【Pages】:373-384

【Authors】: Theodoros Rekatsinas ; Amol Deshpande ; Lise Getoor

【Abstract】: While extensive work has been done on evaluating queries over tuple-independent probabilistic databases, query evaluation over correlated data has received much less attention even though the support for correlations is essential for many natural applications of probabilistic databases, e.g., information extraction, data integration, computer vision, etc. In this paper, we develop a novel approach for efficiently evaluating probabilistic queries over correlated databases where correlations are represented using a factor graph, a class of graphical models widely used for capturing correlations and performing statistical inference. Our approach exploits the specific values of the factor parameters and the determinism in the correlations, collectively called local structure, to reduce the complexity of query evaluation. Our framework is based on arithmetic circuits, factorized representations of probability distributions that can exploit such local structure. Traditionally, arithmetic circuits are generated following a compilation process and can not be updated directly. We introduce a generalization of arithmetic circuits, called annotated arithmetic circuits, and a novel algorithm for updating them, which enables us to answer probabilistic queries efficiently. We present a comprehensive experimental analysis and show speed-ups of at least one order of magnitude in many cases.

【Keywords】: arithmetic circuits; probabilistic databases; query processing

33. So who won?: dynamic max discovery with the crowd.

Paper Link】 【Pages】:385-396

【Authors】: Stephen Guo ; Aditya G. Parameswaran ; Hector Garcia-Molina

【Abstract】: We consider a crowdsourcing database system that may cleanse, populate, or filter its data by using human workers. Just like a conventional DB system, such a crowdsourcing DB system requires data manipulation functions such as select, aggregate, maximum, average, and so on, except that now it must rely on human operators (that for example compare two objects) with very different latency, cost and accuracy characteristics. In this paper, we focus on one such function, maximum, that finds the highest ranked object or tuple in a set. In particularm we study two problems: given a set of votes (pairwise comparisons among objects), how do we select the maximum? And how do we improve our estimate by requesting additional votes? We show that in a crowdsourcing DB system, the optimal solution to both problems is NP-Hard. We then provide heuristic functions to select the maximum given evidence, and to select additional votes. We experimentally evaluate our functions to highlight their strengths and weaknesses.

【Keywords】: aggregation; crowdsourcing; human computation; max; voting

Top-k query processing and optimization 3

34. Processing a large number of continuous preference top-k queries.

Paper Link】 【Pages】:397-408

【Authors】: Albert Yu ; Pankaj K. Agarwal ; Jun Yang

【Abstract】: Given a set of objects, each with multiple numeric attributes, a (preference) top-k query retrieves the k objects with the highest scores according to a user preference, defined as a linear combination of attribute values. We consider the problem of processing a large number of continuous top-k queries, each with its own preference. When objects or user preferences change, the query results must be updated. We present a dynamic index that supports the reverse top k query, which is of independent interest. Combining this index with another one for top-k queries, we develop a scalable solution for processing many continuous top-k queries that exploits the clusteredness in user preferences. We also define an approximate version of the problem and present a solution significantly more efficient than the exact one with little loss in accuracy.

【Keywords】: continuous top-k queries; linear preference; reverse top-k queries

35. Optimal top-k generation of attribute combinations based on ranked lists.

Paper Link】 【Pages】:409-420

【Authors】: Jiaheng Lu ; Pierre Senellart ; Chunbin Lin ; Xiaoyong Du ; Shan Wang ; Xinxing Chen

【Abstract】: In this work, we study a novel query type, called top-k,m queries. Suppose we are given a set of groups and each group contains a set of attributes, each of which is associated with a ranked list of tuples, with ID and score. All lists are ranked in decreasing order of the scores of tuples. We are interested in finding the best combinations of attributes, each combination involving one attribute from each group. More specifically, we want the top-k combinations of attributes according to the corresponding top-m tuples with matching IDs. This problem has a wide range of applications from databases to search engines on traditional and non-traditional types of data (relational data, XML, text, etc.). We show that a straightforward extension of an optimal top-k algorithm, the Threshold Algorithm (TA), has shortcomings in solving the km problem, as it needs to compute a large number of intermediate results for each combination and reads moreinputs than needed. To overcome this weakness, we provide here, for the first time, a provably instance-optimal algorithm and further develop optimizations for efficient query evaluation to reduce computational and memory costs and the number of accesses. We demonstrate experimentally the scalability and efficiency of our algorithms over three real applications.

【Keywords】: keyword search; package search; top-k querying

36. Top-k bounded diversification.

Paper Link】 【Pages】:421-432

【Authors】: Piero Fraternali ; Davide Martinenghi ; Marco Tagliasacchi

【Abstract】: This paper investigates diversity queries over objects embedded in a low-dimensional vector space. An interesting case is provided by spatial Web objects, which are produced in great quantity by location-based services that let users attach content to places, and arise also in trip planning, news analysis, and real estate scenarios. The targeted queries aim at retrieving the best set of objects relevant to given user criteria and well distributed over a region of interest. Such queries are a particular case of diversified top-k queries, for which existing methods are too costly, as they evaluate diversity by accessing and scanning all relevant objects, even if only a small subset is needed. We therefore introduce Space Partitioning and Probing (SPP), an algorithm that minimizes the number of accessed objects while finding exactly the same result as MMR, the most popular diversification algorithm. SPP belongs to a family of algorithms that rely only on score-based and distance-based access methods, which are available in most geo-referenced Web data sources, and do not require retrieving all the relevant objects. Experiments show that SPP significantly reduces the number of accessed objects while incurring a very low computational overhead.

【Keywords】: aggregation; diversification; ranking; scoring; top-k

Temporal and graph databases 3

37. Temporal alignment.

Paper Link】 【Pages】:433-444

【Authors】: Anton Dignös ; Michael H. Böhlen ; Johann Gamper

【Abstract】: In order to process interval timestamped data, the sequenced semantics has been proposed. This paper presents a relational algebra solution that provides native support for the three properties of the sequenced semantics: snapshot reducibility, extended snapshot reducibility, and change preservation. We introduce two temporal primitives, temporal splitter and temporal aligner, and define rules that use these primitives to reduce the operators of a temporal algebra to their nontemporal counterparts. Our solution supports the three properties of the sequenced semantics through interval adjustment and timestamp propagation. We have implemented the temporal primitives and reduction rules in the kernel of PostgreSQL to get native database support for processing interval timestamped data. The support is comprehensive and includes outer joins, antijoins, and aggregations with predicates and functions over the time intervals of argument relations. The implementation and empirical evaluation confirms effectiveness and scalability of our solution that leverages existing database query optimization techniques.

【Keywords】: sequenced semantics; temporal databases

38. A highway-centric labeling approach for answering distance queries on large sparse graphs.

Paper Link】 【Pages】:445-456

【Authors】: Ruoming Jin ; Ning Ruan ; Yang Xiang ; Victor E. Lee

【Abstract】: The distance query, which asks the length of the shortest path from a vertex $u$ to another vertex v, has applications ranging from link analysis, semantic web and other ontology processing, to social network operations. Here, we propose a novel labeling scheme, referred to as Highway-Centric Labeling, for answering distance queries in a large sparse graph. It empowers the distance labeling with a highway structure and leverages a novel bipartite set cover framework/algorithm. Highway-centric labeling provides better labeling size than the state-of-the-art $2$-hop labeling, theoretically and empirically. It also offers both exact distance and approximate distance with bounded accuracy. A detailed experimental evaluation on both synthetic and real datasets demonstrates that highway-centric labeling can outperform the state-of-the-art distance computation approaches in terms of both index size and query time.

【Keywords】: bipartite set cover; distance query; highway-centric labeling

39. Efficient processing of distance queries in large graphs: a vertex cover approach.

Paper Link】 【Pages】:457-468

【Authors】: James Cheng ; Yiping Ke ; Shumo Chu ; Carter Cheng

【Abstract】: We propose a novel disk-based index for processing single-source shortest path or distance queries. The index is useful in a wide range of important applications (e.g., network analysis, routing planning, etc.). Our index is a tree-structured index constructed based on the concept of vertex cover. We propose an I/O-efficient algorithm to construct the index when the input graph is too large to fit in main memory. We give detailed analysis of I/O and CPU complexity for both index construction and query processing, and verify the efficiency of our index for query processing in massive real-world graphs.

【Keywords】: disk-based shortest-path index; distance graph; distance queries; single-source shortest paths; vertex cover

Information retrieval and text mining 3

Paper Link】 【Pages】:469-480

【Authors】: Mingyang Zhang ; Nan Zhang ; Gautam Das

【Abstract】: Many enterprise websites provide search engines to facilitate customer access to their underlying documents or data. With the web interface of such a search engine, a customer can specify one or a few keywords that he/she is interested in; and the search engine returns a list of documents/tuples matching the user-specified keywords, sorted by an often-proprietary scoring function. It was traditionally believed that, because of its highly-restrictive interface (i.e., keyword search only, no SQL-style queries), such a search engine serves its purpose of answering individual keyword-search queries without disclosing big-picture aggregates over the data which, as we shall show in the paper, may incur significant privacy concerns to the enterprise. Nonetheless, recent work on sampling and aggregate estimation over a search engine's corpus through its keyword-search interface transcends this traditional belief. In this paper, we consider a novel problem of suppressing sensitive aggregates for enterprise search engines while maintaining the quality of answers provided to individual keyword-search queries. We demonstrate the effectiveness and efficiency of our novel techniques through theoretical analysis and extensive experimental studies.

【Keywords】: aggregate suppression; sampling; search engine

41. Probase: a probabilistic taxonomy for text understanding.

Paper Link】 【Pages】:481-492

【Authors】: Wentao Wu ; Hongsong Li ; Haixun Wang ; Kenny Qili Zhu

【Abstract】: Knowledge is indispensable to understanding. The ongoing information explosion highlights the need to enable machines to better understand electronic text in human language. Much work has been devoted to creating universal ontologies or taxonomies for this purpose. However, none of the existing ontologies has the needed depth and breadth for universal understanding. In this paper, we present a universal, probabilistic taxonomy that is more comprehensive than any existing ones. It contains 2.7 million concepts harnessed automatically from a corpus of 1.68 billion web pages. Unlike traditional taxonomies that treat knowledge as black and white, it uses probabilities to model inconsistent, ambiguous and uncertain information it contains. We present details of how the taxonomy is constructed, its probabilistic modeling, and its potential applications in text understanding.

【Keywords】: knowledgebase; taxonomy; text understanding

42. Optimizing index for taxonomy keyword search.

Paper Link】 【Pages】:493-504

【Authors】: Bolin Ding ; Haixun Wang ; Ruoming Jin ; Jiawei Han ; Zhongyuan Wang

【Abstract】: Query substitution is an important problem in information retrieval. Much work focuses on how to find substitutes for any given query. In this paper, we study how to efficiently process a keyword query whose substitutes are defined by a given taxonomy. This problem is challenging because each term in a query can have a large number of substitutes, and the original query can be rewritten into any of their combinations. We propose to build an additional index (besides inverted index) to efficiently process queries. For a query workload, we formulate an optimization problem which chooses the additional index structure, aiming at minimizing the query evaluation cost, under given index space constraints. We show the NP-hardness of the problem, and propose a pseudo-polynomial time algorithm using dynamic programming, as well as an 1 over 4(1-1/e)-approximation algorithm to solve the problem. Experimental results show that, with only 10% additional index space, our approach can greatly reduce the query evaluation cost.

【Keywords】: index; keyword search; materialization; taxonomy

Social networks and graph databases II 3

43. A model-based approach to attributed graph clustering.

Paper Link】 【Pages】:505-516

【Authors】: Zhiqiang Xu ; Yiping Ke ; Yi Wang ; Hong Cheng ; James Cheng

【Abstract】: Graph clustering, also known as community detection, is a long-standing problem in data mining. However, with the proliferation of rich attribute information available for objects in real-world graphs, how to leverage structural and attribute information for clustering attributed graphs becomes a new challenge. Most existing works take a distance-based approach. They proposed various distance measures to combine structural and attribute information. In this paper, we consider an alternative view and propose a model-based approach to attributed graph clustering. We develop a Bayesian probabilistic model for attributed graphs. The model provides a principled and natural framework for capturing both structural and attribute aspects of a graph, while avoiding the artificial design of a distance measure. Clustering with the proposed model can be transformed into a probabilistic inference problem, for which we devise an efficient variational algorithm. Experimental results on large real-world datasets demonstrate that our method significantly outperforms the state-of-art distance-based attributed graph clustering method.

【Keywords】: attributed graph clustering; bayesian method; model-based clustering

44. Towards effective partition management for large graphs.

Paper Link】 【Pages】:517-528

【Authors】: Shengqi Yang ; Xifeng Yan ; Bo Zong ; Arijit Khan

【Abstract】: Searching and mining large graphs today is critical to a variety of application domains, ranging from community detection in social networks, to de novo genome sequence assembly. Scalable processing of large graphs requires careful partitioning and distribution of graphs across clusters. In this paper, we investigate the problem of managing large-scale graphs in clusters and study access characteristics of local graph queries such as breadth-first search, random walk, and SPARQL queries, which are popular in real applications. These queries exhibit strong access locality, and therefore require specific data partitioning strategies. In this work, we propose a Self Evolving Distributed Graph Management Environment (Sedge), to minimize inter-machine communication during graph query processing in multiple machines. In order to improve query response time and throughput, Sedge introduces a two-level partition management architecture with complimentary primary partitions and dynamic secondary partitions. These two kinds of partitions are able to adapt in real time to changes in query workload. (Sedge) also includes a set of workload analyzing algorithms whose time complexity is linear or sublinear to graph size. Empirical results show that it significantly improves distributed graph processing on today's commodity clusters.

【Keywords】: distributed computing; graph; graph query processing; partitioning; rdf

45. TreeSpan: efficiently computing similarity all-matching.

Paper Link】 【Pages】:529-540

【Authors】: Gaoping Zhu ; Xuemin Lin ; Ke Zhu ; Wenjie Zhang ; Jeffrey Xu Yu

【Abstract】: Given a query graph $q$ and a data graph G, computing all occurrences of q in G, namely exact all-matching, is fundamental in graph data analysis with a wide spectrum of real applications. It is challenging since even finding one occurrence of q in G (subgraph isomorphism test) is NP-Complete. Consider that in many real applications, exploratory queries from users are often inaccurate to express their real demands. In this paper, we study the problem of efficiently computing all approximate occurrences of q in G. Particularly, we study the problem of efficiently retrieving all matches of q in G with the number of possible missing edges bounded by a given threshold θ, namely similarity all-matching. The problem of similarity all-matching is harder than the problem of exact all-matching since it covers the problem of exact all-matching as a special case with θ = 0. In this paper, we develop a novel paradigm to conduct similarity all-matching. Specifically, we propose to use a minimal set QT of spanning trees in q to cover all connected subgraphs q' of q missing at most θ edges; that is, each q' is spanned by a spanning tree in QT. Then, we conduct exact all-matching for each spanning tree in QT to induce all similarity matches. A rigid theoretic analysis shows that our new search paradigm significantly reduces the times of conducting exact all-matching against the existing techniques. To further speed-up the computation, we develop new filtering, computation sharing, and search ordering techniques. Our comprehensive experiments on both real and synthetic datasets demonstrate that our techniques outperform the state of the art technique by 7 orders of magnitude.

【Keywords】: graph; similarity all-matching

Indexing and physical database design II 3

46. Locality-sensitive hashing scheme based on dynamic collision counting.

Paper Link】 【Pages】:541-552

【Authors】: Junhao Gan ; Jianlin Feng ; Qiong Fang ; Wilfred Ng

【Abstract】: Locality-Sensitive Hashing (LSH) and its variants are well-known methods for solving the c-approximate NN Search problem in high-dimensional space. Traditionally, several LSH functions are concatenated to form a "static" compound hash function for building a hash table. In this paper, we propose to use a base of m single LSH functions to construct "dynamic" compound hash functions, and define a new LSH scheme called Collision Counting LSH (C2LSH). If the number of LSH functions under which a data object o collides with a query object q is greater than a pre-specified collision threhold l, then o can be regarded as a good candidate of c-approximate NN of q. This is the basic idea of C2LSH. Our theoretical studies show that, by appropriately choosing the size of LSH function base m and the collision threshold l, C2LSH can have a guarantee on query quality. Notably, the parameter m is not affected by dimensionality of data objects, which makes C2LSH especially good for high dimensional NN search. The experimental studies based on synthetic datasets and four real datasets have shown that C2LSH outperforms the state of the art method LSB-forest in high dimensional space.

【Keywords】: dynamic collision counting; locality sensitive hashing

47. Efficient external-memory bisimulation on DAGs.

Paper Link】 【Pages】:553-564

【Authors】: Jelle Hellings ; George H. L. Fletcher ; Herman J. Haverkort

【Abstract】: In this paper we introduce the first efficient external-memory algorithm to compute the bisimilarity equivalence classes of a directed acyclic graph (DAG). DAGs are commonly used to model data in a wide variety of practical applications, ranging from XML documents and data provenance models, to web taxonomies and scientific workflows. In the study of efficient reasoning over massive graphs, the notion of node bisimilarity plays a central role. For example, grouping together bisimilar nodes in an XML data set is the first step in many sophisticated approaches to building indexing data structures for efficient XPath query evaluation. To date, however, only internal-memory bisimulation algorithms have been investigated. As the size of real-world DAG data sets often exceeds available main memory, storage in external memory becomes necessary. Hence, there is a practical need for an efficient approach to computing bisimulation in external memory. Our general algorithm has a worst-case IO-complexity of O(Sort(|N| + |E|)), where |N| and |E| are the numbers of nodes and edges, resp., in the data graph and Sort(n) is the number of accesses to external memory needed to sort an input of size n. We also study specializations of this algorithm to common variations of bisimulation for tree-structured XML data sets. We empirically verify efficient performance of the algorithms on graphs and XML documents having billions of nodes and edges, and find that the algorithms can process such graphs efficiently even when very limited internal memory is available. The proposed algorithms are simple enough for practical implementation and use, and open the door for further study of external-memory bisimulation algorithms. To this end, the full open-source C++ implementation has been made freely available.

【Keywords】: bisimulation; external memory; graphs; i/o

48. Materialized view selection for XQuery workloads.

Paper Link】 【Pages】:565-576

【Authors】: Asterios Katsifodimos ; Ioana Manolescu ; Vasilis Vassalos

【Abstract】: The efficient processing of XQuery still poses significant challenges. A particularly effective technique to improve XQuery processing performance consists of using materialized views to answer queries. In this work, we consider the problem of choosing the best views to materialize within a given space budget in order to improve the performance of a query workload. The paper is the first to address the view selection problem for queries and views with value joins and multiple return nodes. The challenges we face stem from the expressive power and features of both the query and view languages and from the size of the search space of candidate views to materialize. While the general problem has prohibitive complexity, we propose and study a heuristic algorithm and demonstrate its superior performance compared to the state of the art.

【Keywords】: materialized views; view selection; xml

Keynote addresses 2

49. Analytic database technologies for a new kind of user: the data enthusiast.

Paper Link】 【Pages】:577-578

【Authors】: Pat Hanrahan

【Abstract】: Analytics enables businesses to increase the efficiency of their activities and ultimately increase their profitability. As a result, it is one of the fastest growing segments of the database industry. There are two usages of the word analytics. The first refers to a set of algorithms and technologies, inspired by data mining, computational statistics, and machine learning, for supporting statistical inference and prediction. The second is equally important: analytical thinking. Analytical thinking is a structured approach to reasoning and decision making based on facts and data. Most of the recent work in the database community has focused on the first, the algorithmic and systems problems. The people behind these advances comprise a new generation of data scientists who have either the mathematical skills to develop advanced statistical models, or the computer skills to develop or implement scalable systems for processing large, complex datasets. The second aspect of analytics -- supporting the analytical thinker -- although equally important and challenging, has received much less attention. In this talk, I will describe recent advances in in making both forms of analytics accessible to a broader range of people, who I call data enthusiasts. A data enthusiast is an educated person who believes that data can be used to answer a question or solve a problem. These people are not mathematicians or programmers, and only know a bit of statistics. I'll review recent work on building easy-to-use, yet powerful, visual interfaces for working with data; and the analytical database technology needed to support these interfaces.

【Keywords】: analytic database technology

50. Symbiosis in scale out networking and data management.

Paper Link】 【Pages】:579-580

【Authors】: Amin Vahdat

【Abstract】: This talk highlights the symbiotic relationship between data management and networking through a study of two seemingly independent trends in the traditionally separate communities: large-scale data processing and software defined networking. First, data processing at scale increasingly runs across hundreds or thousands of servers. We show that balancing network performance with computation and storage is a prerequisite to both efficient and scalable data processing. We illustrate the need for scale out networking in support of data management through a case study of TritonSort, currently the record holder for several sorting benchmarks, including GraySort and JouleSort. Our TritonSort experience shows that disk-bound workloads require 10 Gb/s provisioned bandwidth to keep up with modern processors while emerging flash workloads require 40 Gb/s fabrics at scale. We next argue for the need to apply data management techniques to enable Software Defined Networking (SDN) and Scale Out Networking. SDN promises the abstraction of a single logical network fabric rather than a collection of thousands of individual boxes. In turn, scale out networking allows network capacity (ports, bandwidth) to be expanded incrementally, rather than by wholesale fabric replacement. However, SDN requires an extensible model of both static and dynamic network properties and the ability to deliver dynamic updates to a range of network applications in a fault tolerant and low latency manner. Doing so in networking environments where updates are typically performed by timer-based broadcasts and models are specified as comma-separated text files processed by one-off scripts presents interesting challenges. For example, consider an environment where applications from routing to traffic engineering to monitoring to intrusion/anomaly detection all essentially boil down to inserting, triggering and retrieving updates to/from a shared, extensible data store.

【Keywords】: scale out networking; software defined networking

Tutorials 6

51. Mob data sourcing.

Paper Link】 【Pages】:581-584

【Authors】: Daniel Deutch ; Tova Milo

【Abstract】: Crowdsourcing is an emerging paradigm that harnesses a mass of users to perform various types of tasks. We focus in this tutorial on a particular form of crowdsourcing, namely crowd (or mob) datasourcing whose goal is to obtain, aggregate or process data. We overview crowd datasourcing solutions in various contexts, explain the need for a principled solution, describe advances towards achieving such a solution, and highlight remaining gaps.

【Keywords】: crowdsourcing; declarative systems

52. Managing and mining large graphs: patterns and algorithms.

Paper Link】 【Pages】:585-588

【Authors】: Christos Faloutsos ; U. Kang

【Abstract】: Graphs are everywhere: social networks, the World Wide Web, biological networks, and many more. The sizes of graphs are growing at unprecedented rate, spanning millions and billions of nodes and edges. What are the patterns in large graphs, spanning Giga, Tera, and heading toward Peta bytes? What are the best tools, and how can they help us solve graph mining problems? How do we scale up algorithms for handling graphs with billions of nodes and edges? These are exactly the goals of this tutorial. We start with the patterns in real-world static, weighted, and dynamic graphs. Then we describe important tools for large graph mining, including singular value decomposition, and Hadoop. Finally, we conclude with the design and the implementation of scalable graph mining algorithms on Hadoop. This tutorial is complementary to the related tutorial "Managing and Mining Large Graphs: Systems and Implementations".

【Keywords】: graph mining; hadoop; mapreduce

53. Managing and mining large graphs: systems and implementations.

Paper Link】 【Pages】:589-592

【Authors】: Bin Shao ; Haixun Wang ; Yanghua Xiao

【Abstract】: We are facing challenges at all levels ranging from infrastructures to programming models for managing and mining large graphs. A lot of algorithms on graphs are ad-hoc in the sense that each of them assumes that the underlying graph data can be organized in a certain way that maximizes the performance of the algorithm. In other words, there is no standard graph systems based on which graph algorithms are developed and optimized. In response to this situation, a lot of graph systems have been proposed recently. In this tutorial, we discuss several representative systems. Still, we focus on providing perspectives from a variety of standpoints on the goals and the means for developing a general purpose graph system. We highlight the challenges posed by the graph data, the constraints of architectural design, the different types of application needs, and the power of different programming models that support such needs. This tutorial is complementary to the related tutorial "Managing and Mining Large Graphs: Patterns and Algorithms".

【Keywords】: distributed graph system; graph database; memory cloud; nosql

54. Computational reproducibility: state-of-the-art, challenges, and database research opportunities.

Paper Link】 【Pages】:593-596

【Authors】: Juliana Freire ; Philippe Bonnet ; Dennis Shasha

【Abstract】: Computational experiments have become an integral part of the scientific method, but reproducing, archiving, and querying them is still a challenge. The first barrier to a wider adoption is the fact that it is hard both for authors to derive a compendium that encapsulates all the components needed to reproduce a result and for reviewers to verify the results. In this tutorial, we will present a series of guidelines and, through hands-on examples, review existing tools to help authors create of reproducible results. We will also outline open problems and new directions for database-related research having to do with querying computational experiments.

【Keywords】: computational reproducibility; reproducible publications

55. Database techniques for linked data management.

Paper Link】 【Pages】:597-600

【Authors】: Andreas Harth ; Katja Hose ; Ralf Schenkel

【Abstract】: Linked Data refers to data published in accordance with a number of principles rooted in web standards. In the past few years we have witnessed a tremendous growth in Linked Data publishing on the web, leading to tens of billions of data items published online. Querying the data is a key functionality required to make use of the wealth of rich interlinked data. The goal of the tutorial is to introduce, motivate, and detail techniques for querying heterogeneous structured data from across the web. Our tutorial aims to introduce database researchers and practitioners to the new publishing paradigm on the web, and show how the abundance of data published as Linked Data can serve as fertile ground for database research and experimentation. As such, the tutorial focuses on applying database techniques to processing Linked Data, such as optimized indexing and query processing methods in the centralized setting as well as distributed approaches for querying. At the same time, we make the connection from Linked Data best practices to established technologies in distributed databases and the concept of Dataspaces and show differences as well as commonalities between the fields.

【Keywords】: dataspaces; distributed databases; linked data; query processing; rdf; semantic web; sparql

56. Differential privacy in data publication and analysis.

Paper Link】 【Pages】:601-606

【Authors】: Yin Yang ; Zhenjie Zhang ; Gerome Miklau ; Marianne Winslett ; Xiaokui Xiao

【Abstract】: Data privacy has been an important research topic in the security, theory and database communities in the last few decades. However, many existing studies have restrictive assumptions regarding the adversary's prior knowledge, meaning that they preserve individuals' privacy only when the adversary has rather limited background information about the sensitive data, or only uses certain kinds of attacks. Recently, differential privacy has emerged as a new paradigm for privacy protection with very conservative assumptions about the adversary's prior knowledge. Since its proposal, differential privacy had been gaining attention in many fields of computer science, and is considered among the most promising paradigms for privacy-preserving data publication and analysis. In this tutorial, we will motivate its introduction as a replacement for other paradigms, present the basics of the differential privacy model from a database perspective, describe the state of the art in differential privacy research, explain the limitations and shortcomings of differential privacy, and discuss open problems for future research.

【Keywords】: data analysis; differential privacy; privacy-preserving data publication; query processing

Demonstrations group A: information extraction, search, performance, and clouds 10

57. Automatic web-scale information extraction.

Paper Link】 【Pages】:609-612

【Authors】: Philip Bohannon ; Nilesh N. Dalvi ; Yuval Filmus ; Nori Jacoby ; Sathiya Keerthi ; Alok Kirpal

【Abstract】: In this demonstration, we showcase the technologies that we are building at Yahoo! for Web-scale Information Extraction. Given any new Website, containing semi-structured information about a pre-specified set of schemas, we show how to populate objects in the corresponding schema by automatically extracting information from the Website.

【Keywords】: domain-centric; information extraction; web-scale

58. Just-in-time information extraction using extraction views.

Paper Link】 【Pages】:613-616

【Authors】: Amr El-Helw ; Mina H. Farid ; Ihab F. Ilyas

【Abstract】:

【Keywords】: extraction views; information extraction; text databases

Paper Link】 【Pages】:617-620

【Authors】: Cody Hansen ; Feifei Li

【Abstract】: In many database applications, search is still executed via form based query interfaces, which are then translated into SQL statements to find matching records. Ranking is usually not implemented unless users have explicitly indicated how to rank the matching records, e.g., in the ascending order of year. Often, this approach is neither intuitive nor user friendly (especially with many search fields in a query form). It also requires application developers to design schema-specific query forms and develop specific programs that understand these forms. In this work, we propose to demonstrate the ColumbuScout system that aims at quickly building and deploying a local search engine over one or more large databases. The ColumbuScout system adopts a search-engine-style approach for searches over local databases. It introduces its own indexing structures and storage designs, to improve its overall efficiency and scalability. We will demonstrate that it is simple for application developers to deploy ColumbuScout over any databases, and ColumbuScout is able to support search engine-like types of search over large databases efficiently and effectively.

【Keywords】: indexing; local search engines; ranking; recommendation; search in big data

Paper Link】 【Pages】:621-624

【Authors】: Behzad Golshan ; Theodoros Lappas ; Evimaria Terzi

【Abstract】: When working on a new project, researchers need to devote a significant amount of time and effort to surveying the relevant literature. This is required in order to gain expertise, evaluate the significance of their work and gain useful insights about a particular scientific domain. While necessary, relevant-work search is also a time-consuming and arduous process, requiring the continuous participation of the user. In this work, we introduce Sofia Search, a tool that fully automates the search and retrieval of the literature related to a topic. Given a seed of papers submitted by the user, Sofia Search searches the Web for candidate related papers, evaluates their relevance to the seed and downloads them for the user. The tool also provides modules for the evaluation and ranking of authors and papers, in the context of the retrieved papers. In the demo, we will demonstrate the functionality of our tool, by allowing users to use it via a simple and intuitive interface.

【Keywords】: related-work search

61. RACE: real-time applications over cloud-edge.

Paper Link】 【Pages】:625-628

【Authors】: Badrish Chandramouli ; Joris Claessens ; Suman Nath ; Ivo Santos ; Wenchao Zhou

【Abstract】: The Cloud-Edge topology - where multiple smart edge devices such as phones are connected to one another via the Cloud - is becoming ubiquitous. We demonstrate RACE, a novel framework and system for specifying and efficiently executing distributed real-time applications in the Cloud-Edge topology. RACE uses LINQ for StreamInsight to succinctly express a diverse suite of useful real-time applications. Further, it exploits the processing power of edge devices and the Cloud to partition and execute such queries in a distributed manner. RACE features a novel cost-based optimizer that efficiently finds the optimal placement, minimizing global communication cost while handling multi-level join queries and asymmetric network links.

【Keywords】: cloud; optimization; query; smartphones; streams

62. Partiqle: an elastic SQL engine over key-value stores.

Paper Link】 【Pages】:629-632

【Authors】: Jun'ichi Tatemura ; Oliver Po ; Wang-Pin Hsiung ; Hakan Hacigümüs

【Abstract】: The demo features Partiqle, a SQL engine over key-value stores as a relational alternative for the recent procedural approaches to support OLTP workloads elastically. Based on our microsharding framework [12], it employs a declarative specification, called transaction classes, of constraints applied on the transactions in a workload. We demonstrate use of a transaction class in design and analysis of OLTP workloads. We then demonstrate live-scaling of our fully functioning system on a server cluster.

【Keywords】: cloud computing; elasticity; oltp; scalability

63. JustMyFriends: full SQL, full transactional amenities, and access privacy.

Paper Link】 【Pages】:633-636

【Authors】: Arthur Meacham ; Dennis Shasha

【Abstract】: A major obstacle to using Cloud services for many enterprises is the fear that the data will be stolen. Bringing the Cloud in-house is an incomplete solution to the problem because that implies that data center personnel as well as myriad repair personnel must be trusted. An ideal security solution would be to share data among precisely the people who should see it ("my friends") and nobody else. Encryption might seem to be an easy answer. Each friend could download the data, update it perhaps, and return it to a shared untrusted repository. But such a solution permits no concurrency and therefore no real sharing. JustMyFriends ensures sharing among friends without revealing unencrypted data to anyone outside of a circle of trust. In fact, non-friends (such as system administrators) see only encrypted blobs being added to a persistent store. JustMyFriends allows data sharing and full transactions. It supports the use of all SQL including stored procedures, updates, and arbitrary queries. Additionally, it provides full access privacy, preventing the host from discovering patterns or correlations in the user's data access behavior. The demonstration will show how friends in an unnamed government agency can coordinate the management of a spy network in a transactional fashion. Demo visitors will be able to play the roles of station chiefs and/or of troublemakers. As station chiefs, they will write their own transactions and queries, logout, login. As troublemakers, visitors will be able to play the role of a curious observer, kill client processes, and in general try to disrupt the system.

【Keywords】: cloud; database; outsourcing; privacy; security

64. Dynamic optimization of generalized SQL queries with horizontal aggregations.

Paper Link】 【Pages】:637-640

【Authors】: Carlos Ordonez ; Javier García-García ; Zhibo Chen

【Abstract】: SQL presents limitations to return aggregations as tables with a horizontal layout. A user generally needs to write separate queries and data definition statements to combine transposition with aggregation. With that motivation in mind, we introduce horizontal aggregations, a complementary class of aggregations to traditional (vertical) SQL aggregations. The SQL syntax extension is minimal and it significantly enhances the expressive power and ease of use of SQL. Our proposed SQL extension blurs the boundary between row values and column names. We present a prototype query optimizer that can evaluate arbitrary nested queries combining filtering, joins and both classes of aggregations. Horizontal aggregations have many applications in ad-hoc querying, OLAP cube processing and data mining. We demonstrate query optimization of horizontal aggregations introduces new research challenges.

【Keywords】: aggregation; data set; sql

65. ConsAD: a real-time consistency anomalies detector.

Paper Link】 【Pages】:641-644

【Authors】: Kamal Zellag ; Bettina Kemme

【Abstract】: In this demonstration, we present ConsAD, a tool that detects consistency anomalies for arbitrary multi-tier applications that use lower levels of isolation than serializability. As the application is running, ConsAD detects and quantifies anomalies indicating exactly the transactions and data items involved. Furthermore, it classifies the detected anomalies into patterns showing the business methods involved as well as their occurrence frequency. ConsAD can guide designers to either choose an isolation level for which their application shows few anomalies or change their transaction design to avoid the anomalies. Its graphical interface shows detailed information about detected anomalies as they occur and analyzes their patterns as well as their distribution.

【Keywords】: consistency; multi-tier architectures; serializability

66. Interactive performance monitoring of a composite OLTP and OLAP workload.

Paper Link】 【Pages】:645-648

【Authors】: Anja Bog ; Kai Sachs ; Hasso Plattner

【Abstract】: Online transaction processing (OLTP) and online analytical processing (OLAP) are thought of as two separate domains, despite sharing the same business data to operate on. This is the result of performance impairments encountered in the past when running on the same system, the workloads becoming ever more sophisticated, leading to contradictory optimization in database design. Recent developments in hardware and database systems are bringing forth research prototypes supporting mixed OLTP and OLAP workloads, challenging this separation. At the same time new benchmarks are proposed to assess these mixed workload systems. In the demonstration, we show an interactive performance monitor and benchmark driver developed for the Composite Benchmark for Transaction Processing and Reporting. The performance monitor allows us to directly determine the impact of changing shares within the workload and to interactively assess behavioral characteristics of different database systems under changing mixed workload conditions.

【Keywords】: database benchmarking; mixed oltp and olap workload; performance monitoring

Demonstrations group B: social- or user-centered 10

67. Sindbad: a location-based social networking system.

Paper Link】 【Pages】:649-652

【Authors】: Mohamed Sarwat ; Jie Bao ; Ahmed Eldawy ; Justin J. Levandoski ; Amr Magdy ; Mohamed F. Mokbel

【Abstract】: This demo presents Sindbad; a location-based social networking system. Sindbad supports three new services beyond traditional social networking services, namely, location-aware news feed, location-aware recommender, and location-aware ranking. These new services not only consider social relevance for its users, but they also consider spatial relevance. Since location-aware social networking systems have to deal with large number of users, large number of messages, and user mobility, efficiency and scalability are important issues. To this end, Sindbad encapsulates its three main services inside the query processing engine of PostgreSQL. Usage and internal functionality of Sindbad, implemented with PostgreSQL and Google Maps API, are demonstrated through user (i.e., web/phone) and system analyzer GUI interfaces, respectively.

【Keywords】: news feed; recommender; recommender systems; social; social networking; spatial message; spatial rating

68. MAQSA: a system for social analytics on news.

Paper Link】 【Pages】:653-656

【Authors】: Sihem Amer-Yahia ; Samreen Anjum ; Amira Ghenai ; Aysha Siddique ; Sofiane Abbar ; Sam Madden ; Adam Marcus ; Mohammed El-Haddad

【Abstract】: We present MAQSA, a system for social analytics on news. MAQSA provides an interactive topic-centric dashboard that summarizes news articles and social activity (e.g., comments and tweets) around them. MAQSA helps editors and publishers in newsrooms understand user engagement and audience sentiment evolution on various topics of interest. It also helps news consumers explore public reaction on articles relevant to a topic and refine their exploration via related entities, topics, articles and tweets. Given a topic, e.g., "Gulf Oil Spill," or "The Arab Spring", MAQSA combines three key dimensions: time, geographic location, and topic to generate a detailed activity dashboard around relevant articles. The dashboard contains an annotated comment timeline and a social graph of comments. It utilizes commenters' locations to build maps of comment sentiment and topics by region of the world. Finally, to facilitate exploration, MAQSA provides listings of related entities, articles, and tweets. It algorithmically processes large collections of articles and tweets, and enables the dynamic specification of topics and dates for exploration. In this demo, participants will be invited to explore the social dynamics around articles on oil spills, the Libyan revolution, and the Arab Spring. In addition, participants will be able to define and explore their own topics dynamically.

【Keywords】: interactive exploration; sentiment analysis; social media

69. Surfacing time-critical insights from social media.

Paper Link】 【Pages】:657-660

【Authors】: Bogdan Alexe ; Mauricio A. Hernández ; Kirsten Hildrum ; Rajasekar Krishnamurthy ; Georgia Koutrika ; Meenakshi Nagarajan ; Haggai Roitman ; Michal Shmueli-Scheuer ; Ioana Roxana Stanoi ; Chitra Venkatramani ; Rohit Wagle

【Abstract】: We propose to demonstrate an end-to-end framework for leveraging time-sensitive and critical social media information for businesses. More specifically, we focus on identifying, structuring, integrating, and exposing timely insights that are essential to marketing services and monitoring reputation over social media. Our system includes components for information extraction from text, entity resolution and integration, analytics, and a user interface.

【Keywords】: entity integration; entity resolution; information extraction; social media

Paper Link】 【Pages】:661-664

【Authors】: Silviu Maniu ; Bogdan Cautis

【Abstract】: We demonstrate the Taagle system for top-k retrieval in social tagging systems (also known as folksonomies). The general setting is the following: users form a weighted social network, which may reflect friendship, similarity, or trust; items from a public pool of items (e.g., URLs, blogs, photos, documents) are tagged by users with keywords; users search for the top-k items having certain tags. Going beyond a classic search paradigm where data is decoupled from the users querying it, users can now act both as producers and seekers of information. Hence finding the most relevant items in response to a query should be done in a network-aware manner: items tagged by users who are closer (more similar) to the seeker should be given more weight than items tagged by distant users. We illustrate with Taagle novel algorithms and a general approach that has the potential to scale to current applications, in an online context where the social network, the tagging data and even the seekers' search ingredients can change at any moment. We also illustrate possible design choices for providing users a fully-personalized and customizable search interface. By this interface, they can calibrate how social proximity is computed (for example, with respect to similarity in tagging actions), how much weight the social score of tagging actions should have in the result build-up, or the criteria by which the user network should be explored. In order to further reduce running time, seekers are given the possibility to chose between exact or approximate answers, and can benefit from cached results of previous queries (materialized views).

【Keywords】: social applications; social search; threshold algorithms

71. PrefDB: bringing preferences closer to the DBMS.

Paper Link】 【Pages】:665-668

【Authors】: Anastasios Arvanitis ; Georgia Koutrika

【Abstract】: In this demonstration we present a preference-aware relational query answering system, termed PrefDB. The key novelty of PrefDB is the use of an extended relational data model and algebra that allow expressing different flavors of preferential queries. Furthermore, unlike existing approaches that either treat the DBMS as a black box or require modifications of the database core, PrefDB's hybrid implementation enables operator-level query optimizations without being obtrusive to the database engine. We showcase the flexibility and efficiency of PrefDB using PrefDBAdmin, a graphical tool that we have built aiming at assisting application designers in the task of building, testing and tuning queries with preferences.

【Keywords】: preferences; query personalization

72. Auto-completion learning for XML.

Paper Link】 【Pages】:669-672

【Authors】: Serge Abiteboul ; Yael Amsterdamer ; Tova Milo ; Pierre Senellart

【Abstract】: Editing an XML document manually is a complicated task. While many XML editors exist in the market, we argue that some important functionalities are missing in all of them. Our goal is to makes the editing task simpler and faster. We present ALEX (Auto-completion Learning Editor for XML), an editor that assists the users by providing intelligent auto-completion suggestions. These suggestions are adapted to the user needs, simply by feeding ALEX with a set of example XML documents to learn from. The suggestions are also guaranteed to be compliant with a given XML schema, possibly including integrity constraints. To fulfill this challenging goal, we rely on novel, theoretical foundations by us and others, which are combined here in a system for the first time.

【Keywords】: auto-completion; editor; learning; schema; xml

73. Logos: a system for translating queries into narratives.

Paper Link】 【Pages】:673-676

【Authors】: Andreas Kokkalis ; Panagiotis Vagenas ; Alexandros Zervakis ; Alkis Simitsis ; Georgia Koutrika ; Yannis E. Ioannidis

【Abstract】: This paper presents Logos, a system that provides natural language translations for relational queries expressed in SQL. Our translation mechanism is based on a graph-based approach to the query translation problem. We represent various forms of structured queries as directed graphs and we annotate the graph edges with template labels using an extensible template mechanism. Logos uses different graph traversal strategies for efficiently exploring these graphs and composing textual query descriptions. The audience may interactively explore Logos using various database schemata and issuing either sample or ad hoc queries.

【Keywords】: natural language; query translation; sql queries

74. PAnG: finding patterns in annotation graphs.

Paper Link】 【Pages】:677-680

【Authors】: Philip Anderson ; Andreas Thor ; Joseph Benik ; Louiqa Raschid ; Maria-Esther Vidal

【Abstract】: Annotation graph datasets are a natural representation of scientific knowledge. They are common in the life sciences and health sciences, where concepts such as genes, proteins or clinical trials are annotated with controlled vocabulary terms from ontologies. We present a tool, PAnG (Patterns in Annotation Graphs), that is based on a complementary methodology of graph summarization and dense subgraphs. The elements of a graph summary correspond to a pattern and its visualization can provide an explanation of the underlying knowledge. Scientists can use PAnG to develop hypotheses and for exploration.

【Keywords】: dense subgraphs; graph summarization; link prediction

75. VizDeck: self-organizing dashboards for visual analytics.

Paper Link】 【Pages】:681-684

【Authors】: Alicia Key ; Bill Howe ; Daniel Perry ; Cecilia R. Aragon

【Abstract】: We present VizDeck, a web-based tool for exploratory visual analytics of unorganized relational data. Motivated by collaborations with domain scientists who search for complex patterns in hundreds of data sources simultaneously, VizDeck automatically recommends appropriate visualizations based on the statistical properties of the data and adopts a card game metaphor to help organize the recommended visualizations into interactive visual dashboard applications in seconds with zero programming. The demonstration allows users to derive, share, and permanently store their own dashboard from hundreds of real science datasets using a production system deployed at the University of Washington.

【Keywords】: visualization

76. Kaizen: a semi-automatic index advisor.

Paper Link】 【Pages】:685-688

【Authors】: Ivo Jimenez ; Huascar Sanchez ; Quoc Trung Tran ; Neoklis Polyzotis

【Abstract】: Index tuning; i.e., selecting indexes that are appropriate for the workload to obtain good system performance, is a crucial task for database administrators. Administrators rely on automated index advisors for this task, but existing advisors work either offline, requiring a-priori knowledge of the workload, or online, taking the administrator out of the picture and assuming total control of the index tuning task. Semi-automatic index tuning is a new paradigm that achieves a middle ground: the advisor analyzes the workload online and provides recommendations tailored to the current workload, and the administrator is able to provide feedback to refine future recommendations. In this demonstration we present Kaizen, an index tuning tool that implements semi-automatic tuning.

【Keywords】: feedback; index advisor; semi-automatic

Demonstrations group C: analytics 10

77. Shark: fast data analysis using coarse-grained distributed memory.

Paper Link】 【Pages】:689-692

【Authors】: Cliff Engle ; Antonio Lupher ; Reynold Xin ; Matei Zaharia ; Michael J. Franklin ; Scott Shenker ; Ion Stoica

【Abstract】: Shark is a research data analysis system built on a novel coarse-grained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing sophisticated analysis closer to data. It scales to thousands of nodes in a fault-tolerant manner. Shark can answer queries 40X faster than Apache Hive and run machine learning programs 25X faster than MapReduce programs in Apache Hadoop on large datasets.

【Keywords】: data warehouse; databases; machine learning; resilient distributed dataset; shark; spark

78. Exploiting MapReduce-based similarity joins.

Paper Link】 【Pages】:693-696

【Authors】: Yasin N. Silva ; Jason M. Reed

【Abstract】: Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a pre-defined threshold ∈. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little work has addressed the study of Similarity Joins for cloud systems. This paper presents MRSimJoin, a multi-round MapReduce based algorithm to efficiently solve the Similarity Join problem. MRSimJoin efficiently partitions and distributes the data until the subsets are small enough to be processed in a single node. The proposed algorithm is general enough to be used with data that lies in any metric space. We have implemented MRSimJoin in Hadoop, a highly used open-source cloud system. We show how this operation can be used in multiple real-world data analysis scenarios with multiple data types and distance functions. Particularly, we show the use of MRSimJoin to identify similar images represented as feature vectors, and similar publications in a bibliographic database. We also show how MRSimJoin scales in each scenario when important parameters, e.g., ∈, data size and number of cluster nodes, increase. We demonstrate the execution of MRSimJoin queries using an Amazon Elastic Compute Cloud (EC2) cluster.

【Keywords】: hadoop; mapreduce; similarity join

79. GLADE: big data analytics made easy.

Paper Link】 【Pages】:697-700

【Authors】: Yu Cheng ; Chengjie Qin ; Florin Rusu

【Abstract】: We present GLADE, a scalable distributed system for large scale data analytics. GLADE takes analytical functions expressed through the User-Defined Aggregate (UDA) interface and executes them efficiently on the input data. The entire computation is encapsulated in a single class which requires the definition of four methods. The runtime takes the user code and executes it right near the data by taking full advantage of the parallelism available inside a single machine as well as across a cluster of computing nodes. The demonstration has two goals. First, it presents the architecture of GLADE and how processing is done by using a series of analytical functions. Second, it compares GLADE with two different classes of systems for data analytics: a relational database (PostgreSQL) enhanced with UDAs and Map-Reduce (Hadoop). We show how the analytical functions are coded into each of these systems (for Map-Reduce, we use both Java code as well as Pig Latin) and compare their expressiveness, scalability, and running time efficiency.

【Keywords】: gla; hadoop; uda

80. ReStore: reusing results of MapReduce jobs in pig.

Paper Link】 【Pages】:701-704

【Authors】: Iman Elghandour ; Ashraf Aboulnaga

【Abstract】: Analyzing large scale data has become an important activity for many organizations, and is now facilitated by the MapReduce programming and execution model and its implementations, most notably Hadoop. Query languages such as Pig Latin, Hive, and Jaql make it simpler for users to express complex analysis tasks, and the compilers of these languages translate these complex tasks into workflows of MapReduce jobs. Each job in these workflows reads its input from the distributed file system used by the MapReduce system (e.g., HDFS in the case of Hadoop) and produces output that is stored in this distributed file system. This output is then read as input by the next job in the workflow. The current practice is to delete these intermediate results from the distributed file system at the end of executing the workflow. It would be more useful if these intermediate results can be stored and reused in future workflows. We demonstrate ReStore, an extension to Pig that enables it to manage storage and reuse of intermediate results of the MapReduce workflows executed in the Pig data analysis system. ReStore matches input workflows of MapReduce jobs with previously executed jobs and rewrites these workflows to reuse the stored results of the matched jobs. ReStore also creates additional reuse opportunities by materializing and reserving the output of query execution operators that are executed within a MapReduce job. In this demonstration we showcase the MapReduce jobs and sub-jobs recommended by ReStore for a given Pig query, the rewriting of input queries to reuse stored intermediate results, and a what-if analysis of the effectiveness of reusing stored outputs of previously executed jobs.

【Keywords】: data reuse; mapreduce; pig latin

81. Clydesdale: structured data processing on hadoop.

Paper Link】 【Pages】:705-708

【Authors】: Andrey Balmin ; Tim Kaldewey ; Sandeep Tata

【Abstract】: There have been several recent proposals modifying Hadoop, radically changing the storage organization or query processing techniques to obtain good performance for structured data processing. We will showcase Clydesdale, a research prototype for structured data processing on Hadoop that can achieve dramatic performance improvements over existing solutions, without any changes to the underlying MapReduce implementation. Clydesdale achieves this through a novel synthesis of several techniques from the database literature and carefully adapting them to the Hadoop environment. On the star schema benchmark, we show that Clydesdale is on average 38x faster than Hive, the dominant approach for structured data processing on Hadoop today. To the best of our knowledge, Clydesdale is the fastest solution for processing workloads on structured data sets that fit a star schema on Hadoop. Attendees will be able to run queries on the data from the star schema benchmark on a remote Hadoop cluster with Clydesdale and Hive installed, and get a breakdown of the time taken to execute the query. Attendees will also be able to pose their own queries using ClyQL -- a novel embedded DSL in Scala that can be used to rapidly prototype star join queries. With this demonstration, we hope to convince the attendees that unlike previously thought, Hadoop can indeed efficiently support structured data processing.

【Keywords】: hadoop; mapreduce; query processing; star joins; structured data processing

82. Tiresias: a demonstration of how-to queries.

Paper Link】 【Pages】:709-712

【Authors】: Alexandra Meliou ; Yisong Song ; Dan Suciu

【Abstract】: In this demo, we will present Tiresias, the first how-to query engine. How-to queries represent fundamental data analysis questions of the form: "How should the input change in order to achieve the desired output". They exemplify an important Reverse Data Management problem: solving constrained optimization problems over data residing in a DBMS. Tiresias, named after the mythical oracle of Thebes, has complex under-workings, but includes a simple interface that allows users to load datasets and interactively design optimization problems by simply selecting actions, key performance indicators, and objectives. The user choices are translated into a declarative query, which is then processed by Tiresias and translated into a Mixed Integer Program: we then use an MIP solver to find a solution. The solution is then presented to the user as an interactive data instance. The user can provide feedback by rejecting certain tuples and/or values. Then, based on the user feedback, Tiresias automatically refines the how-to query and presents a new set of results.

【Keywords】: constrained optimization; tiql; tiresias

Paper Link】 【Pages】:713-716

【Authors】: Panayiotis Neophytou ; Roxana Gheorghiu ; Rebecca Hachey ; Timothy Luciani ; Di Bao ; Alexandros Labrinidis ; G. Elisabeta Marai ; Panos K. Chrysanthis

【Abstract】: This demo presents AstroShelf, our on-going effort to enable astrophysicists to collaboratively investigate celestial objects using data originating from multiple sky surveys, hosted at different sites. The AstroShelf platform combines database and data stream, workflow and visualization technologies to provide a means for querying and displaying telescope images (in a Google Sky manner), visualizations of spectrum data, and for managing annotations. In addition to the user interface, AstroShelf supports a programmatic interface (available as a web service), which allows astrophysicists to incorporate functionality from AstroShelf in their own programs. A key feature is Live Annotations which is the detection and delivery of events or annotations to users in real-time, based on their profiles. We demonstrate the capabilities of AstroShelf through real end-user exploration scenarios (with participation from "stargazers" in the audience), in the presence of simulated annotation workloads executed through web services.

【Keywords】: annotations; astronomy; big data; continuous workflows; scientific databases; visualization

84. OPAvion: mining and visualization in large graphs.

Paper Link】 【Pages】:717-720

【Authors】: Leman Akoglu ; Duen Horng Chau ; U. Kang ; Danai Koutra ; Christos Faloutsos

【Abstract】: Given a large graph with millions or billions of nodes and edges, like a who-follows-whom Twitter graph, how do we scalably compute its statistics, summarize its patterns, spot anomalies, visualize and make sense of it? We present OPAvion, a graph mining system that provides a scalable, interactive workflow to accomplish these analysis tasks. OPAvion consists of three modules: (1) The Summarization module (Pegasus) operates off-line on massive, disk-resident graphs and computes graph statistics, like PageRank scores, connected components, degree distribution, triangles, etc.; (2) The Anomaly Detection module (OddBall) uses graph statistics to mine patterns and spot anomalies, such as nodes with many contacts but few interactions with them (possibly telemarketers); (3) The Interactive Visualization module (Apolo) lets users incrementally explore the graph, starting with their chosen nodes or the flagged anomalous nodes; then users can expand to the nodes' vicinities, label them into categories, and thus interactively navigate the interesting parts of the graph. In our demonstration, we invite our audience to interact with OPAvion and try out its core capabilities on the Stack Overflow Q&A graph that describes over 6 million questions and answers among 650K users.

【Keywords】: anomaly detection; cloud computing; visualization

85. CloudAlloc: a monitoring and reservation system for compute clusters.

Paper Link】 【Pages】:721-724

【Authors】: Enrico Iori ; Alkis Simitsis ; Themis Palpanas ; Kevin Wilkinson ; Stavros Harizopoulos

【Abstract】: Cloud computing has emerged as a promising environment capable of providing flexibility, scalability, elasticity, fail-over mechanisms, high availability, and other important features to applications. Compute clusters are relatively easy to create and use, but tools to effectively share cluster resources are lacking. CloudAlloc addresses this problem and schedules workloads to cluster resources using allocation algorithms that can be easily changed according to the objectives of the enterprise. It also monitors resource utilization and thus, provides accountability for actual usage. CloudAlloc is a lightweight, flexible, easy-to-use tool for cluster resource allocation that has also proved useful as a research platform. We demonstrate its features and also discuss its allocation algorithms that minimize power usage. CloudAlloc was implemented and is in use at HP Labs.

【Keywords】: cloud computing; monitoring system; reservation system

86. TIRAMOLA: elastic nosql provisioning through a cloud management platform.

Paper Link】 【Pages】:725-728

【Authors】: Ioannis Konstantinou ; Evangelos Angelou ; Dimitrios Tsoumakos ; Christina Boumpouka ; Nectarios Koziris ; Spyros Sioutas

【Abstract】: NoSQL databases focus on analytical processing of large scale datasets, offering increased scalability over commodity hardware. One of their strongest features is elasticity, which allows for fairly portioned premiums and high-quality performance. Yet, the process of adaptive expansion and contraction of resources usually involves a lot of manual effort, often requiring the definition of the conditions for scaling up or down to be provided by the users. To date, there exists no open-source system for automatic resizing of NoSQL clusters. In this demonstration, we present TIRAMOLA, a modular, cloud-enabled framework for monitoring and adaptively resizing NoSQL clusters. Our system incorporates a decision-making module which allows for optimal cluster resize actions in order to maximize any quantifiable reward function provided together with life-long adaptation to workload or infrastructural changes. The audience will be able to initiate HBase clusters of various sizes and apply varying workloads through multiple YCSB clients. The attendees will be able to watch, in real-time, the system perform automatic VM additions and removals as well as how cluster performance metrics change relative to the optimization parameters of their choice.

【Keywords】: automatic cluster resize; cloud monitoring; elasticity; markov decision process; nosql; open-source

Databases in the cloud 3

87. Amazon dynamoDB: a seamlessly scalable non-relational database service.

Paper Link】 【Pages】:729-730

【Authors】: Swaminathan Sivasubramanian

【Abstract】: Reliability and scalability of an application is dependent on how its application state is managed. To run applications at massive scale requires one to operate datastores that can scale to operate seamlessly across thousands of servers and can deal with various failure modes such as server failures, datacenter failures and network partitions. The goal of Amazon DynamoDB is to eliminate this complexity and operational overhead for our customers by offering a seamlessly scalable database service. In this talk, I will talk about how developers can build applications on DynamoDB without having to deal with the complexity of operating a large scale database.

【Keywords】: distributed databases; non-relational databases; scalability

88. Efficient transaction processing in SAP HANA database: the end of a column store myth.

Paper Link】 【Pages】:731-742

【Authors】: Vishal Sikka ; Franz Färber ; Wolfgang Lehner ; Sang Kyun Cha ; Thomas Peh ; Christof Bornhövd

【Abstract】: The SAP HANA database is the core of SAP's new data management platform. The overall goal of the SAP HANA database is to provide a generic but powerful system for different query scenarios, both transactional and analytical, on the same data representation within a highly scalable execution environment. Within this paper, we highlight the main features that differentiate the SAP HANA database from classical relational database engines. Therefore, we outline the general architecture and design criteria of the SAP HANA in a first step. In a second step, we challenge the common belief that column store data structures are only superior in analytical workloads and not well suited for transactional workloads. We outline the concept of record life cycle management to use different storage formats for the different stages of a record. We not only discuss the general concept but also dive into some of the details of how to efficiently propagate records through their life cycle and moving database entries from write-optimized to read-optimized storage formats. In summary, the paper aims at illustrating how the SAP HANA database is able to efficiently work in analytical as well as transactional workload environments.

【Keywords】: column store; sap hana; transaction processing

89. Walnut: a unified cloud object store.

Paper Link】 【Pages】:743-754

【Authors】: Jianjun Chen ; Chris Douglas ; Michi Mutsuzaki ; Patrick Quaid ; Raghu Ramakrishnan ; Sriram Rao ; Russell Sears

【Abstract】: Walnut is an object-store being developed at Yahoo! with the goal of serving as a common low-level storage layer for a variety of cloud data management systems including Hadoop (a MapReduce system), MObStor (a multimedia serving system), and PNUTS (an extended key-value serving system). Thus, a key performance challenge is to meet the latency and throughput requirements of the wide range of workloads commonly observed across these diverse systems. The motivation for Walnut is to leverage a carefully optimized low-level storage system, with support for elasticity and high-availability, across all of Yahoo!'s data clouds. This would enable sharing of hardware resources across hitherto siloed clouds of different types, offering greater potential for intelligent load balancing and efficient elastic operation, and simplify the operational tasks related to data storage. In this paper, we discuss the motivation for unifying different storage clouds, describe the requirements of a common storage layer, and present the Walnut design, which uses a quorum-based replication protocol and one-hop direct client access to the data in most regular operations. A unique contribution of Walnut is its hybrid object strategy, which efficiently supports both small and large objects. We present experiments based on both synthetic and real data traces, showing that Walnut works well over a wide range of workloads, and can indeed serve as a common low-level storage layer across a range of cloud systems.

【Keywords】: cloud storage; hybrid object store; paxos-based replication

Social media and crowdsourcing 3

90. The value of social media data in enterprise applications.

Paper Link】 【Pages】:755-756

【Authors】: Shivakumar Vaithyanathan

【Abstract】: Social media is an interactive vehicle for communication accessed on a daily basis by hundreds of millions of people. Unlike conventional media, which is a one-way street for information exchange, social media enables people to write content as well as provide feedback and recommend content to other users. There are multiple enterprise applications, such as customer retention, new customer acquisition, campaign management and lead generation that can significantly benefit from the consumer insights hidden in the massive amounts of social media content. Defining, extracting and representing entities such as people, organization and products, and their inter-relationships enables the building of comprehensive consumer profiles that can be leveraged in enterprise applications. Building these social media profiles requires a combination of text and entity analytics, while the utilization of such profiles makes heavy use of statistical models and machine learning. In this talk I will briefly describe the work in progress at IBM Research - Almaden on how such consumer insights, both at the level of an individual and at the level of appropriate micro-segments, can be used in enterprise applications in companies ranging from movie studios to financial services and insurance companies. I will also provide a brief overview of text, entity and statistical modeling tools that can operate in a distributed fashion over very large amounts of data.

【Keywords】: entity resolution; hadoop; information extraction; large scale machine learning; social media analytics; text analytics

91. Anatomy of a gift recommendation engine powered by social media.

Paper Link】 【Pages】:757-764

【Authors】: Yannis Pavlidis ; Madhusudan Mathihalli ; Indrani Chakravarty ; Arvind Batra ; Ron Benson ; Ravi Raj ; Robert Yau ; Mike McKiernan ; Venky Harinarayan ; Anand Rajaraman

【Abstract】: More and more people conduct their shopping online [1], especially during the holiday season [2]. Shopping online offers a lot of convenience, including the luxury of shopping from home, the ease of research, better prices, and in many cases access to unique products not available in stores. One of the facets of shopping is gifting. Gifting may be the act of giving a present to somebody because of an event (e.g., birthday) or occasion (e.g., house warming party). People may also treat themselves or loved ones to a gift. Regardless of the occasion or the reason for gifting, there is often one common denominator: delight the receiver. The pursuit of delight can cause a great deal of stress and also be extremely time consuming as many people today either already have everything, or have easy access to everything. The @WalmartLabs Gift Recommendation Engine and its first application, Shopycat, which is a gift finder application on Facebook, aim to find the right and "wow" gifts much easier and quicker than ever before, by taking into account social media interactions. In this paper we will begin by describing the Shopycat Social Gift Finder Facebook application. Next, we describe the components of the engine. Finally, we discuss the metrics used to evaluate the engine. Building such a gift recommendation engine raises many challenges, in inferring user interests, computing the giftability of a product and an interest, and processing the big and fast data associated with social media. We briefly discuss our solutions to these challenges. Overall, our gift recommendation engine is an example that illustrates social commerce, a powerful emerging trend in e-commerce, and a major focus of @WalmartLabs.

【Keywords】: data integration; gift; information extraction; recommendation engine; semantic analysis; social genome; social media

92. Designing a scalable crowdsourcing platform.

Paper Link】 【Pages】:765-766

【Authors】: Chris Van Pelt ; Alex Sorokin

【Abstract】: Computers are extremely efficient at crawling, storing and processing huge volumes of structured data. They are great at exploiting link structures to generate valuable knowledge. Yet there are plenty of data processing tasks that are difficult today. Labeling sentiment, moderating images, and mining structured content from the web are still too hard for computers. Automated techniques can get us a long way in some of those, but human inteligence is required when an accurate decision is ultimately important. In many cases that decision is easy for people and can be made quickly - in a few seconds to few minutes. By creating millions of simple online tasks we create a distributed computing machine. By shipping the tasks to millions of contributers around the globe, we make this human computer available 24/7 to make important decisions about your data. In this talk, I will describe our approach to designing CrowdFlower - a scalable crowdsourcing platform - as it evolved over the last 4 years. We think about crowdsourcing in terms of Quality, Cost and Speed. They are the ultimate design objectives of a human computer. Unfortunately, we can't have all 3. A general price-constrained task requiring 99.9% accuracy and 10 minute turnaround is not possible today. I will discuss design decisions behind CrowdFlower that allow us to pursue any two of these objectives. I will briefly present examples of common crowdsourced tasks and tools built into the platform to make the design of complex tasks easy, tools such as CrowdFlower Markup Language(CML). Quality control is the single most important challenge in Crowdsourcing. To enable an unidentified crowd of people to produce meaningful work, we must be certain that we can filter out bad contributors and produce high quality output. Initially we only used consensus. As the diversity and size of our crowd grew, so did the number of people attempting fraud. CrowdFlower developed "Gold standard" to block attempts of fraud. The use of gold allowed us to train contributors for the details of specific domains. By defining expected responses for a subset of the work and providing explanations of why a given response was expected, we are able distribute tasks to an ever-expanding anonymous workforce without sacrificing quality.

【Keywords】: crowdsourcing; data; workforce

Modern RDBMSs 3

93. Query optimization in microsoft SQL server PDW.

Paper Link】 【Pages】:767-776

【Authors】: Srinath Shankar ; Rimma V. Nehme ; Josep Aguilar-Saborit ; Andrew Chung ; Mostafa Elhemali ; Alan Halverson ; Eric Robinson ; Mahadevan Sankara Subramanian ; David J. DeWitt ; César A. Galindo-Legaria

【Abstract】: In recent years, Massively Parallel Processors have increasingly been used to manage and query vast amounts of data. Dramatic performance improvements are achieved through distributed execution of queries across many nodes. Query optimization for such system is a challenging and important problem. In this paper we describe the Query Optimizer inside the SQL Server Parallel Data Warehouse product (PDW QO). We leverage existing QO technology in Microsoft SQL Server to implement a cost-based optimizer for distributed query execution. By properly abstracting metadata we can readily reuse existing logic for query simplification, space exploration and cardinality estimation. Unlike earlier approaches that simply parallelize the best serial plan, our optimizer considers a rich space of execution alternatives, and picks one based on a cost-model for the distributed execution environment. The result is a high-quality, effective query optimizer for distributed query processing in an MPP.

【Keywords】: query optimization

94. F1: the fault-tolerant distributed RDBMS supporting google's ad business.

Paper Link】 【Pages】:777-778

【Authors】: Jeff Shute ; Mircea Oancea ; Stephan Ellner ; Ben Handy ; Eric Rollins ; Bart Samwel ; Radek Vingralek ; Chad Whipkey ; Xin Chen ; Beat Jegerlehner ; Kyle Littlefield ; Phoenix Tong

【Abstract】: Many of the services that are critical to Google's ad business have historically been backed by MySQL. We have recently migrated several of these services to F1, a new RDBMS developed at Google. F1 implements rich relational database features, including a strictly enforced schema, a powerful parallel SQL query engine, general transactions, change tracking and notification, and indexing, and is built on top of a highly distributed storage system that scales on standard hardware in Google data centers. The store is dynamically sharded, supports transactionally-consistent replication across data centers, and is able to handle data center outages without data loss. The strong consistency properties of F1 and its storage system come at the cost of higher write latencies compared to MySQL. Having successfully migrated a rich customer-facing application suite at the heart of Google's ad business to F1, with no downtime, we will describe how we restructured schema and applications to largely hide this increased latency from external users. The distributed nature of F1 also allows it to scale easily and to support significantly higher throughput for batch workloads than a traditional RDBMS. With F1, we have built a novel hybrid system that combines the scalability, fault tolerance, transparent sharding, and cost benefits so far available only in "NoSQL" systems with the usability, familiarity, and transactional guarantees expected from an RDBMS.

【Keywords】: business applications; mysql; query engines; relational schema; scalability; sql; synchronous replication; transactional consistency

95. Oracle in-database hadoop: when mapreduce meets RDBMS.

Paper Link】 【Pages】:779-790

【Authors】: Xueyuan Su ; Garret Swart

【Abstract】: Big data is the tar sands of the data world: vast reserves of raw gritty data whose valuable information content can only be extracted at great cost. MapReduce is a popular parallel programming paradigm well suited to the programmatic extraction and analysis of information from these unstructured Big Data reserves. The Apache Hadoop implementation of MapReduce has become an important player in this market due to its ability to exploit large networks of inexpensive servers. The increasing importance of unstructured data has led to the interest in MapReduce and its Apache Hadoop implementation, which has led to the interest of data processing vendors in supporting this programming style. Oracle RDBMS has had support for the MapReduce paradigm for many years through the mechanism of user defined pipelined table functions and aggregation objects. However, such support has not been Hadoop source compatible. Native Hadoop programs needed to be rewritten before becoming usable in this framework. The ability to run Hadoop programs inside the Oracle database provides a versatile solution to database users, allowing them use programming skills they may already possess and to exploit the growing Hadoop eco-system. In this paper, we describe a prototype of Oracle In-Database Hadoop that supports the running of native Hadoop applications written in Java. This implementation executes Hadoop applications using the efficient parallel capabilities of the Oracle database and a subset of the Apache Hadoop infrastructure. This system's target audience includes both SQL and Hadoop users. We discuss the architecture and design, and in particular, demonstrate how MapReduce functionalities are seamlessly integrated within SQL queries. We also share our experience in building such a system within Oracle database and follow-on topics that we think are promising areas for exploration.

【Keywords】: hadoop; mapreduce; parallel query execution

Big data 3

96. TAO: how facebook serves the social graph.

Paper Link】 【Pages】:791-792

【Authors】: Venkateshwaran Venkataramani ; Zach Amsden ; Nathan Bronson ; George Cabrera III ; Prasad Chakka ; Peter Dimov ; Hui Ding ; Jack Ferris ; Anthony Giardullo ; Jeremy Hoon ; Sachin Kulkarni ; Nathan Lawrence ; Mark Marchukov ; Dmitri Petrov ; Lovro Puzar

【Abstract】: Over 800 million people around the world share their social interactions with friends on Facebook, providing a rich body of information referred to as the social graph. In this talk, I describe how we model and serve this graph. Our model uses typed nodes (fbobjects) and edges (associations) to express the relationships and actions that happen on Facebook. We access the graph via a simple API that provides queries over the set of same-typed associations leaving an object. We have found this API to be both sufficiently expressive and amenable to a scalable implementation. In the last segment of the talk I describe the design of TAO, our graph data store. TAO is a distributed implementation of the fbobject and association API that has been serving production traffic at Facebook for more than 2 years.

【Keywords】: distributed systems; facebook; social graph; tao

97. Large-scale machine learning at twitter.

Paper Link】 【Pages】:793-804

【Authors】: Jimmy Lin ; Alek Kolcz

【Abstract】: The success of data-driven solutions to difficult problems, along with the dropping costs of storing and processing massive amounts of data, has led to growing interest in large-scale machine learning. This paper presents a case study of Twitter's integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform. We begin with an overview of this platform, which handles "traditional" data warehousing and business intelligence tasks for the organization. The core of this work lies in recent Pig extensions to provide predictive analytics capabilities that incorporate machine learning, focused specifically on supervised classification. In particular, we have identified stochastic gradient descent techniques for online learning and ensemble methods as being highly amenable to scaling out to large amounts of data. In our deployed solution, common machine learning tasks such as data sampling, feature generation, training, and testing can be accomplished directly in Pig, via carefully crafted loaders, storage functions, and user-defined functions. This means that machine learning is just another Pig script, which allows seamless integration with existing infrastructure for data management, scheduling, and monitoring in a production environment, as well as access to rich libraries of user-defined functions and the materialized output of other scripts.

【Keywords】: ensembles; logistic regression; online learning; stochastic gradient descent

98. Recurring job optimization in scope.

Paper Link】 【Pages】:805-806

【Authors】: Nicolas Bruno ; Sameer Agarwal ; Srikanth Kandula ; Bing Shi ; Ming-Chuan Wu ; Jingren Zhou

【Abstract】:

【Keywords】: distributed computation; query optimization; recurring jobs; scope; statistics

Data integration and analytics 3

99. Dynamic workload driven data integration in tableau.

Paper Link】 【Pages】:807-816

【Authors】: Kristi Morton ; Ross Bunker ; Jock D. Mackinlay ; Robert Morton ; Chris Stolte

【Abstract】: Tableau is a commercial business intelligence (BI) software tool that supports interactive, visual analysis of data. Armed with a visual interface to data and a focus on usability, Tableau enables a wide audience of end-users to gain insight into their datasets. The user experience is a fluid process of interaction in which exploring and visualizing data takes just a few simple drag-and-drop operations (no programming or DB experience necessary). In this context of exploratory, ad-hoc visual analysis, we describe a novel approach to integrating large, heterogeneous data sources. We present a new feature in Tableau called data blending, which gives users the ability to create data visualization mashups from structured, heterogeneous data sources dynamically without any upfront integration effort. Users can author visualizations that automatically integrate data from a variety of sources, including data warehouses, data marts, text files, spreadsheets, and data cubes. Because our data blending system is workload driven, we are able to bypass many of the pain-points and uncertainty in creating mediated schemas and schema-mappings in current pay-as-you-go integration systems.

【Keywords】: data integration; visualization

Paper Link】 【Pages】:817-828

【Authors】: Anish Das Sarma ; Lujun Fang ; Nitin Gupta ; Alon Y. Halevy ; Hongrae Lee ; Fei Wu ; Reynold Xin ; Cong Yu

【Abstract】: We consider the problem of finding related tables in a large corpus of heterogenous tables. Detecting related tables provides users a powerful tool for enhancing their tables with additional data and enables effective reuse of available public data. Our first contribution is a framework that captures several types of relatedness, including tables that are candidates for joins and tables that are candidates for union. Our second contribution is a set of algorithms for detecting related tables that can be either unioned or joined. We describe a set of experiments that demonstrate that our algorithms produce highly related tables. We also show that we can often improve the results of table search by pulling up tables that are ranked much lower based on their relatedness to top-ranked tables. Finally, we describe how to scale up our algorithms and show the results of running it on a corpus of over a million tables extracted from Wikipedia.

【Keywords】: data integration; related tables; web tables

101. Optimizing analytic data flows for multiple execution engines.

Paper Link】 【Pages】:829-840

【Authors】: Alkis Simitsis ; Kevin Wilkinson ; Malú Castellanos ; Umeshwar Dayal

【Abstract】: Next generation business intelligence involves data flows that span different execution engines, contain complex functionality like data/text analytics, machine learning operations, and need to be optimized against various objectives. Creating correct analytic data flows in such an environment is a challenging task and is both labor-intensive and time-consuming. Optimizing these flows is currently an ad-hoc process where the result is largely dependent on the abilities and experience of the flow designer. Our previous work addressed analytic flow optimization for multiple objectives over a single execution engine. This paper focuses on optimizing flows for a single objective, namely performance, over multiple execution engines. We consider flows that span a DBMS, a Map-Reduce engine, and an orchestration engine (e.g., an ETL tool or scripting language). This configuration is emerging as a common paradigm used to combine analysis of unstructured data with analysis of structured data (e.g., NoSQL plus SQL). We present flow transformations that model data shipping, function shipping, and operation decomposition and we describe how flow graphs are generated for multiple engines. Performance results for various configurations demonstrate the benefit of optimization.

【Keywords】: bi; data flows; databases; map-reduce; optimization

Query processing and war stories 3

102. CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster.

Paper Link】 【Pages】:841-850

【Authors】: Changkyu Kim ; Jongsoo Park ; Nadathur Satish ; Hongrae Lee ; Pradeep Dubey ; Jatin Chhugani

【Abstract】: Sorting is a fundamental kernel used in many database operations. The total memory available across cloud computers is now sufficient to store even hundreds of terabytes of data in-memory. Applications requiring high-speed data analysis typically use in-memory sorting. The two most important factors in designing a high-speed in-memory sorting system are the single-node sorting performance and inter-node communication. In this paper, we present CloudRAMSort, a fast and efficient system for large-scale distributed sorting on shared-nothing clusters. CloudRAMSort performs multi-node optimizations by carefully overlapping computation with inter-node communication. The system uses a dynamic multi-stage random sampling approach for improved load-balancing between nodes. CloudRAMSort maximizes per-node efficiency by exploiting modern architectural features such as multiple cores and SIMD (Single-Instruction Multiple Data) units. This holistic combination results in the highest performing sorting performance on distributed shared-nothing platforms. CloudRAMSort sorts 1 Terabyte (TB) of data in 4.6 seconds on a 256-node Xeon X5680 cluster called the Intel Endeavor system. CloudRAMSort also performs well on heavily skewed input distributions, sorting 1 TB of data generated using Zipf distribution in less than 5 seconds. We also provide a detailed analytical model that accurately projects (within avg. 7%) the performance of CloudRAMSort with varying tuple sizes and interconnect bandwidths. Our analytical model serves as a useful tool to analyze performance bottlenecks on current systems and project performance with future architectural advances. With architectural trends of increasing number of cores, bandwidth, SIMD width, cache-sizes, and interconnect bandwidth, we believe CloudRAMSort would be the system of choice for distributed sorting of large-scale in-memory data of current and future systems

【Keywords】: cloud computing; distributed; many core; multi core; ram sort; simd; sorting

103. Adaptive optimizations of recursive queries in teradata.

Paper Link】 【Pages】:851-860

【Authors】: Ahmad Ghazal ; Dawit Yimam Seid ; Alain Crolotte ; Mohammed Al-Kateb

【Abstract】: Recursive queries were introduced as part of ANSI SQL 99 to support processing of hierarchical data typical of air flight schedules, bill-of-materials, data cube dimension hierarchies, and ancestor-descendant information (e.g. XML data stored in relations). Recently, recursive queries have also found extensive use in web data analysis such as social network and click stream data. Teradata implemented recursive queries in V2R6 using static plans whereby a query is executed in multiple iterations, each iteration corresponding to one level of the recursion. Such a static planning strategy may not be optimal since the demographics of intermediate results from recursive iterations often vary to a great extent. Gathering feedback at each iteration could address this problem by providing size estimates to the optimizer which, in turn, can produce an execution plan for the next iteration. However, such a full feedback scheme suffers from lack of pipelining and the inability to exploit global optimizations across the different recursion iterations. In this paper, we propose adaptive optimization techniques that avoid the issues with static as well as full feedback optimization approaches. Our approach employs a mix of multi-iteration pre-planning and dynamic feedback techniques which are generally applicable to any recursive query implementation in an RDBMS. We also validated the effectiveness of our proposed techniques by conducting experiments on a prototype implementation using a real-life social network data from the FriendFeed online blogging service.

【Keywords】: database; optimization; recursion; sampling

104. From x100 to vectorwise: opportunities, challenges and things most researchers do not think about.

Paper Link】 【Pages】:861-862

【Authors】: Marcin Zukowski ; Peter A. Boncz

【Abstract】: In 2008 a group of researchers behind the X100 database kernel created Vectorwise: a spin-off which together with the Actian corporation (previously Ingres) worked on bringing this technology to the market. Today, Vectorwise is a popular product and one of the examples of conversion of a research prototype into successful commercial software. We describe here some of the interesting aspects of the work performed by the Vectorwise development team in the process, and discuss the opportunities and challenges resulting from the decision of integrating a prototype-quality kernel with Ingres, an established commercial product. We also discuss how requirements coming from reallife scenarios sometimes clashed with design choices and simplifications often found in research projects, and how Vectorwise team addressed some of of them.

【Keywords】: actian; dbms; vectorwise; x100

Undergraduate poster competition 8

105. Declarative web application development: encapsulating dynamic JavaScript widgets (abstract only).

Paper Link】 【Pages】:863

【Authors】: Robert Bolton ; David Ing ; Christopher Rebert ; Kristina Lam Thai

【Abstract】: The development of modern, highly interactive AJAX Web applications that enable dynamic visualization of data requires writing a great deal of tedious "plumbing code" to interface data between browser-based DOM and AJAX components, the application server, and the SQL database. Worse, each of these layers utilizes a different language. Further, much code is needed to keep the page and application states in sync using an imperative paradigm, which hurts simplicity. These factors result in a frustrating experience for today's Web developer. The FORWARD Project aims to alleviate this frustration by enabling pages that are "rendered views", in the SQL sense of "view". Our work in the project has led to a highly declarative approach whereby JavaScript/AJAX UI widgets automatically render views over the application state (database + session data + page data) without requiring the developer to tediously code how changes to the application state lead to invocation of the components' update methods. In contrast to conventional Web application development approaches, a FORWARD application involves only two languages, both declarative: an extended version of SQL, and an XML-based language for configuration and orchestration. The framework automatically handles efficient exchange of user input and changes to the underlying data, and updates the application state accordingly. The developer does not need to write any JavaScript or explicit updating code themselves. On the client side, FORWARD "units" wrap widgets using JavaScript to collect user input, directly display data, and reflect server-side updates to the data. On the server side, units contain Java code necessary to expose their functionality to the FORWARD framework and define their XML configuration representation. Our demo consists of a dynamically rendered webpage which internally uses AJAX to update a Google Maps widget that shows location markers for current Groupon deals in a specified area. It will illustrate that our SQL-driven approach makes this kind of rich dynamic webpage easy to write, with significant improvements in simplicity, brevity, and development time, while still providing the quality experience expected from top AJAX components. The amount of "plumbing code" is significantly reduced, enhancing the experience of AJAX Web application developers.

【Keywords】: ajax; sql; view maintenance

106. Towards scalable summarization and visualization of large text corpora (abstract only).

Paper Link】 【Pages】:863

【Authors】: Tyler Sliwkanich ; Douglas Schneider ; Aaron Yong ; Mitchell Home ; Denilson Barbosa

【Abstract】: Society is awash with problems requiring the analysis of vast quantities of text and data. From detecting flu trends out of twitter conversations to finding scholarly works answering specific questions, we rely more and more on computers to process text for us. Text analytics is the application of computational, mathematical, and statistical models to derive information from large quantities of data coming primarily as text. Our project provides fast and effective text-analytics tools for large document collections, such as the blogosphere. We use natural language processing and database techniques to extract, collect, analyze, visualize, and archive information extracted from text. We focus on discovering relationships between entities (people, places, organizations, etc.) mentioned in one or more sources (blog posts or news articles). We built a custom solution using mostly off-the-shelf, open-source tools to provide a scalable platform for users to search and analyze large text corpora. Currently, we provide two main outlets for users to discover these relations: (1) full-text search over the documents and (2) graph visualizations of the entities and their relationships. This provides the user with succinct and easily digestible information gleaned from the corpus as a whole. For example, we can easily pose queries like which companies were bought by Google? as entity:google relation:bought. The extracted data is stored on a combination of the noSQL database CouchDB and Apache's Lucene. This combination is justified as our work-flow consists of offline batch insertions with almost no updates. Because we support specialized queries, we can forgo the flexibility of traditional SQL solutions and materialize all necessary indices, which are used to quickly query large amounts of de-normalized data using MapReduce. Lucene provides a flexible and powerful query syntax to yield relevant ranked results to the user. Moreover, its indices are synchronized by a process subscribed to the list of database changes published by CouchDB. The graph visualizations rely on CouchDB's ability to export the data in any format: we currently use a customized graph visualization relying on XML data. Finally, we use memcached to further improve the performance, especially for queries involving popular entities.

【Keywords】: information extraction; text analytics

107. Reducing cache misses in hash join probing phase by pre-sorting strategy (abstract only).

Paper Link】 【Pages】:864

【Authors】: Gi-Hwan Oh ; Jae-Myung Kim ; Woon-Hak Kang ; Sang-Won Lee

【Abstract】: Recently, several studies on multi-core cache-aware hash join have been carried out [Kim09VLDB, Blanas11SIGMOD]. In particular, the work of Blanas has shown that rather simple no-partitioning hash join can outperform the work of Kim. Meanwhile, the simple but best performing hash join of Blanas still experiences severe cache misses in probing phase. Because the key values of tuples in outer relation are not sorted or clustered, each outer record has different hashed key value and thus accesses the different hash bucket. Since the size of hash table of inner table is usually much larger than that of the CPU cache, it is highly probable that the reference to hash bucket of inner table by each outer record would encounter cache miss. To reduce the cache misses in hash join probing phase, we propose a new join algorithm, Sorted Probing (in short, SP), which pre-sorts the hashed key values of outer table of hash join so that the access to the hash bucket of inner table has strong temporal locality, thus minimizing the cache misses during the probing phase. As an optimization technique of sorting, we used the cache-aware AlphaSort technique, which extracts the key from each record of data set to be sorted and its pointer, and then sorts the pairs of (key, rec_ptr). For performance evaluation, we used two hash join algorithms from Blanas' work, no partitioning(NP) and independent partitioning(IP) in a standard C++ program, provided by Blanas. Also, we implemented the AlphaSort and added it before each probing phase of NP and IP, and we call each algorithm as NP+SP and IP+SP. For syntactic workload, IP+SP outperforms all other algorithms: IP+SP is faster than other altorithms up to 30%.

【Keywords】: cache miss; hash join; sorted probing

108. DP-tree: indexing multi-dimensional data under differential privacy (abstract only).

Paper Link】 【Pages】:864

【Authors】: Shangfu Peng ; Yin Yang ; Zhenjie Zhang ; Marianne Winslett ; Yong Yu

【Abstract】: e-differential privacy (e-DP) is a strong and rigorous scheme for protecting individuals' privacy while releasing useful statistical information. The main idea is to inject random noise into the results of statistical queries, such that the existence of any single record has negligible impact on the distributions of query results. The accuracy of such randomized results depends heavily upon the query processing technique, which has been an active research topic in recent years. So far, most existing methods focus on 1-dimensional queries. The only work that handles multi-dimensional query processing under e-DP is [1], which indexes the sensitive data using variants of the quad-tree and the k-d-tree. As we point out in this paper, these structures are inherently suboptimal for answering queries under e-DP. Consequently, the solutions in [1] suffer from several serious drawbacks, including limited and unstable query accuracy, as well as bias towards certain types of queries. Motivated by this, we propose the DP-tree, a novel index structure for multi-dimensional query processing under e-DP that eliminates the problems encountered by the methods in [1]. Further, we show that the effectiveness of the DP-tree can be improved using statistical information about the query workload. Extensive experiments using real and synthetic datasets confirm that the DP-tree achieves significantly higher query accuracy than existing methods. Interestingly, an adaptation of the DP-tree also outperforms previous 1D solutions in their restricted scope, by large margins.

【Keywords】: consistency; differential privacy; multi-dimensional index; workload-aware query processing

109. Temporal provenance discovery in micro-blog message streams (abstract only).

Paper Link】 【Pages】:864

【Authors】: Zijun Xue ; Junjie Yao ; Bin Cui

【Abstract】: Recent years have witnessed the flourishing increases of micro-blog message applications. Prominent examples include Twitter, Facebook's status, and Sina Weibo in China. Messages in these applications are short (140 characters in a message) and easy to create. The subscription and re-sharing features also make it fairly intuitive to propagate. Micro-blog applications provide abundant information to present world scale user interests and social pulse in an unexpected way. But the precious corpus also brings out the noise and fast changing fragments to prohibit effective understanding and management. In this work, we propose a micro-blog provenance model to capture temporal connections within micro-blog messages. Here, provenance refers to data origin identification and transformation logging, demonstrating of great value in recent database and workflow systems. The provenance model is used to represent the message development trail and changes explicitly. We select various types of connections in micro-blog applications to identify the provenance. To cope with the real time micro-message deluge, we discuss a novel message grouping approach to encode and maintain the provenance information. A summary index structure is utilized to enable efficient provenance updating. We collect in-coming messages and compare them with an in-memory index to associate them with related ones. The closely related messages form some virtual provenance representation in a coarse granularity. We periodically dump memory values onto disks. In the actual implementation, we also introduce several adaptive pruning strategies to extend the potential of provenance discovery efficiency. We use the temporal decaying and granularity levels to filter out low chance messages. In the demonstration, we reveal the usefulness of provenance information for rich query retrieval and dynamic message tracking for effective message organization. The real-time collection approach shows advantages over some baselines. Experiments conducted on a real dataset verify the effectiveness and efficiency of our provenance approach. Results show that the partial-indexing strategy and other restriction ones can maintenance the accuracy at 90% and returning rate at 60% with a reasonable low memory usage. This is the first work towards provenance-based indexing support for micro-blog platforms.

【Keywords】: indexing; micro-blog; provenance

110. SigSpot: mining significant anomalous regions from time-evolving networks (abstract only).

Paper Link】 【Pages】:865

【Authors】: Misael Mongiovì ; Petko Bogdanov ; Razvan Ranca ; Ambuj K. Singh ; Evangelos E. Papalexakis ; Christos Faloutsos

【Abstract】: Anomaly detection in dynamic networks has a rich gamut of application domains, such as road networks, communication networks and water distribution networks. An anomalous event, such as a traffic accident, denial of service attack or a chemical spill, can cause a local shift from normal behavior in the network state that persists over an interval of time. Detecting such anomalous regions of network and time extent in large real-world networks is a challenging task. Existing anomaly detection techniques focus on either the time series associated with individual network edges or on global anomalies that affect the entire network. In order to detect anomalous regions, one needs to consider both the time and the affected network substructure jointly, which brings forth computational challenges due to the combinatorial nature of possible solutions. We propose the problem of mining all Significant Anomalous Regions (SAR) in time-evolving networks that asks for the discovery of connected temporal subgraphs comprised of edges that significantly deviate from normal in a persistent manner. We propose an optimal Baseline algorithm for the problem and an efficient approximation, called S IG S POT. Compared to Baseline, SIGSPOT is up to one order of magnitude faster in real data, while achieving less than 10% average relative error rate. In synthetic datasets it is more than 30 times faster than Baseline with 94% accuracy and solves efficiently large instances that are infeasible (more than 10 hours running time) for Baseline. We demonstrate the utility of SIGSPOT for inferring accidents on road networks and study its scalability when detecting anomalies in social, transportation and synthetic evolving networks, spanning up to 1GB.

【Keywords】: anomaly detection; dynamic network; significant anomalous regions

111. VRRC: web based tool for visualization and recommendation on co-authorship network (abstract only).

Paper Link】 【Pages】:865

【Authors】: Eduardo M. Barbosa ; Mirella M. Moro ; Giseli Rabello Lopes ; José Palazzo Moreira de Oliveira

【Abstract】: Scientific studies are usually developed by contributions from different researchers. Analyzing such collaborations is often necessary, for example, when evaluating the quality of a research group. Also, identifying new partnership possibilities within a set of researchers is frequently desired, for example, when looking for partners in foreign countries. Both analysis and identification are not easy tasks, and are usually done manually. This work presents VRRC, a new approach for visualizing recommendations of people within a co-authorship network (i.e., a graph in which nodes represent researchers and edges represent their co-authorships). VRRC input is a publication list from which it extracts the co-authorships. VRRC then recommends which relations could be created or intensified based on metrics designed for evaluating co-authorship networks. Finally, VRRC provides brand new ways to visualize not only the final recommendations but also the intermediate interactions within the network, including: a complete representation of the co-authorship network; an overview of the collaborations evolution over time; and the recommendations for each researcher to initiate or intensify cooperation. Some visualizations are interactive, allowing to filter data by time frame and highlighting specific collaborations. The contributions of our work, compared to the state-of-art, can be summarized as follows: (i) VRRC can be applied to any co-authorship network, it provides both net and recommendation visualizations, it is a Web-based tool and it allows easy sharing of the created visualizations (existing tools do not offer all these features together); (ii) VRRC establishes graphical representations to ease the visualization of its results (traditional approaches present the recommendation results through simple lists or charts); and (iii) with VRRC, the user can identify not only new possible collaborations but also existing cooperation that can be intensified (current recommendation approaches only indicate new collaborations). This work was partially supported by CNPq, Brazil.

【Keywords】: co-authorship networks; recommendation; visualization

112. Fast sampling word correlations of high dimensional text data (abstract only).

Paper Link】 【Pages】:866

【Authors】: Frank Rosner ; Alexander Hinneburg ; Martin Gleditzsch ; Matthias Priebe ; Andreas Both

【Abstract】: Finding correlated words in large document collections is an important ingredient for text analytics. The naïve approach computes the correlations of each word against all other words and filters for highly correlated word pairs. Clearly, this quadratic method cannot be applied to real world scenarios with millions of documents and words. Our main contribution is to transform the task of finding highly correlated word pairs into a word clustering problem that is efficiently solved by locality sensitive hashing (LSH). A key insight of our new method is to note that the empirical Pearson correlation between two words is the cosine of the angle between the centered versions of their word vectors. The angle can be approximated by an LSH scheme. Although centered word vectors are not sparse, the computation of the LSH hash functions can exploit the inherent sparsity of the word data. This leads to an efficient way to detect collisions between centered word vectors having a small angle and therefore provides a fast algorithm to sample highly correlated word pairs. Our new method based on LSH improves run time complexity of the enhanced naïve algorithm. This algorithm reduces the dimensionality of the word vectors using random projection and approximates correlations by computing cosine similarity on the reduced and centered word vectors. However, this method still has quadratic run time. Our new method replaces the filtering for high correlations in the naïve algorithm with finding hash collisions, which can be done by sorting the hash values of the word vectors. We evaluate the scalability of our new algorithm to large text collections.

【Keywords】: locality sensitive hashing; pearson correlation; text mining; word correlations