Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010. ACM 【DBLP Link】
【Paper Link】 【Pages】:1-2
【Authors】: Jon M. Kleinberg
【Abstract】:
【Keywords】: information access; news media; real-time search; social networks
【Paper Link】 【Pages】:3-14
【Authors】: Marcus Fontoura ; Suhas Sadanandan ; Jayavel Shanmugasundaram ; Sergei Vassilvitskii ; Erik Vee ; Srihari Venkatesan ; Jason Y. Zien
【Abstract】: The problem of efficiently evaluating a large collection of complex Boolean expressions - beyond simple conjunctions and Disjunctive/Conjunctive Normal Forms (DNF/CNF) - occurs in many emerging online advertising applications such as advertising exchanges and automatic targeting. The simple solution of normalizing complex Boolean expressions to DNF or CNF form, and then using existing methods for evaluating such expressions is not always effective because of the exponential blow-up in the size of expressions due to normalization. We thus propose a novel method for evaluating complex expressions, which leverages existing techniques for evaluating leaf-level conjunctions, and then uses a bottom-up evaluation technique to only process the relevant parts of the complex expressions that contain the matching conjunctions. We develop two such bottom-up evaluation techniques, one based on Dewey IDs and another based on mapping Boolean expressions to one-dimensional intervals. Our experimental evaluation based on data obtained from an online advertising exchange shows that the proposed techniques are efficient and scalable, both with respect to space usage as well as evaluation time.
【Keywords】: boolean expressions; dewey; interval; pub/sub
【Paper Link】 【Pages】:15-26
【Authors】: Quoc Trung Tran ; Chee-Yong Chan
【Abstract】: One useful feature that is missing from today's database systems is an explain capability that enables users to seek clarifications on unexpected query results. There are two types of unexpected query results that are of interest: the presence of unexpected tuples, and the absence of expected tuples (i.e., missing tuples). Clearly, it would be very helpful to users if they could pose follow-up why and why-not questions to seek clarifications on, respectively, unexpected and expected (but missing) tuples in query results. While the why questions can be addressed by applying established data provenance techniques, the problem of explaining the why-not questions has received very little attention. There are currently two explanation models proposed for why-not questions. The first model explains a missing tuple t in terms of modifications to the database such that t appears in the query result wrt the modified database. The second model explains by identifying the data manipulation operator in the query evaluation plan that is responsible for excluding t from the result. In this paper, we propose a new paradigm for explaining a why-not question that is based on automatically generating a refined query whose result includes both the original query's result as well as the user-specified missing tuple(s). In contrast to the existing explanation models, our approach goes beyond merely identifying the "culprit" query operator responsible for the missing tuple(s) and is useful for applications where it is not appropriate to modify the database to obtain missing tuples.
【Keywords】: query explanation; query refinement; why-not questions
【Paper Link】 【Pages】:27-38
【Authors】: Feng Zhao ; Gautam Das ; Kian-Lee Tan ; Anthony K. H. Tung
【Abstract】: Computing preference queries has received a lot of attention in the database community. It is common that the user is unsure of his/her preference, so care must be taken to elicit the preference of the user correctly. In this paper, we propose to elicit the preferred ordering of a user by utilizing skyline objects as the representatives of the possible ordering. We introduce the notion of order-based representative skylines which selects representatives based on the orderings that they represent. To further facilitate preference exploration, a hierarchical clustering algorithm is applied to compute a denogram on the skyline objects. By coupling the hierarchical clustering with visualization techniques, we allow users to refine their preference weight settings by browsing the hierarchy. Extensive experiments were conducted and the results validate the feasibility and the efficiency of our approach.
【Keywords】: preference query; sampling; skyline query; visualization
【Paper Link】 【Pages】:39-50
【Authors】: Tobias Emrich ; Hans-Peter Kriegel ; Peer Kröger ; Matthias Renz ; Andreas Züfle
【Abstract】: Fast query processing of complex objects, e.g. spatial or uncertain objects, depends on efficient spatial pruning of objects' approximations, which are typically minimum bounding rectangles (MBRs). In this paper, we propose a novel effective and efficient criterion to determine the spatial topology between multi-dimensional rectangles. Given three rectangles R, A, and B, in a multi-dimensional space, the task is to determine whether A, is definitely closer to R, than B. This domination relation is used in many applications to perform spatial pruning. Traditional techniques apply spatial pruning based on minimal and maximal distance. These techniques however show significant deficiencies in terms of effectivity. We prove that our decision criterion is correct, complete, and efficient to compute even for high dimensional databases. In addition, we tackle the problem of computing the number of objects dominating an object o. The challenge here is to incorporate objects that only partially dominate o. In this work we will show how to detect such partial domination topology by using a modified version of our decision criterion. We propose strategies for conservatively and progressively estimating the total number of objects dominating an object. Our experiments show that the new pruning criterion, albeit very general and widely applicable, significantly outperforms current state-of-the-art pruning criteria.
【Keywords】: knn; pruning; rectangles; rknn; spatial
【Paper Link】 【Pages】:51-62
【Authors】: Haiquan Chen ; Wei-Shinn Ku ; Haixun Wang ; Min-Te Sun
【Abstract】: Radio Frequency Identification (RFID) technologies are used in many applications for data collection. However, raw RFID readings are usually of low quality and may contain many anomalies. An ideal solution for RFID data cleansing should address the following issues. First, in many applications, duplicate readings (by multiple readers simultaneously or by a single reader over a period of time) of the same object are very common. The solution should take advantage of the resulting data redundancy for data cleaning. Second, prior knowledge about the readers and the environment (e.g., prior data distribution, false negative rates of readers) may help improve data quality and remove data anomalies, and a desired solution must be able to quantify the degree of uncertainty based on such knowledge. Third, the solution should take advantage of given constraints in target applications (e.g., the number of objects in a same location cannot exceed a given value) to elevate the accuracy of data cleansing. There are a number of existing RFID data cleansing techniques. However, none of them support all the aforementioned features. In this paper we propose a Bayesian inference based approach for cleaning RFID raw data. Our approach takes full advantage of data redundancy. To capture the likelihood, we design an n-state detection model and formally prove that the 3-state model can maximize the system performance. Moreover, in order to sample from the posterior, we devise a Metropolis-Hastings sampler with Constraints (MH-C), which incorporates constraint management to clean RFID raw data with high efficiency and accuracy. We validate our solution with a common RFID application and demonstrate the advantages of our approach through extensive simulations.
【Keywords】: data cleaning; probabilistic algorithms; spatio-temporal databases; uncertainty
【Paper Link】 【Pages】:63-74
【Authors】: Henning Köhler ; Xiaofang Zhou ; Shazia Wasim Sadiq ; Yanfeng Shu ; Kerry L. Taylor
【Abstract】: We investigate the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are often 'dirty', especially when integrating data from different sources. To deal with this issue, we propose new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures effective, we develop efficient algorithms for distributed sample creation and similarity computation. Test results show that for dirty data our measures are more accurate for measuring value overlap than existing sample-based methods, but we also observe that there is a clear tradeoff between accuracy and speed. This motivates a two-stage filtering approach, with both measures operating on the same samples.
【Keywords】: database integration; sampling; schema matching
【Paper Link】 【Pages】:75-86
【Authors】: Chris Mayfield ; Jennifer Neville ; Sunil Prabhakar
【Abstract】: Real-world databases often contain syntactic and semantic errors, in spite of integrity constraints and other safety measures incorporated into modern DBMSs. We present ERACER, an iterative statistical framework for inferring missing information and correcting such errors automatically. Our approach is based on belief propagation and relational dependency networks, and includes an efficient approximate inference algorithm that is easily implemented in standard DBMSs using SQL and user defined functions. The system performs the inference and cleansing tasks in an integrated manner, using shrinkage techniques to infer correct values accurately even in the presence of dirty data. We evaluate the proposed methods empirically on multiple synthetic and real-world data sets. The results show that our framework achieves accuracy comparable to a baseline statistical method using Bayesian networks with exact inference. However, our framework has wider applicability than the Bayesian network baseline, due to its ability to reason with complex, cyclic relational dependencies.
【Keywords】: approximate inference; discrete convolution; linear regression; outlier detection; relational dependency network
【Paper Link】 【Pages】:87-98
【Authors】: Aditya G. Parameswaran ; Georgia Koutrika ; Benjamin Bercovitz ; Hector Garcia-Molina
【Abstract】: We study recommendations in applications where there are temporal patterns in the way items are consumed or watched. For example, a student who has taken the Advanced Algorithms course is more likely to be interested in Convex Optimization, but a student who has taken Convex Optimization need not be interested in Advanced Algorithms in the future. Similarly, a person who has purchased the Godfather I DVD on Amazon is more likely to purchase Godfather II sometime in the future (though it is not strictly necessary to watch/purchase Godfather I beforehand). We propose a precedence mining model that estimates the probability of future consumption based on past behavior. We then propose Recsplorer: a suite of recommendation algorithms that exploit the precedence information. We evaluate our algorithms, as well as traditional recommendation ones, using a real course planning system. We use existing transcripts to evaluate how well the algorithms perform. In addition, we augment our experiments with a user study on the live system where users rate their recommendations.
【Keywords】: precedence mining; recommendations; temporality
【Paper Link】 【Pages】:99-110
【Authors】: Fang Wei
【Abstract】: Efficient shortest path query answering in large graphs is enjoying a growing number of applications, such as ranked keyword search in databases, social networks, ontology reasoning and bioinformatics. A shortest path query on a graph finds the shortest path for the given source and target vertices in the graph. Current techniques for efficient evaluation of such queries are based on the pre-computation of compressed Breadth First Search trees of the graph. However, they suffer from drawbacks of scalability. To address these problems, we propose TEDI, an indexing and query processing scheme for the shortest path query answering. TEDI is based on the tree decomposition methodology. The graph is first decomposed into a tree in which the node (a.k.a. bag) contains more than one vertex from the graph. The shortest paths are stored in such bags and these local paths together with the tree are the components of the index of the graph. Based on this index, a bottom-up operation can be executed to find the shortest path for any given source and target vertices. Our experimental results show that TEDI offers orders-of-magnitude performance improvement over existing approaches on the index construction time, the index size and the query answering.
【Keywords】: graphs; indexing; shortest path; tree decomposition
【Paper Link】 【Pages】:111-122
【Authors】: Changjiu Jin ; Sourav S. Bhowmick ; Xiaokui Xiao ; James Cheng ; Byron Choi
【Abstract】: Given a graph database D and a query graph g, an exact subgraph matching query asks for the set S of graphs in D that contain g as a subgraph. This type of queries find important applications in several domains such as bioinformatics and chemoinformatics, where users are generally not familiar with complex graph query languages. Consequently, user-friendly visual interfaces which support query graph construction can reduce the burden of data retrieval for these users. Existing techniques for subgraph matching queries built on top of such visual framework are designed to optimize the time required in retrieving the result set S from D, assuming that the whole query graph has been constructed. This leads to sub-optimal system response time as the query processing is initiated only after the user has finished drawing the query graph. In this paper, we take the first step towards exploring a novel graph query processing paradigm, where instead of processing a query graph after its construction, it interleaves visual query construction and processing to improve system response time. To realize this, we present an algorithm called GBLENDER that prunes false results and prefetches partial query results by exploiting the latency offered by the visual query formulation. It employs a novel action-aware indexing scheme that exploits users' interaction characteristics with visual interfaces to support efficient retrieval. Extensive experiments on both real and synthetic datasets demonstrate the effectiveness and efficiency of our solution.
【Keywords】: frequent subgraphs; graph databases; graph indexing; infrequent subgraphs; prefetching; visual query formulation
【Paper Link】 【Pages】:123-134
【Authors】: Ruoming Jin ; Hui Hong ; Haixun Wang ; Ning Ruan ; Yang Xiang
【Abstract】: Our world today is generating huge amounts of graph data such as social networks, biological networks, and the semantic web. Many of these real-world graphs are edge-labeled graphs, i.e., each edge has a label that denotes the relationship between the two vertices connected by the edge. A fundamental research problem on these labeled graphs is how to handle the label-constraint reachability query: Can vertex u reach vertex v through a path whose edge labels are constrained by a set of labels? In this work, we introduce a novel tree-based index framework which utilizes the directed maximal weighted spanning tree algorithm and sampling techniques to maximally compress the generalized transitive closure for the labeled graphs. An extensive experimental evaluation on both real and synthetic datasets demonstrates the efficiency of our approach in answering label-constraint reachability queries.
【Keywords】: generalized transitive closure; hoeffding and bernstein bounds; label-constraint reachability; maximal directed spanning tree
【Paper Link】 【Pages】:135-146
【Authors】: Grzegorz Malewicz ; Matthew H. Austern ; Aart J. C. Bik ; James C. Dehnert ; Ilan Horn ; Naty Leiser ; Grzegorz Czajkowski
【Abstract】: Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs - in some cases billions of vertices, trillions of edges - poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.
【Keywords】: distributed computing; graph algorigthms
【Paper Link】 【Pages】:147-158
【Authors】: Shimin Chen ; Phillip B. Gibbons ; Suman Nath
【Abstract】: Online aggregation is a promising solution to achieving fast early responses for interactive ad-hoc queries that compute aggregates on a large amount of data. Essential to the success of online aggregation is a good non-blocking join algorithm that enables both (i) high early result rates with statistical guarantees and (ii) fast end-to-end query times. We analyze existing non-blocking join algorithms and find that they all provide sub-optimal early result rates, and those with fast end-to-end times achieve them only by further sacrificing their early result rates. We propose a new non-blocking join algorithm, Partitioned expanding Ripple Join (PR-Join), which achieves considerably higher early result rates than previous non-blocking joins, while also delivering fast end-to-end query times. PR-Join performs separate, ripple-like join operations on individual hash partitions, where the width of a ripple expands multiplicatively over time. This contrasts with the non-partitioned, fixed-width ripples of Block Ripple Join. Assuming, as in previous non-blocking join studies, that the input relations are in random order, PR-Join ensures representative early results that are amenable to statistical guarantees. We show both analytically and with real-machine experiments that PR-Join achieves over an order of magnitude higher early result rates than previous non-blocking joins. We also discuss the benefits of using a flash-based SSD for temporary storage, showing that PR-Join can then achieve close to optimal end-to-end performance. Finally, we consider the joining of finite data streams that arrive over time, and find that PR-Join achieves similar or higher result rates than RPJ, the state-of-the-art algorithm specialized for that domain.
【Keywords】: data warehouse; fast early result; finite data stream; non-blocking join; online aggregation; pr-join; statistical guarantee
【Paper Link】 【Pages】:159-170
【Authors】: Thanh T. L. Tran ; Liping Peng ; Boduo Li ; Yanlei Diao ; Anna Liu
【Abstract】: Uncertain data streams, where data is incomplete, imprecise, and even misleading, have been observed in many environments. Feeding such data streams to existing stream systems produces results of unknown quality, which is of paramount concern to monitoring applications. In this paper, we present the PODS system that supports stream processing for uncertain data naturally captured using continuous random variables. PODS employs a unique data model that is flexible and allows efficient computation. Built on this model, we develop evaluation techniques for complex relational operators, i.e., aggregates and joins, by exploring advanced statistical theory and approximation. Evaluation results show that our techniques can achieve high performance while satisfying accuracy requirements, and significantly outperform a state-of-the-art sampling method. A case study further shows that our techniques can enable a tornado detection system (for the first time) to produce detection results at stream speed and with much improved quality.
【Keywords】: data models; relational processing; uncertain data streams
【Paper Link】 【Pages】:171-182
【Authors】: Abdullah Mueen ; Suman Nath ; Jie Liu
【Abstract】: We consider the problem of computing all-pair correlations in a warehouse containing a large number (e.g., tens of thousands) of time-series (or, signals). The problem arises in automatic discovery of patterns and anomalies in data intensive applications such as data center management, environmental monitoring, and scientific experiments. However, with existing techniques, solving the problem for a large stream warehouse is extremely expensive, due to the problem's inherent quadratic I/O and CPU complexities. We propose novel algorithms, based on Discrete Fourier Transformation (DFT) and graph partitioning, to reduce the end-to-end response time of an all-pair correlation query. To minimize I/O cost, we partition a massive set of input signals into smaller batches such that caching the signals one batch at a time maximizes data reuse and minimizes disk I/O. To reduce CPU cost, we propose two approximation algorithms. Our first algorithm efficiently computes approximate correlation coefficients of similar signal pairs within a given error bound. The second algorithm efficiently identifies, without any false positives or negatives, all signal pairs with correlations above a given threshold. For many real applications, our approximate solutions are as useful as corresponding exact solutions, due to our strict error guarantees. However, compared to the state-of-the-art exact algorithms, our algorithms are up to 17x faster for several real datasets.
【Keywords】: correlation matrix; discrete fourier transform
【Paper Link】 【Pages】:183-194
【Authors】: Peng Wang ; Haixun Wang ; Majin Liu ; Wei Wang
【Abstract】: Recently, much study has been directed toward summarizing event data, in the hope that the summary will lead us to a better understanding of the system that generates the events. However, instead of offering a global picture of the system, the summary obtained by most current approaches are piecewise, each describing an isolated snapshot of the system. We argue that the best summary, both in terms of its minimal description length and its interpretability, is the one obtained with the understanding of the internal dynamics of the system. Such understanding includes, for example, what are the internal states of the system, and how the system alternates among these states. In this paper, we adopt an algorithmic approach for event data summarization. More specifically, we use a hidden Markov model to describe the event generation process. We show that summarizing events based on the learned hidden Markov Model achieves short description length and high interpretability. Experiments show that our approach is both efficient and effective.
【Keywords】: event summarization; hidden markov model; minimal description length
【Paper Link】 【Pages】:195-206
【Authors】: Jerzy Tyszkiewicz
【Abstract】: Spreadsheets are among the most commonly used applications for data management and analysis. Perhaps they are even among the most widely used computer applications of all kinds. However, the spreadsheet paradigm of computation still lacks sufficient analysis. In this paper we demonstrate that a spreadsheet can play the role of a relational database engine, without any use of macros or built-in programming languages, merely by utilizing spreadsheet formulas. We achieve that by implementing all operators of relational algebra by means of spreadsheet functions. Given a definition of a database in SQL, it is therefore possible to construct a spreadsheet workbook with empty worksheets for data tables and worksheets filled with formulas for queries. From then on, when the user enters, alters or deletes data in the data worksheets, the formulas in query worksheets automatically compute the actual results of the queries. Thus, the spreadsheet serves as data storage and executes SQL queries, and therefore acts as a relational database engine. The paper is based on Microsoft Excel (TM), but our constructions work in other spreadsheet systems, too. We present a number of performance tests conducted in the beta version of Excel 2010. Their conclusion is that the performance is sufficient for a desktop database with a couple thousand rows.
【Keywords】: performance; relational algebra; relational databases; spreadsheets; sql
【Paper Link】 【Pages】:207-218
【Authors】: Hyun Jin Moon ; Carlo Curino ; Carlo Zaniolo
【Abstract】: The problem of archiving and querying the history of a database is made more complex by the fact that, along with the database content, the database schema also evolves with time. Indeed, archival quality can only be guaranteed by storing past database contents using the schema versions under which they were originally created. This causes major usability and scalability problems in preservation, retrieval and querying of databases with intense evolution histories, i.e., hundreds of schema versions. This scenario is common in web information systems and scientific databases that frequently accumulate that many versions in just a few years. Our system, Archival Information Management System (AIMS), solves this usability issue by letting users write queries against a chosen schema version and then performing for the users the rewriting and execution of queries on all appropriate schema versions. AIMS achieves scalability by using (i) an advanced storage strategy based on relational technology and attribute-level-timestamping of the history of the database content, (ii) suitable temporal indexing and clustering techniques, and (iii) novel temporal query optimizations. In particular, with AIMS we introduce a novel technique called CoalNesT that achieves unprecedented performance when temporal coalescing tuples fragmented by schema changes. Extensive experiments show that the performance and scalability thus achieved greatly exceeds those obtained by previous approaches. The AIMS technology is easily deployed by plugging into existing DBMS replication technologies, leading to very low overhead; moreover, by decoupling logical and physical layers provides multiple query interfaces, from the basic archive&query features considered in the upcoming SQL standards, to the much richer temporal XML/XQuery capabilities proposed by researchers.
【Keywords】:
【Paper Link】 【Pages】:219-230
【Authors】: Wolfgang Gatterbauer ; Dan Suciu
【Abstract】: In massively collaborative projects such as scientific or community databases, users often need to agree or disagree on the content of individual data items. On the other hand, trust relationships often exist between users, allowing them to accept or reject other users' beliefs by default. As those trust relationships become complex, however, it becomes difficult to define and compute a consistent snapshot of the conflicting information. Previous solutions to a related problem, the update reconciliation problem, are dependent on the order in which the updates are processed and, therefore, do not guarantee a globally consistent snapshot. This paper proposes the first principled solution to the automatic conflict resolution problem in a community database. Our semantics is based on the certain tuples of all stable models of a logic program. While evaluating stable models in general is well known to be hard, even for very simple logic programs, we show that the conflict resolution problem admits a PTIME solution. To the best of our knowledge, ours is the first PTIME algorithm that allows conflict resolution in a principled way. We further discuss extensions to negative beliefs and prove that some of these extensions are hard. This work is done in the context of the BeliefDB project at the University of Washington, which focuses on the efficient management of conflicts in community databases.
【Keywords】: community database; consistency; data exchange; stable model
【Paper Link】 【Pages】:231-242
【Authors】: Dimitris Tsirogiannis ; Stavros Harizopoulos ; Mehul A. Shah
【Abstract】: Rising energy costs in large data centers are driving an agenda for energy-efficient computing. In this paper, we focus on the role of database software in affecting, and, ultimately, improving the energy efficiency of a server. We first characterize the power-use profiles of database operators under different configuration parameters. We find that common database operations can exercise the full dynamic power range of a server, and that the CPU power consumption of different operators, for the same CPU utilization, can differ by as much as 60%. We also find that for these operations CPU power does not vary linearly with CPU utilization. We then experiment with several classes of database systems and storage managers, varying parameters that span from different query plans to compression algorithms and from physical layout to CPU frequency and operating system scheduling. Contrary to what recent work has suggested, we find that within a single node intended for use in scale-out (shared-nothing) architectures, the most energy-efficient configuration is typically the highest performing one. We explain under which circumstances this is not the case, and argue that these circumstances do not warrant a retargeting of database system optimization goals. Further, our results reveal opportunities for cross-node energy optimizations and point out directions for new scale-out architectures.
【Keywords】: cpu power; database server; energy efficiency; power consumption; ssd
【Paper Link】 【Pages】:243-254
【Authors】: Zhengdao Xu ; Hans-Arno Jacobsen
【Abstract】: Applications ranging from location-based services to multi-player online gaming require continuous query support to monitor, track, and detect events of interest among sets of moving objects. Examples are alerting capabilities for detecting whether the distance, the travel cost, or the travel time among a set of moving objects exceeds a threshold. These types of queries are driven by continuous streams of location updates, simultaneously evaluated over many queries. In this paper, we define three types of proximity relations that induce location constraints to model continuous spatio-temporal queries among sets of moving objects in road networks. Our focus lies on evaluating a large number of continuous queries simultaneously. We introduce a novel moving object indexing technique that together with a novel road network partitioning scheme restricts computations within the partial road network. These techniques reduce query processing overhead by more than 95%. Experiments over real-world data sets show that our approach is twenty times faster than a baseline algorithm.
【Keywords】: constraint processing; location constraint; location query; location-based services; publish/subscribe; road networks
【Paper Link】 【Pages】:255-266
【Authors】: Zaiben Chen ; Heng Tao Shen ; Xiaofang Zhou ; Yu Zheng ; Xing Xie
【Abstract】: Trajectory search has long been an attractive and challenging topic which blooms various interesting applications in spatial-temporal databases. In this work, we study a new problem of searching trajectories by locations, in which context the query is only a small set of locations with or without an order specified, while the target is to find the k Best-Connected Trajectories (k-BCT) from a database such that the k-BCT best connect the designated locations geographically. Different from the conventional trajectory search that looks for similar trajectories w.r.t. shape or other criteria by using a sample query trajectory, we focus on the goodness of connection provided by a trajectory to the specified query locations. This new query can benefit users in many novel applications such as trip planning. In our work, we firstly define a new similarity function for measuring how well a trajectory connects the query locations, with both spatial distance and order constraint being considered. Upon the observation that the number of query locations is normally small (e.g. 10 or less) since it is impractical for a user to input too many locations, we analyze the feasibility of using a general-purpose spatial index to achieve efficient k-BCT search, based on a simple Incremental k-NN based Algorithm (IKNN). The IKNN effectively prunes and refines trajectories by using the devised lower bound and upper bound of similarity. Our contributions mainly lie in adapting the best-first and depth-first k-NN algorithms to the basic IKNN properly, and more importantly ensuring the efficiency in both search effort and memory usage. An in-depth study on the adaption and its efficiency is provided. Further optimization is also presented to accelerate the IKNN algorithm. Finally, we verify the efficiency of the algorithm by extensive experiments.
【Keywords】: efficiency; locations; optimization; trajectory search
【Paper Link】 【Pages】:267-278
【Authors】: Mirco Stern ; Klemens Böhm ; Erik Buchmann
【Abstract】: While join processing in wireless sensor networks has received a lot of attention recently, current solutions do not work well for continuous queries. In those networks however, continuous queries are the rule. To minimize the communication costs of join processing, it is important to not ship non-joining tuples. In order to know which tuples do not join, prior work has proposed a precomputation step. For continuous queries however, repeating the precomputation for each execution is unnecessary and leaves aside that data tends to be temporally correlated. In this paper, we present a filtering approach for the processing of continuous join queries. We propose to keep the filters and to maintain them. The problems are determining the sizes of the filters and deciding which filters to update. Simplistic approaches result in bad performance. We show how to compute solutions that are optimal. Experiments on real-world sensor data indicate that our method performs close to a theoretical optimum and consistently outperforms state-of-the-art join approaches.
【Keywords】: join processing; sensor networks
【Paper Link】 【Pages】:279-290
【Authors】: Nikos Giatrakos ; Yannis Kotidis ; Antonios Deligiannakis ; Vasilis Vassalos ; Yannis Theodoridis
【Abstract】: Wireless sensor networks are becoming increasingly popular for a variety of applications. Users are frequently faced with the surprising discovery that readings produced by the sensing elements of their motes are often contaminated with outliers. Outlier readings can severely affect applications that rely on timely and reliable sensory data in order to provide the desired functionality. As a consequence, there is a recent trend to explore how techniques that identify outlier values can be applied to sensory data cleaning. Unfortunately, most of these approaches incur an overwhelming communication overhead, which limits their practicality. In this paper we introduce an in-network outlier detection framework, based on locality sensitive hashing, extended with a novel boosting process as well as efficient load balancing and comparison pruning mechanisms. Our method trades off bandwidth for accuracy in a straightforward manner and supports many intuitive similarity metrics.
【Keywords】: outliers; sensor networks
【Paper Link】 【Pages】:291-302
【Authors】: Ruiwen Chen ; Yongyi Mao ; Iluju Kiringa
【Abstract】: Under the tuple-level uncertainty paradigm, we formalize the use of a novel graphical model, Generator-Recognizer Network (GRN), as a model of probabilistic databases. The GRN modeling framework is capable of representing a much wider range of tuple dependency structure. We show that a GRN representation of a probabilistic database may undergo transitions induced by imposing constraints or evaluating queries. We formalize procedures for these two types of transitions such that the resulting graphical models after transitions remain as GRNs. This formalism makes GRN a self-contained modeling framework and a closed representation system for probabilistic databases - a property that is lacking in most existing models. In addition, we show that exploiting the transitional mechanisms allows a systematic approach to constructing GRNs for arbitrary probabilistic data at arbitrary stages. Advantages of GRNs in query evaluation are also demonstrated.
【Keywords】: bayesian network; generator-recognizer network; graphical model; probabilistic database; uncertainty
【Paper Link】 【Pages】:303-314
【Authors】: Xiang Lian ; Lei Chen ; Shaoxu Song
【Abstract】: Efficient and effective manipulation of probabilistic data has become increasingly important recently due to many real applications that involve the data uncertainty. This is especially crucial when probabilistic data collected from different sources disagree with each other and incur inconsistencies. In order to accommodate such inconsistencies and enable consistent query answering (CQA), in this paper, we propose the all-possible-repair semantics in the context of inconsistent probabilistic databases, which formalize the repairs on the database as repair worlds via a graph representation. In turn, the CQA problem can be converted into one in the so-called repaired possible worlds (w.r.t. both repair worlds and possible worlds). We investigate a series of consistent queries in inconsistent probabilistic databases, including consistent range queries, join, and top-k queries, which, however, need to deal with an exponential number of the repaired possible worlds at high cost. To tackle the efficiency problem of CQA, in this paper, we propose efficient approaches for retrieving consistent query answers, including effective pruning methods to filter out false positives. Extensive experiments have been conducted to demonstrate the efficiency and effectiveness of our approaches.
【Keywords】: all-possible-repairs semantics; consistent query answering; inconsistent probabilistic database; repair world; repaired possible world
【Paper Link】 【Pages】:315-326
【Authors】: Yinian Qi ; Rohit Jain ; Sarvjeet Singh ; Sunil Prabhakar
【Abstract】: The probabilistic threshold query (PTQ) is one of the most common queries in uncertain databases, where all results satisfying the query with probabilities that meet the threshold requirement are returned. PTQ is used widely in nearest-neighbor queries, range queries, ranking queries, etc. In this paper, we investigate the general PTQ for arbitrary SQL queries that involve selections, projections and joins. The uncertain database model that we use is one that combines both attribute and tuple uncertainty as well as correlations between arbitrary attribute sets. We address the PTQ optimization problem that aims at improving the efficiency of PTQ query execution by enabling alternative query plan enumeration for optimization. We propose general optimization rules as well as rules specifically for selections, projections and joins. We introduce a threshold operator (τ-operator) to the query plan and show it is generally desirable to push down the τ-operator as much as possible.
【Keywords】: probabilistic data; query optimization; threshold queries; uncertain data
【Paper Link】 【Pages】:327-338
【Authors】: Jeffrey Jestes ; Feifei Li ; Zhepeng Yan ; Ke Yi
【Abstract】: Edit distance based string similarity join is a fundamental operator in string databases. Increasingly, many applications in data cleaning, data integration, and scientific computing have to deal with fuzzy information in string attributes. Despite the intensive efforts devoted in processing (deterministic) string joins and managing probabilistic data respectively, modeling and processing probabilistic strings is still a largely unexplored territory. This work studies the string join problem in probabilistic string databases, using the expected edit distance (EED) as the similarity measure. We first discuss two probabilistic string models to capture the fuzziness in string values in real-world applications. The string-level model is complete, but may be expensive to represent and process. The character-level model has a much more succinct representation when uncertainty in strings only exists at certain positions. Since computing the EED between two probabilistic strings is prohibitively expensive, we have designed efficient and effective pruning techniques that can be easily implemented in existing relational database engines for both models. Extensive experiments on real data have demonstrated order-of-magnitude improvements of our approaches over the baseline.
【Keywords】: approximate string queries; probabilistic strings; string joins
【Paper Link】 【Pages】:339-350
【Authors】: Changkyu Kim ; Jatin Chhugani ; Nadathur Satish ; Eric Sedlar ; Anthony D. Nguyen ; Tim Kaldewey ; Victor W. Lee ; Scott A. Brandt ; Pradeep Dubey
【Abstract】: In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous computing power by integrating multiple cores, each with wide vector units. There has been much work to exploit modern processor architectures for database primitives like scan, sort, join and aggregation. However, unlike other primitives, tree search presents significant challenges due to irregular and unpredictable data accesses in tree traversal. In this paper, we present FAST, an extremely fast architecture sensitive layout of the index tree. FAST is a binary tree logically organized to optimize for architecture features like page size, cache line size, and SIMD width of the underlying hardware. FAST eliminates impact of memory latency, and exploits thread-level and datalevel parallelism on both CPUs and GPUs to achieve 50 million (CPU) and 85 million (GPU) queries per second, 5X (CPU) and 1.7X (GPU) faster than the best previously reported performance on the same architectures. FAST supports efficient bulk updates by rebuilding index trees in less than 0.1 seconds for datasets as large as 64Mkeys and naturally integrates compression techniques, overcoming the memory bandwidth bottleneck and achieving a 6X performance improvement over uncompressed index search for large keys on CPUs.
【Keywords】: compression; cpu; data-level parallelism; gpu; thread-level parallelism; tree search
【Paper Link】 【Pages】:351-362
【Authors】: Nadathur Satish ; Changkyu Kim ; Jatin Chhugani ; Anthony D. Nguyen ; Victor W. Lee ; Daehyun Kim ; Pradeep Dubey
【Abstract】: Sort is a fundamental kernel used in many database operations. In-memory sorts are now feasible; sort performance is limited by compute flops and main memory bandwidth rather than I/O. In this paper, we present a competitive analysis of comparison and non-comparison based sorting algorithms on two modern architectures - the latest CPU and GPU architectures. We propose novel CPU radix sort and GPU merge sort implementations which are 2X faster than previously published results. We perform a fair comparison of the algorithms using these best performing implementations on both architectures. While radix sort is faster on current architectures, the gap narrows from CPU to GPU architectures. Merge sort performs better than radix sort for sorting keys of large sizes - such keys will be required to accommodate the increasing cardinality of future databases. We present analytical models for analyzing the performance of our implementations in terms of architectural features such as core count, SIMD and bandwidth. Our obtained performance results are successfully predicted by our models. Our analysis points to merge sort winning over radix sort on future architectures due to its efficient utilization of SIMD and low bandwidth utilization. We simulate a 64-core platform with varying SIMD widths under constant bandwidth per core constraints, and show that large data sizes of 240 (one trillion records), merge sort performance on large key sizes is up to 3X better than radix sort for large SIMD widths on future architectures. Therefore, merge sort should be the sorting method of choice for future databases.
【Keywords】: buffer; databases; many-core; merge; merge network; performance; radix; simd; sorting; tlp
【Paper Link】 【Pages】:363-374
【Authors】: Yi-Reun Kim ; Kyu-Young Whang ; Il-Yeol Song
【Abstract】: Flash memory is widely used as the secondary storage in lightweight computing devices due to its outstanding advantages over magnetic disks. Flash memory has many access characteristics different from those of magnetic disks, and how to take advantage of them is becoming an important research issue. There are two existing approaches to storing data into flash memory: page-based and log-based. The former has good performance for read operations, but poor performance for write operations. In contrast, the latter has good performance for write operations when updates are light, but poor performance for read operations. In this paper, we propose a new method of storing data, called page-differential logging, for flash-based storage systems that solves the drawbacks of the two methods. The primary characteristics of our method are: (1) writing only the difference (which we define as the page-differential) between the original page in flash memory and the up-to-date page in memory; (2) computing and writing the page-differential only once at the time the page needs to be reflected into flash memory. The former contrasts with existing page-based methods that write the whole page including both changed and unchanged parts of data or from log-based ones that keep track of the history of all the changes in a page. Our method allows existing disk-based DBMSs to be reused as flash-based DBMSs just by modifying the flash memory driver, i.e., it is DBMS-independent. Experimental results show that the proposed method is superior in I/O performance, except for some special cases, to existing ones. Specifically, it improves the performance of various mixes of read-only and update operations by 0.5 (the special case when all transactions are read-only on updated pages) ~3.4 times over the page-based method and by 1.6 ~ 3.1 times over the log-based one for synthetic data of approximately 1 Gbytes. The TPC-C benchmark also shows improvement of the I/O time over existing methods by 1.2 ~ 6.1 times. This result indicates the effectiveness of our method under (semi) real workloads. We note that the performance advantage of our method can be further enhanced up to two folds by obviating the need to write to the spare area of the page a second time.
【Keywords】: dbms-independent; flash memory; page-differential
【Paper Link】 【Pages】:375-386
【Authors】: Rajendra Shinde ; Ashish Goel ; Pankaj Gupta ; Debojyoti Dutta
【Abstract】: Similarity search methods are widely used as kernels in various data mining and machine learning applications including those in computational biology, web search/clustering. Nearest neighbor search (NNS) algorithms are often used to retrieve similar entries, given a query. While there exist efficient techniques for exact query lookup using hashing, similarity search using exact nearest neighbors suffers from a "curse of dimensionality", i.e. for high dimensional spaces, best known solutions offer little improvement over brute force search and thus are unsuitable for large scale streaming applications. Fast solutions to the approximate NNS problem include Locality Sensitive Hashing (LSH) based techniques, which need storage polynomial in n with exponent greater than 1, and query time sublinear, but still polynomial in n, where n is the size of the database. In this work we present a new technique of solving the approximate NNS problem in Euclidean space using a Ternary Content Addressable Memory (TCAM), which needs near linear space and has O(1) query time. In fact, this method also works around the best known lower bounds in the cell probe model for the query time using a data structure near linear in the size of the data base. TCAMs are high performance associative memories widely used in networking applications such as address lookups and access control lists. A TCAM can query for a bit vector within a database of ternary vectors, where every bit position represents 0, 1 or . The is a wild card representing either a 0 or a 1. We leverage TCAMs to design a variant of LSH, called Ternary Locality Sensitive Hashing (TLSH) wherein we hash database entries represented by vectors in the Euclidean space into {0,1,}. By using the added functionality of a TLSH scheme with respect to the character, we solve an instance of the approximate nearest neighbor problem with 1 TCAM access and storage nearly linear in the size of the database. We validate our claims with extensive simulations using both real world (Wikipedia) as well as synthetic (but illustrative) datasets. We observe that using a TCAM of width 288 bits, it is possible to solve the approximate NNS problem on a database of size 1 million points with high accuracy. Finally, we design an experiment with TCAMs within an enterprise ethernet switch (Cisco Catalyst 4500) to validate that TLSH can be used to perform 1.5 million queries per second per 1Gb/s port. We believe that this work can open new avenues in very high speed data mining.
【Keywords】: locality sensitive hashing; nearest neighbor search; similarity search; tcam
【Paper Link】 【Pages】:387-398
【Authors】: Partha Pratim Talukdar ; Zachary G. Ives ; Fernando C. N. Pereira
【Abstract】: Scientific data offers some of the most interesting challenges in data integration today. Scientific fields evolve rapidly and accumulate masses of observational and experimental data that needs to be annotated, revised, interlinked, and made available to other scientists. From the perspective of the user, this can be a major headache as the data they seek may initially be spread across many databases in need of integration. Worse, even if users are given a solution that integrates the current state of the source databases, new data sources appear with new data items of interest to the user. Here we build upon recent ideas for creating integrated views over data sources using keyword search techniques, ranked answers, and user feedback [32] to investigate how to automatically discover when a new data source has content relevant to a user's view - in essence, performing automatic data integration for incoming data sets. The new architecture accommodates a variety of methods to discover related attributes, including label propagation algorithms from the machine learning community [2] and existing schema matchers [11]. The user may provide feedback on the suggested new results, helping the system repair any bad alignments or increase the cost of including a new source that is not useful. We evaluate our approach on actual bioinformatics schemas and data, using state-of-the-art schema matchers as components. We also discuss how our architecture can be adapted to more traditional settings with a mediated schema.
【Keywords】: data integration; keyword search; machine learning; schema alignment; schema matching; user feedback
【Paper Link】 【Pages】:399-410
【Authors】: Nicoleta Preda ; Gjergji Kasneci ; Fabian M. Suchanek ; Thomas Neumann ; Wenjun Yuan ; Gerhard Weikum
【Abstract】: The proliferation of knowledge-sharing communities and the advances in information extraction have enabled the construction of large knowledge bases using the RDF data model to represent entities and relationships. However, as the Web and its latently embedded facts evolve, a knowledge base can never be complete and up-to-date. On the other hand, a rapidly increasing suite of Web services provide access to timely and high-quality information, but this is encapsulated by the service interface. We propose to leverage the information that could be dynamically obtained from Web services in order to enrich RDF knowledge bases on the fly whenever the knowledge base does not suffice to answer a user query. To this end, we develop a sound framework for appropriately generating queries to encapsulated Web services and efficient algorithms for query execution and result integration. The query generator composes sequences of function calls based on the available service interfaces. As Web service calls are expensive, our method aims to reduce the number of calls in order to retrieve results with sufficient recall. Our approach is fully implemented in a complete prototype system named ANGIE1. The user can query and browse the RDF knowledge base as if it already contained all the facts from the Web services. This data, however, is gathered and integrated on the fly, transparently to the user. We demonstrate the viability and efficiency of our approach in experiments based on real-life data provided by popular Web services.
【Keywords】: binding patterns; information integration; knowledge bases; query mediation; rdf; semantics; warehousing
【Paper Link】 【Pages】:411-422
【Authors】: Hatem A. Mahmoud ; Ashraf Aboulnaga
【Abstract】: A data integration system offers a single interface to multiple structured data sources. Many application contexts (e.g., searching structured data on the web) involve the integration of large numbers of structured data sources. At web scale, it is impractical to use manual or semi-automatic data integration methods, so a pay-as-you-go approach is more appropriate. A pay-as-you-go approach entails using a fully automatic approximate data integration technique to provide an initial data integration system (i.e., an initial mediated schema, and initial mappings from source schemas to the mediated schema), and then refining the system as it gets used. Previous research has investigated automatic approximate data integration techniques, but all existing techniques require the schemas being integrated to belong to the same conceptual domain. At web scale, it is impractical to classify schemas into domains manually or semi-automatically, which limits the applicability of these techniques. In this paper, we present an approach for clustering schemas into domains without any human intervention and based only on the names of attributes in the schemas. Our clustering approach deals with uncertainty in assigning schemas to domains using a probabilistic model. We also propose a query classifier that determines, for a given a keyword query, the most relevant domains to this query. We experimentally demonstrate the effectiveness of our schema clustering and query classification techniques.
【Keywords】: classification; clustering; data integration
【Paper Link】 【Pages】:423-434
【Authors】: Jeffrey Pound ; Ihab F. Ilyas ; Grant E. Weddell
【Abstract】: Automated extraction of structured data from Web sources often leads to large heterogeneous knowledge bases (KB), with data and schema items numbering in the hundreds of thousands or millions. Formulating information needs with conventional structured query languages is difficult due to the sheer size of schema information available to the user. We address this challenge by proposing a new query language that blends keyword search with structured query processing over large information graphs with rich semantics. Our formalism for structured queries based on keywords combines the flexibility of keyword search with the expressiveness of structures queries. We propose a solution to the resulting disambiguation problem caused by introducing keywords as primitives in a structured query language. We show how expressions in our proposed language can be rewritten using the vocabulary of the web-extracted KB, and how different possible rewritings can be ranked based on their syntactic relationship to the keywords in the query as well as their semantic coherence in the underlying KB. An extensive experimental study demonstrates the efficiency and effectiveness of our approach. Additionally, we show how our query language fits into QUICK, an end-to-end information system that integrates web-extracted data graphs with full-text search. In this system, the rewritten query describes an arbitrary topic of interest for which corresponding entities, and documents relevant to the entities, are efficiently retrieved.
【Keywords】: entity search; query disambiguation; semantic search; structured keyword queries
【Paper Link】 【Pages】:435-446
【Authors】: Bin Cui ; Anthony K. H. Tung ; Ce Zhang ; Zhe Zhao
【Abstract】: The emergence of social media as a crucial paradigm has posed new challenges to the research and industry communities, where media are designed to be disseminated through social interaction. Recent literature has noted the generality of multiple features in the social media environment, such as textual, visual and user information. However, most of the studies employ only a relatively simple mechanism to merge the features rather than fully exploit feature correlation for social media applications. In this paper, we propose a novel approach to fusing multiple features and their correlations for similarity evaluation. Specifically, we first build a Feature Interaction Graph (FIG) by taking features as nodes and the correlations between them as edges. Then, we employ a probabilistic model based on Markov Random Field to describe the graph for similarity measure between multimedia objects. Using that, we design an efficient retrieval algorithm for large social media data. Further, we integrate temporal information into the probabilistic model for social media recommendation. We evaluate our approach using a large real-life corpus collected from Flickr, and the experimental results indicate the superiority of our proposed method over state-of-the-art techniques.
【Keywords】: feature fusion; recommendation; search; social media
【Paper Link】 【Pages】:447-458
【Authors】: James Cheng ; Yiping Ke ; Ada Wai-Chee Fu ; Jeffrey Xu Yu ; Linhong Zhu
【Abstract】: Maximal clique enumeration (MCE) is a fundamental problem in graph theory and has important applications in many areas such as social network analysis and bioinformatics. The problem is extensively studied; however, the best existing algorithms require memory space linear in the size of the input graph. This has become a serious concern in view of the massive volume of today's fast-growing network graphs. Since MCE requires random access to different parts of a large graph, it is difficult to divide the graph into smaller parts and process one part at a time, because either the result may be incorrect and incomplete, or it incurs huge cost on merging the results from different parts. We propose a novel notion, H-graph, which defines the core of a network and extends to encompass the neighborhood of the core for MCE computation. We propose the first external-memory algorithm for MCE (ExtMCE) that uses the H-graph to bound the memory usage. We prove both the correctness and completeness of the result computed by ExtMCE. Extensive experiments verify that ExtMCE efficiently processes large networks that cannot be fit in the memory. We also show that the H-graph captures important properties of the network; thus, updating the maximal cliques in the H-graph retains the most essential information, with a low update cost, when it is infeasible to perform update on the entire network.
【Keywords】: h*-graph; h-index; massive networks; maximal clique enumeration; scale-free networks
【Paper Link】 【Pages】:459-470
【Authors】: James Cheng ; Ada Wai-Chee Fu ; Jia Liu
【Abstract】: Serious concerns on privacy protection in social networks have been raised in recent years; however, research in this area is still in its infancy. The problem is challenging due to the diversity and complexity of graph data, on which an adversary can use many types of background knowledge to conduct an attack. One popular type of attacks as studied by pioneer work [2] is the use of embedding subgraphs. We follow this line of work and identify two realistic targets of attacks, namely, NodeInfo and LinkInfo. Our investigations show that k-isomorphism, or anonymization by forming k pairwise isomorphic subgraphs, is both sufficient and necessary for the protection. The problem is shown to be NP-hard. We devise a number of techniques to enhance the anonymization efficiency while retaining the data utility. A compound vertex ID mechanism is also introduced for privacy preservation over multiple data releases. The satisfactory performance on a number of real datasets, including HEP-Th, EUemail and LiveJournal, illustrates that the high symmetry of social networks is very helpful in mitigating the difficulty of the problem.
【Keywords】: data publishing; graph isomorphism; privacy preservation; social networks; structural attacks
【Paper Link】 【Pages】:471-482
【Authors】: Emiran Curtmola ; Alin Deutsch ; K. K. Ramakrishnan ; Divesh Srivastava
【Abstract】: We propose a novel privacy-preserving distributed infrastructure in which data resides only with the publishers owning it. The infrastructure disseminates user queries to publishers, who answer them at their own discretion. The infrastructure enforces a publisher k-anonymity guarantee, which prevents leakage of information about which publishers are capable of answering a certain query. Given the virtual nature of the global data collection, we study the challenging problem of efficiently locating publishers in the community that contain data items matching a specified query. We propose a distributed index structure, UQDT, that is organized as a union of Query Dissemination Trees (QDTs), and realized on an overlay (i.e., logical) network infrastructure. Each QDT has data publishers as its leaf nodes, and overlay network nodes as its internal nodes; each internal node routes queries to publishers, based on a summary of the data advertised by publishers in its subtrees. We experimentally evaluate design tradeoffs, and demonstrate that UQDT can maximize throughput by preventing any overlay network node from becoming a bottleneck.
【Keywords】: distributed infrastructure; k-anonymity; load balancing; online communities; privacy aware; privacy preservation; publisher k-anonymity; query dissemination; user censorship resistant
【Paper Link】 【Pages】:483-494
【Authors】: John Cieslewicz ; Kenneth A. Ross ; Kyoho Satsumi ; Yang Ye
【Abstract】: To take full advantage of the parallelism offered by a multi-core machine, one must write parallel code. Writing parallel code is difficult. Even when one writes correct code, there are numerous performance pitfalls. For example, an unrecognized data hotspot could mean that all threads effectively serialize their access to the hotspot, and throughput is dramatically reduced. Previous work has demonstrated that database operations suffer from such hotspots when naively implemented to run in parallel on a multi-core processor. In this paper, we aim to provide a generic framework for performing certain kinds of concurrent database operations in parallel. The formalism is similar to user-defined aggregates and Google's MapReduce in that users specify certain functions for parts of the computation that need to be performed over large volumes of data. We provide infrastructure that allows multiple threads on a multi-core machine to concurrently perform read and write operations on shared data structures, automatically mitigating hotspots and other performance hazards. Our goal is not to squeeze the last drop of performance out of a particular platform. Rather, we aim to provide a framework within which a programmer can, without detailed knowledge of concurrent and parallel programming, develop code that efficiently utilizes a multi-core machine.
【Keywords】: contention; parallel programming
【Paper Link】 【Pages】:495-506
【Authors】: Rares Vernica ; Michael J. Carey ; Chen Li
【Abstract】: In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.
【Keywords】: mapreduce; set-similarity join
【Paper Link】 【Pages】:507-518
【Authors】: Kristi Morton ; Magdalena Balazinska ; Dan Grossman
【Abstract】: Time-oriented progress estimation for parallel queries is a challenging problem that has received only limited attention. In this paper, we present ParaTimer, a new type of time-remaining indicator for parallel queries. Several parallel data processing systems exist. ParaTimer targets environments where declarative queries are translated into ensembles of MapReduce jobs. ParaTimer builds on previous techniques and makes two key contributions. First, it estimates the progress of queries that translate into directed acyclic graphs of MapReduce jobs, where jobs on different paths can execute concurrently (unlike prior work that looked at sequences only). For such queries, we use a new type of critical-path-based progress-estimation approach. Second, ParaTimer handles a variety of real systems challenges such as failures and data skew. To handle unexpected changes in query execution times due to runtime condition changes, ParaTimer provides users with not only one but with a set of time-remaining estimates, each one corresponding to a different carefully selected scenario. We implement our estimator in the Pig system and demonstrate its performance on experiments running on a real, small-scale cluster.
【Keywords】: mapreduce; parallel database; progress estimation
【Paper Link】 【Pages】:519-530
【Authors】: Subi Arumugam ; Alin Dobra ; Christopher M. Jermaine ; Niketan Pansare ; Luis Leopoldo Perez
【Abstract】: Since the 1970's, database systems have been "compute-centric". When a computation needs the data, it requests the data, and the data are pulled through the system. We believe that this is problematic for two reasons. First, requests for data naturally incur high latency as the data are pulled through the memory hierarchy, and second, it makes it difficult or impossible for multiple queries or operations that are interested in the same data to amortize the bandwidth and latency costs associated with their data access. In this paper, we describe a purely-push based, research prototype database system called DataPath. DataPath is "data-centric". In DataPath, queries do not request data. Instead, data are automatically pushed onto processors, where they are then processed by any interested computation. We show experimentally on a multi-terabyte benchmark that this basic design principle makes for a very lean and fast database system.
【Keywords】: algorithms
【Paper Link】 【Pages】:531-542
【Authors】: Surajit Chaudhuri ; Hongrae Lee ; Vivek R. Narasayya
【Abstract】: Parameterized queries are commonly used in database applications. In a parameterized query, the same SQL statement is potentially executed multiple times with different parameter values. In today's DBMSs the query optimizer typically chooses a single execution plan that is reused for multiple instances of the same query. A key problem is that even if a plan with low average cost across instances is chosen, its variance can be high, which is undesirable in many production settings. In this paper, we describe techniques for selecting a plan that can better address the trade-off between the average and variance of cost across instances of a parameterized query. We show how to efficiently compute the skyline in the average-variance cost space. We have implemented our techniques on top of a commercial DBMS. We present experimental results on benchmark and real-world decision support queries.
【Keywords】: query optimizer; variance; workload
【Paper Link】 【Pages】:543-554
【Authors】: Sándor Héman ; Marcin Zukowski ; Niels J. Nes ; Lefteris Sidirourgos ; Peter A. Boncz
【Abstract】: In this paper we investigate techniques that allow for on-line updates to columnar databases, leaving intact their high read-only performance. Rather than keeping differential structures organized by the table key values, the core proposition of this paper is that this can better be done by keeping track of the tuple position of the modifications. Not only does this minimize the computational overhead of merging in differences into read-only queries, but this makes the differential structure oblivious of the value of the order keys, allowing it to avoid disk I/O for retrieving the order keys in read-only queries that otherwise do not need them - a crucial advantage for a column-store. We describe a new data structure for maintaining such positional updates, called the Positional Delta Tree (PDT), and describe detailed algorithms for PDT/column merging, updating PDTs, and for using PDTs in transaction management. In experiments with a columnar DBMS, we perform microbenchmarks on PDTs, and show in a TPC-H workload that PDTs allow quick on-line updates, yet significantly reduce their performance impact on read-only queries compared with classical value-based differential methods.
【Keywords】: columnstores; pdt; positional; updates
【Paper Link】 【Pages】:555-566
【Authors】: Leong Hou U ; Nikos Mamoulis ; Klaus Berberich ; Srikanta J. Bedathur
【Abstract】: We propose and study a new ranking problem in versioned databases. Consider a database of versioned objects which have different valid instances along a history (e.g., documents in a web archive). Durable top-k search finds the set of objects that are consistently in the top-k results of a query (e.g., a keyword query) throughout a given time interval (e.g., from June 2008 to May 2009). Existing work on temporal top-k queries mainly focuses on finding the most representative top-k elements within a time interval. Such methods are not readily applicable to durable top-k queries. To address this need, we propose two techniques that compute the durable top-k result. The first is adapted from the classic top-k rank aggregation algorithm NRA. The second technique is based on a shared execution paradigm and is more efficient than the first approach. In addition, we propose a special indexing technique for archived data. The index, coupled with a space partitioning technique, improves performance even further. We use data from Wikipedia and the Internet Archive to demonstrate the efficiency and effectiveness of our solutions.
【Keywords】: document archives; temporal queries; top-k search
【Paper Link】 【Pages】:567-578
【Authors】: Yupeng Fu ; Keith Kowalczykowski ; Kian Win Ong ; Yannis Papakonstantinou ; Kevin Keliang Zhao
【Abstract】: While Ajax-based programming enables faster performance and higher interface quality over pure server-side programming, it is demanding and error prone as each action that partially updates the page requires custom, ad-hoc code. The problem is exacerbated by distributed programming between the browser and server, where the developer uses JavaScript to access the page state and Java/SQL for the database. The FORWARD framework simplifies the development of Ajax pages by treating them as rendered views, where the developer declares a view using an extension of SQL and page units, which map to the view and render the data in the browser. Such a declarative approach leads to significantly less code, as the framework automatically solves performance optimization problems that the developer would otherwise hand-code. Since pages are fueled by views, FORWARD leverages years of database research on incremental view maintenance by creating optimization techniques appropriately extended for the needs of pages (nesting, variability, ordering), thereby achieving performance comparable to hand-coded JavaScript/Java applications.
【Keywords】: ajax; sql; view maintenance
【Paper Link】 【Pages】:579-590
【Authors】: Donald Kossmann ; Tim Kraska ; Simon Loesing
【Abstract】: Cloud computing promises a number of advantages for the deployment of data-intensive applications. One important promise is reduced cost with a pay-as-you-go business model. Another promise is (virtually) unlimited throughput by adding servers if the workload increases. This paper lists alternative architectures to effect cloud computing for database applications and reports on the results of a comprehensive evaluation of existing commercial cloud services that have adopted these architectures. The focus of this work is on transaction processing (i.e., read and update workloads), rather than analytics or OLAP workloads, which have recently gained a great deal of attention. The results are surprising in several ways. Most importantly, it seems that all major vendors have adopted a different architecture for their cloud services. As a result, the cost and performance of the services vary significantly depending on the workload.
【Keywords】: benchmark; cloud computing; cloud db; cloud provider; cost; performance evaluation; transaction processing
【Paper Link】 【Pages】:591-602
【Authors】: Jinbao Wang ; Sai Wu ; Hong Gao ; Jianzhong Li ; Beng Chin Ooi
【Abstract】: Providing scalable database services is an essential requirement for extending many existing applications of the Cloud platform. Due to the diversity of applications, database services on the Cloud must support large-scale data analytical jobs and high concurrent OLTP queries. Most existing work focuses on some specific type of applications. To provide an integrated framework, we are designing a new system, epiC, as our solution to next-generation database systems. In epiC, indexes play an important role in improving overall performance. Different types of indexes are built to provide efficient query processing for different applications. In this paper, we propose RT-CAN, a multi-dimensional indexing scheme in epiC. RT-CAN integrates CAN [23] based routing protocol and the R-tree based indexing scheme to support efficient multi-dimensional query processing in a Cloud system. RT-CAN organizes storage and compute nodes into an overlay structure based on an extended CAN protocol. In our proposal, we make a simple assumption that each compute node uses an R-tree like indexing structure to index the data that are locally stored. We propose a query-conscious cost model that selects beneficial local R-tree nodes for publishing. By keeping the number of persistently connected nodes small and maintaining a global multi-dimensional search index, we can locate the compute nodes that may contain the answer with a few hops, making the scheme scalable in terms of data volume and number of compute nodes. Experiments on Amazon's EC2 show that our proposed routing protocol and indexing scheme are robust, efficient and scalable.
【Keywords】: cloud; index; query processing
【Paper Link】 【Pages】:603-614
【Authors】: Evan P. C. Jones ; Daniel J. Abadi ; Samuel Madden
【Abstract】: Database partitioning is a technique for improving the performance of distributed OLTP databases, since "single partition" transactions that access data on one partition do not need coordination with other partitions. For workloads that are amenable to partitioning, some argue that transactions should be executed serially on each partition without any concurrency at all. This strategy makes sense for a main memory database where there are no disk or user stalls, since the CPU can be fully utilized and the overhead of traditional concurrency control, such as two-phase locking, can be avoided. Unfortunately, many OLTP applications have some transactions which access multiple partitions. This introduces network stalls in order to coordinate distributed transactions, which will limit the performance of a database that does not allow concurrency. In this paper, we compare two low overhead concurrency control schemes that allow partitions to work on other transactions during network stalls, yet have little cost in the common case when concurrency is not needed. The first is a light-weight locking scheme, and the second is an even lighter-weight type of speculative concurrency control that avoids the overhead of tracking reads and writes, but sometimes performs work that eventually must be undone. We quantify the range of workloads over which each technique is beneficial, showing that speculative concurrency control generally outperforms locking as long as there are few aborts or few distributed transactions that involve multiple rounds of communication. On a modified TPC-C benchmark, speculative concurrency control can improve throughput relative to the other schemes by up to a factor of two.
【Keywords】: concurrency control; distributed databases
【Paper Link】 【Pages】:615-626
【Authors】: Wenchao Zhou ; Micah Sherr ; Tao Tao ; Xiaozhou Li ; Boon Thau Loo ; Yun Mao
【Abstract】: Network accountability, forensic analysis, and failure diagnosis are becoming increasingly important for network management and security. Such capabilities often utilize network provenance - the ability to issue queries over network meta-data. For example, network provenance may be used to trace the path a message traverses on the network as well as to determine how message data were derived and which parties were involved in its derivation. This paper presents the design and implementation of ExSPAN, a generic and extensible framework that achieves efficient network provenance in a distributed environment. We utilize the database notion of data provenance to "explain" the existence of any network state, providing a versatile mechanism for network provenance. To achieve such flexibility at Internet-scale, ExSPAN uses declarative networking in which network protocols can be modeled as continuous queries over distributed streams and specified concisely in a declarative query language. We extend existing data models for provenance developed in database literature to enable distribution at Internet-scale, and investigate numerous optimization techniques to maintain and query distributed network provenance efficiently. The ExSPAN prototype is developed using RapidNet, a declarative networking platform based on the emerging ns-3 toolkit. Experiments over a simulated network and an actual deployment in a testbed environment demonstrate that our system supports a wide range of distributed provenance computations efficiently, resulting in significant reductions in bandwidth costs compared to traditional approaches.
【Keywords】: declarative networking; distributed query processing; provenance
【Paper Link】 【Pages】:627-638
【Authors】: Yohan J. Roh ; Jae Ho Kim ; Yon Dohn Chung ; Jin Hyun Son ; Myoung-Ho Kim
【Abstract】: Histograms have been widely used for fast estimation of query result sizes in query optimization. In this paper, we propose a new histogram method, called the Skew-Tolerant Histogram (STHistogram) for two or three dimensional geographic data objects that are used in many real-world applications in practice. The proposed method provides a significantly enhanced accuracy in a robust manner even for the data set that has a highly skewed distribution. Our method detects hotspots present in various parts of a data set and exploits them in organizing histogram buckets. For this purpose, we first define the concept of a hotspot, and provide an algorithm that efficiently extracts hotspots from the given data set. Then, we present our histogram construction method that utilizes hotspot information. We also describe how to estimate query result sizes by using the proposed histogram. We show through extensive performance experiments that the proposed method provides better performance than other existing methods.
【Keywords】: histograms; query optimization; spatial databases
【Paper Link】 【Pages】:639-650
【Authors】: Yufei Tao ; Ke Yi ; Cheng Sheng ; Jian Pei ; Feifei Li
【Abstract】: Quantiles are a crucial type of order statistics in databases. Extensive research has been focused on maintaining a space-efficient structure for approximate quantile computation as the underlying dataset is updated. The existing solutions, however, are designed to support only the current, most-updated, snapshot of the dataset. Queries on the past versions of the data cannot be answered. This paper studies the problem of historical quantile search. The objective is to enable ε-approximate quantile retrieval on any snapshot of the dataset in history. The problem is very important in analyzing the evolution of a distribution, monitoring the quality of services, query optimization in temporal databases, and so on. We present the first formal results in the literature. First, we prove a novel theoretical lower bound on the space cost of supporting ε-approximate historical quantile queries. The bound reveals the fundamental difference between answering quantile queries about the past and those about the present time. Second, we propose a structure for finding ε-approximate historical quantiles, and show that it consumes more space than the lower bound by only a square-logarithmic factor. Extensive experiments demonstrate that in practice our technique performs much better than predicted by theory. In particular, the quantiles it returns are remarkably more accurate than the theoretical precision guarantee.
【Keywords】: approximation; lower bound; quantile
【Paper Link】 【Pages】:651-662
【Authors】: Sai Wu ; Beng Chin Ooi ; Kian-Lee Tan
【Abstract】: In this paper, we propose an online aggregation system called COSMOS (Continuous Sampling for Multiple queries in an Online aggregation System), to process multiple aggregate queries efficiently. In COSMOS, a dataset is first scrambled so that sequentially scanning the dataset gives rise to a stream of random samples for all queries. Moreover, COSMOS organizes queries into a dissemination graph to exploit the dependencies across queries. In this way, aggregates of queries closer to the root (source of data flow) can potentially be used to compute the aggregates of descendent/dependent queries. COSMOS applies some statistical approach to combine answers from ancestor nodes to generate the online aggregates for a node. COSMOS also offers a partitioning strategy to further salvage intermediate answers. We have implemented COSMOS and conducted an extensive experimental study in PostgreSQL. Our results on the TPC-H benchmark show the efficiency and effectiveness of COSMOS.
【Keywords】: dissemination graph; online aggregation; random; sampling
【Paper Link】 【Pages】:663-674
【Authors】: Carl-Christian Kanne ; Guido Moerkotte
【Abstract】: Virtually all histograms store for each bucket the number of distinct values it contains and their average frequency. In this paper, we question this paradigm. We start out by investigating the estimation precision of three commercial database systems which also follow the above paradigm. It turns out that huge errors are quite common. We then introduce new bucket types and investigate their accuracy when building optimal histograms with them. The results are ambiguous. There is no clear winner among the bucket types. At this point, we (1) switch to heterogeneous histograms, where different buckets of the same histogram possibly are of different types, and (2) design more bucket types. The nice consequence of introducing heterogeneous histograms is that we can guarantee decent upper error bounds while at the same time heterogeneous histograms require far less space than homogeneous histograms.
【Keywords】: cardinality estimation; histograms
【Paper Link】 【Pages】:675-686
【Authors】: Bhargav Kanagal ; Amol Deshpande
【Abstract】: In this paper, we address the problem of scalably evaluating conjunctive queries over correlated probabilistic databases containing tuple or attribute uncertainties. Like previous work, we adopt a two-phase approach where we first compute lineages of the output tuples, and then compute the probabilities of the lineage formulas. However unlike previous work, we allow for arbitrary and complex correlations to be present in the data, captured via a forest of junction trees. We observe that evaluating even read-once (tree structured) lineages (e.g., those generated by hierarchical conjunctive queries), polynomially computable over tuple independent probabilistic databases, is #P-complete for lightly correlated probabilistic databases like Markov sequences. We characterize the complexity of exact computation of the probability of the lineage formula on a correlated database using a parameter called lwidth (analogous to the notion of treewidth). For lineages that result in low lwidth, we compute exact probabilities using a novel message passing algorithm, and for lineages that induce large lwidths, we develop approximate Monte Carlo algorithms to estimate the result probabilities. We scale our algorithms to very large correlated probabilistic databases using the previously proposed INDSEP data structure. To mitigate the complexity of lineage evaluation, we develop optimization techniques to process a batch of lineages by sharing computation across formulas, and to exploit any independence relationships that may exist in the data. Our experimental study illustrates the benefits of using our algorithms for processing lineage formulas over correlated probabilistic databases.
【Keywords】: conjunctive queries; indexing; junction trees; lineage; probabilistic databases
【Paper Link】 【Pages】:687-698
【Authors】: Luis Leopoldo Perez ; Subi Arumugam ; Christopher M. Jermaine
【Abstract】: MCDB is a prototype database system for managing stochastic models for uncertain data. In this paper, we study the problem of how to use MCDB to answer statistical queries that search for database objects which satisfy some filter condition with greater (or less than) a user-specified probability. For example: "Which packages will arrive late with > 5% probability?" "Which regions will see more than a 2% decline in sales with > 50% probability?" "What items will be out of stock by Friday with > 20% probability?" We consider both the systems aspects and the statistical aspects of the problem.
【Keywords】: algorithms
【Paper Link】 【Pages】:699-710
【Authors】: Kai Zheng ; Gabriel Pui Cheong Fung ; Xiaofang Zhou
【Abstract】: The K-Nearest Neighbor search (kNN) problem has been investigated extensively in the past due to its broad range of applications. In this paper we study this problem in the context of fuzzy objects that have indeterministic boundaries. Fuzzy objects play an important role in many areas, such as biomedical image databases and GIS. Existing research on fuzzy objects mainly focuses on modelling basic fuzzy object types and operations, leaving the processing of more advanced queries such as kNN query untouched. In this paper, we propose two new kinds of kNN queries for fuzzy objects, Ad-hoc kNN query (AKNN) and Range kNN query (RKNN), to find the k nearest objects qualifying at a probability threshold or within a probability range. For efficient AKNN query processing, we optimize the basic best-first search algorithm by deriving more accurate approximations for the distance function between fuzzy objects and the query object. To improve the performance of RKNN search, effective pruning rules are developed to significantly reduce the search space and further speed up the candidate refinement process. The efficiency of our proposed algorithms as well as the optimization techniques are verified with an extensive set of experiments using both synthetic and real datasets.
【Keywords】: fuzzy database; nearest neighbor query; probabilistic database
【Paper Link】 【Pages】:711-722
【Authors】: Zhuowei Bao ; Susan B. Davidson ; Sanjeev Khanna ; Sudeepa Roy
【Abstract】: We develop a compact and efficient reachability labeling scheme for answering provenance queries on workflow runs that conform to a given specification. Even though a workflow run can be structurally more complex and can be arbitrarily larger than the specification due to fork (parallel) and loop executions, we show that a compact reachability labeling for a run can be efficiently computed using the fact that it originates from a fixed specification. Our labeling scheme is optimal in the sense that it uses labels of logarithmic length, runs in linear time, and answers any reachability query in constant time. Our approach is based on using the reachability labeling for the specification as an effective skeleton for designing the reachability labeling for workflow runs. We also demonstrate empirically the effectiveness of our skeleton-based labeling approach.
【Keywords】: labeling; provenance; reachability; workflow
【Paper Link】 【Pages】:723-734
【Authors】: William R. Marczak ; Shan Shan Huang ; Martin Bravenboer ; Micah Sherr ; Boon Thau Loo ; Molham Aref
【Abstract】: We present SecureBlox, a declarative system that unifies a distributed query processor with a security policy framework. SecureBlox decouples security concerns from system specification, allowing easy reconfiguration of a system's security properties to suit a given execution environment. Our implementation of SecureBlox is a series of extensions to LogicBlox, an emerging commercial Datalog-based platform for enterprise software systems. SecureBlox enhances LogicBlox to enable distribution and static meta-programmability, and makes novel use of existing LogicBlox features such as integrity constraints. SecureBlox allows meta-programmability via BloxGenerics - a language extension for compile-time code generation based on the security requirements and trust policies of the deployed environment. We present and evaluate detailed use-cases in which SecureBlox enables diverse applications, including an authenticated declarative routing protocol with encrypted advertisements and an authenticated and encrypted parallel hash join operation. Our results demonstrate SecureBlox's abilities to specify and implement a wide range of different security constructs for distributed systems as well as to enable tradeoffs between performance and security.
【Keywords】: datalog; distributed query processing; secure data management
【Paper Link】 【Pages】:735-746
【Authors】: Vibhor Rastogi ; Suman Nath
【Abstract】: We propose the first differentially private aggregation algorithm for distributed time-series data that offers good practical utility without any trusted server. This addresses two important challenges in participatory data-mining applications where (i) individual users collect temporally correlated time-series data (such as location traces, web history, personal health data), and (ii) an untrusted third-party aggregator wishes to run aggregate queries on the data. To ensure differential privacy for time-series data despite the presence of temporal correlation, we propose the Fourier Perturbation Algorithm (FPAk). Standard differential privacy techniques perform poorly for time-series data. To answer n queries, such techniques can result in a noise of Θ(n) to each query answer, making the answers practically useless if n is large. Our FPAk algorithm perturbs the Discrete Fourier Transform of the query answers. For answering n queries, FPAk improves the expected error from Θ(n) to roughly Θ(k) where k is the number of Fourier coefficients that can (approximately) reconstruct all the n query answers. Our experiments show that k << n for many real-life data-sets resulting in a huge error-improvement for FPAk. To deal with the absence of a trusted central server, we propose the Distributed Laplace Perturbation Algorithm (DLPA) to add noise in a distributed way in order to guarantee differential privacy. To the best of our knowledge, DLPA is the first distributed differentially private algorithm that can scale with a large number of users: DLPA outperforms the only other distributed solution for differential privacy proposed so far, by reducing the computational load per user from O(U) to O(1) where U is the number of users.
【Keywords】: distributed noise addition; output perturbation; participatory data mining; private data analysis; time-series data
【Paper Link】 【Pages】:747-758
【Authors】: Wai Kit Wong ; Nikos Mamoulis ; David Wai-Lok Cheung
【Abstract】: Most previous research on privacy-preserving data publishing, based on the k-anonymity model, has followed the simplistic approach of homogeneously giving the same generalized value in all quasi-identifiers within a partition. We observe that the anonymization error can be reduced if we follow a non-homogeneous generalization approach for groups of size larger than k. Such an approach would allow tuples within a partition to take different generalized quasi-identifier values. Anonymization following this model is not trivial, as its direct application can easily violate k-anonymity. In addition, non-homogeneous generalization allows for additional types of attack, which should be considered in the process. We provide a methodology for verifying whether a non-homogeneous generalization violates k-anonymity. Then, we propose a technique that generates a non-homogeneous generalization for a partition and show that its result satisfies k-anonymity, however by straightforwardly applying it, privacy can be compromised if the attacker knows the anonymization algorithm. Based on this, we propose a randomization method that prevents this type of attack and show that k-anonymity is not compromised by it. Nonhomogeneous generalization can be used on top of any existing partitioning approach to improve its utility. In addition, we show that a new partitioning technique tailored for non-homogeneous generalization can further improve quality. A thorough experimental evaluation demonstrates that our methodology greatly improves the utility of anonymized data in practice.
【Keywords】: anonymization; non-homogeneous generalization; privacy
【Paper Link】 【Pages】:759-770
【Authors】: Hazem Elmeleegy ; Mourad Ouzzani ; Ahmed K. Elmagarmid ; Ahmad M. Abusalah
【Abstract】: Peer-to-peer data integration - a.k.a. Peer Data Management Systems (PDMSs) - promises to extend the classical data integration approach to the Internet scale. Unfortunately, some challenges remain before realizing this promise. One of the biggest challenges is preserving the privacy of the exchanged data while passing through several intermediate peers. Another challenge is protecting the mappings used for data translation. Protecting the privacy without being unfair to any of the peers is yet a third challenge. This paper presents a novel query answering protocol in PDMSs to address these challenges. The protocol employs a technique based on noise selection and insertion to protect the query results, and a commutative encryption-based technique to protect the mappings and ensure fairness among peers. An extensive security analysis of the protocol shows that it is resilient to several possible types of attacks. We implemented the protocol within an established PDMS: the Hyperion system. We conducted an experimental study using real data from the healthcare domain. The results show that our protocol manages to achieve its privacy and fairness goals, while maintaining query processing time at the interactive level.
【Keywords】: fairness; mappings; peer data management systems; peer-to-peer data integration; privacy
【Paper Link】 【Pages】:771-782
【Authors】: Nikos Sarkas ; Stelios Paparizos ; Panayiotis Tsaparas
【Abstract】: Queries asked on web search engines often target structured data, such as commercial products, movie showtimes, or airline schedules. However, surfacing relevant results from such data is a highly challenging problem, due to the unstructured language of the web queries, and the imposing scalability and speed requirements of web search. In this paper, we discover latent structured semantics in web queries and produce Structured Annotations for them. We consider an annotation as a mapping of a query to a table of structured data and attributes of this table. Given a collection of structured tables, we present a fast and scalable tagging mechanism for obtaining all possible annotations of a query over these tables. However, we observe that for a given query only few are sensible for the user needs. We thus propose a principled probabilistic scoring mechanism, using a generative model, for assessing the likelihood of a structured annotation, and we define a dynamic threshold for filtering out misinterpreted query annotations. Our techniques are completely unsupervised, obviating the need for costly manual labeling effort. We evaluated our techniques using real world queries and data and present promising experimental results.
【Keywords】: keyword search; structured data; web
【Paper Link】 【Pages】:783-794
【Authors】: Arvind Arasu ; Michaela Götz ; Raghav Kaushik
【Abstract】: We consider the problem of learning a record matching package (classifier) in an active learning setting. In active learning, the learning algorithm picks the set of examples to be labeled, unlike more traditional passive learning setting where a user selects the labeled examples. Active learning is important for record matching since manually identifying a suitable set of labeled examples is difficult. Previous algorithms that use active learning for record matching have serious limitations: The packages that they learn lack quality guarantees and the algorithms do not scale to large input sizes. We present new algorithms for this problem that overcome these limitations. Our algorithms are fundamentally different from traditional active learning approaches, and are designed ground up to exploit problem characteristics specific to record matching. We include a detailed experimental evaluation on realworld data demonstrating the effectiveness of our algorithms.
【Keywords】: active learning; data cleaning; record matching
【Paper Link】 【Pages】:795-806
【Authors】: Anish Das Sarma ; Alpa Jain ; Divesh Srivastava
【Abstract】: Information extraction systems are increasingly being used to mine structured information from unstructured text documents. A commonly used unsupervised technique is to build iterative information extraction (IIE) systems that learn task-specific rules, called patterns, to generate the desired tuples. Oftentimes, output from an information extraction system may contain unexpected results which may be due to an incorrect pattern, incorrect tuple, or both. In such scenarios, users and developers of the extraction system could greatly benefit from an investigation tool that can quickly help them reason about and repair the output. In this paper, we develop an approach for interactive post-extraction investigation for IIE systems. We formalize three important phases of this investigation, namely, explain the IIE result, diagnose the influential and problematic components, and repair the output from an information extraction system. We show how to characterize the execution of an IIE system and build a suite of algorithms to answer questions pertaining to each of these phases. We experimentally evaluate our proposed approach over several domains over a Web corpus of about 500 million documents. We show that our approach effectively enables post-extraction investigation, while maximizing the gain from user and developer interaction.
【Keywords】: debugging; diagnose; explain; information extraction; interactive investigation; repair
【Paper Link】 【Pages】:807-818
【Authors】: Eli Cortez ; Altigran Soares da Silva ; Marcos André Gonçalves ; Edleno Silva de Moura
【Abstract】: Information extraction by text segmentation (IETS) applies to cases in which data values of interest are organized in implicit semi-structured records available in textual sources (e.g. postal addresses, bibliographic information, ads). It is an important practical problem that has been frequently addressed in the recent literature. In this paper we introduce ONDUX (On Demand Unsupervised Information Extraction), a new unsupervised probabilistic approach for IETS. As other unsupervised IETS approaches, ONDUX relies on information available on pre-existing data to associate segments in the input string with attributes of a given domain. Unlike other approaches, we rely on very effective matching strategies instead of explicit learning strategies. The effectiveness of this matching strategy is also exploited to disambiguate the extraction of certain attributes through a reinforcement step that explores sequencing and positioning of attribute values directly learned on-demand from test data, with no previous human-driven training, a feature unique to ONDUX. This assigns to ONDUX a high degree of flexibility and results in superior effectiveness, as demonstrated by the experimental evaluation we report with textual sources from different domains, in which ONDUX is compared with a state-of-art IETS approach.
【Keywords】: data management; information extraction; text segmentation
【Paper Link】 【Pages】:819-830
【Authors】: Mohan Yang ; Haixun Wang ; Lipyeow Lim ; Min Wang
【Abstract】: An increasing number of applications operate on data obtained from the Web. These applications typically maintain local copies of the web data to avoid network latency in data accesses. As the data on the Web evolves, it is critical that the local copy be kept up-to-date. Data freshness is one of the most important data quality issues, and has been extensively studied for various applications including web crawling. However, web crawling is focused on obtaining as many raw web pages as possible. Our applications, on the other hand, are interested in specific content from specific data sources. Knowing the content or the semantics of the data enables us to differentiate data items based on their importance and volatility, which are key factors that impact the design of the data synchronization strategy. In this work, we formulate the concept of content freshness, and present a novel approach that maintains content freshness with least amount of web communication. Specifically, we assume data is accessible through a general keyword search interface, and we form keyword queries based on their selectivity, as well their contribution to content freshness of the local copy. Experiments show the effectiveness of our approach compared with several naive methods for keeping data fresh.
【Keywords】: content freshness; synchronization strategy; web crawling
【Paper Link】 【Pages】:831-842
【Authors】: Adam Silberstein ; Jeff Terrace ; Brian F. Cooper ; Raghu Ramakrishnan
【Abstract】: Near real-time event streams are becoming a key feature of many popular web applications. Many web sites allow users to create a personalized feed by selecting one or more event streams they wish to follow. Examples include Twitter and Facebook, which a user to follow other users' activity, and iGoogle and My Yahoo, which allow users to follow selected RSS streams. How can we efficiently construct a web page showing the latest events from a user's feed? Constructing such a feed must be fast so the page loads quickly, yet reflects recent updates to the underlying event streams. The wide fanout of popular streams (those with many followers) and high skew (fanout and update rates vary widely) make it difficult to scale such applications. We associate feeds with consumers and event streams with producers. We demonstrate that the best performance results from selectively materializing each consumer's feed: events from high-rate producers are retrieved at query time, while events from lower-rate producers are materialized in advance. A formal analysis of the problem shows the surprising result that we can minimize global cost by making local decisions about each producer/consumer pair, based on the ratio between a given producer's update rate (how often an event is added to the stream) and a given consumer's view rate (how often the feed is viewed). Our experimental results, using Yahoo!'s web-scale database PNUTS, shows that this hybrid strategy results in the lowest system load (and hence improves scalability) under a variety of workloads.
【Keywords】: social networks; view maintenance
【Paper Link】 【Pages】:843-854
【Authors】: Senjuti Basu Roy ; Sihem Amer-Yahia ; Ashish Chawla ; Gautam Das ; Cong Yu
【Abstract】: Nowadays, online shopping has become a daily activity. Web users purchase a variety of items ranging from books to electronics. The large supply of online products calls for sophisticated techniques to help users explore available items. We propose to build composite items which associate a central item with a set of packages, formed by satellite items, and help users explore them. For example, a user shopping for an iPhone (i.e., the central item) with a price budget can be presented with both the iPhone and a package of other items that match well with the iPhone (e.g., {Belkin case, Bose sounddock, Kroo USB cable}) as a composite item, whose total price is within the user's budget. We define and study the problem of effective construction and exploration of large sets of packages associated with a central item, and design and implement efficient algorithms for solving the problem in two stages: summarization, a technique which picks k representative packages for each central item; and visual effect optimization, which helps the user find diverse composite items quickly by minimizing overlap between packages presented to the user in a ranked order. We conduct an extensive set of experiments on Yahoo! Shopping1 data sets to demonstrate the efficiency and effectiveness of our algorithms.
【Keywords】: composite item construction; e-commerce application; np-hard problems
【Paper Link】 【Pages】:855-866
【Authors】: Arjun Dasgupta ; Xin Jin ; Bradley Jewell ; Nan Zhang ; Gautam Das
【Abstract】: Many websites provide restrictive form-like interfaces which allow users to execute search queries on the underlying hidden databases. In this paper, we consider the problem of estimating the size of a hidden database through its web interface. We propose novel techniques which use a small number of queries to produce unbiased estimates with small variance. These techniques can also be used for approximate query processing over hidden databases. We present theoretical analysis and extensive experiments to illustrate the effectiveness of our approach.
【Keywords】: aggregate query processing; hidden databases
【Paper Link】 【Pages】:867-878
【Authors】: Arijit Khan ; Xifeng Yan ; Kun-Lung Wu
【Abstract】: Mining graph patterns in large networks is critical to a variety of applications such as malware detection and biological module discovery. However, frequent subgraphs are often ineffective to capture association existing in these applications, due to the complexity of isomorphism testing and the inelastic pattern definition. In this paper, we introduce proximity pattern which is a significant departure from the traditional concept of frequent subgraphs. Defined as a set of labels that co-occur in neighborhoods, proximity pattern blurs the boundary between itemset and structure. It relaxes the rigid structure constraint of frequent subgraphs, while introducing connectivity to frequent itemsets. Therefore, it can benefit from both: efficient mining in itemsets and structure proximity from graphs. We developed two models to define proximity patterns. The second one, called Normalized Probabilistic Association (NmPA), is able to transform a complex graph mining problem to a simplified probabilistic itemset mining problem, which can be solved eficiently by a modified FP-tree algorithm, called pFP. NmPA and pFP are evaluated on real-life social and intrusion networks. Empirical results show that it not only finds interesting patterns that are ignored by the existing approaches, but also achieves high performance for finding proximity patterns in large-scale graphs.
【Keywords】: association; graph; mining; pattern
【Paper Link】 【Pages】:879-890
【Authors】: Ning Jin ; Calvin Young ; Wei Wang
【Abstract】: Discriminative subgraphs are widely used to define the feature space for graph classification in large graph databases. Several scalable approaches have been proposed to mine discriminative subgraphs. However, their intensive computation needs prevent them from mining large databases. We propose an efficient method GAIA for mining discriminative subgraphs for graph classification in large databases. Our method employs a novel subgraph encoding approach to support an arbitrary subgraph pattern exploration order and explores the subgraph pattern space in a process resembling biological evolution. In this manner, GAIA is able to find discriminative subgraph patterns much faster than other algorithms. Additionally, we take advantage of parallel computing to further improve the quality of resulting patterns. In the end, we employ sequential coverage to generate association rules as graph classifiers using patterns mined by GAIA. Extensive experiments have been performed to analyze the performance of GAIA and to compare it with two other state-of-the-art approaches. GAIA outperforms the other approaches both in terms of classification accuracy and runtime efficiency.
【Keywords】: graph classification; graph mining
【Paper Link】 【Pages】:891-902
【Authors】: Yufei Tao ; Cheng Sheng ; Jianzhong Li
【Abstract】: An (edge) hidden graph is a graph whose edges are not explicitly given. Detecting the presence of an edge requires expensive edge-probing queries. We consider the k most connected vertex problem on hidden bipartite graphs. Specifically, given a bipartite graph G with independent vertex sets B and W, the goal is to find the k vertices in B with the largest degrees using the minimum number of queries. This problem can be regarded as a top-k extension of a semi-join, and is encountered in many applications in practice (e.g., top-k spatial join with arbitrarily complex join predicates). If B and W have n and m vertices respectively, the number of queries needed to solve the problem is nm in the worst case. This, however, is a pessimistic estimate on how many queries are necessary on practical data. In fact, on some easy inputs, the problem can be efficiently settled with only km+ n edges, which is significantly lower than nm for k << n. The huge difference between km + n and nm makes it interesting to design an adaptive algorithm that is guaranteed to achieve the best possible performance on every input G. We give such an algorithm, and prove that it is instance optimal among a broad class of solutions. This means that, for any G, our algorithm can perform more queries than the optimal solution (which is currently unknown) by only a constant factor, which can be shown to be at most 2. Extensive experiments demonstrate that, in practice, the number of queries required by our technique is far less than nm, and agrees with our theoretical findings very well.
【Keywords】: bipartite graph; competitive analysis; instance optimality; maximum degree
【Paper Link】 【Pages】:903-914
【Authors】: Haichuan Shang ; Xuemin Lin ; Ying Zhang ; Jeffrey Xu Yu ; Wei Wang
【Abstract】: Substructure similarity search is to retrieve graphs that approximately contain a given query graph. It has many applications, e.g., detecting similar functions among chemical compounds. The problem is challenging as even testing subgraph containment between two graphs is NP-complete. Hence, existing techniques adopt the filtering-and-verification framework with the focus on developing effective and efficient techniques to remove non-promising graphs. Nevertheless, existing filtering techniques may be still unable to effectively remove many "low" quality candidates. To resolve this, in this paper we propose a novel indexing technique, GrafD-Index, to index graphs according to their "distances" to features. We characterize a tight condition under which the distance-based triangular inequality holds. We then develop lower and upper bounding techniques that exploit the GrafD-Index to (1) prune non-promising graphs and (2) include graphs whose similarities are guaranteed to exceed the given similarity threshold. Considering that the verification phase is not well studied and plays the dominant role in the whole process, we devise efficient algorithms to verify candidates. A comprehensive experiment using real datasets demonstrates that our proposed methods significantly outperform existing methods.
【Keywords】: graph database; similarity search
【Paper Link】 【Pages】:915-926
【Authors】: Zhenjie Zhang ; Marios Hadjieleftheriou ; Beng Chin Ooi ; Divesh Srivastava
【Abstract】: Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database cleaning, biological sequence analysis, and more. While a large number of dissimilarity measures on strings have been proposed, edit distance is the most popular choice in a wide spectrum of applications. Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on n-gram signatures coupled with inverted list structures. These techniques are tailored for specific query types only, and their performance remains unsatisfactory especially in scenarios with strict memory constraints or frequent data updates. In this paper we propose the Bed-tree, a B+-tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance. We identify the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. Three transformations are proposed that capture different aspects of information inherent in strings, enabling efficient pruning during the search process on the tree. Compared to state-of-the-art methods on string similarity search, the Bed-tree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time.
【Keywords】: approximate join; edit distance; range query; similarity search; string; top-k query
【Paper Link】 【Pages】:927-938
【Authors】: Parag Agrawal ; Arvind Arasu ; Raghav Kaushik
【Abstract】: Prior work has identified set based comparisons as a useful primitive for supporting a wide variety of similarity functions in record matching. Accordingly, various techniques have been proposed to improve the performance of set similarity lookups. However, this body of work focuses almost exclusively on symmetric notions of set similarity. In this paper, we study the indexing problem for the asymmetric Jaccard containment similarity function that is an error-tolerant variation of set containment. We enhance this similarity function to also account for string transformations that reflect synonyms such as "Bob" and "Robert" referring to the same first name. We propose an index structure that builds inverted lists on carefully chosen token-sets and a lookup algorithm using our index that is sensitive to the output size of the query. Our experiments over real life data sets show the benefits of our techniques. To our knowledge, this is the first paper that studies the indexing problem for Jaccard containment in the presence of string transformations.
【Keywords】: data cleaning; indexing; jaccard containment; transformations
【Paper Link】 【Pages】:939-950
【Authors】: Oguzhan Ozmen ; Kenneth Salem ; Jiri Schindler ; Steve Daniel
【Abstract】: The performance of a database system depends strongly on the layout of database objects, such as indexes or tables, onto the underlying storage devices. A good layout will both balance the I/O workload generated by the database system and avoid the performance-degrading interference that can occur when concurrently accessed objects are stored on the same volume. In current practice, layout is typically guided by heuristics and rules of thumb, such as separating indexes and tables or striping all objects across all of the available storage devices. However, these guidelines may give poor results. In this paper, we address the problem of generating an optimized layout of a given set of database objects. Our layout optimizer goes beyond generic guidelines by making use of a description of the database system's I/O activity. We formulate the layout problem as a non-linear programming (NLP) problem and use the I/O description as input to an NLP solver. Our layout optimization technique, which is incorporated into a database layout advisor, identifies a layout that both balances load and avoids interference. We evaluate experimentally the efficacy of our approach and demonstrate that it can quickly identify non-trivial optimized layouts.
【Keywords】: layout; physical design; storage systems
【Paper Link】 【Pages】:951-962
【Authors】: Grigoris Karvounarakis ; Zachary G. Ives ; Val Tannen
【Abstract】: Many advanced data management operations (e.g., incremental maintenance, trust assessment, debugging schema mappings, keyword search over databases, or query answering in probabilistic databases), involve computations that look at how a tuple was produced, e.g., to determine its score or existence. This requires answers to queries such as, "Is this data derivable from trusted tuples?"; "What tuples are derived from this relation?"; or "What score should this answer receive, given initial scores of the base tuples?". Such questions can be answered by consulting the provenance of query results. In recent years there has been significant progress on formal models for provenance. However, the issues of provenance storage, maintenance, and querying have not yet been addressed in an application-independent way. In this paper, we adopt the most general formalism for tuple-based provenance, semiring provenance. We develop a query language for provenance, which can express all of the aforementioned types of queries, as well as many more; we propose storage, processing and indexing schemes for data provenance in support of these queries; and we experimentally validate the feasibility of provenance querying and the benefits of our indexing techniques across a variety of application classes and queries.
【Keywords】: annotation; data provenance; query language; query processing
【Paper Link】 【Pages】:963-968
【Authors】: Paul G. Brown
【Abstract】: SciDB [4, 3] is a new open-source data management system intended primarily for use in application domains that involve very large (petabyte) scale array data; for example, scientific applications such as astronomy, remote sensing and climate modeling, bio-science information management, risk management systems in financial applications, and the analysis of web log data. In this talk we will describe our set of motivating examples and use them to explain the features of SciDB. We then briefly give an overview of the project 'in flight', explaining our novel storage manager, array data model, query language, and extensibility frameworks.
【Keywords】: acm sigmod industrial proceedings scientific data management
【Paper Link】 【Pages】:969-974
【Authors】: Yu Xu ; Pekka Kostamaa ; Like Gao
【Abstract】: Teradata's parallel DBMS has been successfully deployed in large data warehouses over the last two decades for large scale business analysis in various industries over data sets ranging from a few terabytes to multiple petabytes. However, due to the explosive data volume increase in recent years at some customer sites, some data such as web logs and sensor data are not managed by Teradata EDW (Enterprise Data Warehouse), partially because it is very expensive to load those extreme large volumes of data to a RDBMS, especially when those data are not frequently used to support important business decisions. Recently the MapReduce programming paradigm, started by Google and made popular by the open source Hadoop implementation with major support from Yahoo!, is gaining rapid momentum in both academia and industry as another way of performing large scale data analysis. By now most data warehouse researchers and practitioners agree that both parallel DBMS and MapReduce paradigms have advantages and disadvantages for various business applications and thus both paradigms are going to coexist for a long time [16]. In fact, a large number of Teradata customers, especially those in the e-business and telecom industries have seen increasing needs to perform BI over both data stored in Hadoop and data in Teradata EDW. One common thing between Hadoop and Teradata EDW is that data in both systems are partitioned across multiple nodes for parallel computing, which creates integration optimization opportunities not possible for DBMSs running on a single node. In this paper we describe our three efforts towards tight and efficient integration of Hadoop and Teradata EDW.
【Keywords】: data load; hadoop; mapreduce; parallel computing; parallel dbms; shared nothing
【Paper Link】 【Pages】:975-986
【Authors】: Spyros Blanas ; Jignesh M. Patel ; Vuk Ercegovac ; Jun Rao ; Eugene J. Shekita ; Yuanyuan Tian
【Abstract】: The MapReduce framework is increasingly being used to analyze large volumes of data. One important type of data analysis done with MapReduce is log processing, in which a click-stream or an event log is filtered, aggregated, or mined for patterns. As part of this analysis, the log often needs to be joined with reference data such as information about users. Although there have been many studies examining join algorithms in parallel and distributed DBMSs, the MapReduce framework is cumbersome for joins. MapReduce programmers often use simple but inefficient algorithms to perform joins. In this paper, we describe crucial implementation details of a number of well-known join strategies in MapReduce, and present a comprehensive experimental comparison of these join techniques on a 100-node Hadoop cluster. Our results provide insights that are unique to the MapReduce platform and offer guidance on when to use a particular join algorithm on this platform.
【Keywords】: analytics; hadoop; join processing; mapreduce
【Paper Link】 【Pages】:987-998
【Authors】: Sudipto Das ; Yannis Sismanis ; Kevin S. Beyer ; Rainer Gemulla ; Peter J. Haas ; John McPherson
【Abstract】: Many modern enterprises are collecting data at the most detailed level possible, creating data repositories ranging from terabytes to petabytes in size. The ability to apply sophisticated statistical analysis methods to this data is becoming essential for marketplace competitiveness. This need to perform deep analysis over huge data repositories poses a significant challenge to existing statistical software and data management systems. On the one hand, statistical software provides rich functionality for data analysis and modeling, but can handle only limited amounts of data; e.g., popular packages like R and SPSS operate entirely in main memory. On the other hand, data management systems - such as MapReduce-based systems - can scale to petabytes of data, but provide insufficient analytical functionality. We report our experiences in building Ricardo, a scalable platform for deep analytics. Ricardo is part of the eXtreme Analytics Platform (XAP) project at the IBM Almaden Research Center, and rests on a decomposition of data-analysis algorithms into parts executed by the R statistical analysis system and parts handled by the Hadoop data management system. This decomposition attempts to minimize the transfer of data across system boundaries. Ricardo contrasts with previous approaches, which try to get along with only one type of system, and allows analysts to work on huge datasets from within a popular, well supported, and powerful analysis environment. Because our approach avoids the need to re-implement either statistical or data-management functionality, it can be used to solve complex problems right now.
【Keywords】: analytics; cloud; data management; hadoop; r; statistics
【Paper Link】 【Pages】:999-1002
【Authors】: Michael Moricz ; Yerbolat Dosbayev ; Mikhail Berlyant
【Abstract】: In recent years Social Networking has enjoyed a significant increase in popularity. The main reason behind this surge in popularity is the social experience associated with connecting content to people and also connecting people with other people. Knowing, seeing, hearing what our friends and like-minded people feel or listen to or upload is an unparalleled experience. Similar to real life, finding good friends is not easy without the help of good recommendations. In this Industry Talk paper we present the MySpace friend recommendation algorithm named People You May Know. We will also comment on both the quality and the effectiveness of the algorithms.
【Keywords】: friend recommendation; long tail; personalized recommendation; recommendation; recommender systems; social recommendations
【Paper Link】 【Pages】:1003-1012
【Authors】: Deepak Agarwal ; Datong Chen ; Long-ji Lin ; Jayavel Shanmugasundaram ; Erik Vee
【Abstract】: We propose a method for forecasting high-dimensional data (hundreds of attributes, trillions of attribute combinations) for a duration of several months. Our motivating application is guaranteed display advertising, a multi-billion dollar industry, whereby advertisers can buy targeted (high-dimensional) user visits from publishers many months or even years in advance. Forecasting high-dimensional data is challenging because of the many possible attribute combinations that need to be forecast. To address this issue, we propose a method whereby only a sub-set of attribute combinations are explicitly forecast and stored, while the other combinations are dynamically forecast on-the-fly using high-dimensional attribute correlation models. We evaluate various attribute correlation models, from simple models that assume the independence of attributes to more sophisticated sample-based models that fully capture the correlations in a high-dimensional space. Our evaluation using real-world display advertising data sets shows that fully capturing high-dimensional correlations leads to significant forecast accuracy gains. A variant of the proposed method has been implemented in the context of Yahoo!'s guaranteed display advertising system.
【Keywords】: forecasting
【Paper Link】 【Pages】:1013-1020
【Authors】: Ashish Thusoo ; Zheng Shao ; Suresh Anthony ; Dhruba Borthakur ; Namit Jain ; Joydeep Sen Sarma ; Raghotham Murthy ; Hao Liu
【Abstract】: Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more advanced kinds such as friend recommendations. In order to support this diversity of use cases on the ever increasing amount of data, a flexible infrastructure that scales up in a cost effective manner, is critical. We have leveraged, authored and contributed to a number of open source technologies in order to address these requirements at Facebook. These include Scribe, Hadoop and Hive which together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook. In this paper we will present how these systems have come together and enabled us to implement a data warehouse that stores more than 15PB of data (2.5PB after compression) and loads more than 60TB of new data (10TB after compression) every day. We discuss the motivations behind our design choices, the capabilities of this solution, the challenges that we face in day today operations and future capabilities and improvements that we are working on.
【Keywords】: analytics; data discovery; data warehouse; distributed file system; distributed systems; facebook; hadoop; hive; log aggregation; map-reduce; resource sharing; scalability; scribe
【Paper Link】 【Pages】:1021-1024
【Authors】: David G. Campbell ; Gopal Kakivaya ; Nigel Ellis
【Abstract】: Cloud SQL Server is an Internet scale relational database service which is currently used by Microsoft delivered services and also offered directly as a fully relational database service known as "SQL Azure". One of the principle design objectives in Cloud SQL Server was to provide true SQL support with full ACID transactions within controlled scale "consistency domains" and provide a relaxed degree of consistency across consistency domains that would be viable to clusters of 1,000's of nodes. In this paper, we describe the implementation of Cloud SQL Server with an emphasis on this core design principle.
【Keywords】: cloud database; extreme scale
【Paper Link】 【Pages】:1025-1036
【Authors】: Zhen Hua Liu ; Thomas Baby ; Sukhendu Chakraborty ; Junyan Ding ; Anguel Novoselsky ; Vikas Arora
【Abstract】: RDBMS provides best performance for querying structured data that starts out with a well-defined schema. However, such a 'schema first, data later' approach does not work for unstructured data or data without much structure. Therefore, RDBMS typically stores such data without any schema in LOB columns (for example, Character Large Object (CLOB) or Binary Large Object (BLOB) columns) and provides Information-Retrieval (IR) style, keyword-based search capability over these LOB columns. Lately, XML as a native datatype (XMLType) in RDBMS has been introduced via the SQL/XML standard. Semi-structured data with or without any schema can be stored into such XMLType columns, and XQuery provides query capability over them. In particular, XQuery full text specification provides the capability of searching keywords within document context. Such full context-aware text search capability is more powerful than pure keyword search, since the user can now provide fine-grained context in which the keywords should occur. However, XML with XQuery full text searching requires that the user first convert her text data into XML and store them into XMLType column. Such massive physical data migration with possible loss of document fidelity and its potential impact on existing production environments are often expensive enough that users are reluctant to adopt the XML/XQuery approach. In this paper, we propose a pay-as-you-go architecture to provide XML text view over LOB columns, so that user can take advantage of context-aware full-text search capability adaptively. This adaptive architecture includes a novel XML text index that can be created over the LOB column where the content is stored. The XML text index supports an XML text view over LOB data on top of which XQuery full-text search capability is feasible. Such an adaptive index/view approach provides least intrusion over existing data, as it requires no physical data migration. We describe the design and challenge of building such an adaptive XML text index. Furthermore, we advocate that the pay-as-you-go approach provides the integration bridge between the structured relational world and text oriented document world and fulfills the primary motivation of XML in the database.
【Keywords】: information retrieval; keyword search; text index; tree index; xml; xml index; xquery; xquery full text
【Paper Link】 【Pages】:1037-1046
【Authors】: Ilya Taranov ; Ivan Shcheklein ; Alexander Kalinin ; Leonid Novak ; Sergei D. Kuznetsov ; Roman Pastukhov ; Alexander Boldakov ; Denis Turdakov ; Konstantin Antipin ; Andrey Fomichev ; Peter Pleshachkov ; Pavel Velikhov ; Nikolai Zavaritski ; Maxim Grinev ; Maria P. Grineva ; Dmitry Lizorkin
【Abstract】: We present a native XML database management system, Sedna, which is implemented from scratch as a full-featured database management system for storing large amounts of XML data. We believe that the key contribution of this system is an improved schema-based clustering storage strategy efficient for both XML querying and updating, and powered by a novel memory management technique. We position our approach with respect to state-of-the-art methods.
【Keywords】: data multiversioning; memory management; native xml storage; numbering scheme; xml database; xquery
【Paper Link】 【Pages】:1047-1056
【Authors】: Scott M. Meyer ; Jutta Degener ; John Giannandrea ; Barak Michener
【Abstract】: Current relational databases require that a database schema exist prior to data entry and require manual optimization for best performance. We describe the query optimization techniques used by graphd, the schema-last, automatically indexed tuple-store which supports freebase.com, a large world-writable database. Graphd is a log-structured store with a query optimizer based on a functional operator tree over the domain of sorted integer sets which accumulate naturally as tuples are appended to the store. We demonstrate that a set-based optimizer can deliver performance that is roughly comparable to traditional RDBMS query optimization techniques applied to a fixed schema.
【Keywords】: database; graph store; object-oriented; query optimization; schema-last; triple-store; tuple-store
【Paper Link】 【Pages】:1057-1060
【Authors】: Len Seligman ; Peter Mork ; Alon Y. Halevy ; Kenneth P. Smith ; Michael J. Carey ; Kuang Chen ; Chris Wolf ; Jayant Madhavan ; Akshay Kannan ; Doug Burdick
【Abstract】: OpenII (openintegration.org) is a collaborative effort to create a suite of open-source tools for information integration (II). The project is leveraging the latest developments in II research to create a platform on which integration tools can be built and further research conducted. In addition to a scalable, extensible platform, OpenII includes industrial-strength components developed by MITRE, Google, UC-Irvine, and UC-Berkeley that interoperate through a common repository in order to solve II problems. Components of the toolkit have been successfully applied to several large-scale US government II challenges.
【Keywords】: data exchange; information integration
【Paper Link】 【Pages】:1061-1066
【Authors】: Hector Gonzalez ; Alon Y. Halevy ; Christian S. Jensen ; Anno Langen ; Jayant Madhavan ; Rebecca Shapley ; Warren Shen ; Jonathan Goldberg-Kidon
【Abstract】: It has long been observed that database management systems focus on traditional business applications, and that few people use a database management system outside their workplace. Many have wondered what it will take to enable the use of data management technology by a broader class of users and for a much wider range of applications. Google Fusion Tables represents an initial answer to the question of how data management functionality that focused on enabling new users and applications would look in today's computing environment. This paper characterizes such users and applications and highlights the resulting principles, such as seamless Web integration, emphasis on ease of use, and incentives for data sharing, that underlie the design of Fusion Tables. We describe key novel features, such as the support for data acquisition, collaboration, visualization, and web-publishing.
【Keywords】: cloud services; collaboration; visualization
【Paper Link】 【Pages】:1067-1068
【Authors】: Christopher R. Stolte
【Abstract】: Easy-to-use visual interfaces to data can broadly expand the audience for databases. Domain experts rather than database experts can engage in rapid-fire Q&A sessions with the data. Visual interfaces can provide a medium for story-telling, debate, and conversations about the data. They can also put new and challenging demands on the capabilities of traditional relational databases. In this talk, I will describe our formal language-based approach to visual analysis and how the use of a formal language enables us to build user experiences that more effectively support the process of analysis. Tableau's VizQL algebra is a declarative language for succinctly describing visual representations of data and analytics operations on the data. A VizQL statement compiles into the SQL or MDX queries necessary to generate the view and into the graphical commands to render the interactive view of the data. Our easy-to-use drag-and-drop user experiences for analysis and visual interface authoring are built on top of VizQL. In addition to supporting the process of analysis, a formal language-based approach provides a basis for reasoning about the structure of views and the space of possible views. This in turn enables the development of powerful new analytic capabilities, such as automatic presentation of structured data, visual authoring of statistical models, and view-based calculation, which we demonstrate. I will also discuss the challenges we have faced in getting relational databases "in the wild" to effectively support visual analysis for the average business or scientific user. The challenges range from the technical to the political. Traditional relational databases, both for OLTP and OLAP, often require sophisticated data modeling and data management expertise, optimize for performance based on known workloads, and are designed for scaling to large databases sizes (e.g. PB or TB) on clusters of machines rather than reducing analytic latency using limited hardware. I will describe our approaches to building a database focused on providing interactive query performance on tens or hundreds of millions of rows of data with little or no data modeling (physical or logical) and running on a typical knowledge worker desktop machine. Finally, I will discuss the changing landscape of interfaces to databases. The original interface to the database was transactional in focus: Many users read and make atomic changes to a small number of rows in a large database. In recent decades, powerful analytic use cases have emerged focused on the study and analysis of massive amounts of data by relatively small numbers of power users. The emergence of easily authored visual interfaces to public and private data changes will enables a new style of database usage. Millions of users performing analytics on thousands of data sets all hosted in the cloud with usage demonstrating the familiar long-tail distribution. Everyone will become an author and all interfaces will enable analytics.
【Keywords】: visual analytics
【Paper Link】 【Pages】:1069-1080
【Authors】: Vinayak R. Borkar ; Michael J. Carey ; Sebu Koleth ; Alexander Kotopoulis ; Kautul Mehta ; Joshua Spiegel ; Sachin Thatte ; Till Westmann
【Abstract】: The AquaLogic Data Services Platform (ALDSP) is a middleware platform developed at BEA Systems for building services, referred to as data services, that integrate, access, and manipulate information coming from multiple heterogeneous sources of data (including databases, files, and other services). ALDSP uses functions that produce and consume XML to model heterogeneous information sources, so the integration logic for data access services in ALDSP is specified declaratively using the XQuery language. A challenge that we faced in developing ALDSP was providing effective graphical tooling to help data service developers to develop information integration queries. In this paper, we describe the graphical XQuery Editor (XQE) that resulted from our attempt to tackle this challenge. XQE handles the full XQuery language and provides a robust two-way editing experience involving both graphical and source views of each query. XQE is novel in being the first commercial graphical XQuery editor to support both of these features.
【Keywords】: data services; graphical user interface; information integration; service-oriented architecture; xml; xquery
【Paper Link】 【Pages】:1081-1092
【Authors】: Sailesh Krishnamurthy ; Michael J. Franklin ; Jeffrey Davis ; Daniel Farina ; Pasha Golovko ; Alan Li ; Neil Thombre
【Abstract】: Continuous analytics systems that enable query processing over steams of data have emerged as key solutions for dealing with massive data volumes and demands for low latency. These systems have been heavily influenced by an assumption that data streams can be viewed as sequences of data that arrived more or less in order. The reality, however, is that streams are not often so well behaved and disruptions of various sorts are endemic. We argue, therefore, that stream processing needs a fundamental rethink and advocate a unified approach toward continuous analytics over discontinuous streaming data. Our approach is based on a simple insight - using techniques inspired by data parallel query processing, queries can be performed over independent sub-streams with arbitrary time ranges in parallel, generating partial results. The consolidation of the partial results over each sub-stream can then be deferred to the time at which the results are actually used on an on-demand basis. In this paper, we describe how the Truviso Continuous Analytics system implements this type of order-independent processing. Not only does the approach provide the first real solution to the problem of processing streaming data that arrives arbitrarily late, it also serves as a critical building block for solutions to a host of hard problems such as parallelism, recovery, transactional consistency, high availability, failover, and replication.
【Keywords】: continuous queries; order; out-of-order; streams
【Paper Link】 【Pages】:1093-1104
【Authors】: Alain Biem ; Eric Bouillet ; Hanhua Feng ; Anand Ranganathan ; Anton Riabov ; Olivier Verscheure ; Haris N. Koutsopoulos ; Carlos Moran
【Abstract】: With the widespread adoption of location tracking technologies like GPS, the domain of intelligent transportation services has seen growing interest in the last few years. Services in this domain make use of real-time location-based data from a variety of sources, combine this data with static location-based data such as maps and points of interest databases, and provide useful information to end-users. Some of the major challenges in this domain include i) scalability, in terms of processing large volumes of real-time and static data; ii) extensibility, in terms of being able to add new kinds of analyses on the data rapidly, and iii) user interaction, in terms of being able to support different kinds of one-time and continuous queries from the end-user. In this paper, we demonstrate the use of IBM InfoSphere Streams, a scalable stream processing platform, for tackling these challenges. We describe a prototype system that generates dynamic, multi-faceted views of transportation information for the city of Stockholm, using real vehicle GPS and road-network data. The system also continuously derives current traffic statistics, and provides useful value-added information such as shortest-time routes from real-time observed and inferred traffic conditions. Our performance experiments illustrate the scalability of the system. For instance, our system can process over 120000 incoming GPS points per second, combine it with a map containing over 600,000 links, continuously generate different kinds of traffic statistics and answer user queries.
【Keywords】: geostreaming; stream processing; transportation
【Paper Link】 【Pages】:1105-1110
【Authors】: Malú Castellanos ; Song Wang ; Umeshwar Dayal ; Chetan Gupta
【Abstract】: Emerging business intelligence (BI) applications aim to provide situational awareness, i.e., information about real-world events that might affect the business operations of an enterprise. For instance, an enterprise might want to know whether customers are posting positive or negative comments about a new product it has just introduced; or whether some natural disaster affects its contracted suppliers. It is difficult to develop such applications today because they require extracting and correlating facts from multiple streaming and stored data sources, typically including unstructured data, which is not well supported by BI platforms today. In this paper, we describe SIE-OBI, a system that we are developing to enable the development and execution of such applications. We describe the novel features of this system, including a declarative interface for rapidly developing such applications, and a platform for optimizing and executing the applications. We illustrate its applicability through two use cases.
【Keywords】: business intelligence; correlation; information extraction; situational awareness; streaming data; unstructured data
【Paper Link】 【Pages】:1111-1114
【Authors】: Azza Abouzied ; Kamil Bajda-Pawlikowski ; Jiewen Huang ; Daniel J. Abadi ; Avi Silberschatz
【Abstract】: HadoopDB is a hybrid of MapReduce and DBMS technologies, designed to meet the growing demand of analyzing massive datasets on very large clusters of machines. Our previous work has shown that HadoopDB approaches parallel databases in performance and still yields the scalability and fault tolerance of MapReduce-based systems. In this demonstration, we focus on HadoopDB's flexible architecture and versatility with two real world application scenarios: a semantic web data application for protein sequence analysis and a business data warehousing application based on TPC-H. The demonstration offers a thorough walk-through of how to easily build applications on top of HadoopDB.
【Keywords】: hadoop; hadoopdb; hive; mapreduce; parallel database; semantic web; tpc-h; uniprot
【Paper Link】 【Pages】:1115-1118
【Authors】: Tyson Condie ; Neil Conway ; Peter Alvaro ; Joseph M. Hellerstein ; John Gerth ; Justin Talbot ; Khaled Elmeleegy ; Russell Sears
【Abstract】: MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, the output of each MapReduce task and job is materialized to disk before it is consumed. In this demonstration, we describe a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We demonstrate a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see "early returns" from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applications such as event monitoring and stream processing. HOP retains the fault tolerance properties of Hadoop, and can run unmodified user-defined MapReduce programs.
【Keywords】: mapreduce
【Paper Link】 【Pages】:1119-1122
【Authors】: Chaokun Wang ; Jianmin Wang ; Xuemin Lin ; Wei Wang ; Haixun Wang ; Hongsong Li ; Wanpeng Tian ; Jun Xu ; Rui Li
【Abstract】: Near duplicate detection benefits many applications, e.g., on-line news selection over the Web by keyword search. The purpose of this demo is to show the design and implementation of MapDupReducer, a MapReduce based system capable of detecting near duplicates over massive datasets efficiently.
【Keywords】: mapreduce; near duplicate detection
【Paper Link】 【Pages】:1123-1126
【Authors】: Rishan Chen ; Xuetian Weng ; Bingsheng He ; Mao Yang
【Abstract】: As the study of graphs, such as web and social graphs, becomes increasingly popular, the requirements of efficiency and programming flexibility of large graph processing tasks challenge existing tools. We propose to demonstrate Surfer, a large graph processing engine designed to execute in the cloud. Surfer provides two basic primitives for programmers - MapReduce and propagation. MapReduce, originally developed by Google, processes different key-value pairs in parallel, and propagation is an iterative computational pattern that transfers information along the edges from a vertex to its neighbors in the graph. These two primitives are complementary in graph processing. MapReduce is suitable for processing flat data structures, such as vertex-oriented tasks, and propagation is optimized for edge-oriented tasks on partitioned graphs. To further improve the programmability of large graph processing, Surfer consists of a small set of high level building blocks that use these two primitives. Developers may also construct custom building blocks. Surfer further provides a GUI (Graphical User Interface) using which developers can visually create large graph processing tasks. Surfer transforms a task into an execution plan composed of MapReduce and propagation operations. It then automatically applies various optimizations to improve the efficiency of distributed execution. Surfer also provides a visualization tool to monitor the detailed execution dynamics of the execution plan to show the interesting tradeoffs between MapReduce and propagation. We demonstrate our system in two ways: first, we demo the ease-of-programming features of the system; second, we show the efficiency of the system with a series of applications on a social network. We find that Surfer is simple to use and is highly efficient for large graph-based tasks.
【Keywords】: distributed systems; graph processing; mapreduce; propagation
【Paper Link】 【Pages】:1127-1130
【Authors】: Salvatore Ruggieri ; Dino Pedreschi ; Franco Turini
【Abstract】: Discrimination discovery in databases consists in finding unfair practices against minorities which are hidden in a dataset of historical decisions. The DCUBE system implements the approach of [5], which is based on classification rule extraction and analysis, by centering the analysis phase around an Oracle database. The proposed demonstration guides the audience through the legal issues about discrimination hidden in data, and through several legally-grounded analyses to unveil discriminatory situations. The SIGMOD attendees will freely pose complex discrimination analysis queries over the database of extracted classification rules, once they are presented with the database relational schema, a few ad-hoc functions and procedures, and several snippets of SQL queries for discrimination discovery.
【Keywords】: classification rules; discrimination
【Paper Link】 【Pages】:1131-1134
【Authors】: Chun Kit Chui ; Ben Kao ; Eric Lo ; David W. Cheung
【Abstract】: The Sequence OLAP (S-OLAP) system is a novel online analytical processing system for analyzing sequence data. S-OLAP supports "pattern-based" grouping and aggregation on sequence data - a very powerful concept and capability that is not supported by traditional OLAP systems. It also supports several new OLAP operations that are specific to sequence data analysis. The query processing techniques documented in [1] have been implemented in our S-OLAP engine for efficient query processing. The system also provides users with a friendly graphical interface for query construction and result visualization. Query parameters can be interactively refined and the results are updated in real-time so as to facilitate the exploratory analysis of sequence data.
【Keywords】: olap; sequence data
【Paper Link】 【Pages】:1135-1138
【Authors】: Venkatesh Raghavan ; Elke A. Rundensteiner
【Abstract】: We demonstrate ProgXe, a practical approach to support Multi-Criteria Decision Support (MCDS) applications that need to report results as they are being generated to enable the user to make competitive decisions. ProgXe transforms the execution of MCDS queries involving skyline over joins to be non-blocking by progressively generating results early and often. The demonstration highlights key features of our progressive execution framework that optimizes for early output generation by: (1) evaluating the query at multiple levels of abstraction, (2) exploiting the skyline knowledge gained from both input as well as mapped output spaces. The audience will be able to submit MCDS queries. We provide visualization tools that enable the user to make quick decisions, compare alternative techniques, and provide capability to fine-tune the query predicates based on the early output results.
【Keywords】: progressive result generation; skyline over join queries
【Paper Link】 【Pages】:1139-1142
【Authors】: María Pérez Catalán ; Ismael Sanz ; Rafael Berlanga Llavori
【Abstract】: In this demonstration we present XTaGe (XML Tester and Generator), a flexible tool for the creation of complex XML collections. XTaGe focuses on XML collections with complex structural constraints and domain-specific characteristics, which would be very difficult or impossible to replicate using existing XML generators. It addresses the limitations of current XML generators by providing a highly extensible framework to introduce controlled variability in XML structures.
【Keywords】: heterogeneous schemas; xml data generation
【Paper Link】 【Pages】:1143-1146
【Authors】: Barzan Mozafari ; Kai Zeng ; Carlo Zaniolo
【Abstract】: A strong interest is emerging in SQL extensions for sequence patterns using Kleene-closure expressions. This burst of interest from both the research community and the commercial world is due to the many database and data stream applications made possible by these extensions, including financial services, RFID-based inventory management, and electronic health systems. In this demo we will present the KSQL system that represents a major step forward in this area. KSQL supports a more expressive language that allows for generalized Kleene-closure queries and also achieves the expressive power of the nested word model, which greatly expands the application domain to include XML queries, software trace analysis, and genomics. In this demo, we first introduce the core features of our language in expressing complex pattern queries over both relational and XML data. We overview the architecture of our unifying engine and its user-friendly interfaces. We also present several K*SQL queries from stock market, XML, software trace analysis and genomic applications.
【Keywords】: kleene-closure; pattern matching; sequence queries; sql; xpath
【Paper Link】 【Pages】:1147-1150
【Authors】: Pranav Vaidya ; Jaehwan John Lee ; Francis Bowen ; Yingzi Du ; Chandima H. Nadungodage ; Yuni Xia
【Abstract】: Numerous monitoring applications such as traffic control systems, border patrol monitoring, and person locater services generate a large number of multimedia data streams that need to be analyzed and processed using image processing and data stream management techniques in order to detect significant events of interest or abnormal conditions. Such multimedia monitoring systems usually have high-bandwidth characteristics, stricter real-time deadlines, and high accuracy requirements. In an attempt to meet all of these requirements, we have designed Symbiote - a distributed Reconfigurable Logic Assisted multimedia Data Stream Management System (RLADSMS) at Indiana University Purdue University, Indianapolis (IUPUI) that provides hardware accelerated data stream processing using Field Programmable Gate Arrays (FPGAs).
【Keywords】: data stream management systems; fpgas; hardware accelerator
【Paper Link】 【Pages】:1151-1154
【Authors】: Di Yang ; Zhenyu Guo ; Zaixian Xie ; Elke A. Rundensteiner ; Matthew O. Ward
【Abstract】: We will demonstrate our system, called V iStream, supporting interactive visual exploration of neighbor-based patterns [7] in data streams. V iStream does not only apply innovative multi-query strategies to compute a broad range of popular patterns, such as clusters and outliers, in a highly efficient manner, but it also provides a rich set of visual interfaces and interactions to enable real-time pattern exploration. With ViStream, analysts can easily interact with pattern mining processes by navigating along the time horizons, abstraction levels and parameter spaces, and thus better understand the phenomena of interest.
【Keywords】: pattern mining; streaming data; visual interaction
【Paper Link】 【Pages】:1155-1158
【Authors】: Michael Mathioudakis ; Nick Koudas
【Abstract】: We present TwitterMonitor, a system that performs trend detection over the Twitter stream. The system identifies emerging topics (i.e. 'trends') on Twitter in real time and provides meaningful analytics that synthesize an accurate description of each topic. Users interact with the system by ordering the identified trends using different criteria and submitting their own description for each trend. We discuss the motivation for trend detection over social media streams and the challenges that lie therein. We then describe our approach to trend detection, as well as the architecture of TwitterMonitor. Finally, we lay out our demonstration scenario.
【Keywords】: social media analysis; trend detection
【Paper Link】 【Pages】:1159-1162
【Authors】: René Müller ; Jens Teubner ; Gustavo Alonso
【Abstract】: Field-programmable gate arrays (FPGAs) are a promising technology that can be used in database systems. In this demonstration we show Glacier, a library and a compiler that can be employed to implement streaming queries as hardware circuits on FPGAs. Glacier consists of a library of compositional hardware modules that represent stream processing operators. Given a query execution plan, the compiler instantiates the corresponding components and wires them up to a digital circuit. The goal of this demo is to show the flexibility of the compositional approach.
【Keywords】: fpga; query compilation; stream processing
【Paper Link】 【Pages】:1163-1166
【Authors】: Hilit Achiezra ; Konstantin Golenberg ; Benny Kimelfeld ; Yehoshua Sagiv
【Abstract】: A system for keyword search on data graphs is demonstrated on two challenging datasets: the large DBLP and Mondial (which is highly cyclic and has a complex schema). The system supports search, exploration and question answering. The demonstration shows how the system copes with the main challenges in keywords search on data graphs. In particular, the system generates answers efficiently and completely (i.e., it does not miss answers). It has an effective ranking mechanism that also takes into account redundancies among answers. Finally, the system uses a novel technique for displaying multi-node subtrees in a compact graphical form that facilitates quick and easy understanding, which is essential for effective browsing of the answers.
【Keywords】: information retrieval on graphs; keyword search on graphs; redundancy elimination
【Paper Link】 【Pages】:1167-1170
【Authors】: Mark Sifer ; Jian Lin ; Yutaka Watanobe ; Subhash Bhalla
【Abstract】: We demonstrate a system that integrates a novel OLAP component with a keyword search engine, to support querying over sparse and ragged corpus data. The key contribution of our system is the integration of dynamically selected point sets such as search results with OLAP querying over aggregated data. During the demonstration, participants will be able to enter a keyword search; observe the returned list of result files; observe distributional features such as outliers and clusters of results in the corpus in multiple dimension views; and select and partition corpus slices in the OLAP component to narrow search results. Participants will be able to experience not just the individual querying features of our system, but the way that they work together to facilitate smooth interaction sequences that combine OLAP and keyword search querying.
【Keywords】: faceted search; olap; user interface
【Paper Link】 【Pages】:1171-1174
【Authors】: Sanjay Agrawal ; Kaushik Chakrabarti ; Surajit Chaudhuri ; Venkatesh Ganti ; Arnd Christian König ; Dong Xin
【Abstract】: Many web queries seek information about named entities (such as products or people). Web search engines federate such entity-oriented queries to relevant structured databases; the results of those searches are then returned to the user along with web search results. Current federated approaches have two limitations: (i) they often fail to return important results for a broad class of such entity-oriented queries and (ii) the information they return per entity is often inadequate. In this paper, we present the Query Portals system that addresses these limitations. The Query Portals system dynamically generates a portal for an entity-oriented query. It first provides an overview of the relevant entities and further allows users to drill down to gather additional information on these entities. Our architecture uses a judicious combination of pre-processing and query time techniques so that the query portal can be generated efficiently.
【Keywords】: entity search; information extraction; portals; vertical search
【Paper Link】 【Pages】:1175-1178
【Authors】: Luciano Barbosa ; Hoa Nguyen ; Thanh Hoang Nguyen ; Ramesh Pinnamaneni ; Juliana Freire
【Abstract】: We present DeepPeep (http://www.deeppeep.org), a new system for discovering, organizing and analyzing Web forms. DeepPeep allows users to explore the entry points to hidden-Web sites whose contents are out of reach for traditional search engines. Besides demonstrating important features of DeepPeep and describing the infrastructure we used to build the system, we will show how this infrastructure can be used to create form collections and form search engines for different domains. We also present the analysis component of DeepPeep which allows users to explore and visualize information in form repositories, helping them not only to better search and understand forms in different domains, but also to refine the form gathering process.
【Keywords】: focused crawling; hidden web; learning classifiers; search engines; web forms
【Paper Link】 【Pages】:1179-1182
【Authors】: Kenneth P. Smith ; Craig Bonaceto ; Chris Wolf ; Beth Yost ; Michael Morse ; Peter Mork ; Doug Burdick
【Abstract】: Large, dynamic, and ad-hoc organizations must frequently initiate data integration and sharing efforts with insufficient awareness of how organizational data sources are related. Decision makers need to reason about data model interactions much as they do about data instance interactions in OLAP: at multiple levels of granularity. We demonstrate an integrated environment for exploring schema similarity across multiple resolutions. Users visualize and interact with clusters of related schemas using a tool named Affinity. Within any cluster, users may drill-down to examine the extent and content of schema overlap. Further drill down enables users to explore fine-grained element-level correspondences between between two selected schemas.
【Keywords】: schema similarity exploration
【Paper Link】 【Pages】:1183-1186
【Authors】: Ioannis Alagiannis ; Debabrata Dash ; Karl Schnaitter ; Anastasia Ailamaki ; Neoklis Polyzotis
【Abstract】: Tuning tools attempt to configure a database to achieve optimal performance for a given workload. Selecting an optimal set of physical structures is computationally hard since it involves searching a vast space of possible configurations. Commercial DBMSs offer tools that can address this problem. The usefulness of such tools, however, is limited by their dependence on greedy heuristics, the need for a-priori (offline) knowledge of the workload, and lack of an optimal materialization schedule to get the best out of suggested design features. Moreover, the open source DBMSs do not provide any automated tuning tools. This demonstration introduces a comprehensive physical designer for the PostgreSQL open source DBMS. The tool suggests design features for both offline and online workloads. It provides close to optimal suggestions for indexes for a given workload by modeling the problem as a combinatorial optimization problem and solving it by sophisticated and mature solvers. It also determines the interaction between indexes to suggest an effective materialization strategy for the selected indexes. The tool is interactive as it allows the database administrator (DBA) to suggest a set of candidate features and shows their benefits and interactions visually. For the demonstration we use large real-world scientific datasets and query workloads.
【Keywords】: continuous tuning; index interaction; physical design tuning
【Paper Link】 【Pages】:1187-1190
【Authors】: Sreeram Balakrishnan ; Vivian Chu ; Mauricio A. Hernández ; Howard Ho ; Rajasekar Krishnamurthy ; Shixia Liu ; Jan Pieper ; Jeffrey S. Pierce ; Lucian Popa ; Christine Robson ; Lei Shi ; Ioana Roxana Stanoi ; Edison L. Ting ; Shivakumar Vaithyanathan ; Huahai Yang
【Abstract】: The primary goal of the Midas project is to build a system that enables easy and scalable integration of unstructured and semi-structured information present across multiple data sources. As a first step in this direction, we have built a system that extracts and integrates information from regulatory filings submitted to the U.S. Securities and Exchange Commission (SEC) and the Federal Deposit Insurance Corporation (FDIC). Midas creates a repository of entities, events, and relationships by extracting, conceptualizing, integrating, and aggregating data from unstructured and semi-structured documents. This repository enables applications to use the extracted and integrated data in a variety of ways including mashups with other public data and complex risk analysis.
【Keywords】: data cleansing; financial data; information extraction; information integration
【Paper Link】 【Pages】:1191-1194
【Authors】: James F. Terwilliger ; Philip A. Bernstein ; Adi Unnithan
【Abstract】: Schema evolution is an unavoidable consequence of the application development lifecycle. The two primary schemas in an application, the client conceptual object model and the persistent database model, must co-evolve or risk quality, stability, and maintainability issues. We present MoDEF, an extension to Visual Studio that supports automatic evolution of object-relational mapping artifacts in the Microsoft Entity Framework. When starting with a valid mapping between client and store, MoDEF translates changes made to a client model into incremental changes to the store as an upgrade script, along with a new valid mapping to the new store. MoDEF mines the existing mapping for mapping patterns which MoDEF reuses for new client artifacts.
【Keywords】: model management; o-r mapping; schema evolution
【Paper Link】 【Pages】:1195-1198
【Authors】: Matteo Magnani ; Danilo Montesi
【Abstract】: In this paper we describe a demo concerning the management of uncertain schemata. Many works have studied the problem of representing uncertainty on attribute values or tuples, like the fact that a value is 10 with probability .3 or 20 with probability .7, leading to the implementation of probabilistic database management systems. In our demo we deal with the representation of uncertainty about the meta-data, i.e., about the meaning of these values. Using our system it is possible to create alternative probabilistic schemata on a database, execute queries over uncertain schemata and verify how this additional information is stored in an underlying relational database and how queries are executed.
【Keywords】: schema; uncertainty; us-sql
【Paper Link】 【Pages】:1199-1202
【Authors】: Franz Graf ; Hans-Peter Kriegel ; Matthias Renz ; Matthias Schubert
【Abstract】: Modern maps provide a variety of information about roads and their surrounding landscape allowing navigation systems to go beyond simple shortest path computation. In this demo, we show how the concept of skyline queries can be successfully adapted to routing problems considering multiple road attributes. In particular, we demonstrate how to compute several pareto-optimal paths which contain optimal results for a variety of user preferences. The PAROS-system has two main purposes. The first is to calculate the route skyline for a starting point and a destination. Our demonstrator visualizes the result set for up to three road attributes. Therefore, we provide a dual view on the computed skyline paths. The first view displays the result paths on the road map itself. The second view describes the result paths in the property space, displaying the trade-off between the underlying criteria. Thus, a user can browse through the results in order to find the path which fits best to his personal preferences. The second component of our system suits analysis issues. In this component, we illustrate the functionality of the underlying route skyline algorithm. Thus, we provide benchmark information about processing time and the search space visited during route skyline computation.
【Keywords】: gis; performance; query processing; road network; route finding; skyline query
【Paper Link】 【Pages】:1203-1206
【Authors】: Zhenhui Li ; Ming Ji ; Jae-Gil Lee ; Lu An Tang ; Yintao Yu ; Jiawei Han ; Roland Kays
【Abstract】: With the maturity of GPS, wireless, and Web technologies, increasing amounts of movement data collected from various moving objects, such as animals, vehicles, mobile devices, and climate radars, have become widely available. Analyzing such data has broad applications, e.g., in ecological study, vehicle control, mobile communication management, and climatological forecast. However, few data mining tools are available for flexible and scalable analysis of massive-scale moving object data. Our system, MoveMine, is designed for sophisticated moving object data mining by integrating several attractive functions including moving object pattern mining and trajectory mining. We explore the state-of-the-art and novel techniques at implementation of the selected functions. A user-friendly interface is provided to facilitate interactive exploration of mining results and flexible tuning of the underlying methods. Since MoveMine is tested on multiple kinds of real data sets, it will benefit users to carry out versatile analysis on these kinds of data. At the same time, it will benefit researchers to realize the importance and limitations of current techniques as well as the potential future studies in moving object data mining.
【Keywords】: moving objects; pattern/trajectory mining; visualization
【Paper Link】 【Pages】:1207-1210
【Authors】: Michael Armbrust ; Stephen Tu ; Armando Fox ; Michael J. Franklin ; David A. Patterson ; Nick Lanham ; Beth Trushkowsky ; Jesse Trutna
【Abstract】: Large-scale websites are increasingly moving from relational databases to distributed key-value stores for high request rate, low latency workloads. Often this move is motivated not only by key-value stores' ability to scale simply by adding more hardware, but also by the easy to understand predictable performance they provide for all operations. While this data model works well, lookups are only done by primary key. More complex queries require onerous, explicit index management and imperative data lookups by the developer. We demonstrate PIQL, a Performance Insightful Query Language that allows developers to express many of the queries found on these websites, while still providing strict bounds on the number of I/O operations for any query.
【Keywords】: databases
【Paper Link】 【Pages】:1211-1214
【Authors】: Mianwei Zhou ; Tao Cheng ; Kevin Chen-Chuan Chang
【Abstract】: Witnessing the richness of data in document content and many ad-hoc efforts for finding such data, we propose a Data-oriented Content Query System(DoCQS), which is oriented towards fine granularity data of all types by searching directly into document content. DoCQS uses the relational model as the underlying data model, and offers a powerful and flexible Content Query Language(CQL) to adapt to diverse query demands. In this demonstration, we show how to model various search tasks by CQL statements, and how the system architecture efficiently supports the CQL execution. Our online demo of the system is available at http://wisdm.cs.uiuc.edu/demos/docqs/.
【Keywords】: content query; cql; data retrieval; docqs; entity search; relational data model
【Paper Link】 【Pages】:1215-1218
【Authors】: Manasi Vartak ; Venkatesh Raghavan ; Elke A. Rundensteiner
【Abstract】: In many business and consumer applications, queries have cardinality constraints. However, current database systems provide minimal support for cardinality assurance. Consequently, users must adopt a cumbersome trial-and-error approach to find queries that are close to the original query but also attain the desired cardinality. In this demonstration, we present QRelX a novel framework to automatically generate alternate queries that meet the cardinality and closeness criteria. QRelX employs an innovative query space transformation strategy, proximity-based search and incremental cardinality estimation to efficiently find alternate queries. Our demonstration is an interactive game that allows the audience to compete with QRelX via manual query refinement. We illustrate the importance of cardinality assurance through real-time comparisons between manual refinement and QRelX. We also highlight the novelty of our solution by visualizing the core algorithms of QRelX.
【Keywords】: cardinality assurance; query refinement
【Paper Link】 【Pages】:1219-1222
【Authors】: Matias Bjørling ; Lionel Le Folgoc ; Ahmed Mseddi ; Philippe Bonnet ; Luc Bouganim ; Björn Þór Jónsson
【Abstract】: It is amazingly easy to get meaningless results when measuring flash devices, partly because of the peculiarity of flash memory, but primarily because their behavior is determined by layers of complex, proprietary, and undocumented software and hardware. In this demonstration, we share the lessons we learnt developing the uFlip benchmark and conducting experiments with a wide range of flash devices. We illustrate the problems that are actual obstacles to sound performance and energy measurements, and we show how to mitigate the effects of these problems. We also present the uFlip web site and its on-line visualization tool that should help the research community investigate flash device behavior.
【Keywords】: benchmarking; energy measurements; flash devices; methodology; ssd; uflip
【Paper Link】 【Pages】:1223-1226
【Authors】: Mohamed Yakout ; Ahmed K. Elmagarmid ; Jennifer Neville ; Mourad Ouzzani
【Abstract】: Improving data quality is a time-consuming, labor-intensive and often domain specific operation. Existing data repair approaches are either fully automated or not efficient in interactively involving the users. We present a demo of GDR, a Guided Data Repair system that uses a novel approach to efficiently involve the user alongside automatic data repair techniques to reach better data quality as quickly as possible. Specifically, GDR generates data repairs and acquire feedback on them that would be most beneficial in improving the data quality. GDR quantifies the data quality benefit of generated repairs by combining mechanisms from decision theory and active learning. Based on these benefit scores, groups of repairs are ranked and displayed to the user. User feedback is used to train a machine learning component to eventually replace the user in deciding on the validity of a suggested repair. We describe how the generated repairs are ranked and displayed to the user in a "useful-looking" way and demonstrate how data quality can be effectively improved with minimal feedback from the user.
【Keywords】: data cleaning; data quality; data repair; interactive system
【Paper Link】 【Pages】:1227-1230
【Authors】: Georgios Giannikis ; Philipp Unterbrunner ; Jeremy Meyer ; Gustavo Alonso ; Dietmar Fauser ; Donald Kossmann
【Abstract】: This demonstration presents Crescando, an implementation of a distributed relational table that guarantees predictable response time on unpredictable workloads. In Crescando, data is stored in main memory and accessed via full-table scans. By using scans instead of index lookups, Crescando overcomes the read-write contention in index structures and eliminates the scalability issues that exist in traditional index-based systems. Crescando is specifically designed to process a large number of queries in parallel, allowing high query rates. The goal of this demonstration is to show the ability of Crescando to a) quickly answer arbitrary user-generated queries, and b) execute a large number of queries and updates in parallel, while providing strict response time and data freshness guarantees.
【Keywords】: main memory; parallel database
【Paper Link】 【Pages】:1231-1234
【Authors】: Vamsidhar Thummala ; Shivnath Babu
【Abstract】: iTuned is a tool that takes a SQL workload as input and recommends good settings for database configuration parameters such as buffer pool sizes, multi-programming level, and number of I/O daemons. iTuned also provides response-surface and sensitivity-analysis plots that help the end-user analyze the impact of each parameter. iTuned has the following novel features: (i) a technique called Adaptive Sampling that proactively brings in appropriate data through planned experiments to find high-impact parameters and high-performance parameter settings, (ii) an executor that supports on-line experiments in production database environments through a cycle-stealing paradigm that places minimal overhead on the production workload, and (iii) portability across different database systems. This demonstration will focus on the interesting use-case scenarios of iTuned, and show the effectiveness of recommended configuration settings on popular database benchmarks.
【Keywords】: configuration; tuning; visualization
【Paper Link】 【Pages】:1235-1238
【Authors】: Nicolas Anciaux ; Luc Bouganim ; Yanli Guo ; Philippe Pucheral ; Jean-Jacques Vandewalle ; Shaoyi Yin
【Abstract】: An increasing amount of personal data is automatically gathered on servers by administrations, hospitals and private companies while several security surveys highlight the failure of database servers to keep confidential data really private. The advent of powerful secure tokens, combining the security of smart card microcontrollers with the storage capacity of NAND Flash chips, introduces a credible alternative to the systematic centralization of personal data. By embedding a full-fledged database server in such device, an individual can now store her personal data in her own secure token, kept under her control, and never disclose in clear her private data to the outside untrusted world. This demonstration shows the benefit of the proposed approach in terms of privacy protection and pervasiveness through a healthcare scenario. This scenario is extracted from a field experiment where medical folders embedded in secure tokens are used to improve the coordination of medical care at home for elderly people. The demonstration also highlights interesting features of the embedded DBMS engine introduced to tackle the secure token's strong hardware constraints.
【Keywords】: privacy; secure device; storage model
【Paper Link】 【Pages】:1239-1242
【Authors】: Mohamed Nabeel ; Ning Shang ; John Zage ; Elisa Bertino
【Abstract】: We propose to demonstrate Mask, the first system addressing the seemingly-unsolvable problem of how to selectively share contents among a group of users based on access control policies expressed as conditions against the identity attributes of these users while at the same time assuring the privacy of these identity attributes from the content publisher. Mask consists of three entities: a Content Publisher, Users referred to as Subscribers, and Identity Providers that issue certified identity attributes. The content publisher specifies access control policies against identity attributes of subscribers indicating which conditions the identity attributes of a subscriber must verify in order for this subscriber to access a document or a subdocument. The main novelty of Mask is that, even though the publisher is able to match the identity attributes of the subscribers against its own access control policies, the publisher does not learn the values of the identity attributes of the subscribers; the privacy of the authorized subscribers is thus preserved. Based on the specified access control policies, documents are divided into subdocuments and the subdocuments having different access control policies are encrypted with different keys. Subscribers derive the keys corresponding to the subdocuments they are authorized to access. Key distribution in Mask is supported by a novel group key management protocol by which subscribers can reconstruct the decryption keys from the subscription information they receive from the publisher. The publisher however does not learn which decryption keys each subscriber is able to reconstruct. In this demonstration, we show our system using a healthcare scenario.
【Keywords】: access control; broadcast systems; group key management; identity; privacy
【Paper Link】 【Pages】:1243-1246
【Authors】: Yasin N. Silva ; Ahmed M. Aly ; Walid G. Aref ; Per-Åke Larson
【Abstract】: The identification and processing of similarities in the data play a key role in multiple application scenarios. Several types of similarity-aware operations have been studied in the literature. However, in most of the previous work, similarity-aware operations are studied in isolation from other regular or similarity-aware operations. Furthermore, most of the previous research in the area considers a standalone implementation, i.e., without any integration with a database system. In this demonstration we present SimDB, a similarity-aware database management system. SimDB supports multiple similarity-aware operations as first-class database operators. We describe the architectural changes to implement the similarity-aware operators. In particular, we present the way conventional operators' implementation machinery is extended to support similarity-aware operators. We also show how these operators interact with other similarity-aware and regular operators. In particular, we show the effectiveness of multiple equivalence rules that can be used to extend cost-based query optimization to the case of similarity-ware operations.
【Keywords】: similarity group-by; similarity join; similarity-aware query processing and optimization
【Paper Link】 【Pages】:1247-1250
【Authors】: Justin J. Levandoski ; Mohamed F. Mokbel ; Mohamed E. Khalefa ; Venkateshwar R. Korukanti
【Abstract】: This demonstration presents FlexPref, a framework implemented inside the DBMS query processor that enables efficient and extensible preference query processing. FlexPref provides query processing support inside the database engine for a wide-array of preference evaluation methods (e.g., skyline, top-k, k-dominance, k-frequency) in a single extensible code base. Integration with FlexPref is simple, involving the registration of only three functions that capture the essence of the preference method. Once integrated, the preference method "lives" at the core of the database, enabling the efficient execution of preference queries involving common database operations (e.g, selection, join). Functionality of FlexPref, implemented inside PostgreSQL, is demonstrated through the implementation and use of several state-of-the-art preference methods in a real application scenario.
【Keywords】: preference queries; preference query processing
【Paper Link】 【Pages】:1251-1252
【Authors】: Jiawei Han ; Yizhou Sun ; Xifeng Yan ; Philip S. Yu
【Abstract】: Most people consider a database is merely a data repository that supports data storage and retrieval. Actually, a database contains rich, inter-related, multi-typed data and information, forming one or a set of gigantic, interconnected, heterogeneous information networks. Much knowledge can be derived from such information networks if we systematically develop an effective and scalable database-oriented information network analysis technology. In this tutorial, we introduce database-oriented information network analysis methods and demonstrate how information networks can be used to improve data quality and consistency, facilitate data integration, and generate interesting knowledge. This tutorial presents an organized picture on how to turn a database into one or a set of organized heterogeneous information networks, how information networks can be used for data cleaning, data consolidation, and data qualify improvement, how to discover various kinds of knowledge from information networks, how to perform OLAP in information networks, and how to transform database data into knowledge by information network analysis. Moreover, we present interesting case studies on real datasets, including DBLP and Flickr, and show how interesting and organized knowledge can be generated from database-oriented information networks.
【Keywords】: data mining; information networks; link analysis
【Paper Link】 【Pages】:1253-1254
【Authors】: Carlos Ordonez ; Javier García-García
【Abstract】: Data mining remains an important research area in database systems. We present a review of processing alternatives, storage mechanisms, algorithms, data structures and optimizations that enable data mining on large data sets. We focus on the computation of well-known multidimensional statistical and machine learning models. We pay particular attention to SQL and MapReduce as two competing technologies for large scale processing. We conclude with a summary of solved major problems and open research issues.
【Keywords】: dbms; mapreduce; sql; statistical model; udf
【Paper Link】 【Pages】:1255-1256
【Authors】: Divesh Srivastava ; Suresh Venkatasubramanian
【Abstract】: We explore the use of information theory as a tool to express and quantify notions of information content and information transfer for representing and analyzing data, using examples from database design, data integration and data anonymization. We also examine the computational challenges associated with information-theoretic primitives, indicating how they might be computed efficiently.
【Keywords】: data anonymization; data integration; database design; entropy; estimation; kl divergence; mutual information
【Paper Link】 【Pages】:1257-1258
【Authors】: Laura Chiticariu ; Yunyao Li ; Sriram Raghavan ; Frederick Reiss
【Abstract】: Information extraction (IE) - the problem of extracting structured information from unstructured text - has become an increasingly important topic in recent years. A SIGMOD 2006 tutorial [3] outlined challenges and opportunities for the database community to advance the state of the art in information extraction, and posed the following grand challenge: "Can we build a System R for information extraction? Our tutorial gives an overview of progress the database community has made towards meeting this challenge. In particular, we start by discussing design requirements in building an enterprise IE system. We then survey recent technological advances towards addressing these requirements, broadly categorized as: (1) Languages for specifying extraction programs in a declarative way, thus allowing database-style performance optimizations; (2) Infrastructure needed to ensure scalability, and (3) Development support for enterprise IE systems. Finally, we outline several open challenges and opportunities for the database community to further advance the state of the art in enterprise IE systems. The tutorial is intended for students and researchers interested in information extraction and its applications, and assumes no prior knowledge of the area.
【Keywords】: accuracy; declarative; enterprise applications; information extraction; scalability; usability
【Paper Link】 【Pages】:1259-1260
【Authors】: Sihem Amer-Yahia ; AnHai Doan ; Jon M. Kleinberg ; Nick Koudas ; Michael J. Franklin
【Abstract】:
【Keywords】: cloud computing; crowd/cloud systems; data integration; social computing