IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014. IEEE Computer Society 【DBLP Link】
【Paper Link】 【Pages】:1
【Authors】: Anastasia Ailamaki
【Abstract】: The amount of data collected in the last two years is higher than the amount of data collected since the dawn of time. Businesses are drowning in data, and need several months of ETL processing to barely prepare them for querying. Domain scientists collect data much faster than they can be transformed into valuable information and are often forced into hasty decisions on which parts to discard, potentially throwing away valuable data before it has been exploited fully. The reason is that query processing, which is the mechanism to squeeze information out of data, becomes slower as datasets grow larger. At the same time, the continuously increased number of hardware contexts ends up slowing processing down further, as keeping all cores busy with doing useful computation is difficult. Today's query engines cannot harness but a fraction of the potential of new hardware platforms.
【Keywords】:
【Paper Link】 【Pages】:2
【Authors】: Amit P. Sheth
【Abstract】: Big Data has captured a lot of interest in industry, with anticipation of better decisions, efficient organizations, and many new jobs. Much of the emphasis is on the challenges of the four V's of Big Data: Volume, Variety, Velocity, and Veracity, and technologies that handle volume, including storage and computational techniques to support analysis (Hadoop, NoSQL, MapReduce, etc). However, the most important feature of Big Data, the raison d'etre, is none of these 4 V's — but value. In this talk, I will forward the concept of Smart Data that is realized by extracting value from a variety of data, and how Smart Data for growing variety (e.g., social, sensor/IoT, health care) of Big Data enable a much larger class of applications that can benefit not just large companies but each individual. This requires organized ways to harness and overcome the four V-challenges. In particular, we will need to utilize metadata, employ semantics and intelligent processing, and go beyond traditional reliance on ML and NLP.
【Keywords】:
【Paper Link】 【Pages】:3-14
【Authors】: Pei Lee ; Laks V. S. Lakshmanan ; Evangelos E. Milios
【Abstract】: Dynamic networks are commonly found in the current web age. In scenarios like social networks and social media, dynamic networks are noisy, are of large-scale and evolve quickly. In this paper, we focus on the cluster evolution tracking problem on highly dynamic networks, with clear application to event evolution tracking. There are several previous works on data stream clustering using a node-by-node approach for maintaining clusters. However, handling of bulk updates, i.e., a subgraph at a time, is critical for achieving acceptable performance over very large highly dynamic networks. We propose a subgraph-by-subgraph incremental tracking framework for cluster evolution in this paper. To effectively illustrate the techniques in our framework, we consider the event evolution tracking task in social streams as an application, where a social stream and an event are modeled as a dynamic post network and a dynamic cluster respectively. By monitoring through a fading time window, we introduce a skeletal graph to summarize the information in the dynamic network, and formalize cluster evolution patterns using a group of primitive evolution operations and their algebra. Two incremental computation algorithms are developed to maintain clusters and track evolution patterns as time rolls on and the network evolves. Our detailed experimental evaluation on large Twitter datasets demonstrates that our framework can effectively track the complete set of cluster evolution patterns from highly dynamic networks on the fly.
【Keywords】: algebra; evolutionary computation; learning (artificial intelligence); network theory (graphs); pattern clustering; social networking (online); Twitter datasets; algebra; bulk update handling; cluster evolution patterns; cluster evolution tracking problem; cluster maintenance; data stream clustering; dynamic networks; event evolution tracking; fading time window; incremental cluster evolution tracking; network data; node-by-node approach; primitive evolution operations; skeletal graph; social media; social networks; social stream; subgraph-by-subgraph incremental tracking framework; Clustering algorithms; Fading; Heuristic algorithms; Monitoring; Noise; Robustness; Twitter
【Paper Link】 【Pages】:15-27
【Authors】: Guanhua Yan
【Abstract】: Data clustering is a basic technique for knowledge discovery and data mining. As the volume of data grows significantly, data clustering becomes computationally prohibitive and resource demanding, and sometimes it is necessary to outsource these tasks to third party experts who specialize in data clustering. The goal of this work is to develop techniques that find common ground among experts' opinions on data clustering, which may be biased due to the features or algorithms used in clustering. Our work differs from the large body of existing approaches to consensus clustering, as we do not require all data objects be grouped into clusters. Rather, our work is motivated by real-world applications that demand high confidence in how data objects - if they are selected - are grouped together.We formulate the problem rigorously and show that it is NP-complete. We further develop a lightweight technique based on finding a maximum independent set in a 3-uniform hypergraph to select data objects that do not form conflicts among experts' opinions. We apply our proposed method to a real-world malware dataset with hundreds of thousands of instances to find malware clusters based on how multiple major AV (Anti-Virus) software classify these samples. Our work offers a new direction for consensus clustering by striking a balance between the clustering quality and the amount of data objects chosen to be clustered.
【Keywords】: computational complexity; computer viruses; data mining; graph theory; pattern clustering; 3-uniform hypergraph; AV software; NP-complete; antivirus software; clustering quality; common ground; consensus clustering; data clustering; data mining; data objects; expert opinions; knowledge discovery; malware analysis; malware clusters; Feature extraction
【Paper Link】 【Pages】:28-39
【Authors】: Hao Huang ; Yunjun Gao ; Kevin Chiew ; Lei Chen ; Qinming He
【Abstract】: Mining arbitrary shaped clusters in large data sets is an open challenge in data mining. Various approaches to this problem have been proposed with high time complexity. To save computational cost, some algorithms try to shrink a data set size to a smaller amount of representative data examples. However, their user-defined shrinking ratios may significantly affect the clustering performance. In this paper, we present CLASP an effective and efficient algorithm for mining arbitrary shaped clusters. It automatically shrinks the size of a data set while effectively preserving the shape information of clusters in the data set with representative data examples. Then, it adjusts the positions of these representative data examples to enhance their intrinsic relationship and make the cluster structures more clear and distinct for clustering. Finally, it performs agglomerative clustering to identify the cluster structures with the help of a mutual k-nearest neighbors-based similarity metric called Pk. Extensive experiments on both synthetic and real data sets are conducted, and the results verify the effectiveness and efficiency of our approach.
【Keywords】: data mining; pattern clustering; CLASP; agglomerative clustering; arbitrary shaped clusters mining; cluster structures; clustering performance; computational cost; data mining; data set size; k-nearest neighbors-based similarity metric; large data sets; real data sets; shape information; synthetic data sets; time complexity; user-defined shrinking ratios; Algorithm design and analysis; Clustering algorithms; Data mining; Educational institutions; Eigenvalues and eigenfunctions; Shape; Symmetric matrices
【Paper Link】 【Pages】:40-51
【Authors】: Feng Li ; M. Tamer Özsu ; Gang Chen ; Beng Chin Ooi
【Abstract】: It is widely recognized that OLTP and OLAP queries have different data access patterns, processing needs and requirements. Hence, the OLTP queries and OLAP queries are typically handled by two different systems, and the data are periodically extracted from the OLTP system, transformed and loaded into the OLAP system for data analysis. With the awareness of the ability of big data in providing enterprises useful insights from vast amounts of data, effective and timely decisions derived from real-time analytics are important. It is therefore desirable to provide real-time OLAP querying support, where OLAP queries read the latest data while OLTP queries create the new versions. In this paper, we propose R-Store, a scalable distributed system for supporting real-time OLAP by extending the MapReduce framework. We extend an open source distributed key/value system, HBase, as the underlying storage system that stores data cube and real-time data. When real-time data are updated, they are streamed to a streaming MapReduce, namely Hstreaming, for updating the cube on incremental basis. Based on the metadata stored in the storage system, either the data cube or OLTP database or both are used by the MapReduce jobs for OLAP queries. We propose techniques to efficiently scan the real-time data in the storage system, and design an adaptive algorithm to process the real-time query based on our proposed cost model. The main objectives are to ensure the freshness of answers and low processing latency. The experiments conducted on the TPC-H data set demonstrate the effectiveness and efficiency of our approach.
【Keywords】: Big Data; data analysis; data mining; distributed processing; meta data; public domain software; query processing; storage management; HBase; Hstreaming; MapReduce framework; OLAP queries; OLTP database; OLTP queries; R-Store; TPC-H data set; adaptive algorithm; big data; data access patterns; data analysis; data cube storage; metadata; open source distributed key-value system; real-time analytics; real-time data storage; scalable distributed system; storage system; Compaction; Computer architecture; Data models; Distributed databases; Educational institutions; Maintenance engineering; Real-time systems
【Paper Link】 【Pages】:52-63
【Authors】: Peter Alvaro ; Neil Conway ; Joseph M. Hellerstein ; David Maier
【Abstract】: Distributed consistency is perhaps the most discussed topic in distributed systems today. Coordination protocols can ensure consistency, but in practice they cause undesirable performance unless used judiciously. Scalable distributed architectures avoid coordination whenever possible, but undercoordinated systems can exhibit behavioral anomalies under fault, which are often extremely difficult to debug. This raises significant challenges for distributed system architects and developers. In this paper we present BLAZES, a cross-platform program analysis framework that (a) identifies program locations that require coordination to ensure consistent executions, and (b) automatically synthesizes application-specific coordination code that can significantly outperform general-purpose techniques. We present two case studies, one using annotated programs in the Twitter Storm system, and another using the Bloom declarative language.
【Keywords】: distributed processing; program diagnostics; BLAZES program; Bloom declarative language; Twitter Storm system; annotated programs; application-specific coordination code synthesis; coordination protocols; cross-platform program analysis framework; distributed consistency; distributed programs; distributed systems; program location identification; scalable distributed architectures; Fault tolerance; Fault tolerant systems; Semantics; Servers; Storms; Topology; Twitter
【Paper Link】 【Pages】:64-75
【Authors】: Jiewen Huang ; Kartik Venkatraman ; Daniel J. Abadi
【Abstract】: Greedy algorithms for subgraph pattern matching operations are often sufficient when the graph data set can be held in memory on a single machine. However, as graph data sets increasingly expand and require external storage and partitioning across a cluster of machines, more sophisticated query optimization techniques become critical to avoid explosions in query latency. In this paper, we introduce several query optimization techniques for distributed graph pattern matching. These techniques include (1) a System-R style dynamic programming-based optimization algorithm that considers both linear and bushy plans, (2) a cycle detection-based algorithm that leverages cycles to reduce intermediate result set sizes, and (3) a computation reusing technique that eliminates redundant query execution and data transfer over the network. Experimental results show that these algorithms can lead to an order of magnitude improvement in query performance.
【Keywords】: distributed processing; dynamic programming; pattern matching; query processing; cycle detection based algorithm; distributed graph pattern matching; dynamic programming; external storage; greedy algorithms; query latency; query optimization techniques; subgraph pattern matching operations; Data models; Distributed databases; Heuristic algorithms; Optimization; Partitioning algorithms; Pattern matching; Query processing
【Paper Link】 【Pages】:76-87
【Authors】: Lei Cao ; Di Yang ; Qingyang Wang ; Yanwei Yu ; Jiayuan Wang ; Elke A. Rundensteiner
【Abstract】: The discovery of distance-based outliers from huge volumes of streaming data is critical for modern applications ranging from credit card fraud detection to moving object monitoring. In this work, we propose the first general framework to handle the three major classes of distance-based outliers in streaming environments, including the traditional distance-threshold based and the nearest-neighbor-based definitions. Our LEAP framework encompasses two general optimization principles applicable across all three outlier types. First, our “minimal probing” principle uses a lightweight probing operation to gather minimal yet sufficient evidence for outlier detection. This principle overturns the state-of-the-art methodology that requires routinely conducting expensive complete neighborhood searches to identify outliers. Second, our “lifespan-aware prioritization” principle leverages the temporal relationships among stream data points to prioritize the processing order among them during the probing process. Guided by these two principles, we design an outlier detection strategy which is proven to be optimal in CPU costs needed to determine the outlier status of any data point during its entire life. Our comprehensive experimental studies, using both synthetic as well as real streaming data, demonstrate that our methods are 3 orders of magnitude faster than state-of-the-art methods for a rich diversity of scenarios tested yet scale to high dimensional streaming data.
【Keywords】: data handling; pattern recognition; CPU costs; LEAP framework; credit card fraud detection; distance-based outliers; distance-threshold; general framework; general optimization principles; high dimensional streaming data; high-volume data streams; lifespan-aware prioritization principle; lightweight probing operation; minimal probing principle; moving object monitoring; nearest-neighbor-based definitions; neighborhood searches; probing process; scalable distance-based outlier detection; stream data points; streaming environments; Monitoring; Optimization
【Paper Link】 【Pages】:88-99
【Authors】: Xuan Hong Dang ; Ira Assent ; Raymond T. Ng ; Arthur Zimek ; Erich Schubert
【Abstract】: We consider the problem of outlier detection and interpretation. While most existing studies focus on the first problem, we simultaneously address the equally important challenge of outlier interpretation. We propose an algorithm that uncovers outliers in subspaces of reduced dimensionality in which they are well discriminated from regular objects while at the same time retaining the natural local structure of the original data to ensure the quality of outlier explanation. Our algorithm takes a mathematically appealing approach from the spectral graph embedding theory and we show that it achieves the globally optimal solution for the objective of subspace learning. By using a number of real-world datasets, we demonstrate its appealing performance not only w.r.t. the outlier detection rate but also w.r.t. the discriminative human-interpretable features. This is the first approach to exploit discriminative features for both outlier detection and interpretation, leading to better understanding of how and why the hidden outliers are exceptional.
【Keywords】: data analysis; data mining; statistical distributions; discriminative human-interpretable features; outlier detection; outlier explanation quality; outlier interpretation; spectral graph embedding theory; Data mining; Eigenvalues and eigenfunctions; Face; Feature extraction; Linear programming; Sparse matrices; Vectors
【Paper Link】 【Pages】:100-111
【Authors】: Mourad Khayati ; Michael H. Böhlen ; Johann Gamper
【Abstract】: Real world applications that deal with time series data often rely on matrix decomposition techniques, such as the Singular Value Decomposition (SVD). The Centroid Decomposition (CD) approximates the Singular Value Decomposition, but does not scale to long time series because of the quadratic space complexity of the sign vector computation. In this paper, we propose a greedy algorithm, termed Scalable Sign Vector (SSV), to efficiently determine sign vectors for CD applications with long time series, i.e., where the number of rows (observations) is much larger than the number of columns (time series). The SSV algorithm starts with a sign vector consisting of only 1s and iteratively changes the sign of the element that maximizes the benefit. The space complexity of the SSV algorithm is linear in the length of the time series. We provide proofs for the scalability, the termination and the correctness of the SSV algorithm. Experiments with real world hydrological time series and data sets from the UCR repository validate the analytical results and show the scalability of SSV.
【Keywords】: approximation theory; data handling; greedy algorithms; mathematics computing; singular value decomposition; time series; vectors; CD; SSV algorithm; SVD; UCR repository; approximation; greedy algorithm; hydrological time series; matrix decomposition techniques; memory-efficient centroid decomposition; quadratic space complexity; scalable sign vector; sign vector computation; singular value decomposition; time series data; Complexity theory; Loading; Matrix decomposition; Runtime; Singular value decomposition; Time series analysis; Vectors
【Paper Link】 【Pages】:112-123
【Authors】: Afroza Sultana ; Naeemul Hassan ; Chengkai Li ; Jun Yang ; Cong Yu
【Abstract】: We study the novel problem of finding new, prominent situational facts, which are emerging statements about objects that stand out within certain contexts. Many such facts are newsworthy-e.g., an athlete's outstanding performance in a game, or a viral video's impressive popularity. Effective and efficient identification of these facts assists journalists in reporting, one of the main goals of computational journalism. Technically, we consider an ever-growing table of objects with dimension and measure attributes. A situational fact is a “contextual” skyline tuple that stands out against historical tuples in a context, specified by a conjunctive constraint involving dimension attributes, when a set of measure attributes are compared. New tuples are constantly added to the table, reflecting events happening in the real world. Our goal is to discover constraint-measure pairs that qualify a new tuple as a contextual skyline tuple, and discover them quickly before the event becomes yesterday's news. A brute-force approach requires exhaustive comparison with every tuple, under every constraint, and in every measure subspace. We design algorithms in response to these challenges using three corresponding ideas-tuple reduction, constraint pruning, and sharing computation across measure subspaces. We also adopt a simple prominence measure to rank the discovered facts when they are numerous. Experiments over two real datasets validate the effectiveness and efficiency of our techniques.
【Keywords】: data mining; information retrieval; learning (artificial intelligence); brute-force approach; computational journalism; conjunctive constraint; constraint pruning; constraint-measure pairs; contextual skyline tuple; dimension attributes; fact identification; historical tuple; incremental discovery; measure attributes; prominent situational facts; sharing computation; tuple reduction; Algorithm design and analysis; Context; Databases; Extraterrestrial measurements; Games; Lattices; Media
【Paper Link】 【Pages】:124-135
【Authors】: Odysseas Papapetrou ; Minos N. Garofalakis
【Abstract】: Distributed skyline computation is important for a wide range of application domains, from distributed and web-based systems to ISP-network monitoring and distributed databases. The problem is particularly challenging in dynamic distributed settings, where the goal is to efficiently monitor a continuous skyline query over a collection of distributed streams. All existing work relies on the assumption of a single point of reference for object attributes/dimensions, i.e., objects may be vertically or horizontally partitioned, but the accurate value of each dimension for each object is always maintained by a single site. This assumption is unrealistic for several distributed monitoring applications, where object information is fragmented over a set of distributed streams (each monitored by a different site) and needs to be aggregated (e.g., averaged) across several sites. Furthermore, it is frequently useful to define skyline dimensions through complex functions over the aggregated objects, which raises further challenges for dealing with object fragmentation. In this paper, we present the first known distributed approach for continuous fragmented skylines, namely distributed monitoring of skylines over complex functions of fragmented multi-dimensional objects. We also propose several optimizations, including a new technique based on random-walk models for adaptively determining the most efficient monitoring strategy for each object. A thorough experimental study with synthetic and real-life data sets verifies the effectiveness of our approach, demonstrating order-of-magnitude improvements in communication costs compared to the only available centralized solution.
【Keywords】: distributed processing; query processing; continuous fragmented skylines; continuous skyline query monitoring; distributed skyline computation; distributed streams; fragmented multidimensional objects; object attributes; object dimensions; object fragmentation; order-of-magnitude improvements; random-walk models; skyline dimensions; skyline distributed monitoring; Aggregates; Distributed databases; IP networks; Maintenance engineering; Monitoring; Optimization; Vectors
【Paper Link】 【Pages】:136-147
【Authors】: Bin Yang ; Chenjuan Guo ; Christian S. Jensen ; Manohar Kaul ; Shuo Shang
【Abstract】: Different uses of a road network call for the consideration of different travel costs: in route planning, travel time and distance are typically considered, and green house gas (GHG) emissions are increasingly being considered. Further, travel costs such as travel time and GHG emissions are time-dependent and uncertain. To support such uses, we propose techniques that enable the construction of a multi-cost, time-dependent, uncertain graph (MTUG) model of a road network based on GPS data from vehicles that traversed the road network. Based on the MTUG, we define stochastic skyline routes that consider multiple costs and time-dependent uncertainty, and we propose efficient algorithms to retrieve stochastic skyline routes for a given source-destination pair and a start time. Empirical studies with three road networks in Denmark and a substantial GPS data set offer insight into the design properties of the MTUG and the efficiency of the stochastic skyline routing algorithms.
【Keywords】: Global Positioning System; data handling; directed graphs; road traffic; traffic engineering computing; Denmark; GHG emissions; GPS data; Global Positioning Systems; MTUG model; greenhouse gas emission; multicost time-dependent uncertain graph model; road network; source-destination pair; stochastic skyline route planning; time-varying uncertainty; travel costs; travel distance; travel time; Context; Global Positioning System; Random variables; Roads; Routing; Stochastic processes; Uncertainty
【Paper Link】 【Pages】:148-159
【Authors】: Anders Skovsgaard ; Darius Sidlauskas ; Christian S. Jensen
【Abstract】: With the rapidly increasing deployment of Internet-connected, location-aware mobile devices, very large and increasing amounts of geo-tagged and timestamped user-generated content, such as microblog posts, are being generated. We present indexing, update, and query processing techniques that are capable of providing the top-k terms seen in posts in a user-specified spatio-temporal range. The techniques enable interactive response times in the millisecond range in a realistic setting where the arrival rate of posts exceeds today's average tweet arrival rate by a factor of 4-10. The techniques adaptively maintain the most frequent items at various spatial and temporal granularities. They extend existing frequent item counting techniques to maintain exact counts rather than approximations. An extensive empirical study with a large collection of geo-tagged tweets shows that the proposed techniques enable online aggregation and query processing at scale in realistic settings.
【Keywords】: Internet; indexing; mobile computing; query processing; social networking (online); Internet-connected device; frequent item counting techniques; geo-tagged content; geo-tagged tweets; indexing technique; interactive response times; location-aware mobile devices; online aggregation; query processing techniques; scalable top-k spatio-temporal term querying; spatial granularities; temporal granularities; timestamped user-generated content; update technique; user-specified spatio-temporal range; Aggregates; Heuristic algorithms; Indexing; Merging; Monitoring; Radiation detectors
【Paper Link】 【Pages】:172-183
【Authors】: Amr Magdy ; Mohamed F. Mokbel ; Sameh Elnikety ; Suman Nath ; Yuxiong He
【Abstract】: This paper presents Mercury; a system for real-time support of top-k spatio-temporal queries on microblogs, where users are able to browse recent microblogs near their locations. With high arrival rates of microblogs, Mercury ensures real-time query response within a tight memory-constrained environment. Mercury bounds its search space to include only those microblogs that have arrived within certain spatial and temporal boundaries, in which only the top-k microblogs, according to a spatio-temporal ranking function, are returned in the search results. Mercury employs: (a) a scalable dynamic in-memory index structure that is capable of digesting all incoming microblogs, (b) an efficient query processor that exploits the in-memory index through spatio-temporal pruning techniques that reduce the number of visited microblogs to return the final answer, (c) an index size tuning module that dynamically finds and adjusts the minimum index size to ensure that incoming queries will be answered accurately, and (d) a load shedding technique that trades slight decrease in query accuracy for significant storage savings. Extensive experimental results based on a real-time Twitter Firehose feed and actual locations of Bing search queries show that Mercury supports high arrival rates of up to 64K microblogs/second and average query latency of 4 msec.
【Keywords】: Internet; Web sites; query processing; real-time systems; Bing search queries; Mercury; in-memory index structure; load shedding technique; memory-constrained spatiotemporal real-time search; microblog browsing; query processor; real-time Twitter Firehose feed; real-time query response; real-time support system; top-k microblogs; Accuracy; Indexing; Keyword search; Memory management; Real-time systems; Twitter
【Paper Link】 【Pages】:184-195
【Authors】: Wenfei Fan ; Xin Wang ; Yinghui Wu
【Abstract】: Answering queries using views has proven an effective technique for querying relational and semistructured data. This paper investigates this issue for graph pattern queries based on (bounded) simulation, which have been increasingly used in, e.g., social network analysis. We propose a notion of pattern containment to characterize graph pattern matching using graph pattern views. We show that a graph pattern query can be answered using a set of views if and only if the query is contained in the views. Based on this characterization we develop efficient algorithms to answer graph pattern queries. In addition, we identify three problems associated with graph pattern containment. We show that these problems range from quadratic-time to NP-complete, and provide efficient algorithms for containment checking (approximation when the problem is intractable). Using real-life data and synthetic data, we experimentally verify that these methods are able to efficiently answer graph pattern queries on large social graphs, by using views.
【Keywords】: computational complexity; graph theory; pattern matching; query processing; containment checking; graph pattern matching; graph pattern query answering; graph pattern views; large social graphs; pattern containment; quadratic-time to NP-complete; relational data querying; semistructured data querying; Approximation algorithms; Approximation methods; Artificial intelligence; Complexity theory; Pattern matching; Social network services; XML
【Paper Link】 【Pages】:196-207
【Authors】: Paul W. Olsen ; Alan G. Labouseur ; Jeong-Hyon Hwang
【Abstract】: Many of today's applications can benefit from the discovery of the most central entities in real-world networks. This paper presents a new technique that efficiently finds the k most central entities in terms of closeness centrality. Instead of computing the centrality of each entity independently, our technique shares intermediate results between centrality computations. Since the cost of each centrality computation may vary substantially depending on the choice of the previous computation, our technique schedules centrality computations in a manner that minimizes the estimated completion time. This technique also updates, with negligible overhead, an upper bound on the centrality of every entity. Using this information, our technique proactively skips entities that cannot belong to the final answer. This paper presents evaluation results for actual networks to demonstrate the benefits of our technique.
【Keywords】: information retrieval; network theory (graphs); scheduling; centrality computations; real-world networks; top-k closeness centrality search; upper bound; Approximation methods; Educational institutions; Equations; Heuristic algorithms; Measurement; Schedules; Upper bound
【Paper Link】 【Pages】:208-219
【Authors】: Zhiwei Zhang ; Lu Qin ; Jeffrey Xu Yu
【Abstract】: As an important branch of big data processing, big graph processing is becoming increasingly popular in recent years. Strongly connected component (SCC) computation is a fundamental graph operation on directed graphs, where an SCC is a maximal subgraph S of a directed graph G in which every pair of nodes is reachable from each other in S. By contracting each SCC into a node, a large general directed graph can be represented by a small directed acyclic graph (DAG). In the literature, there are I/O efficient semi-external algorithms to compute all SCCs of a graph G, by assuming that all nodes of a graph G can fit in the main memory. However, many real graphs are large and even the nodes cannot reside entirely in the main memory. In this paper, we study new I/O efficient external algorithms to find all SCCs for a directed graph G whose nodes cannot fit entirely in the main memory. To overcome the deficiency of the existing external graph contraction based approach that usually cannot stop in finite iterations, and the external DFS based approach that will generate a large number of random I/Os, we explore a new contraction-expansion based approach. In the graph contraction phase, instead of contracting the whole graph as the contraction based approach, we only contract the nodes of a graph, which are much more selective. The contraction phase stops when all nodes of the graph can fit in the main memory, such that the semi-external algorithm can be used in SCC computation. In the graph expansion phase, as the graph is expanded in the reverse order as it is contracted, the SCCs of all nodes in the graph are computed. Both graph contraction phase and graph expansion phase use only I/O efficient sequential scans and external sorts of nodes/edges in the graph. Our algorithm leverages the efficiency of the semi-external SCC computation algorithm and usually stops in a small number of iterations. We further optimize our approach by reducing the size of nodes and edges of the- contracted graph in each iteration. We conduct extensive experimental studies using both real and synthetic web-scale graphs to confirm the I/O efficiency of our approaches.
【Keywords】: Big Data; directed graphs; DAG; I-O efficient SCC computing; I-O efficient semiexternal algorithm; I-O efficient sequential scans; Web-scale graph; big data processing; big graph processing; contraction-expansion based approach; directed acyclic graph; external DFS based approach; external graph contraction based approach; general directed graph; graph contraction phase; graph expansion phase; strongly connected component computation; Algorithm design and analysis; Approximation algorithms; Computational modeling; Contracts; Memory management; Pattern matching; Social network services
【Paper Link】 【Pages】:220-231
【Authors】: Nguyen Quoc Viet Hung ; Nguyen Thanh Tam ; Zoltán Miklós ; Karl Aberer ; Avigdor Gal ; Matthias Weidlich
【Abstract】: Schema matching is the process of establishing correspondences between the attributes of database schemas for data integration purposes. Although several automatic schema matching tools have been developed, their results are often incomplete or erroneous. To obtain a correct set of correspondences, a human expert is usually required to validate the generated correspondences. We analyze this reconciliation process in a setting where a number of schemas needs to be matched, in the presence of consistency expectations about the network of attribute correspondences. We develop a probabilistic model that helps to identify the most uncertain correspondences, thus allowing us to guide the expert's work and collect his input about the most problematic cases. As the availability of such experts is often limited, we develop techniques that can construct a set of good quality correspondences with a high probability, even if the expert does not validate all the necessary correspondences. We demonstrate the efficiency of our techniques through extensive experimentation using real-world datasets.
【Keywords】: data integration; pattern matching; probability; consistency expectations; data integration; database schemas; pay-as-you-go reconciliation; probabilistic model; reconciliation process; schema matching networks; Approximation methods; Computational modeling; Data integration; Databases; Measurement uncertainty; Probabilistic logic; Uncertainty
【Paper Link】 【Pages】:232-243
【Authors】: Floris Geerts ; Giansalvatore Mecca ; Paolo Papotti ; Donatello Santoro
【Abstract】: We address the challenging and open problem of bringing together two crucial activities in data integration and data quality, i.e., transforming data using schema mappings, and fixing conflicts and inconsistencies using data repairing. This problem is made complex by several factors. First, schema mappings and data repairing have traditionally been considered as separate activities, and research has progressed in a largely independent way in the two fields. Second, the elegant formalizations and the algorithms that have been proposed for both tasks have had mixed fortune in scaling to large databases. In the paper, we introduce a very general notion of a mapping and cleaning scenario that incorporates a wide variety of features, like, for example, user interventions. We develop a new semantics for these scenarios that represents a conservative extension of previous semantics for schema mappings and data repairing. Based on the semantics, we introduce a chase-based algorithm to compute solutions. Appropriate care is devoted to developing a scalable implementation of the chase algorithm. To the best of our knowledge, this is the first general and scalable proposal in this direction.
【Keywords】: data handling; chase-based algorithm; cleaning scenario; data cleaning; data integration; data quality; data repairing; data transformation; mapping notion; schema mappings; user interventions; Cleaning; Databases; Hospitals; Maintenance engineering; Semantics; Standards
【Paper Link】 【Pages】:244-255
【Authors】: Maksims Volkovs ; Fei Chiang ; Jaroslaw Szlichta ; Renée J. Miller
【Abstract】: In declarative data cleaning, data semantics are encoded as constraints and errors arise when the data violates the constraints. Various forms of statistical and logical inference can be used to reason about and repair inconsistencies (errors) in data. Recently, unified approaches that repair both errors in data and errors in semantics (the constraints) have been proposed. However, both data-only approaches and unified approaches are by and large static in that they apply cleaning to a single snapshot of the data and constraints. We introduce a continuous data cleaning framework that can be applied to dynamic data and constraint environments. Our approach permits both the data and its semantics to evolve and suggests repairs based on the accumulated evidence to date. Importantly, our approach uses not only the data and constraints as evidence, but also considers the past repairs chosen and applied by a user (user repair preferences). We introduce a repair classifier that predicts the type of repair needed to resolve an inconsistency, and that learns from past user repair preferences to recommend more accurate repairs in the future. Our evaluation shows that our techniques achieve high prediction accuracy and generate high quality repairs. Of independent interest, our work makes use of a set of data statistics that are shown to be sensitive to predicting particular repair types.
【Keywords】: data handling; pattern classification; statistical analysis; constraint environments; continuous data cleaning framework; data repair inconsistencies; data semantics; data statistics; data-only approach; declarative data cleaning; logical inference; repair classifier; statistical inference; user repair preferences; Accuracy; Cleaning; Databases; Maintenance engineering; Remuneration; Semantics
【Paper Link】 【Pages】:256-267
【Authors】: Peng Wang ; Chinya V. Ravishankar
【Abstract】: Text-based search queries reveal user intent to the search engine, compromising privacy. Topical Intent Obfuscation (TIO) is a promising new approach to preserving user privacy. TIO masks topical intent by mixing real user queries with dummy queries matching various different topics. Dummy queries are generated using a Dummy Query Generation Algorithm (DGA). We demonstrate various shortcomings in current TIO schemes, and show how to correct them. Current schemes assume that DGA details are unknown to the adversary. We argue that this is a flawed assumption, and show how DGA details can be used to construct efficient attacks on TIO schemes, using an iterative DGA as an example. Our extensive experiments on real data sets show that our attacks can flag up to 80% of dummy queries. We also propose HDGA, a new DGA that we prove to be immune to the attacks based on DGA semantics that we describe.
【Keywords】: data privacy; iterative methods; query processing; search engines; DGA semantics; HDGA; TIO schemes; dummy queries; dummy query generation algorithm; iterative DGA; keyword search; real user queries; search engine; text-based search queries; topical intent masking; topical intent obfuscation; user privacy preservation; Engines
【Paper Link】 【Pages】:268-279
【Authors】: Jianxin Li ; Chengfei Liu ; Md. Saiful Islam
【Abstract】: Recent years have witnessed an unprecedented proliferation of social media, e.g., millions of blog posts, micro-blog posts, and social networks on the Internet. This kind of social media data can be modeled in a large graph where nodes represent the entities and edges represent relationships between entities of the social media. Discovering keyword-based correlated networks of these large graphs is an important primitive in data analysis, from which users can pay more attention about their concerned information in the large graph. In this paper, we propose and define the problem of keyword-based correlated network computation over a massive graph. To do this, we first present a novel tree data structure that only maintains the shortest path of any two graph nodes, by which the massive graph can be equivalently transformed into a tree data structure for addressing our proposed problem. After that, we design efficient algorithms to build the transformed tree data structure from a graph offline and compute the γ-bounded keyword matched subgraphs based on the pre-built tree data structure on the fly. To further improve the efficiency, we propose weighted shingle-based approximation approaches to measure the correlation among a large number of γ-bounded keyword matched subgraphs. At last, we develop a merge-sort based approach to efficiently generate the correlated networks. Our extensive experiments demonstrate the efficiency of our algorithms on reducing time and space cost. The experimental results also justify the effectiveness of our method in discovering correlated networks from three real datasets.
【Keywords】: Internet; data analysis; graph theory; merging; social networking (online); sorting; tree data structures; γ-bounded keyword matched subgraphs; Internet; data analysis; graph nodes shortest path; keyword-based correlated network computation; large social media; massive graph; merge-sort based approach; social networks; tree data structure; weighted single-based approximation approach; Blogs; Correlation; Keyword search; Measurement; Media; Social network services; Tree data structures; Correlated Networks; Keyword Query; Large Graph; Social Media
【Paper Link】 【Pages】:280-291
【Authors】: Zhongyuan Wang ; Haixun Wang ; Zhirui Hu
【Abstract】: Head and modifier detection is an important problem for applications that handle short texts such as search queries, ads keywords, titles, captions, etc. In many cases, short texts such as search queries do not follow grammar rules, and existing approaches for head and modifier detection are coarse-grained, domain specific, and/or require labeling of large amounts of training data. In this paper, we introduce a semantic approach for head and modifier detection. We first obtain a large number of instance level head-modifier pairs from search log. Then, we develop a conceptualization mechanism to generalize the instance level pairs to concept level. Finally, we derive weighted concept patterns that are concise, accurate, and have strong generalization power in head and modifier detection. Furthermore, we identify a subset of modifiers that we call constraints. Constraints are usually specific and not negligible as far as the intent of the short text is concerned, while non-constraint modifiers are more subjective. The mechanism we developed has been used in production for search relevance and ads matching. We use extensive experiment results to demonstrate the effectiveness of our approach.
【Keywords】: data mining; generalisation (artificial intelligence); query processing; text analysis; ads matching; conceptualization mechanism; constraint detection; generalization power; grammar rules; head detection; instance level head-modifier pairs; modifier detection; search log; search queries; search relevance; semantic approach; short text handling; weighted concept patterns; Companies; Educational institutions; Grammar; Head; Magnetic heads; Pattern matching; Taxonomy
【Paper Link】 【Pages】:292-303
【Authors】: Sungsu Lim ; Seungwoo Ryu ; Sejeong Kwon ; Kyomin Jung ; Jae-Gil Lee
【Abstract】: In this paper, for overlapping community detection, we propose a novel framework of the link-space transformation that transforms a given original graph into a link-space graph. Its unique idea is to consider topological structure and link similarity separately using two distinct types of graphs: the line graph and the original graph. For topological structure, each link of the original graph is mapped to a node of the link-space graph, which enables us to discover overlapping communities using non-overlapping community detection algorithms as in the line graph. For link similarity, it is calculated on the original graph and carried over into the link-space graph, which enables us to keep the original structure on the transformed graph. Thus, our transformation, by combining these two advantages, facilitates overlapping community detection as well as improves the resulting quality. Based on this framework, we develop the algorithm LinkSCAN that performs structural clustering on the link-space graph. Moreover, we propose the algorithm LinkSCAN* that enhances the efficiency of LinkSCAN by sampling. Extensive experiments were conducted using the LFR benchmark networks as well as some real-world networks. The results show that our algorithms achieve higher accuracy, quality, and coverage than the state-of-the-art algorithms.
【Keywords】: Web sites; graph theory; pattern clustering; LFR benchmark networks; LinkSCAN* algorithm; line graph; link similarity; link-space graph; link-space transformation; original graph; overlapping community detection; real-world networks; sampling; structural clustering; topological structure; Clustering algorithms; Communities; Detection algorithms; Educational institutions; Kernel; Lifting equipment; Social network services
【Paper Link】 【Pages】:304-315
【Authors】: Weiren Yu ; Xuemin Lin ; Wenjie Zhang
【Abstract】: SimRank is an arresting measure of node-pair similarity based on hyperlinks. It iteratively follows the concept that 2 nodes are similar if they are referenced by similar nodes. Real graphs are often large, and links constantly evolve with small changes over time. This paper considers fast incremental computations of SimRank on link-evolving graphs. The prior approach [12] to this issue factorizes the graph via a singular value decomposition (SVD) first, and then incrementally maintains this factorization for link updates at the expense of exactness. Consequently, all node-pair similarities are estimated in O(r4n2) time on a graph of n nodes, where r is the target rank of the low-rank approximation, which is not negligibly small in practice. In this paper, we propose a novel fast incremental paradigm. (1) We characterize the SimRank update matrix ΔS, in response to every link update, via a rank-one Sylvester matrix equation. By virtue of this, we devise a fast incremental algorithm computing similarities of n2 node-pairs in O(Kn2) time for K iterations. (2) We also propose an effective pruning technique capturing the “affected areas” of ΔS to skip unnecessary computations, without loss of exactness. This can further accelerate the incremental SimRank computation to O(K(nd+|AFF|)) time, where d is the average in-degree of the old graph, and |AFF| (≤ n2) is the size of “affected areas” in ΔS, and in practice, |AFF| ≪ n2. Our empirical evaluations verify that our algorithm (a) outperforms the best known link-update algorithm [12], and (b) runs much faster than its batch counterpart when link updates are small.
【Keywords】: approximation theory; computational complexity; graph theory; information retrieval; learning (artificial intelligence); singular value decomposition; O(Kn2) time; O(r4n2) time estimation; SVD; SimRank update matrix; fast incremental SimRank; fast incremental algorithm; fast incremental paradigm; graph factorization; hyperlinks; link updates; link-evolving graphs; link-update algorithm; low-rank approximation; node-pair similarities; node-pair similarity measure; pruning technique; rank-one Sylvester matrix equation; singular value decomposition; Accuracy; Approximation methods; Equations; Heuristic algorithms; Matrix converters; Matrix decomposition; Vectors
【Paper Link】 【Pages】:316-327
【Authors】: Samamon Khemmarat ; Lixin Gao
【Abstract】: The task of obtaining the items highly-relevant to a given set of query items is a basis for various applications, such as recommendation and prediction. A family of path-based relevance metrics, which quantify item relevance based on the paths in a given item graph, have been shown to be effective in capturing the relevance in many applications. Despite their effectiveness, path-based relevance normally requires time-consuming iterative computation. We propose an approach to obtain the top-k most relevant items for a given query item set quickly. Our approach can obtain the top-k items without having to compute converged scores. The approach is designed for a distributed environment, which makes it scale for massive graphs having hundreds of millions of nodes. Our experimental results show that the proposed approach can produce the result 20 to 50 times faster than a previously proposed approach and can scale well with both the size of input and the number of machines used in the computation.
【Keywords】: graph theory; query processing; fast top-k path-based relevance query; item relevance; massive graphs; path-based relevance metrics; time-consuming iterative computation; Adsorption; Equations; Joining processes; Mathematical model; Measurement; Upper bound; Vectors
【Paper Link】 【Pages】:328-339
【Authors】: Inci Cetindil ; Jamshid Esmaelnezhad ; Taewoo Kim ; Chen Li
【Abstract】: Instant search is an emerging information-retrieval paradigm in which a system finds answers to a query instantly while a user types in keywords character-by-character. Fuzzy search further improves user search experiences by finding relevant answers with keywords similar to query keywords. A main computational challenge in this paradigm is the high-speed requirement, i.e., each query needs to be answered within milliseconds to achieve an instant response and a high query throughput. At the same time, we also need good ranking functions that consider the proximity of keywords to compute relevance scores. In this paper, we study how to integrate proximity information into ranking in instant-fuzzy search while achieving efficient time and space complexities. We adapt existing solutions on proximity ranking to instant-fuzzy search. A naïve solution is computing all answers then ranking them, but it cannot meet this high-speed requirement on large data sets when there are too many answers, so there are studies of early-termination techniques to efficiently compute relevant answers. To overcome the space and time limitations of these solutions, we propose an approach that focuses on common phrases in the data and queries, assuming records with these phrases are ranked higher. We study how to index these phrases and develop an incremental-computation algorithm for efficiently segmenting a query into phrases and computing relevant answers. We conducted a thorough experimental study on real data sets to show the tradeoffs between time, space, and quality of these solutions.
【Keywords】: computational complexity; fuzzy set theory; query formulation; query processing; early-termination techniques; incremental-computation algorithm; information-retrieval paradigm; instant-fuzzy search; proximity ranking; query answering; query segmentation; space complexity; time complexity; Dictionaries; Heart; Indexing; Servers; Surgery; Tumors
【Paper Link】 【Pages】:340-351
【Authors】: Dong Deng ; Guoliang Li ; Shuang Hao ; Jiannan Wang ; Jianhua Feng
【Abstract】: String similarity join is an essential operation in data integration. The era of big data calls for scalable algorithms to support large-scale string similarity joins. In this paper, we study scalable string similarity joins using MapReduce. We propose a MapReduce-based framework, called MASSJOIN, which supports both set-based similarity functions and character-based similarity functions. We extend the existing partition-based signature scheme to support set-based similarity functions. We utilize the signatures to generate key-value pairs. To reduce the transmission cost, we merge key-value pairs to significantly reduce the number of key-value pairs, from cubic to linear complexity, while not sacrificing the pruning power. To improve the performance, we incorporate “light-weight” filter units into the key-value pairs which can be utilized to prune large number of dissimilar pairs without significantly increasing the transmission cost. Experimental results on real-world datasets show that our method significantly outperformed state-of-the-art approaches.
【Keywords】: Big Data; computational complexity; cost reduction; data integration; string matching; MASSJOIN; MapReduce-based framework; MassJoin; big data; character-based similarity functions; cubic complexity; data integration; key-value pairs; large-scale string similarity join; light-weight filter units; linear complexity; mapreduce-based method; partition-based signature scheme; scalable algorithm; scalable string similarity joins; set-based similarity functions; transmission cost reduction; Erbium; Filtering; Open systems
【Paper Link】 【Pages】:352-363
【Authors】: Ian Rae ; Alan Halverson ; Jeffrey F. Naughton
【Abstract】: Every major open-source and commercial RDBMS offers some form of support for full-text search using inverted indexes. When providing this support, some developers have implemented specialized indexes that adapt techniques from the Information Retrieval (IR) community to work in a database setting, while others have opted to rely on the standard relational query engine to process inverted index lookups. This choice is an important one, since the storage formats and algorithms used can vary greatly between a specialized index and a relational index, but these alternatives have not been thoroughly compared in the same system. Our work explores the differences in implementation and performance of three representative environments for an in-RDBMS inverted index: an in-RDBMS IR engine, a row-oriented relational query engine, and a column-oriented relational query engine. We found that a specialized IR engine integrated into the RDBMS can provide more than an order of magnitude speedup over both the row- and column-oriented relational query engines for conjunctive and phrase queries. For warm queries, this advantage is largely algorithmic, and we show that by using ZigZag merge join to accelerate conjunctive and phrase query processing, relational inverted indexes can provide performance comparable to a specialized in-RDBMS IR engine with no change to the underlying storage format. Compression and index format, in contrast, have more impact on cold queries, where the IR and column-oriented engines are able to outperform the row-oriented engine, even with ZigZag merge join.
【Keywords】: database indexing; query processing; relational databases; IR community; ZigZag merge join; column-oriented engines; column-oriented relational query engine; commercial RDBMS; conjunctive queries; database setting; full-text search; in-RDBMS IR engine; in-RDBMS inverted indexes; information retrieval; inverted index lookups; open-source RDBMS; phrase queries; relational database management systems; relational index; relational query engine; row-oriented engine; row-oriented relational query engine; specialized index; specialized indexes; storage formats; Algorithm design and analysis; Communities; Encoding; Engines; Indexes; Servers; Standards
【Paper Link】 【Pages】:364-375
【Authors】: Mohammad Sadoghi ; Hans-Arno Jacobsen
【Abstract】: The efficient processing of large collections of patterns expressed as Boolean expressions over event streams plays a central role in major data intensive applications ranging from user-centric processing and personalization to real-time data analysis. On the one hand, emerging user-centric applications, including computational advertising and selective information dissemination, demand determining and presenting to an end-user the relevant content as it is published. On the other hand, applications in real-time data analysis, including push-based multi-query optimization, computational finance and intrusion detection, demand meeting stringent subsecond processing requirements and providing high-frequency event processing. We achieve these event processing requirements by exploiting the shift towards multi-core architectures by proposing novel adaptive parallel compressed event matching algorithm (A-PCM) and online event stream re-ordering technique (OSR) that unleash an unprecedented degree of parallelism amenable for highly parallel event processing. In our comprehensive evaluation, we demonstrate the efficiency of our proposed techniques. We show that the adaptive parallel compressed event matching algorithm can sustain an event rate of up to 233,863 events/second while state-of-the-art sequential event matching algorithms sustains only 36 events/second when processing up to five million Boolean expressions.
【Keywords】: data analysis; finance; multiprocessing systems; parallel processing; query processing; security of data; A-PCM; Boolean expressions; adaptive parallel compressed event matching; computational advertising; computational finance; intrusion detection; multicore architectures; novel adaptive parallel compressed event matching algorithm; online event stream re-ordering technique; push-based multi-query optimization; real-time data analysis; selective information dissemination; user-centric processing; Data analysis; Encoding; Indexes; Instruction sets; Phase change materials; Real-time systems; Subscriptions
【Paper Link】 【Pages】:376-387
【Authors】: Xiaochen Zhu ; Shaoxu Song ; Jianmin Wang ; Philip S. Yu ; Jiaguang Sun
【Abstract】: A large amount of heterogeneous event data are increasingly generated, e.g., in online systems for Web services or operational systems in enterprises. Owing to the difference between event data and traditional relational data, the matching of heterogeneous events is highly non-trivial. While event names are often opaque (e.g., merely with obscure IDs), the existing structure-based matching techniques for relational data also fail to perform owing to the poor discriminative power of dependency relationships between events. We note that interesting patterns exist in the occurrence of events, which may serve as discriminative features in event matching. In this paper, we formalize the problem of matching events with patterns. A generic pattern based matching framework is proposed, which is compatible with the existing structure based techniques. To improve the matching efficiency, we devise several bounds of matching scores for pruning. Since the exploration of patterns is costly and incrementally, our proposed techniques support matching in a pay-as-you-go style, i.e., incrementally update the matching results with the increase of available patterns. Finally, extensive experiments on both real and synthetic data demonstrate the effectiveness of our pattern based matching compared with approaches adapted from existing techniques, and the efficiency improved by the bounding/pruning methods.
【Keywords】: distributed databases; relational databases; bounding/pruning methods; enterprise operational systems; heterogeneous event data; heterogeneous event matching; matching scores; online systems; pattern based matching; relational data; Business; Educational institutions; Frequency conversion; Information systems; Optimal matching; Pattern matching; Upper bound
【Paper Link】 【Pages】:388-399
【Authors】: Xiaolan Wang ; K. Selçuk Candan ; Maria Luisa Sapino
【Abstract】: Many applications generate and/or consume multi-variate temporal data, yet experts often lack the means to adequately and systematically search for and interpret multi-variate observations. In this paper, we first observe that multi-variate time series often carry localized multi-variate temporal features that are robust against noise. We then argue that these multi-variate temporal features can be extracted by simultaneously considering, at multiple scales, temporal characteristics of the time-series along with external knowledge, including variate relationships, known a priori. Relying on these observations, we develop algorithms to detect robust multi-variate temporal (RMT) features which can be indexed for efficient and accurate retrieval and can be used for supporting analysis tasks, such as classification. Experiments confirm that the proposed RMT algorithm is highly effective and efficient in identifying robust multi-scale temporal features of multi-variate time series.
【Keywords】: feature extraction; indexing; information retrieval; meta data; pattern classification; time series; RMT features; classification task; external knowledge; information retrieval; local multivariate temporal features; meta data; multivariate observations; multivariate time series; robust multivariate temporal features; variate relationships; Correlation; Data models; Feature extraction; Robustness; Smoothing methods; Tensile stress; Time series analysis
【Paper Link】 【Pages】:400-411
【Authors】: Di Jiang ; Kenneth Wai-Ting Leung ; Jan Vosecky ; Wilfred Ng
【Abstract】: Query suggestion is an important functionality provided by the search engine to facilitate information seeking of the users. Existing query suggestion methods usually focus on recommending queries that are the most relevant to the input query. However, such relevance-oriented strategy cannot effectively handle query uncertainty, a common scenario that the input query can be interpreted as multiple different meanings. To alleviate this problem, the concepts of diversification and person-alization have been individually introduced to query suggestion systems. These two concepts are often seen as incompatible alternatives, because diversification considers multiple aspects of the input query to maximize the probability that some query aspect is relevant to the user while personalization aims to adapt the suggestions to a specific aspect that aligns with the preference of a specific user. In this paper, we refute this antagonistic view and propose a new query suggestion paradigm, Personalized Query Suggestion With Diversity Awareness (PQS-DA) to effectively combine diversification and personalization into one unified framework. In PQS-DA, the suggested queries are effectively diversified to cover different potential facets of the input query while the ranking of suggested queries are personalized to ensure that the top ones are those that align with a user's personal preference. We evaluate PQS-DA on a real-life search engine query log against several state-of-the-art methods with respect to a variety of metrics. The experimental results verify our hypothesis that diversification and personalization can be effectively integrated and they are able to enhance each other within the PQS-DA framework, which significantly outperforms several strong baselines with respect to a series of metrics.
【Keywords】: query processing; search engines; PQS-DA paradigm; antagonistic view; diversification concept; diversity awareness; information seeking; input query; personalization concept; personalized query suggestion; query ranking; query recommendation; query uncertainty; relevance-oriented strategy; search engine query log; Context; Equations; Java; Search engines; Sun; Uncertainty; Web search
【Paper Link】 【Pages】:412-423
【Authors】: Senjuti Basu Roy ; Saravanan Thirumuruganathan ; Sihem Amer-Yahia ; Gautam Das ; Cong Yu
【Abstract】: We examine the problem of enabling the flexibility of updating one's preferences in group recommendation. In our setting, any group member can provide a vector of preferences that, in addition to past preferences and other group members' preferences, will be accounted for in computing group recommendation. This functionality is essential in many group recommendation applications, such as travel planning, online games, book clubs, or strategic voting, as it has been previously shown that user preferences may vary depending on mood, context, and company (i.e., other people in the group). Preferences are enforced in an feedback box that replaces preferences provided by the users by a potentially different feedback vector that is better suited for maximizing the individual satisfaction when computing the group recommendation. The feedback box interacts with a traditional recommendation box that implements a group consensus semantics in the form of Aggregated Voting or Least Misery, two popular aggregation functions for group recommendation. We develop efficient algorithms to compute robust group recommendations that are appropriate in situations where users have changing preferences. Our extensive empirical study on real world data-sets validates our findings.
【Keywords】: ergonomics; recommender systems; aggregation functions; book clubs; feedback box; feedback vector; flexible preferences; group consensus semantics; group recommendation computing; group recommendation functions; least misery; online games; strategic voting; travel planning; user preferences; Approximation algorithms; Complexity theory; Hamming distance; Poles and towers; Robustness; Semantics; Vectors
【Paper Link】 【Pages】:424-435
【Authors】: Nicholas L. Farnan ; Adam J. Lee ; Panos K. Chrysanthis ; Ting Yu
【Abstract】: The declarative nature of SQL has traditionally been a major strength. Users simply state what information they are interested in, and the database management system determines the best plan for retrieving it. A consequence of this model is that should a user ever want to specify some aspect of how their queries are evaluated (e.g., a preference to read data from a specific replica, or a requirement for all joins to be performed by a single server), they are unable to. This can leave database administrators shoehorning evaluation preferences into database cost models. Further, for distributed database users, it can result in query evaluation plans that violate data handling best practices or the privacy of the user. To address such issues, we have developed a framework for declarative, user-specified constraints on the query optimization process and implemented it within PosgreSQL. Our Preference-Aware Query Optimizer (PAQO) upholds both strict requirements and partially ordered preferences that are issued alongside of the queries that it processes. In this paper, we present the design of PAQO and thoroughly evaluate its performance.
【Keywords】: SQL; query processing; relational databases; PAQO optimization; PosgreSQL; Structured Query Languages; data handling best practices; database cost models; database management system; decentralized database systems; preference-aware query optimization; query evaluation; query optimization process; user privacy; user-specified constraints; Servers; Solvents
【Paper Link】 【Pages】:436-447
【Authors】: Jitendra Ajmera ; Sachindra Joshi ; Ashish Verma ; Amol Mittal
【Abstract】:
In a customer support scenario, a lot of valuable information is recorded in the form of case logs'. Case logs are primarily written for future references or manual inspections and therefore are written in a hasty manner and are very noisy. In this paper, we propose techniques that exploit these case logs to mine real customer concerns or problems and then map them to well written knowledge articles for that enterprise. This mapping results into generation of question-answer (QA) pairs. These QA pairs can be used for a variety of applications such as dynamically updating the frequently-asked-questions (FAQs), updating the knowledge repository etc. In this paper we show the utility of these discovered QA pairs as training data for a question-answering system. Our approach for mining the case logs is based on a composite model consisting of two generative models, viz, hidden Markov model (HMM) and latent Dirichlet allocation (LDA) model. The LDA model explains the long-range dependencies across words due to their semantic similarity and HMM models the sequential patterns present in these case logs. Such processing results in crisp
problem statement' segments which are indicative of the real customer concerns. Our experiments show that this approach finds crisp problem-statements in 56% of the cases and outperforms other alternate methods for segmentation such as HMM, LDA and conditional random field (CRF). After finding these crisp problem-statements, appropriate answers are looked up from an existing knowledge repository index forming candidate QA pairs. We show that considering only the problemstatement segments for which the answers can be found further improves the segmentation performance to 82%. Finally, we show that when these QA pairs are used as training data, the performance of a question-answering system can be improved significantly.
【Keywords】: data mining; hidden Markov models; question answering (information retrieval); FAQ; HMM; LDA model; QA pairs discovery; case logs mining; conditional random field; frequently-asked-questions; hidden Markov model; knowledge repository; latent Dirichlet allocation model; problem statement segments; question answer pairs; question-answering system; segmentation performance; semantic similarity; Context; Hidden Markov models; Noise measurement; Semantics; Syntactics; Training; Viterbi algorithm
【Paper Link】 【Pages】:448-459
【Authors】: Loïc Cerf ; Wagner Meira Jr.
【Abstract】: Many datasets are numerical tensors, i. e., associate n-tuples with numerical values. Until recently, the discovery of relevant local patterns in such numerical and multidimensional data has received little attention despite the broad applicative perspectives offered by this general framework. Even in the simpler 2-dimensional case, almost every proposal so far is either incomplete (i. e., it does not list every pattern) or relies on binning and mines Boolean tensors. In both cases, some information is lost during the process. In uncertain tensors, n-tuples satisfy the studied predicate to a certain extent and no information is lost w.r.t. the original data. Given an uncertain tensor, the closed patterns are its maximal “sub-tensors” covering n-tuples that “mostly” satisfy the predicate. Defining “mostly” is the key problem: the patterns should be both relevant given the data and efficiently extractable. The proposed complete extractor reuses the enumeration principles of the state-of-the-art miner for closed n-sets but incrementally enforces the newly designed definition. In this way, the proposed algorithm runs orders of magnitude faster than its only competitor and large datasets are tractable. The experimental section reports the discovery of dynamic patterns of influence in Twitter as well as usage patterns in a transportation network. Additional experiments on synthetic data quantitatively assess the quality of the chosen definition for the patterns.
【Keywords】: Boolean algebra; data mining; set theory; social networking (online); tensors; Boolean tensors; Twitter; closed n-sets; data mining; enumeration principles; high-quality patterns; n-tuples; numerical tensors; transportation network; Data mining; Itemsets; Noise; Proposals; Radiation detectors; Tensile stress; Writing
【Paper Link】 【Pages】:460-471
【Authors】: Sofiane Abbar ; Habibur Rahman ; Saravanan Thirumuruganathan ; Carlos Castillo ; Gautam Das
【Abstract】: We assume a database of items in which each item is described by a set of attributes, some of which could be multi-valued. We refer to each of the distinct attribute values as a feature. We also assume that we have information about the interactions (such as visits or likes) between a set of users and those items. In our paper, we would like to rank the features of an item using user-item interactions. For instance, if the items are movies, features could be actors, directors or genres, and user-item interaction could be user liking the movie. These information could be used to identify the most important actors for each movie. While users are drawn to an item due to a subset of its features, a user-item interaction only provides an expression of user preference over the entire item, and not its component features. We design algorithms to rank the features of an item depending on whether interaction information is available at aggregated or individual level granularity and extend them to rank composite features (set of features). Our algorithms are based on constrained least squares, network flow and non-trivial adaptations to non-negative matrix factorization. We evaluate our algorithms using both real-world and synthetic datasets.
【Keywords】: data mining; matrix decomposition; actors; composite feature ranking; constrained least squares; directors; genres; item database; item feature ranking; most important actor identification; network flow; nonnegative matrix factorization; nontrivial adaptations; online user-item interaction mining; synthetic datasets; user preference; user-item interactions; Aggregates; Algorithm design and analysis; Computational modeling; Databases; Matrix decomposition; Motion pictures; Vectors
【Paper Link】 【Pages】:472-483
【Authors】: Niranjan Kamat ; Prasanth Jayachandran ; Karthik Tunga ; Arnab Nandi
【Abstract】: Interactive ad-hoc analytics over large datasets has become an increasingly popular use case. We detail the challenges encountered when building a distributed system that allows the interactive exploration of a data cube. We introduce DICE, a distributed system that uses a novel session-oriented model for data cube exploration, designed to provide the user with interactive sub-second latencies for specified accuracy levels. A novel framework is provided that combines three concepts: faceted exploration of data cubes, speculative execution of queries and query execution over subsets of data. We discuss design considerations, implementation details and optimizations of our system. Experiments demonstrate that DICE provides a sub-second interactive cube exploration experience at the billion-tuple scale that is at least 33% faster than current approaches.
【Keywords】: data analysis; query processing; DICE system; billion-tuple scale; distributed data cube exploration; distributed system; faceted data cubes exploration; interactive ad-hoc analytics; interactive data cube exploration; session-oriented model; speculative query execution; sub-second interactive cube exploration; Accuracy; Catalogs; Context; Data models; Distributed databases; Lattices
【Paper Link】 【Pages】:484-495
【Authors】: Gheorghi Guzun ; Guadalupe Canahuate ; David Chiu ; Jason Sawin
【Abstract】: Bitmap indices are widely used for large read-only repositories in data warehouses and scientific databases. Their binary representation allows for the use of bitwise operations and specialized run-length compression techniques. Due to a trade-off between compression and query efficiency, bitmap compression schemes are aligned using a fixed encoding length size (typically the word length) to avoid explicit decompression during query time. In general, smaller encoding lengths provide better compression, but require more decoding during query execution. However, when the difference in size is considerable, it is possible for smaller encodings to also provide better execution time. We posit that a tailored encoding length for each bit vector will provide better performance than a one-size-fits-all approach. We present a framework that optimizes compression and query efficiency by allowing bitmaps to be compressed using variable encoding lengths while still maintaining alignment to avoid explicit decompression. Efficient algorithms are introduced to process queries over bitmaps compressed using different encoding lengths. An input parameter controls the aggressiveness of the compression providing the user with the ability to tune the tradeoff between space and query time. Our empirical study shows this approach achieves significant improvements in terms of both query time and compression ratio for synthetic and real data sets. Compared to 32-bit WAH, VAL-WAH produces up to 1.8× smaller bitmaps and achieves query times that are 30% faster.
【Keywords】: data compression; data structures; query processing; VAL-WAH data set; binary representation; bit vector; bitmap compression schemes; bitmap indices; bitwise operations; compression ratio; data warehouses; fixed encoding length size; input parameter; one-size-fits-all approach; query efficiency; query time; read-only repositories; scientific databases; specialized run-length compression techniques; tunable compression framework; variable encoding lengths; word length; Algorithm design and analysis; Computer architecture; Decoding; Encoding; Indexes; Query processing; Vectors
【Paper Link】 【Pages】:496-507
【Authors】: Zhengkui Wang ; Qi Fan ; Huiju Wang ; Kian-Lee Tan ; Divyakant Agrawal ; Amr El Abbadi
【Abstract】: Attributed graphs are becoming important tools for modeling information networks, such as the Web and various social networks (e.g. Facebook, LinkedIn, Twitter). However, it is computationally challenging to manage and analyze attributed graphs to support effective decision making. In this paper, we propose, Pagrol, a parallel graph OLAP (Online Analytical Processing) system over attributed graphs. In particular, Pagrol introduces a new conceptual Hyper Graph Cube model (which is an attributed-graph analogue of the data cube model for relational DBMS) to aggregate attributed graphs at different granularities and levels. The proposed model supports different queries as well as a new set of graph OLAP Roll-Up/Drill-Down operations. Furthermore, on the basis of Hyper Graph Cube, Pagrol provides an efficient MapReduce-based parallel graph cubing algorithm, MRGraph-Cubing, to compute the graph cube for an attributed graph. Pagrol employs numerous optimization techniques: (a) a self-contained join strategy to minimize I/O cost; (b) a scheme that groups cuboids into batches so as to minimize redundant computations; (c) a cost-based scheme to allocate the batches into bags (each with a small number of batches); and (d) an efficient scheme to process a bag using a single MapReduce job. Results of extensive experimental studies using both real Facebook and synthetic datasets on a 128-node cluster show that Pagrol is effective, efficient and scalable.
【Keywords】: data mining; graph theory; parallel algorithms; social networking (online); Facebook; MRGraph-cubing; MapReduce-based parallel graph cubing algorithm; Pagrol; conceptual hyper graph cube model; decision making; information networks; large-scale attributed graphs; numerous optimization techniques; online analytical processing; parallel graph OLAP system; self-contained join strategy; single MapReduce job; Aggregates; Cities and towns; Computational modeling; Decision making; Educational institutions; Lattices; Warehousing
【Paper Link】 【Pages】:508-519
【Authors】: Holger Pirk ; Stefan Manegold ; Martin L. Kersten
【Abstract】: The variety of memory devices in modern computer systems holds opportunities as well as challenges for data management systems. In particular, the exploitation of Graphics Processing Units (GPUs) and their fast memory has been studied quite intensively. However, current approaches treat GPUs as systems in their own right and fail to provide a generic strategy for efficient CPU/GPU cooperation. We propose such a strategy for relational query processing: calculating an approximate result based on lossily compressed, GPU-resident data and refine the result using residuals, i.e., the lost data, on the CPU.We developed the required algorithms, implemented the strategy in an existing DBMS and found up to 8 times performance improvement, even for datasets larger than the available GPU memory.
【Keywords】: graphics processing units; query processing; relational databases; storage management; CPU-GPU cooperation; GPU memory; approximate result; data management systems; graphics processing units; lossily compressed GPU-resident data; memory devices; relational data coprocessing; relational query processing
【Paper Link】 【Pages】:520-531
【Authors】: Luis Leopoldo Perez ; Christopher M. Jermaine
【Abstract】: The use of materialized views derived from the intermediate results of frequently executed queries is a popular strategy for improving performance in query workloads. Optimizers capable of matching such views with inbound queries can generate alternative execution plans that read the materialized contents directly instead of re-computing the corresponding subqueries, which tends to result in reduced query execution times. In this paper, we introduce an architecture called Hawc that extends a cost-based logical optimizer with the capability to use history information to identify query plans that, if executed, produce intermediate result sets that can be used to create materialized views with the potential to reduce the execution time of future queries. We present techniques for using knowledge of past queries to assist the query optimizer and match, generate and select useful materialized views. Experimental results indicate that these techniques provide substantial improvements in workload execution time.
【Keywords】: history; query processing; history aware query optimization; history information; inbound queries; logical optimizer; materialized contents; materialized intermediate views; query execution; query workloads; Costing; History; Optimization; Query processing; Vectors
【Paper Link】 【Pages】:532-543
【Authors】: Varun Simhadri ; Karthik Ramachandra ; Arun Chaitanya ; Ravindra Guravannavar ; S. Sudarshan
【Abstract】: Queries containing user-defined functions (UDFs) are widely used, since they allow queries to be written using a mix of imperative language constructs and SQL, thereby increasing the expressive power of SQL; further, they encourage modularity, and make queries easier to understand. However, not much attention has been paid to their optimization, except for simple UDFs without imperative constructs. Queries invoking UDFs with imperative constructs are executed using iterative invocation of the UDFs, leading to poor performance, especially if the UDF contains queries. Such poor execution has been a major deterrent to the wider usage of complex UDFs. In this paper we present a novel technique to decorrelate UDFs containing imperative constructs, allowing set-oriented execution of queries that invoke UDFs. Our technique allows imperative execution to be modeled using the Apply construct used earlier to model correlated subqueries, and enables transformation rules to be applied subsequently to decorrelate (or inline) UDF bodies. Subquery decorrelation was critical to the wide use of subqueries; our work brings the same benefits to queries that invoke complex UDFs. We have applied our techniques to UDFs running on two commercial database systems, and present results showing up to orders of magnitude improvement.
【Keywords】: database management systems; query processing; Apply construct; SQL; Structured Query Language; UDF; commercial database systems; imperative language constructs; query processing; set-oriented query execution; subquery decorrelation; user defined function invocations; Context; Database systems; Decorrelation; Gold; Optimization; Platinum; Standards
【Paper Link】 【Pages】:544-555
【Authors】: Jun Gao ; Jiashuai Zhou ; Chang Zhou ; Jeffrey Xu Yu
【Abstract】: With the rapid growth of graphs in different applications, it is inevitable to leverage existing distributed data processing frameworks in managing large graphs. Although these frameworks ease the developing cost, it is still cumbersome and error-prone for developers to implement complex graph analysis tasks in distributed environments. Additionally, developers have to learn the details of these frameworks quite well, which is a key to improve the performance of distributed jobs. This paper introduces a high level query language called GLog and proposes its evaluation method to overcome these limitations. Specifically, we first design a RG (Relational-Graph) data model to mix relational data and graph data, and extend Datalog to GLog on RG tables to support various graph analysis tasks. Second, we define operations on RG tables, and show translation templates to convert a GLog query into a sequence of MapReduce jobs. Third, we propose two strategies, namely rule merging and iteration rewriting, to optimize the translated jobs. The final experiments show that GLog can not only express various graph analysis tasks in a more succinct way, but also achieve a better performance for most of the graph analysis tasks than Pig, another high level dataflow system.
【Keywords】: DATALOG; data flow computing; graph theory; relational databases; DATALOG; GLog; MapReduce; Pig; complex graph analysis; distributed data processing; high level dataflow system; high level graph analysis system; high level query language; relational-graph data model; Algebra; Computer languages; Educational institutions; Engines; Indexes; Merging; Optimization
【Paper Link】 【Pages】:556-567
【Authors】: Jun Gao ; Chang Zhou ; Jiashuai Zhou ; Jeffrey Xu Yu
【Abstract】: Continuous pattern detection plays an important role in monitoring-related applications. The large size and dynamic update of graphs, along with the massive search space, pose huge challenges in developing an efficient continuous pattern detection system. In this paper, we leverage a distributed graph processing framework to approximately detect a given pattern over a large dynamic graph. We aim to improve the scalability and precision, and reduce the response time and message cost in the detection. We convert a given query pattern into a Single-Sink DAG (Directed Acyclic Graph), and propose an evaluation plan with message transitions on the DAG, which is shorten by SSD plan, to detect the pattern in a large dynamic graph. SSD plan can guide the data graph exploration via messages, and the messages will converge at data sink vertices, which then detect existences of the query pattern. We also conduct join operations over partial vertices during the graph exploration to improve the precision of pattern detection. In addition, we show that SSD plan can support the continuous query over dynamic graphs with slight extensions. We further design various sink vertex selection strategies and neighborhood based transition rule attachment to lower the evaluation cost. The experiments on billion-edge real-life graphs using Giraph, an open source implementation of Pregel, illustrate the efficiency and effectiveness of our method.
【Keywords】: directed graphs; distributed processing; query processing; Giraph; Pregel; SSD plan; billion-edge graph; continuous pattern detection system; data graph exploration; directed acyclic graph; distributed graph processing framework; large dynamic graph; massive search space; neighborhood based transition rule attachment; query pattern; single-sink DAG; sink vertex selection strategies; Biomedical monitoring; Drives; Image edge detection; Monitoring; Parallel processing; Pattern matching; Transforms
【Paper Link】 【Pages】:568-579
【Authors】: Lu Wang ; Yanghua Xiao ; Bin Shao ; Haixun Wang
【Abstract】: Billion-node graphs pose significant challenges at all levels from storage infrastructures to programming models. It is critical to develop a general purpose platform for graph processing. A distributed memory system is considered a feasible platform supporting online query processing as well as offline graph analytics. In this paper, we study the problem of partitioning a billion-node graph on such a platform, an important consideration because it has direct impact on load balancing and communication overhead. It is challenging not just because the graph is large, but because we can no longer assume that the data can be organized in arbitrary ways to maximize the performance of the partitioning algorithm. Instead, the algorithm must adopt the same data and programming model adopted by the system and other applications. In this paper, we propose a multi-level label propagation (MLP) method for graph partitioning. Experimental results show that our solution can partition billion-node graphs within several hours on a distributed memory system consisting of merely several machines, and the quality of the partitions produced by our approach is comparable to state-of-the-art approaches applied on toy-size graphs.
【Keywords】: distributed memory systems; graph theory; query processing; resource allocation; MLP method; billion-node graph partitioning; communication overhead; distributed memory system; general purpose platform; graph processing; load balancing; multilevel label propagation method; offline graph analytics; online query processing; toy-size graphs; Algorithm design and analysis; Communities; Educational institutions; Frequency modulation; Partitioning algorithms; Semantics; Social network services
【Paper Link】 【Pages】:580-591
【Authors】: Viktor Leis ; Alfons Kemper ; Thomas Neumann
【Abstract】: So far, transactional memory-although a promising technique-suffered from the absence of an efficient hardware implementation. The upcoming Haswell microarchitecture from Intel introduces hardware transactional memory (HTM) in mainstream CPUs. HTM allows for efficient concurrent, atomic operations, which is also highly desirable in the context of databases. On the other hand HTM has several limitations that, in general, prevent a one-to-one mapping of database transactions to HTM transactions. In this work we devise several building blocks that can be used to exploit HTM in main-memory databases. We show that HTM allows to achieve nearly lock-free processing of database transactions by carefully controlling the data layout and the access patterns. The HTM component is used for detecting the (infrequent) conflicts, which allows for an optimistic, and thus very low-overhead execution of concurrent transactions.
【Keywords】: storage management; transaction processing; HTM; Haswell microarchitecture; Intel; access patterns; atomic operations; concurrent transactions; data layout; database context; database transactions; hardware transactional memory; main-memory databases; one-to-one database mapping; Concurrent computing; Databases; Hardware; Multicore processing; Parallel processing; Protocols; Synchronization
【Paper Link】 【Pages】:592-603
【Authors】: Wolf Rödiger ; Tobias Mühlbauer ; Philipp Unterbrunner ; Angelika Reiser ; Alfons Kemper ; Thomas Neumann
【Abstract】: The growth in compute speed has outpaced the growth in network bandwidth over the last decades. This has led to an increasing performance gap between local and distributed processing. A parallel database cluster thus has to maximize the locality of query processing. A common technique to this end is to co-partition relations to avoid expensive data shuffling across the network. However, this is limited to one attribute per relation and is expensive to maintain in the face of updates. Other attributes often exhibit a fuzzy co-location due to correlations with the distribution key but current approaches do not leverage this. In this paper, we introduce locality-sensitive data shuffling, which can dramatically reduce the amount of network communication for distributed operators such as join and aggregation. We present four novel techniques: (i) optimal partition assignment exploits locality to reduce the network phase duration; (ii) communication scheduling avoids bandwidth underutilization due to cross traffic; (iii) adaptive radix partitioning retains locality during data repartitioning and handles value skew gracefully; and (iv) selective broadcast reduces network communication in the presence of extreme value skew or large numbers of duplicates. We present comprehensive experimental results, which show that our techniques can improve performance by up to factor of 5 for fuzzy co-location and a factor of 3 for inputs with value skew.
【Keywords】: data handling; parallel databases; adaptive radix partitioning technique; aggregation operator; communication scheduling technique; distributed processing; fuzzy colocation; join operator; locality-sensitive data shuffling; locality-sensitive operators; network bandwidth; network communication; optimal partition assignment technique; parallel main-memory database clusters; performance gap; query processing; relation copartitioning; selective broadcast technique; value skew; Algorithm design and analysis; Bandwidth; Correlation; Database systems; Distributed databases; Partitioning algorithms
【Paper Link】 【Pages】:604-615
【Authors】: Nirmesh Malviya ; Ariel Weisberg ; Samuel Madden ; Michael Stonebraker
【Abstract】: Fine-grained, record-oriented write-ahead logging, as exemplified by systems like ARIES, has been the gold standard for relational database recovery. In this paper, we show that in modern high-throughput transaction processing systems, this is no longer the optimal way to recover a database system. In particular, as transaction throughputs get higher, ARIES-style logging starts to represent a non-trivial fraction of the overall transaction execution time. We propose a lighter weight, coarse-grained command logging technique which only records the transactions that were executed on the database. It then does recovery by starting from a transactionally consistent checkpoint and replaying the commands in the log as if they were new transactions. By avoiding the overhead of fine-grained logging of before and after images (both CPU complexity as well as substantial associated 110), command logging can yield significantly higher throughput at run-time. Recovery times for command logging are higher compared to an ARIEs-style physiological logging approach, but with the advent of high-availability techniques that can mask the outage of a recovering node, recovery speeds have become secondary in importance to run-time performance for most applications. We evaluated our approach on an implementation of TPCC in a main memory database system (VoltDB), and found that command logging can offer 1.5 x higher throughput than a main-memory optimized implementation of ARIEs-style physiological logging.
【Keywords】: relational databases; transaction processing; ARIES system; VoltDB system; coarse-grained command logging technique; fine-grained record-oriented write-ahead logging; high-throughput transaction processing systems; main memory OLTP recovery; main memory database system; online transaction processing; physiological logging approach; relational database recovery; transaction execution time; transactionally consistent checkpoint; Database systems
【Paper Link】 【Pages】:616-627
【Authors】: Peter Lindstrom ; Deepak Rajan
【Abstract】: This paper proposes a general framework for generating cache-oblivious layouts for binary search trees. A cache-oblivious layout attempts to minimize cache misses on any hierarchical memory, independent of the number of memory levels and attributes at each level such as cache size, line size, and replacement policy. Recursively partitioning a tree into contiguous subtrees and prescribing an ordering amongst the subtrees, Hierarchical Layouts generalize many commonly used layouts for trees such as in-order, pre-order and breadth-first. They also generalize the various flavors of the van Emde Boas layout, which have previously been used as cache-oblivious layouts. Hierarchical Layouts thus unify previous attempts at deriving layouts for search trees. The paper then derives a new locality measure (the Weighted Edge Product) that mimics the probability of cache misses at multiple levels, and shows that layouts that reduce this measure perform better. We analyze the various degrees of freedom in the construction of Hierarchical Layouts, and investigate the relative effect of each of these decisions in the construction of cache-oblivious layouts. Optimizing the Weighted Edge Product for complete binary search trees, we introduce the MINWEP layout, and show that it outperforms previously used cache-oblivious layouts by almost 20%.
【Keywords】: cache storage; probability; tree searching; MINWEP layout; binary search trees; breadth-first; cache misses minimization; cache misses probability; cache-oblivious layouts; cache-oblivious search trees; contiguous subtrees; hierarchical memory; in-order; locality measure; optimal hierarchical layouts; pre-order; van Emde Boas layout; weighted edge product; weighted edge product optimization; Binary search trees; Layout; Size measurement; Vegetation; Weight measurement
【Paper Link】 【Pages】:628-639
【Authors】: Haibo Hu ; Jianliang Xu ; Xizhong Xu ; Kexin Pei ; Byron Choi ; Shuigeng Zhou
【Abstract】: Query processing that preserves both the query privacy at the client and the data privacy at the server is a new research problem. It has many practical applications, especially when the queries are about the sensitive attributes of records. However, most existing studies, including those originating from data outsourcing, address the data privacy and query privacy separately. Although secure multiparty computation (SMC) is a suitable computing paradigm for this problem, it has significant computation and communication overheads, thus unable to scale up to large datasets. Fortunately, recent advances in cryptography bring us two relevant tools - conditional oblivious transfer and homomorphic encryption. In this paper, we integrate database indexing techniques with these tools in the context of private search on key-value stores. We first present an oblivious index traversal framework, in which the server cannot trace the index traversal path of a query during evaluation. The framework is generic and can support a wide range of query types with a suitable homomorphic encryption algorithm in place. Based on this framework, we devise secure protocols for classic key search queries on B+-tree and R-tree indexes. Our approach is verified by both security analysis and performance study.
【Keywords】: cryptography; data privacy; database indexing; query processing; tree data structures; B+-tree indexes; R-tree indexes; SMC; conditional oblivious transfer; cryptography; data outsourcing; data privacy; database indexing technique; hierarchical indexes; homomorphic encryption algorithm; index traversal path; key-value store; oblivious index traversal framework; private search; query privacy; query processing; secure multiparty computation; Data privacy; Encryption; Indexes; Protocols; Servers
【Paper Link】 【Pages】:640-651
【Authors】: Xun Yi ; Russell Paulet ; Elisa Bertino ; Vijay Varadharajan
【Abstract】: In mobile communication, spatial queries pose a serious threat to user location privacy because the location of a query may reveal sensitive information about the mobile user. In this paper, we study k nearest neighbor (kNN) queries where the mobile user queries the location-based service (LBS) provider about k nearest points of interest (POIs) on the basis of his current location. We propose a solution for the mobile user to preserve his location privacy in kNN queries. The proposed solution is built on the Paillier public-key cryptosystem and can provide both location privacy and data privacy. In particular, our solution allows the mobile user to retrieve one type of POIs, for example, k nearest car parks, without revealing to the LBS provider what type of points is retrieved. For a cloaking region with n×n cells and m types of points, the total communication complexity for the mobile user to retrieve a type of k nearest POIs is O(n+m) while the computation complexities of the mobile user and the LBS provider are O(n + m) and O(n2m), respectively. Compared with existing solutions for kNN queries with location privacy, our solutions are more efficient. Experiments have shown that our solutions are practical for kNN queries.
【Keywords】: communication complexity; data privacy; mobility management (mobile radio); pattern recognition; public key cryptography; query processing; LBS querying; Paillier public-key cryptosystem; cloaking region; computation complexities; data privacy; k nearest POIs retrieval; k nearest car parks; k nearest points of interest; kNN queries; location privacy preservation; location-based service provider querying; mobile communication; mobile user; practical k nearest neighbor queries; spatial queries; total communication complexity; user location privacy; Data privacy; Databases; Games; Middleware; Mobile communication; Privacy; Protocols
【Paper Link】 【Pages】:652-663
【Authors】: Wentian Lu ; Gerome Miklau ; Vani Gupta
【Abstract】: Evaluating the performance of database systems is crucial when database vendors or researchers are developing new technologies. But such evaluation tasks rely heavily on actual data and query workloads that are often unavailable to researchers due to privacy restrictions. To overcome this barrier, we propose a framework for the release of a synthetic database which accurately models selected performance properties of the original database. We improve on prior work on synthetic database generation by providing a formal, rigorous guarantee of privacy. Accuracy is achieved by generating synthetic data using a carefully selected set of statistical properties of the original data which balance privacy loss with relevance to the given query workload. An important contribution of our framework is an extension of standard differential privacy to multiple tables.
【Keywords】: data privacy; database management systems; statistical analysis; trusted computing; balance privacy loss; database researchers; database vendors; differential privacy; privacy guarantee; privacy restrictions; private synthetic database generation; query workloads; statistical properties; synthetic data generation; untrusted system evaluation; Aggregates; Data privacy; Databases; Noise; Privacy; Sensitivity; Standards
【Paper Link】 【Pages】:664-675
【Authors】: Yousef Elmehdwi ; Bharath K. Samanthula ; Wei Jiang
【Abstract】: For the past decade, query processing on relational data has been studied extensively, and many theoretical and practical solutions to query processing have been proposed under various scenarios. With the recent popularity of cloud computing, users now have the opportunity to outsource their data as well as the data management tasks to the cloud. However, due to the rise of various privacy issues, sensitive data (e.g., medical records) need to be encrypted before outsourcing to the cloud. In addition, query processing tasks should be handled by the cloud; otherwise, there would be no point to outsource the data at the first place. To process queries over encrypted data without the cloud ever decrypting the data is a very challenging task. In this paper, we focus on solving the k-nearest neighbor (kNN) query problem over encrypted database outsourced to a cloud: a user issues an encrypted query record to the cloud, and the cloud returns the k closest records to the user. We first present a basic scheme and demonstrate that such a naive solution is not secure. To provide better security, we propose a secure kNN protocol that protects the confidentiality of the data, user's input query, and data access patterns. Also, we empirically analyze the efficiency of our protocols through various experiments. These results indicate that our secure protocol is very efficient on the user end, and this lightweight scheme allows a user to use any mobile device to perform the kNN query.
【Keywords】: cloud computing; cryptography; data privacy; query processing; relational databases; cloud computing; data access patterns; data confidentiality; data management tasks; encrypted data; kNN protocol; kNN query problem; mobile device; outsourced environments; privacy issues; query processing; relational data; secure k-nearest neighbor query; sensitive data; user input query; Distributed databases; Encryption; Protocols; Query processing
【Paper Link】 【Pages】:676-687
【Authors】: Daniel Gómez Ferro ; Flavio Junqueira ; Ivan Kelly ; Benjamin Reed ; Maysam Yabandeh
【Abstract】: In this paper, we introduce Omid, a tool for lock-free transactional support in large data stores such as HBase. Omid uses a centralized scheme and implements snapshot isolation, a property that guarantees that all read operations of a transaction are performed on a consistent snapshot of the data. In a lock-based approach, the unreleased, distributed locks that are held by a failed or slow client block others. By using a centralized scheme for Omid, we are able to implement a lock-free commit algorithm, which does not suffer from this problem. Moreover, Omid lightly replicates a read-only copy of the transaction metadata into the clients where they can locally service a large part of queries on metadata. Thanks to this technique, Omid does not require modifying either the source code of the data store or the tables' schema, and the overhead on data servers is also negligible. The experimental results show that our implementation on a simple dual-core machine can service up to a thousand of client machines. While the added latency is limited to only 10 ms, Omid scales up to 124K write transactions per second. Since this capacity is multiple times larger than the maximum reported traffic in similar systems, we do not expect the centralized scheme of Omid to be a bottleneck even for current large data stores.
【Keywords】: distributed databases; meta data; storage management; transaction processing; HBase; Omid; centralized scheme; client machine; data servers; data stores; distributed data store; dual-core machine; lock-free commit algorithm; lock-free transactional support; read-only copy; snapshot isolation; table schema; transaction metadata; Computer crashes; Delays; Distributed databases; Partitioning algorithms; Servers; Silicon
【Paper Link】 【Pages】:688-699
【Authors】: Danica Porobic ; Erietta Liarou ; Pinar Tözün ; Anastasia Ailamaki
【Abstract】: Nowadays, high-performance transaction processing applications increasingly run on multisocket multicore servers. Such architectures exhibit non-uniform memory access latency as well as non-uniform thread communication costs. Unfortunately, traditional shared-everything database management systems are designed for uniform inter-core communication speeds. This causes unpredictable access latencies in the critical path. While lack of data locality may be a minor nuisance on systems with fewer than 4 processors, it becomes a serious scalability limitation on larger systems due to accesses to centralized data structures. In this paper, we propose ATraPos, a storage manager design that is aware of the non-uniform access latencies of multisocket systems. ATraPos achieves good data locality by carefully partitioning the data as well as internal data structures (e.g., state information) to the available processors and by assigning threads to specific partitions. Furthermore, ATraPos dynamically adapts to the workload characteristics, i.e., when the workload changes, ATraPos detects the change and automatically revises the data partitioning and thread placement to fit the current access patterns and hardware topology. We prototype ATraPos on top of an open-source storage manager Shore-MT and we present a detailed experimental analysis with both synthetic and standard (TPC-C and TATP) benchmarks. We show that ATraPos exhibits performance improvements of a factor ranging from 1.4 to 6.7x for a wide collection of transactional workloads. In addition, we show that the adaptive monitoring and partitioning scheme of ATraPos poses a negligible cost, while it allows the system to dynamically and gracefully adapt when the workload changes.
【Keywords】: data structures; multi-threading; multiprocessing systems; storage management; transaction processing; ATraPos; Shore-MT; TATP benchmarks; TPC-C benchmarks; access patterns; adaptive monitoring; adaptive transaction processing; data locality; data partitioning; hardware Islands; hardware topology; high-performance transaction processing applications; intercore communication speeds; internal data structures; multisocket multicore servers; multisocket systems; nonuniform access latencies; nonuniform memory access latency; nonuniform thread communication costs; open-source storage manager; partitioning scheme; processors; shared-everything database management systems; storage manager design; thread placement; transactional workloads; workload characteristics; Data structures; Hardware; Multicore processing; Program processors; Servers; Sockets; Throughput
【Paper Link】 【Pages】:700-711
【Authors】: Hyuck Han ; Seongjae Park ; Hyungsoo Jung ; Alan Fekete ; Uwe Röhm ; Heon Y. Yeom
【Abstract】: Since 1990's, Snapshot Isolation (SI) has been widely studied, and it was successfully deployed in commercial and open-source database engines. Berenson et al. showed that data consistency can be violated under SI. Recently, a new class of Serializable SI algorithms (SSI) has been proposed to achieve serializable execution while still allowing concurrency between reads and updates.
【Keywords】: concurrency control; data integrity; database management systems; multiprocessing systems; Intel 4-way 32 core machine; MySQL-5.6.10; SSI; data consistency; multicore systems; open-source database engines; read-write conflict conditions; scalable serializable snapshot isolation; Concurrency control; Databases; Engines; Latches; Multicore processing; Scalability; Silicon
【Paper Link】 【Pages】:712-723
【Authors】: Bin Liu ; Jun'ichi Tatemura ; Oliver Po ; Wang-Pin Hsiung ; Hakan Hacigümüs
【Abstract】: Supporting an online transaction processing (OLTP) workload in a scalable and elastic fashion is a challenging task. Recently, a new breed of scalable systems have shown significant throughput gains by limiting consistency to small units of data called “entity-groups” (e.g., a user's account information stored together with all her emails in an online email service.) Transactions that access the data from only one entity-group are guaranteed of full ACID, but those that access multiple entity-groups are not. Defining entity-groups has direct impact on workload consistency and performance, and doing so for data with a complex schema is very challenging. It is prone to go to extremes - groups that are too fine-grained cause excessive number of expensive distributed transactions while those that are too coarse lead to excessive serialization and performance degradation. It is also difficult to balance conflicting requirements from different transactions. In commercially available entity-group systems, creating entity-groups is usually a manual process, which severely limits the usability of those systems. This paper is the first systematic effort on automating the entity-group design process. Our goal is to build a user-friendly design tool for automatically creating entity-groups based on a given workload and to help users trade consistency for performance in a principled manner. For advanced users, we allow them to provide feedback to the entity-group design and iteratively improve the final output. We demonstrate the effectiveness of our approach with widely used benchmarks. We also present the user experience of a prototype we built.
【Keywords】: data mining; iterative methods; transaction processing; OLTP workloads; automatic entity-grouping; expensive distributed transactions; iterative method; multiple entity-groups; online transaction processing workload; user-friendly design tool; Benchmark testing; Distributed databases; Electronic mail; Measurement; Prototypes; Throughput
【Paper Link】 【Pages】:724-735
【Authors】: Wangda Zhang ; Reynold Cheng ; Ben Kao
【Abstract】: The discounted hitting time (DHT), which is a random-walk similarity measure for graph node pairs, is useful in various applications, including link prediction, collaborative recommendation, and reputation ranking. We examine a novel query, called the multi-way join (or n-way join), on DHT scores. Given a graph and n sets of nodes, the n-way join retrieves a set of n-tuples with the k highest scores, according to some aggregation function of DHT values. This query enables analysis and prediction of complex relationship among n sets of nodes. Since an n-way join is expensive to compute, we develop the Partial Join algorithm (or PJ). This solution decomposes an n-way join into a number of top-m 2-way joins, and combines their results to construct the answer of the n-way join. Since PJ may necessitate the computation of top-(m+ 1) 2-way joins, we study an incremental solution, which allows the top-(m+ 1) 2-way join to be derived quickly from the top-m 2-way join results earlier computed. We further examine fast processing and pruning algorithms for 2-way joins. An extensive evaluation on three real datasets shows that PJ accurately evaluates n-way joins, and is four orders of magnitude faster than basic solutions.
【Keywords】: data handling; graph theory; query processing; DHT scores; aggregation function; collaborative recommendation; discounted hitting time; graph node pairs; link prediction; multiway join query; multiway joins evaluation; n-way join query; partial join algorithm; pruning algorithms; random-walk similarity measure; reputation ranking; top-(m+ 1) 2-way joins; Aggregates; Artificial intelligence; DH-HEMTs; Social network services; Spatial databases; Time measurement
【Paper Link】 【Pages】:736-747
【Authors】: Rong-Hua Li ; Jeffrey Xu Yu ; Xin Huang ; Hong Cheng
【Abstract】: We introduce and formulate two types of random-walk domination problems in graphs motivated by a number of applications in practice (e.g., item-placement problem in online social networks, Ads-placement problem in advertisement networks, and resource-placement problem in P2P networks). Specifically, given a graph G, the goal of the first type of random-walk domination problem is to target k nodes such that the total hitting time of an L-length random walk starting from the remaining nodes to the targeted nodes is minimized. The second type of random-walk domination problem is to find k nodes to maximize the expected number of nodes that hit any one targeted node through an L-length random walk. We prove that these problems are two special instances of the submodular set function maximization with cardinality constraint problem. To solve them effectively, we propose a dynamic-programming (DP) based greedy algorithm which is with near-optimal performance guarantee. The DP-based greedy algorithm, however, is not very efficient due to the expensive marginal gain evaluation. To further speed up the algorithm, we propose an approximate greedy algorithm with linear time complexity w.r.t. the graph size and also with near-optimal performance guarantee. The approximate greedy algorithm is based on carefully designed random walk sampling and sample-materialization techniques. Extensive experiments demonstrate the effectiveness, efficiency and scalability of the proposed algorithms.
【Keywords】: computational complexity; dynamic programming; graph theory; greedy algorithms; DP-based greedy algorithm; L-length random walk; cardinality constraint problem; dynamic-programming based greedy algorithm; large graphs; linear time complexity; marginal gain evaluation; random walk sampling; random-walk domination; sample-materialization techniques; submodular set function maximization; Algorithm design and analysis; Approximation algorithms; Approximation methods; Greedy algorithms; Search problems; Social network services; Time complexity
【Paper Link】 【Pages】:748-759
【Authors】: Albert Yu ; Pankaj K. Agarwal ; Jun Yang
【Abstract】: Given a set of objects O, each with d numeric attributes, a top-k preference scores these objects using a linear combination of their attribute values, where the weight on each attribute reflects the interest in this attribute. Given a query preference q, a top-k query finds the k objects in O with highest scores with respect to q. Given a query object o and a set of preferences Q, a reverse top-k query finds all preferences q ∈ Q for which o becomes one of the top k objects with respect to q. Previous solutions to these problems are effective only in low dimensions. In this paper, we develop a solution for much higher dimensions (up to high tens), if many preferences exhibit sparsity-i.e., each specifies non-zero weights for only a handful (say 5-7) of attributes (though the subsets of such attributes and their weights can vary greatly). Our idea is to select carefully a set of low-dimensional core subspaces to “cover” the sparse preferences in a workload. These subspaces allow us to index them more effectively than the full-dimensional space. Being multi-dimensional, each subspace covers many possible preferences; furthermore, multiple subspaces can jointly cover a preference, thereby expanding the coverage beyond each subspace's dimensionality. Experimental evaluation validates our solution's effectiveness and advantages over previous solutions.
【Keywords】: query processing; attribute values; full-dimensional space; high dimensions; low-dimensional core subspaces; numeric attributes; query preference; reverse top-k query; sparse preferences; subspace dimensionality; top-k preference
【Paper Link】 【Pages】:760-771
【Authors】: Shiyu Yang ; Muhammad Aamir Cheema ; Xuemin Lin ; Ying Zhang
【Abstract】: Given a set of facilities and a set of users, a reverse k nearest neighbors (RkNN) query q returns every user for which the query facility is one of the k-closest facilities. Due to its importance, RkNN query has received significant research attention in the past few years. Almost all of the existing techniques adopt a pruning-and-verification framework. Regions-based pruning and half-space pruning are the two most notable pruning strategies. The half-space based approach prunes a larger area and is generally believed to be superior. Influenced by this perception, almost all existing RkNN algorithms utilize and improve the half-space pruning strategy. We observe the weaknesses and strengths of both strategies and discover that the regions-based pruning has certain strengths that have not been exploited in the past. Motivated by this, we present a new RkNN algorithm called SLICE that utilizes the strength of regions-based pruning and overcomes its limitations. Our extensive experimental study on synthetic and real data sets demonstrate that SLICE is significantly more efficient than the existing algorithms. We also provide a detailed theoretical analysis to analyze various aspects of our algorithm such as I/O cost, the unpruned area, and the cost of its verification phase etc. The experimental study validates our theoretical analysis.
【Keywords】: learning (artificial intelligence); query processing; RkNN query; SLICE algorithm; half-space pruning; k-closest facilities; pruning-and-verification framework; query facility; regions-based pruning; reverse k nearest neighbors queries; Algorithm design and analysis; Australia; Computational complexity; Context; Educational institutions; Extraterrestrial measurements; Probabilistic logic
【Paper Link】 【Pages】:772-783
【Authors】: Peng Peng ; Raymond Chi-Wing Wong
【Abstract】: Returning tuples that users may be interested in is one of the most important goals for multi-criteria decision making. Top-k queries and skyline queries are two representative queries. A top-k query has its merit of returning a limited number of tuples to users but requires users to give their exact utility functions. A skyline query has its merit that users do not need to give their exact utility functions but has no control over the number of tuples to be returned. In this paper, we study a k-regret query, a recently proposed query, which integrates the merits of the two representative queries. We first identify some interesting geometry properties for the k-regret query. Based on these properties, we define a set of candidate points called happy points for the k-regret query, which has not been studied in the literature. This result is very fundamental and beneficial to not only all existing algorithms but also all new algorithms to be developed for the k-regret query. Since it is found that the number of happy points is very small, the efficiency of all existing algorithms can be improved significantly. Furthermore, based on other geometry properties, we propose two efficient algorithms each of which performs more efficiently than the best-known fastest algorithm. Our experimental results show that our proposed algorithms run faster than the best-known method on both synthetic and real datasets. In particular, in our experiments on real datasets, the best-known method took more than 3 hours to answer a k-regret query but one of our proposed methods took about a few minutes and the other took within a second.
【Keywords】: geometry; query processing; geometry approach; geometry properties; happy points; k-regret query; multicriteria decision making; representative queries; skyline query; top-k query; tuples; utility functions; Database systems; Decision making; Geometry; Material storage; Time complexity; Vectors; Query Processing; Skyline; k-Regret Query
【Paper Link】 【Pages】:784-795
【Authors】: David C. Anastasiu ; George Karypis
【Abstract】: The All-Pairs similarity search, or self-similarity join problem, finds all pairs of vectors in a high dimensional sparse dataset with a similarity value higher than a given threshold. The problem has been classically solved using a dynamically built inverted index. The search time is reduced by early pruning of candidates using size and value-based bounds on the similarity. In the context of cosine similarity and weighted vectors, leveraging the Cauchy-Schwarz inequality, we propose new ℓ2-norm bounds for reducing the inverted index size, candidate pool size, and the number of full dot-product computations. We tighten previous candidate generation and verification bounds and introduce several new ones to further improve our algorithm's performance. Our new pruning strategies enable significant speedups over baseline approaches, most times outperforming even approximate solutions. We perform an extensive evaluation of our algorithm, L2AP, and compare against state-of-the-art exact and approximate methods, AllPairs, MMJoin, and BayesLSH, across a variety of real-world datasets and similarity thresholds.
【Keywords】: information filtering; AllPairs method; BayesLSH method; Cauchy-Schwarz inequality; L2AP algorithm; MMJoin method; all-pairs similarity search; cosine similarity context; dynamically built inverted index; fast cosine similarity search; filtering strategies; high dimensional sparse dataset; prefix l2-norm bounds; pruning strategies; search time; self-similarity join problem; similarity thresholds; similarity value; size-based bounds; value-based bounds; weighted vectors context; Approximation algorithms; Heuristic algorithms; Indexing; Search problems; Upper bound; Vectors
【Paper Link】 【Pages】:796-807
【Authors】: Sergej Fries ; Brigitte Boden ; Grzegorz Stepien ; Thomas Seidl
【Abstract】: Join processing on large-scale vector data is an important problem in many applications, as vectors are a common representation for various data types. Especially, several data analysis tasks like near duplicate detection, density-based clustering or data cleaning are based on similarity self-joins, which are a special type of join. For huge data sets, MapReduce proved to be a suitable, error-tolerant framework for parallel join algorithms. Recent approaches exploit the vector-space properties for low-dimensional vector data for an efficient join computation. However, so far no parallel similarity self-join approaches aiming at high-dimensional vector data were proposed. In this work we propose the novel similarity self-join algorithm PHiDJ (Parallel High-Dimensional Join) for the MapReduce framework. PHiDJ is well suited for medium to high-dimensional data and exploits multiple filter techniques for reducing communication and computational costs. We provide a solution for efficient join computation for skewed distributed data. Our experimental evaluation on medium- to high-dimensional data shows that our approach outperforms existing techniques.
【Keywords】: data analysis; parallel algorithms; vectors; MapReduce framework; PHiDJ; communication cost reduction; computational cost reduction; data analysis tasks; data cleaning; density-based clustering; error-tolerant framework; high-dimensional vector data; join computation; join processing; large-scale vector data; medium-dimensional data; multiple filter techniques; near duplicate detection; parallel high-dimensional join; parallel similarity self-join algorithms; skewed distributed data; vector-space properties; Distributed databases; Encoding; Extraterrestrial measurements; Partitioning algorithms; Vectors
【Paper Link】 【Pages】:808-819
【Authors】: Jin Huang ; Rui Zhang ; Rajkumar Buyya ; Jian Chen
【Abstract】: The Earth Mover's Distance (EMD) similarity join retrieves pairs of records with EMD below a given threshold. It has a number of important applications such as near duplicate image retrieval and pattern analysis in probabilistic datasets. However, the computational cost of EMD is super cubic to the number of bins in the histograms used to represent the data objects. Consequently, the EMD similarity join operation is prohibitive for large datasets. This is the first paper that specifically addresses the EMD similarity join and we propose to use MapReduce to approach this problem. The MapReduce algorithms designed for generic metric distance similarity joins are inefficient for the EMD similarity join because they involve a large number of distance computations and have unbalanced workloads on reducers when dealing with skewed datasets. We propose a novel framework, named MELODY-JOIN, which transforms data into the space of EMD lower bounds and performs pruning and partitioning at a low cost because computing these EMD lower bounds has a constant complexity. Furthermore, we address two key problems, the limited pruning power and the unbalanced workloads, by enhancing each phase in the MELODY-JOIN framework. We conduct extensive experiments on real datasets. The results show that MELODY-JOIN outperforms the state-of-the-art technique by an order of magnitude, scales up better on large datasets than the state-of-the-art technique, and scales out well on distributed machines.
【Keywords】: data analysis; image retrieval; pattern classification; probability; EMD similarity join operation; MELODY-JOIN; MapReduce algorithm; computational cost; data object; distributed machine; duplicate image retrieval; earth mover distance similarity; generic metric distance similarity; pattern analysis; probabilistic dataset; pruning power; skewed dataset; Aggregates; Approximation error; Earth; Histograms; Transforms; Vectors
【Paper Link】 【Pages】:820-831
【Authors】: Manish Gupta ; Jing Gao ; Xifeng Yan ; Hasan Cam ; Jiawei Han
【Abstract】: In the real world, various systems can be modeled using heterogeneous networks which consist of entities of different types. Many problems on such networks can be mapped to an underlying critical problem of discovering top-K subgraphs of entities with rare and surprising associations. Answering such subgraph queries efficiently involves two main challenges: (1) computing all matching subgraphs which satisfy the query and (2) ranking such results based on the rarity and the interestingness of the associations among entities in the subgraphs. Previous work on the matching problem can be harnessed for a naïve ranking-after-matching solution. However, for large graphs, subgraph queries may have enormous number of matches, and so it is inefficient to compute all matches when only the top-K matches are desired. In this paper, we address the two challenges of matching and ranking in top-K subgraph discovery as follows. First, we introduce two index structures for the network: topology index, and graph maximum metapath weight index, which are both computed offline. Second, we propose novel top-K mechanisms to exploit these indexes for answering interesting subgraph queries online efficiently. Experimental results on several synthetic datasets and the DBLP and Wikipedia datasets containing thousands of entities show the efficiency and the effectiveness of the proposed approach in computing interesting subgraphs.
【Keywords】: data mining; information networks; query processing; DBLP datasets; Wikipedia datasets; graph maximum metapath weight index; heterogeneous networks; index structures; information networks; matching subgraphs; query satisfaction; ranking-after-matching solution; subgraph queries; top-K interesting subgraph discovery; topology index; Data transfer; Indexes; Network topology; Organizations; Query processing; Topology; Upper bound
【Paper Link】 【Pages】:832-843
【Authors】: Bo Zong ; Ramya Raghavendra ; Mudhakar Srivatsa ; Xifeng Yan ; Ambuj K. Singh ; Kang-Won Lee
【Abstract】: Fast service placement, finding a set of nodes with enough free capacity of computation, storage, and network connectivity, is a routine task in daily cloud administration. In this work, we formulate this as a subgraph matching problem. Different from the traditional setting, including approximate and probabilistic graphs, subgraph matching on data-center networks has two unique properties. (1) Node/edge labels representing vacant CPU cycles and network bandwidth change rapidly, while the network topology varies little. (2) There is a partial order on node/edge labels. Basically, one needs to place service in nodes with enough free capacity. Existing graph indexing techniques have not considered very frequent label updates, and none of them supports partial order on numeric labels. Therefore, we resort to a new graph index framework, Gradin, to address both challenges. Gradin encodes subgraphs into multi-dimensional vectors and organizes them with indices such that it can efficiently search the matches of a query's subgraphs and combine them to form a full match. In particular, we analyze how the index parameters affect update and search performance with theoretical results. Moreover, a revised pruning algorithm is introduced to reduce unnecessary search during the combination of partial matches. Using both real and synthetic datasets, we demonstrate that Gradin outperforms the baseline approaches up to 10 times.
【Keywords】: cloud computing; graph theory; pattern matching; query processing; search problems; CPU cycles; Gradin; cloud administration; cloud service placement; data-center networks; edge labels; frequent label updates; graph indexing techniques; index parameters; multidimensional vectors; network bandwidth; network connectivity; network topology; node labels; numeric labels; query subgraphs; revised pruning algorithm; search performance; storage; subgraph matching; update performance; Lead; Memory management
【Paper Link】 【Pages】:844-855
【Authors】: Wenqing Lin ; Xiaokui Xiao ; Gabriel Ghinita
【Abstract】: Mining frequent subgraphs from a large collection of graph objects is an important problem in several application domains such as bio-informatics, social networks, computer vision, etc. The main challenge in subgraph mining is efficiency, as (i) testing for graph isomorphisms is computationally intensive, and (ii) the cardinality of the graph collection to be mined may be very large. We propose a two-step filter-and-refinement approach that is suitable to massive parallelization within the scalable MapReduce computing model. We partition the collection of graphs among worker nodes, and each worker applies the filter step to determine a set of candidate subgraphs that are locally frequent in its partition. The union of all such graphs is the input to the refinement step, where each candidate is checked against all partitions and only the globally frequent graphs are retained. We devise a statistical threshold mechanism that allows us to predict which subgraphs have a high chance to become globally frequent, and thus reduce the computational overhead in the refinement step. We also propose effective strategies to avoid redundant computation in each round when searching for candidate graphs, as well as a lightweight graph compression mechanism to reduce the communication cost between machines. Extensive experimental evaluation results on several real-world large graph datasets show that the proposed approach clearly outperforms the existing state-of-the-art and provides a practical solution to the problem of frequent subgraph mining for massive collections of graphs.
【Keywords】: data compression; data mining; graph theory; parallel programming; statistical analysis; MapReduce; application domains; bioinformatics; computer vision; graph collection; graph isomorphisms; graph objects; large-scale frequent subgraph mining; lightweight graph compression mechanism; parallelization; real-world large graph datasets; refinement step; social networks; statistical threshold mechanism; two-step filter-and-refinement approach; Computational modeling; Data mining; Educational institutions; Lattices; Social network services; Sorting; Testing
【Paper Link】 【Pages】:856-867
【Authors】: Wei Feng ; Jianyong Wang
【Abstract】: In Twitter, users can annotate tweets with hashtags to indicate the ongoing topics. Hashtags provide users a convenient way to categorize tweets. From the system's perspective, hashtags play an important role in tweet retrieval, event detection, topic tracking, and advertising, etc. Annotating tweets with the right hashtags can lead to a better user experience. However, two problems remain unsolved during an annotation: (1) Before the user decides to create a new hashtag, is there any way to help her/him find out whether some related hashtags have already been created and widely used? (2) Different users may have different preferences for categorizing tweets. However, few work has been done to study the personalization issue in hashtag recommendation. To address the above problems, we propose a statistical model for personalized hashtag recommendation in this paper. With millions of <;tweet, hashtag> pairs being published everyday, we are able to learn the complex mappings from tweets to hashtags with the wisdom of the crowd. Two questions are answered in the model: (1) Different from traditional item recommendation data, users and tweets in Twitter have rich auxiliary information like URLs, mentions, locations, social relations, etc. How can we incorporate these features for hashtag recommendation? (2) Different hashtags have different temporal characteristics. Hashtags related to breaking events in the physical world have strong rise-and-fall temporal pattern while some other hashtags remain stable in the system. How can we incorporate hashtag related features to serve for hashtag recommendation? With all the above factors considered, we show that our model successfully outperforms existing methods on real datasets crawled from Twitter.
【Keywords】: social networking (online); #; Twitter; advertising; complex mappings; event detection; explicit topics; hashtags; personalization issue; personalized hashtag recommendation; rise-and-fall temporal pattern; statistical model; temporal characteristics; topic tracking; tweet annotation; tweet categorization; tweet retrieval; tweets; Advertising; Data models; History; Joining processes; Training; Twitter; Web pages
【Paper Link】 【Pages】:868-879
【Authors】: Wei Kang ; Anthony K. H. Tung ; Feng Zhao ; Xinyu Li
【Abstract】: In recent years, much effort has been invested in analyzing social network data. However, it remains a great challenge to support interactive exploration of such huge amounts of data. In this paper, we propose Vesta, a system that enables visual exploration of social network data via tag clouds. Under Vesta, users can interactively explore and extract summaries of social network contents published in a certain spatial region during a certain period of time. These summaries are represented using a novel concept called hierarchical tag clouds, which allows users to zoom in/out to explore more specific/general tag summaries. In Vesta, the spatiotemporal data is split into partitions. A novel biclustering approach is applied for each partition to extract summaries, which are then used to construct a hierarchical latent Dirichlet allocation model to generate a topic hierarchy. At runtime, the topic hierarchies in the relevant partitions of the user-specified region are merged in a probabilistic manner to form tag hierarchies, which are used to construct interactive hierarchical tag clouds for visualization. The result of an extensive experimental study verifies the efficiency and effectiveness of Vesta.
【Keywords】: cloud computing; interactive systems; social networking (online); Vesta; biclustering approach; hierarchical latent Dirichlet allocation model; interactive hierarchical tag clouds; social network data; spatiotemporal data; spatiotemporal social content summarization; tag clouds; Context; Educational institutions; Merging; Spatiotemporal phenomena; Tag clouds; Twitter
【Paper Link】 【Pages】:880-891
【Authors】: Guoliang Li ; Jun Hu ; Jianhua Feng ; Kian-Lee Tan
【Abstract】: The rapid development of social networks has resulted in a proliferation of user-generated content (UGC). The UGC data, when properly analyzed, can be beneficial to many applications. For example, identifying a user's locations from microblogs is very important for effective location-based advertisement and recommendation. In this paper, we study the problem of identifying a user's locations from microblogs. This problem is rather challenging because the location information in a microblog is incomplete and we cannot get an accurate location from a local microblog. To address this challenge, we propose a global location identification method, called Glitter. Glitter combines multiple microblogs of a user and utilizes them to identify the user's locations. Glitter not only improves the quality of identifying a user's location but also supplements the location of a microblog so as to obtain an accurate location of a microblog. To facilitate location identification, GLITTER organizes points of interest (POIs) into a tree structure where leaf nodes are POIs and non-leaf nodes are segments of POIs, e.g., countries, states, cities, districts, and streets. Using the tree structure, Glitter first extracts candidate locations from each microblog of a user which correspond to some tree nodes. Then Glitter aggregates these candidate locations and identifies top-k locations of the user. Using the identified top-k user locations, Glitter refines the candidate locations and computes top-k locations of each microblog. To achieve high recall, we enable fuzzy matching between locations and microblogs. We propose an incremental algorithm to support dynamic updates of microblogs. Experimental results on real-world datasets show that our method achieves high quality and good performance, and scales very well.
【Keywords】: fuzzy set theory; pattern matching; social networking (online); trees (mathematics); POIs; UGC; candidate location extraction; fuzzy matching; glitter; global location identification method; incremental algorithm; location-based advertisement; location-based recommendation; microblogs; nonleaf nodes; points of interest; social networks; top-k user location identification; tree nodes; tree structure; user-generated content; Aggregates; Cities and towns; Educational institutions; Films; Heuristic algorithms; Indexes; Twitter
【Paper Link】 【Pages】:892-903
【Authors】: Rong-Hua Li ; Jeffrey Xu Yu ; Rui Mao ; Tan Jin
【Abstract】: In this paper, we introduce two types of query evaluation problems on uncertain graphs: expectation query evaluation and threshold query evaluation. Since these two problems are #P-complete, most previous solutions for these problems are based on naive Monte-Carlo (NMC) sampling. However, NMC typically leads to a large variance, which significantly reduces its effectiveness. To overcome this problem, we propose two classes of estimators, called class-I and class-II estimators, based on the idea of stratified sampling. More specifically, we first propose two classes of basic stratified sampling estimators, named BSS-I and BSS-II, which partition the entire population into 2r and r+1 strata by picking r edges respectively. Second, to reduce the variance, we find that both BSS-I and BSS-II can be recursively performed in each stratum. Therefore, we propose two classes of recursive stratified sampling estimators called RSS-I and RSS-II respectively. Third, for a particular kind of problem, we propose two cut-set based stratified sampling estimators, named BCSS and RCSS, to further improve the accuracy of the class-I and class-II estimators. For all the proposed estimators, we prove that they are unbiased and their variances are significantly smaller than that of NMC. Moreover, the time complexity of all the proposed estimators are the same as the time complexity of NMC under a mild assumption. In addition, we also apply the proposed estimators to influence function evaluation and expected-reliable distance query problem, which are two instances of the query evaluation problems on uncertain graphs. Finally, we conduct extensive experiments to evaluate our estimators, and the results demonstrate the efficiency, accuracy, and scalability of the proposed estimators.
【Keywords】: Monte Carlo methods; computational complexity; graph theory; query processing; sampling methods; #P-complete; BCSS; BSS-I; BSS-II; NMC sampling; RCSS; class-I estimator; class-II estimator; cut-set based stratified sampling estimators; expectation query evaluation; expected-reliable distance query problem; function evaluation; naive Monte-Carlo sampling; recursive stratified sampling; threshold query evaluation; time complexity; uncertain graphs; Accuracy; Nickel; Partitioning algorithms; Query processing; Resource management; Time complexity; Vectors
【Paper Link】 【Pages】:904-915
【Authors】: Walaa Eldin Moustafa ; Angelika Kimmig ; Amol Deshpande ; Lise Getoor
【Abstract】: There is a growing need for methods that can represent and query uncertain graphs. These uncertain graphs are often the result of an information extraction and integration system that attempts to extract an entity graph or a knowledge graph from multiple unstructured sources [25], [7]. Such an integration typically leads to identity uncertainty, as different data sources may use different references to the same underlying real-world entities. Integration usually also introduces additional uncertainty on node attributes and edge existence. In this paper, we propose the notion of a probabilistic entity graph (PEG), a formal model that uniformly and systematically addresses these three types of uncertainty. A PEG is a probabilistic graph model that defines a distribution over possible graphs at the entity level. We introduce a general framework for constructing a PEG given uncertain data at the reference level and develop efficient algorithms to answer subgraph pattern matching queries in this setting. Our algorithms are based on two novel ideas: context-aware path indexing and reduction by join-candidates, which drastically reduce the query search space. A comprehensive experimental evaluation shows that our approach outperforms baseline implementations by orders of magnitude.
【Keywords】: database indexing; graph theory; pattern matching; probability; query processing; ubiquitous computing; PEG; context-aware path indexing; entity graph extraction; formal probabilistic graph model; identity linkage uncertainty; information extraction system; information integration system; knowledge graph extraction; probabilistic entity graph; query search space reduction; reduction by join-candidates; subgraph pattern matching queries; uncertain graphs; Equations; Markov random fields; Pattern matching; Probabilistic logic; Probability distribution; Random variables; Uncertainty
【Paper Link】 【Pages】:916-927
【Authors】: Bahar Qarabaqi ; Mirek Riedewald
【Abstract】: We propose techniques for exploratory search in large databases. The goal is to provide new functionality that aids users in homing in on the right query conditions to find what they are looking for. Query refinement proceeds interactively by repeatedly consulting the user to manage query conditions. This process is characterized by three key challenges: (1) dealing with incomplete and imprecise user input, (2) keeping user effort low, and (3) guaranteeing interactive system response time. We address the first two challenges with a probability-based framework that guides the user to the most important query conditions. To recover from input errors, we introduce the notion of sensitivity and propose efficient algorithms for identifying the most sensitive user input, i.e., those inputs that had the greatest influence on the query results. For the third challenge, we develop techniques that can deliver estimates of the required probabilities within a given hard realtime limit and are able to adapt automatically as the interactive query refinement proceeds.
【Keywords】: probability; query processing; exploratory search; imprecise queries; incomplete imprecise user input; interactive query refinement; interactive system response time; most sensitive user input identification; probability-based framework; query conditions management; sensitivity notion; user effort; user-driven query refinement; Birds; Databases; Image color analysis; Probabilistic logic; Probability distribution; Sensitivity; Uncertainty
【Paper Link】 【Pages】:928-939
【Authors】: Sara Cohen ; Nerya Or
【Abstract】: Determining similarity between trees is an important problem in a variety of areas. The subtree similarity-search problem is that of finding, given a tree Q and a large set of trees Γ = {T1; ...; Tn}, the subtrees of trees among Γ that are most similar to Q. Similarity is defined using some tree distance function. While subtree similarity-search has been studied in the past, solutions mostly focused on specific tree distance functions, and were usually applicable only to ordered trees. This paper presents an efficient new algorithm that solves the subtree similarity-search problem, and is compatible with a wide family of tree distance functions (for both ordered and unordered trees). Extensive experimentation confirms the efficiency and scalability of the algorithm, which displays consistently good runtime even for large queries and datasets.
【Keywords】: computational complexity; tree searching; NP-complete; subtree similarity-search problem; tree distance function; unordered trees; Context; Databases; Heuristic algorithms; Polynomials; Runtime; TV; Vegetation
【Paper Link】 【Pages】:940-951
【Authors】: Yong Zeng ; Zhifeng Bao ; Tok Wang Ling ; H. V. Jagadish ; Guoliang Li
【Abstract】: When users issue a query to a database, they have expectations about the results. If what they search for is unavailable in the database, the system will return an empty result or, worse, erroneous mismatch results.We call this problem the MisMatch Problem. In this paper, we solve the MisMatch problem in the context of XML keyword search. Our solution is based on two novel concepts that we introduce: Target Node Type and Distinguishability. Using these concepts, we develop a low-cost post-processing algorithm on the results of query evaluation to detect the MisMatch problem and generate helpful suggestions to users. Our approach has three noteworthy features: (1) for queries with the MisMatch problem, it generates the explanation, suggested queries and their sample results as the output to users, helping users judge whether the MisMatch problem is solved without reading all query results; (2) it is portable as it can work with any LCA-based matching semantics and is orthogonal to the choice of result retrieval method adopted; (3) it is lightweight in the way that it occupies a very small proportion of the whole query evaluation time. Extensive experiments on three real datasets verify the effectiveness, efficiency and scalability of our approach. A search engine called XClear has been built and is available at http://xclear.comp.nus.edu.sg.
【Keywords】: XML; query processing; search engines; LCA-based matching semantics; XClear; XML keyword search; database querying; distinguishability; erroneous mismatch results; low-cost post-processing algorithm; mismatch problem; mismatch trap; query evaluation time; result retrieval method; search engine; target node type; Context; Image color analysis; Keyword search; Manganese; Portable computers; Semantics; XML
【Paper Link】 【Pages】:952-963
【Authors】: Shizuya Hakuta ; Sebastian Maneth ; Keisuke Nakano ; Hideya Iwasaki
【Abstract】: Streaming of XML transformations is a challenging task and only a few existing systems support streaming. Research approaches generally define custom fragments of XQuery and XPath that are amenable to streaming, and then design custom algorithms for each fragment. These languages have several shortcomings. Here we take a more principled approach to the problem of streaming XQuery-based transformations. We start with an elegant transducer model for which many static analysis problems are well-understood: the Macro Forest Transducer (MFT). We show that a large fragment of XQuery can be translated into MFTs - indeed, a fragment of XQuery, that can express important features that are missing from other XQuery stream engines, such as GCX: our fragment of XQuery supports XPath predicates and let-statements. We then use an existing streaming engine for MFTs and apply a well-founded set of optimizations from functional programming such as strictness analysis and deforestation. Our prototype achieves time and memory efficiency comparable to the fastest known engine for XQuery streaming, GCX. This is surprising because our engine relies on the OCaml built in garbage collector and does not use any specialized buffer management, while GCX's efficiency is due to clever and explicit buffer management.
【Keywords】: XML; buffer storage; functional programming; query processing; MFT; OCaml; XML transformation streaming; XPath; XQuery stream engines; XQuery streaming; buffer management; deforestation; functional programming; garbage collector; macro forest transducer; static analysis problems; strictness analysis
【Paper Link】 【Pages】:964-975
【Authors】: Anish Das Sarma ; Aditya G. Parameswaran ; Hector Garcia-Molina ; Alon Y. Halevy
【Abstract】: We consider the problem of using humans to find a bounded number of items satisfying certain properties, from a data set. For instance, we may want humans to identify a select number of travel photos from a data set of photos to display on a travel website, or a candidate set of resumes that meet certain requirements from a large pool of applicants. Since data sets can be enormous, and since monetary cost and latency of data processing with humans can be large, optimizing the use of humans for finding items is an important challenge. We formally define the problem using the metrics of cost and time, and design optimal algorithms that span the skyline of cost and time, i.e., we provide designers the ability to control the cost vs. time trade-off. We study the deterministic as well as error-prone human answer settings, along with multiplicative and additive approximations. Lastly, we study how we may design algorithms with specific expected cost and time measures.
【Keywords】: Web sites; data handling; social sciences computing; additive approximations; cost control; crowd-powered find algorithms; data processing; data set; design algorithms; deterministic human answer settings; error-prone human answer settings; monetary cost; multiplicative approximations; optimal algorithms; time measures; time trade-off; travel Website; travel photos; Algorithm design and analysis; Approximation algorithms; Approximation methods; Databases; Machine learning algorithms; Measurement; Poles and towers
【Paper Link】 【Pages】:976-987
【Authors】: Ju Fan ; Meiyu Lu ; Beng Chin Ooi ; Wang-Chiew Tan ; Meihui Zhang
【Abstract】: The Web is teeming with rich structured information in the form of HTML tables, which provides us with the opportunity to build a knowledge repository by integrating these tables. An essential problem of web data integration is to discover semantic correspondences between web table columns, and schema matching is a popular means to determine the semantic correspondences. However, conventional schema matching techniques are not always effective for web table matching due to the incompleteness in web tables. In this paper, we propose a two-pronged approach for web table matching that effectively addresses the above difficulties. First, we propose a concept-based approach that maps each column of a web table to the best concept, in a well-developed knowledge base, that represents it. This approach overcomes the problem that sometimes values of two web table columns may be disjoint, even though the columns are related, due to incompleteness in the column values. Second, we develop a hybrid machine-crowdsourcing framework that leverages human intelligence to discern the concepts for “difficult” columns. Our overall framework assigns the most “beneficial” column-to-concept matching tasks to the crowd under a given budget and utilizes the crowdsourcing result to help our algorithm infer the best matches for the rest of the columns. We validate the effectiveness of our framework through an extensive experimental study over two real-world web table data sets. The results show that our two-pronged approach outperforms existing schema matching techniques at only a low cost for crowdsourcing.
【Keywords】: Internet; data integration; hypermedia markup languages; knowledge based systems; pattern matching; HTML tables; Web data integration; Web table columns; Web table data sets; Web table matching; column values; column-to-concept matching tasks; concept-based approach; human intelligence; hybrid machine-crowdsourcing system; knowledge base; knowledge repository; schema matching; semantic correspondences; structured information; two-pronged approach; Accuracy; Catalogs; Educational institutions; Entropy; Films; Motion pictures; Semantics
【Paper Link】 【Pages】:988-999
【Authors】: Sarath Kumar Kondreddi ; Peter Triantafillou ; Gerhard Weikum
【Abstract】: Automatic information extraction (IE) enables the construction of very large knowledge bases (KBs), with relational facts on millions of entities from text corpora and Web sources. However, such KBs contain errors and they are far from being complete. This motivates the need for exploiting human intelligence and knowledge using crowd-based human computing (HC) for assessing the validity of facts and for gathering additional knowledge. This paper presents a novel system architecture, called Higgins, which shows how to effectively integrate an IE engine and a HC engine. Higgins generates game questions where players choose or fill in missing relations for subject-relation-object triples. For generating multiple-choice answer candidates, we have constructed a large dictionary of entity names and relational phrases, and have developed specifically designed statistical language models for phrase relatedness. To this end, we combine semantic resources like WordNet, ConceptNet, and others with statistics derived from a large Web corpus. We demonstrate the effectiveness of Higgins for knowledge acquisition by crowdsourced gathering of relationships between characters in narrative descriptions of movies and books.
【Keywords】: computational linguistics; human factors; information retrieval; knowledge acquisition; semantic networks; very large databases; HC; Higgins system architecture; IE; KBs; automatic information extraction; crowd-based human computing; crowdsourced knowledge acquisition; entity names; multiple-choice answer candidate generation; phrase relatedness; relational phrases; semantic resources; statistical language models; subject-relation-object triples; very large knowledge bases; Context; Dictionaries; Encyclopedias; Engines; Games; Motion pictures; Semantics
【Paper Link】 【Pages】:1000-1011
【Authors】: Farhan Tauheed ; Thomas Heinis ; Felix Schürmann ; Henry Markram ; Anastasia Ailamaki
【Abstract】: Scientists in many disciplines use spatial mesh models to study physical phenomena. Simulating natural phenomena by changing meshes over time helps to better understand the phenomena. The higher the precision of the mesh models, the more insight do the scientists gain and they thus continuously increase the detail of the meshes and build them as detailed as their instruments and the simulation hardware allow. In the process, the data volume also increases, slowing down the execution of spatial range queries needed to monitor the simulation considerably. Indexing speeds up range query execution, but the overhead to maintain the indexes is considerable because almost the entire mesh changes unpredictably at every simulation step. Using a simple linear scan, on the other hand, requires accessing the entire mesh and the performance deteriorates as the size of the dataset grows. In this paper we propose OCTOPUS, a strategy for executing range queries on mesh datasets that change unpredictably during simulations. In OCTOPUS we use the key insight that the mesh surface along with the mesh connectivity is sufficient to retrieve accurate query results efficiently. With this novel query execution strategy, OCTOPUS minimizes index maintenance cost and reduces query execution time considerably. Our experiments show that OCTOPUS achieves a speedup between 7.3 and 9.2× compared to the state of the art and that it scales better with increasing mesh dataset size and detail.
【Keywords】: data handling; query processing; OCTOPUS strategy; data volume; dynamic mesh datasets; index maintenance cost; mesh datasets; mesh models; query execution; simple linear scan; spatial range queries; Analytical models; Data models; Geometry; Indexes; Monitoring; Neurons; Probes
【Paper Link】 【Pages】:1012-1023
【Authors】: Pinghui Wang ; Wenbo He ; Xue Liu
【Abstract】: Recently map services (e.g., Google maps) and location-based online social networks (e.g., Foursquare) attract a lot of attention and businesses. With the increasing popularity of these location-based services, exploring and characterizing points of interests (PoIs) such as restaurants and hotels on maps provides valuable information for applications such as start-up marketing research. Due to the lack of a direct fully access to PoI databases, it is infeasible to exhaustively search and collect all PoIs within a large area using public APIs, which usually impose a limit on the maximum query rate. In this paper, we propose an effective and efficient method to sample PoIs on maps, and give unbiased estimators to calculate PoI statistics such as sum and average aggregates. Experimental results based on real datasets show that our method is efficient, and requires six times less queries than state-of-the-art methods to achieve the same accuracy.
【Keywords】: cartography; information services; sampling methods; PoI characterization; PoI database; PoI exploration; PoI statistics; application program interface; average aggregates; location-based online social networks; location-based services; map services; points-of-interests; public API; query rate; sampling method; start-up marketing research; sum aggregates; Accuracy; Aggregates; Cities and towns; Databases; Google; Probability; Sampling methods
【Paper Link】 【Pages】:1024-1035
【Authors】: Pimin Konstantin Kefaloukos ; Marcos Antonio Vaz Salles ; Martin Zachariasen
【Abstract】: Creating good maps is the challenge of map generalization. An important generalization method is selecting subsets of the data to be shown at different zoom-levels of a zoomable map, subject to a set of spatial constraints. Applying these constraints serves the dual purpose of increasing the information quality of the map and improving the performance of data transfer and rendering. Unfortunately, with current tools, users must explicitly specify which objects to show at each zoom level of their map, while keeping their application constraints implicit. This paper introduces a novel declarative approach to map generalization based on a language called CVL, the Cartographic Visualization Language. In contrast to current tools, users declare application constraints and object importance in CVL, while leaving the selection of objects implicit. In order to compute an explicit selection of objects, CVL scripts are translated into an algorithmic search task. We show how this translation allows for reuse of existing algorithms from the optimization literature, while at the same time supporting fully pluggable, user-defined constraints and object weight functions. In addition, we show how to evaluate CVL entirely inside a relational database. The latter allows users to seamlessly integrate storage of geospatial data with its transformation into map visualizations. In a set of experiments with a variety of real-world data sets, we find that CVL produces generalizations in reasonable time for off-line processing; furthermore, the quality of the generalizations is high with respect to the chosen objective function.
【Keywords】: cartography; data visualisation; optimisation; relational databases; rendering (computer graphics); visual databases; CVL scripts; algorithmic search task; cartographic visualization language; data transfer performance; declarative cartography; explicit object selection; geospatial data storage; geospatial dataset in-database map generalization; map information quality; map visualizations; object weight functions; offline processing; relational database; rendering; zoomable map; Databases; Sociology; Statistics; Visualization
【Paper Link】 【Pages】:1036-1047
【Authors】: Ziawasch Abedjan ; Jorge-Arnulfo Quiané-Ruiz ; Felix Naumann
【Abstract】: The discovery of all unique (and non-unique) column combinations in an unknown dataset is at the core of any data profiling effort. Unique column combinations resemble candidate keys of a relational dataset. Several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are suitable for applications on dynamic datasets, such as transactional databases, social networks, and scientific applications. In these cases, data profiling techniques should be able to efficiently discover new uniques and non-uniques (and validate old ones) after tuple inserts or deletes, without re-profiling the entire dataset. We present the first approach to efficiently discover unique and non-unique constraints on dynamic datasets that is independent of the initial dataset size. In particular, Swan makes use of intelligently chosen indices to minimize access to old data. We perform an exhaustive analysis of Swan and compare it with two state-of-the-art techniques for unique discovery: Gordian and Ducc. The results show that Swan significantly outperforms both, as well as their incremental adaptations. For inserts, Swan is more than 63x faster than Gordian and up to 50x faster than Ducc. For deletes, Swan is more than 15x faster than Gordian and up to 1 order of magnitude faster than Ducc. In fact, Swan even improves on the static case by dividing the dataset into a static part and a set of inserts.
【Keywords】: data mining; Ducc technique; Gordian technique; Swan technique; data profiling techniques; dataset discovery; dynamic dataset; relational dataset; scientific applications; social networks; static dataset; transactional databases; unique column combinations detection; Data models; Data structures; Heuristic algorithms; Indexing; Proteins; Query processing
【Paper Link】 【Pages】:1048-1059
【Authors】: Yutian Sun ; Jianwen Su ; Budan Wu ; Jian Yang
【Abstract】: An important omission in current development practice for business process (or workflow) management systems is modeling of data & access for a business process, including relationship of the process data and the persistent data in the underlying enterprise database(s). This paper develops and studies a new approach to modeling data for business processes: representing data used by a process as a hierarchically structured business entity with (i) keys, local keys, and update constraints, and (ii) a set of data mapping rules defining exact correspondence between entity data values and values in the enterprise database. This paper makes the following technical contributions: (1) A data mapping language is formulated based on path expressions, and shown to coincide with a subclass of the schema mapping language Clio. (2) Two new notions are formulated: Updatability allows each update on a business entity (or database) to be translated to updates on the database (or resp. business entity), a fundamental requirement for process implementation. Isolation reflects that updates by one process execution do not alter data used by another running process. The property provides an important clue in process design. (3) Decision algorithms for updatability and isolation are presented, and they can be easily adapted for data mappings expressed in the subclass of Clio.
【Keywords】: business data processing; data handling; data structures; Clio schema mapping language; business process management systems; data mapping language; data mapping rules; data representation; data-and-access modeling; decision algorithms; enterprise database; entity data values; hierarchically structured business entity; isolation notion; path expressions; persistent data; process data; updatability notion; workflow management systems; Business; Computational modeling; Context; Data models; Databases; Maintenance engineering; Process design
【Paper Link】 【Pages】:1060-1071
【Authors】: Petar Jovanovic ; Alkis Simitsis ; Kevin Wilkinson
【Abstract】: A complex analytic flow in a modern enterprise may perform multiple, logically independent, tasks where each task uses a different processing engine. We term these multi-engine flows hybrid flows. Using multiple processing engines has advantages such as rapid deployment, better performance, lower cost, and so on. However, as the number and variety of these engines grows, developing and maintaining hybrid flows is a significant challenge because they are specified at a physical level and, so are hard to design and may break as the infrastructure evolves. We address this problem by enabling flow design at a logical level and automatic translation to physical flows. There are three main challenges. First, we describe how flows can be represented at a logical level, abstracting away details of any underlying processing engine. Second, we show how a physical flow, expressed in a programming language or some design GUI, can be imported and converted to a logical flow. In particular, we show how a hybrid flow comprising subflows in different languages can be imported and composed as a single, logical flow for subsequent manipulation. Third, we describe how a logical flow is translated into one or more physical flows for execution by the processing engines. The paper concludes with experimental results and example transformations that demonstrate the correctness and utility of our system.
【Keywords】: business data processing; directed graphs; design GUI; engine independence; enterprise; flow design; graphical user interface; hybrid flow; logical analytic flows; multiengine flow; physical flow; processing engine; programming language; Computer languages; Databases; Dictionaries; Encoding; Engines; Generators; Semantics
【Paper Link】 【Pages】:1072-1083
【Authors】: Yusheng Xie ; Diana Palsetia ; Goce Trajcevski ; Ankit Agrawal ; Alok N. Choudhary
【Abstract】: We address the problem of large scale probabilistic association rule mining and consider the trade-offs between accuracy of the mining results and quest of scalability on modest hardware infrastructure. We demonstrate how extensions and adaptations of research findings can be integrated in an industrial application, and we present the commercially deployed SILVERBACK framework, developed at Voxsup Inc. SILVERBACK tackles the storage efficiency problem by proposing a probabilistic columnar infrastructure and using Bloom filters and reservoir sampling techniques. In addition, a probabilistic pruning technique has been introduced based on Apriori for mining frequent item-sets. The proposed target-driven technique yields a significant reduction on the size of the frequent item-set candidates. We present extensive experimental evaluations which demonstrate the benefits of a context-aware incorporation of infrastructure limitations into corresponding research techniques. The experiments indicate that, when compared to the traditional Hadoop-based approach for improving scalability by adding more hosts, SILVERBACK - which has been commercially deployed and developed at Voxsup Inc. since May 2011 - has much better run-time performance with negligible accuracy sacrifices.
【Keywords】: data mining; data structures; probability; sampling methods; storage management; Apriori; Bloom filters; Hadoop-based approach; SILVERBACK framework; Voxsup Inc; columnar probabilistic databases; context-aware infrastructure limitation incorporation; frequent item-set mining; large scale probabilistic association rule mining; modest hardware infrastructure; probabilistic pruning technique; reservoir sampling techniques; scalable association mining; storage efficiency problem; temporal data; Accuracy; Databases
【Paper Link】 【Pages】:1084-1095
【Authors】: Ramakrishna Varadarajan ; Vivek Bharathan ; Ariel Cary ; Jaimin Dave ; Sreenath Bodagala
【Abstract】: In this paper, we present Vertica's customizable physical design tool, called the DBDesigner (DBD), that produces designs optimized for various scenarios and applications. For a given workload and space budget, DBD automatically recommends a physical design that optimizes query performance, storage footprint, fault tolerance and recovery to meet different customer requirements. Vertica is a distributed, massively parallel columnar database that physically organizes data into projections. Projections are attribute subsets from one or more tables with tuples sorted by one or more attributes, that are replicated or segmented (distributed) on cluster nodes. The key challenges involved in projection design are picking appropriate column sets, sort orders, cluster data distributions and column encodings. To achieve the desired trade-off between query performance and storage footprint, DBD operates under three different design policies: (a) load-optimized, (b) query-optimized or (c) balanced. These policies indirectly control the number of projections proposed and queries optimized to achieve the desired balance. To cater to query workloads that evolve over time, DBD also operates in a comprehensive and incremental design mode. In addition, DBD lets users override specific features of projection design based on their intimate knowledge about the data and query workloads. We present the complete physical design algorithm, describing in detail how projection candidates are efficiently explored and evaluated using optimizer's cost and benefit model. Our experimental results show that DBD produces good physical designs that satisfy a variety of customer use cases.
【Keywords】: parallel databases; query processing; DBD; DBDesigner tool; Vertica analytic database; balanced design policy; customer requirements; customer use case; customizable physical design tool; distributed massively parallel columnar database; fault recovery; fault tolerance; incremental design mode; load-optimized design policy; projection candidates; projection design; query performance; query workloads; query-optimized design policy; storage footprint; Algorithm design and analysis; Cities and towns; Distributed databases; Encoding; Fault tolerance; Fault tolerant systems
【Paper Link】 【Pages】:1096-1107
【Authors】: Yanhua Li ; Moritz Steiner ; Jie Bao ; Limin Wang ; Ting Zhu
【Abstract】: Location based social networks (LBSNs) are becoming increasingly popular with the fast deployment of broadband mobile networks and the growing prevalence of versatile mobile devices. This success has attracted great interest in studying and measuring the characteristics of LBSNs, such as Facebook Places, Yelp, and Google+ Local. However, it is often prohibitive, and sometimes too costly, to obtain a detailed and complete snapshot of a LBSN due to its usually massive scale. In this work, taking Foursquare as an example, we focus on sampling and estimating restricted geographic regions in LBSNs, such as a city or a country. By exploiting the application programming interfaces (APIs) provided by Foursquare for geographic search, we first introduce how to obtain the “ground truth”, namely, a complete set of all venues (i.e., places) in a specified region. Then, we propose random region sampling algorithms that allow us to draw representative samples of venues, and design unbiased estimators of regional characteristics of venues. We validate the efficiency of our sampling algorithms on Foursquare using complete datasets obtained from 12 regions, such as Switzerland, New York City and Los Angeles. Our results are applicable to perform sampling and estimation in all GeoDatabases, such as Facebook Places, Yelp, and Google+ Local, which have similar venue search APIs as Foursquare. These location service providers can also benefit from our results to enable efficient online statistic estimation.
【Keywords】: application program interfaces; data handling; geographic information systems; mobile computing; social networking (online); API; Facebook; Google; LBSN; Yelp; application programming interfaces; broadband mobile networks; dynamic range calibration; geographic regions; geosocial data; location based social networks; mobile devices; random region sampling algorithms; region estimation; region sampling; regional characteristics; Algorithm design and analysis; Cities and towns; Clustering algorithms; Educational institutions; Estimation; Facebook
【Paper Link】 【Pages】:1108-1119
【Authors】: Nga Tran ; Andrew Lamb ; Lakshmikant Shrinivas ; Sreenath Bodagala ; Jaimin Dave
【Abstract】: The Vertica SQL Query Optimizer was written from the ground up for the Vertica Analytic Database. Its design and the tradeoffs we encountered during its implementation argue that the full power of novel database systems can only be realized with a carefully crafted custom Query Optimizer written specifically for the system in which it operates.
【Keywords】: SQL; query processing; novel database systems; specialized query optimizers; vertica SQL query optimizer; vertica analytic database; Data models; Dictionaries; Encoding; Engines; Indexes; Optimization
【Paper Link】 【Pages】:1120-1131
【Authors】: Herald Kllapi ; Boulos Harb ; Cong Yu
【Abstract】: An increasing number of Web applications such as friends recommendation depend on the ability to join objects at scale. The traditional approach taken is nearest neighbor join (also called similarity join), whose goal is to find, based on a given join function, the closest set of objects or all the objects within a distance threshold to each object in the input. The scalability of techniques utilizing this approach often depends on the characteristics of the objects and the join function. However, many real-world join functions are intricately engineered and constantly evolving, which makes the design of white-box methods that rely on understanding the join function impractical. Finding a technique that can join extremely large number of objects with complex join functions has always been a tough challenge. In this paper, we propose a practical alternative approach called near neighbor join that, although does not find the closest neighbors, finds close neighbors, and can do so at extremely large scale when the join functions are complex. In particular, we design and implement a super-scalable system we name SAJ that is capable of best-effort joining of billions of objects for complex functions. Extensive experimental analysis over real-world large datasets shows that SAJ is scalable and generates good results.
【Keywords】: Internet; SAJ; Web applications; distance threshold; join function; near neighbor join; object characteristics; scalable approximate join system; similarity join; super-scalable system; Algorithm design and analysis; Clustering algorithms; Databases; Google; Partitioning algorithms; Pipelines; Scalability
【Paper Link】 【Pages】:1132-1143
【Authors】: Zhen Hua Liu ; Ying Lu ; Hui Joe Chang
【Abstract】: There has been more than decade of efforts of supporting storage, query and update XML documents in RDBMS. XML enabled RDBMS supports SQL/XML standard that defines XMLType as a SQL data type and allows XQuery/XPath embedded in XMLQuery(), XMLExists() and XMLTABLE() in SQL. In XML enabled RDBMS, both relational data and XML documents can be managed in one system and queried using SQL/XML language. However, the use case of management of document centric XML is not well-addressed due to the lacking of full text query constructs in XQuery. Recently, XQuery Full Text (XQFT) becomes the W3C recommendation. In this paper, we show how XQFT can be supported efficiently in SQL/XML for full text search of XML documents managed by XML enabled RDBMS, such as Oracle XMLDB. We present architecture of a new XML Full Text Index, XQuery compile time and run time enhancements to efficiently support XQFT in SQL/XML. We present our design rationale on how to exploit Information Retrieval (IR) techniques for XQFT support in RDBMS. The new XML Full Text Index can index common XML physical storage forms: such as text XML, binary XML, relational decomposition of the XML. Although our work is built within Oracle XMLDB, all of the presented principles and techniques in this paper are valuable enough to RDBMS industry that needs to effectively and efficiently support of XQFT over persisted XML documents.
【Keywords】: SQL; XML; document handling; query processing; relational databases; IR techniques; Oracle XMLDB; SQL data type; SQL-XML enabled RDBMS; SQL-XML language; Structured Query Language; XML documents; XMLExists; XMLQuery; XMLTABLE; XPath; XQuery full text support; binary XML; compile time; document centric XML; document query; document storage; document update; extensible markup language; full text query constructs; information retrieval; relational database management systems; relational decomposition; run time enhancements; text XML; Abstracts; Engines; Indexing; Layout; Standards; XML
【Paper Link】 【Pages】:1144-1155
【Authors】: Han Su ; Kai Zheng ; Jiamin Huang ; Hoyoung Jeung ; Lei Chen ; Xiaofang Zhou
【Abstract】: As travel is taking more significant part in our life, route recommendation service becomes a big business and attracts many major players in IT industry. Given a pair of user-specified origin and destination, a route recommendation service aims to provide users with the routes of best travelling experience according to criteria, such as travelling distance, travelling time, traffic condition, etc. However, previous research shows that even the routes recommended by the big-thumb service providers can deviate significantly from the routes travelled by experienced drivers. It means travellers' preferences on route selection are influenced by many latent and dynamic factors that are hard to model exactly with pre-defined formulas. In this work we approach this challenging problem with a very different perspective- leveraging crowds' knowledge to improve the recommendation quality. In this light, CrowdPlanner - a novel crowd-based route recommendation system has been developed, which requests human workers to evaluate candidate routes recommended by different sources and methods, and determine the best route based on their feedbacks. In this paper, we particularly focus on two important issues that affect system performance significantly: (1) how to efficiently generate tasks which are simple to answer but possess sufficient information to derive user-preferred routes; and (2) how to quickly identify a set of appropriate domain experts to answer the questions timely and accurately. Specifically, the task generation component in our system generates a series of informative and concise questions with optimized ordering for a given candidate route set so that workers feel comfortable and easy to answer. In addition, the worker selection component utilizes a set of selection criteria and an efficient algorithm to find the most eligible workers to answer the questions with high accuracy. A prototype system has been deployed to many voluntary mobile clients and extensive - ests on real-scenario queries have shown the superiority of CrowdPlanner in comparison with the results given by map services and popular route mining algorithms.
【Keywords】: recommender systems; traffic information systems; CrowdPlanner; crowd-based route recommendation system; domain experts identify; map services; recommendation quality; route mining algorithms; route recommendation service; selection criteria; task generation component; task generation efficiency; travelling; user-preferred routes; worker selection component; Accuracy; Mobile communication; Optimization; Roads; Trajectory; Vehicles; Web services
【Paper Link】 【Pages】:1156-1161
【Authors】: Youngchul Cha ; Junghoo Cho ; Jian Yuan ; Tak W. Yan
【Abstract】: Categorical (topic) similarity between a web page and an advertisement (ad) text has long been used for contextual advertising. In this paper, we explore the use of the categorical similarity score, referred to as Category Match Score (CMS), in the context of search advertising. In particular, we explore the effect of CMS on various ad-effectiveness prediction tasks, including user-judgment prediction, ad click-through-rate prediction (CTR), and revenue-per-impression prediction. Our extensive experiments on two editorial datasets and one live traffic dataset demonstrate that CMS is one of the strongest features in the judgment prediction task and that CMS-based filtering is very effective in improving revenue per impression as well as CTR. We believe that our analyses can be extremely effective in helping web service providers serve more relevant and profitable ads to users.
【Keywords】: Web services; advertising data processing; CMS-based filtering; CTR; Web service providers; ad click-through-rate prediction; categorical similarity score; category match score; judgment prediction task; revenue-per-impression prediction; search advertising; user-judgment prediction; Accuracy; Advertising; Channel hot electron injection; Correlation; Measurement; Web pages; Web services
【Paper Link】 【Pages】:1162-1167
【Authors】: Prashanth Menon ; Tilmann Rabl ; Mohammad Sadoghi ; Hans-Arno Jacobsen
【Abstract】: With the ever growing size and complexity of enterprise systems there is a pressing need for more detailed application performance management. Due to the high data rates, traditional database technology cannot sustain the required performance. Alternatives are the more lightweight and, thus, more performant key-value stores. However, these systems tend to sacrifice read performance in order to obtain the desired write throughput by avoiding random disk access in favor of fast sequential accesses. With the advent of SSDs, built upon the philosophy of no moving parts, the boundary between sequential vs. random access is now becoming blurred. This provides a unique opportunity to extend the storage memory hierarchy using SSDs in key-value stores. In this paper, we extensively evaluate the benefits of using SSDs in commercialized key-value stores. In particular, we investigate the performance of hybrid SSD-HDD systems and demonstrate the benefits of our SSD caching and our novel dynamic schema model.
【Keywords】: cache storage; disc drives; hard discs; CaSSanDra; SSD boosted key-value store; SSD caching; dynamic schema model; enterprise systems; hybrid SSD-HDD systems; random access; sequential access; storage memory hierarchy; Ash; Data mining; Data models; Indexes; Measurement; Monitoring; Throughput
【Paper Link】 【Pages】:1168-1173
【Authors】: Ilaria Bordino ; Nicolas Kourtellis ; Nikolay Laptev ; Youssef Billawala
【Abstract】: Web traffic represents a powerful mirror for various real-world phenomena. For example, it was shown that web search volumes have a positive correlation with stock trading volumes and with the sentiment of investors. Our hypothesis is that user browsing behavior on a domain-specific portal is a better predictor of user intent than web searches.
【Keywords】: portals; stock markets; Web search volumes; Web traffic; Yahoo Finance user browsing behavior; domain-specific portal; investor sentiment; positive correlation; stock trade volume prediction; Companies; Correlation; Finance; Industries; Time series analysis; Web search
【Paper Link】 【Pages】:1174-1177
【Authors】: Leihao Xia ; Caleb Chen Cao ; Lei Chen ; Zhao Chen
【Abstract】: Knapsack problems range over a large sphere of real world challenges [?]. For example, every year a professor has to decide her new “squad” of students/staff from possibly hundreds of candidates, while having a restricted budget of funding in consideration. Moreover, in many cases, she has to resort to her colleagues and senior students to make comparisons among the candidates. The difficulties of such tasks are mainly three-fold: 1) the knowledge about the candidates are distributed among a crowd; 2) the underlying factors are human-intrinsic and hard to be formatted; 3) the size of candidates exceeds the capacity of human for a one-shot decision. Other examples in this category include gear set preparation for a venture trip, syllabus design for a popular course and inventory design for goods shelf, where the two difficulties are commonly observed. Consequently, a person may be heavily entangled to work out a final decision, which may even be inaccurate. Driven by this demand, in this demo, we present C-DMr - a Crowd-powered Decision Maker that incorporates the wisdom of the informed crowds to solve such real world Knapsack Problems. The core module of this web-based system is a set of algorithms along with a novel interactive interface. The interface incrementally presents comparison jobs and motivates the crowd to participate with a rewarding mechanism, and the set of algorithms solves the Knapsack Problem given only pairwise preferences among candidates. We demonstrate the novelty and usefulness of C-DMr by forming a aforementioned “squad” for a recruiting professor. Specifically four functionalities are shown: 1) a Candidates Entrance that collects the information about all candidates; 2) a Jury Trial that facilitates informed crowds to contribute preferences; 3) an Knapsack Analyzer that measures the on-going “squad”; and 4) a Consultant that recommends a final set of candidates to the professor.
【Keywords】: Internet; decision support systems; educational administrative data processing; further education; knapsack problems; C-DMr; Web-based system; candidates entrance function; colleagues; consultant function; crowd-powered decision maker; gear set preparation; interactive interface; inventory design; jury trial function; knapsack analyzer function; knapsack problems; one-shot decision; pairwise preferences; recruiting professor; rewarding mechanism; senior students; syllabus design; venture trip; Algorithm design and analysis; Computers; Databases; Educational institutions; Gears; Materials; Registers
【Paper Link】 【Pages】:1178-1181
【Authors】: Han Su ; Kai Zheng ; Jiamin Huang ; Tianyu Liu ; Haozhou Wang ; Xiaofang Zhou
【Abstract】: Route recommendation service has become a big business in industry since traveling is now an important part of our daily life. We can travel to unknown places by simply typing in our destination and then following recommendation service's guidance, that a pleasant trip desires them to provide a good route. However, previous research shows that even the routes recommended by the big-thumb service providers can deviate significantly from the routes travelled by experienced drivers since the many latent factors affect drivers' preferences and it is hard for a single route recommendation algorithm to model all of them. In this demo we will present the CrowPlanner system to leverage crowds' knowledge to improve the recommendation quality. It requests human workers to evaluate candidates routes recommended by different sources and methods, and determines the best route based on the feedbacks of these workers. In this demo, we first introduce the core component of our system for smart question generation, and then show several real route recommendation cases and the feedback of users.
【Keywords】: geographic information systems; recommender systems; social sciences computing; traffic engineering computing; CrowdPlanner; big-thumb service providers; crowd knowledge; crowd-based route recommendation system; driver preferences; recommendation quality; recommendation service guidance; route recommendation algorithm; route recommendation service; smart question generation; travelling; user feedback; Companies; Computers; Rail transportation; Smart phones; Time factors; Trajectory; Web services
【Paper Link】 【Pages】:1182-1185
【Authors】: Yongxin Tong ; Caleb Chen Cao ; Chen Jason Zhang ; Yatao Li ; Lei Chen
【Abstract】: Multi-version data is often one of the most concerned information on the Web since this type of data is usually updated frequently. Even though there exist some Web information integration systems that try to maintain the latest update version, the maintained multi-version data usually includes inaccurate and invalid information due to the data integration or update delay errors. In this demo, we present CrowdCleaner, a smart data cleaning system for cleaning multi-version data on the Web, which utilizes crowdsourcing-based approaches for detecting and repairing errors that usually cannot be solved by traditional data integration and cleaning techniques. In particular, CrowdCleaner blends active and passive crowdsourcing methods together for rectifying errors for multi-version data. We demonstrate the following four facilities provided by CrowdCleaner: (1) an error-monitor to find out which items (e.g., submission date, price of real estate, etc.) are wrong versions according to the reports from the crowds, which belongs to a passive crowdsourcing strategy; (2) a task-manager to allocate the tasks to human workers intelligently; (3) a smart-decision-maker to identify which answer from the crowds is correct with active crowdsourcing methods; and (4) a whom-to-ask-finder to discover which users (or human workers) should be the most credible according to their answer records.
【Keywords】: Internet; data integration; information retrieval; CrowdCleaner; Web information integration systems; crowdsourcing methods; crowdsourcing-based approach; data integration error; error-monitor facility; multiversion data; passive crowdsourcing strategy; smart data cleaning system; smart-decision-maker facility; task-manager facility; update delay error; whom-to-ask-finder facility; Cleaning; Data integration; Delays; Entropy; Maintenance engineering; Monitoring; Uncertainty
【Paper Link】 【Pages】:1186-1189
【Authors】: Siyu Lei ; Xuan S. Yang ; Luyi Mo ; Silviu Maniu ; Reynold Cheng
【Abstract】: In social tagging systems, such as Delicious1 and Flickr2, users are allowed to annotate resources (e.g., Web URLs and images) with textual descriptions called tags. Tags have proven to be invaluable building blocks in algorithms for searching, mining and recommending resources. In practice, however, not all resources receive the same attention from users, and as a result, most tags are added to the few highly-popular resources, while most of the resources receive few tags. Crucially, this incomplete tagging on resources can severely affect the effectiveness of all tagging applications. We present iTag, an incentive-based tagging system, which aims at improving tagging quality of resources, by incentivizing taggers under budget constraints. Our system is built upon traditional crowdsourcing systems such as Amazon Mechanical Turk (MTurk). In our demonstration, we will show how our system allows users to use simple but powerful strategies to significantly improve the tagging quality of resources.
【Keywords】: information retrieval systems; Amazon Mechanical Turk; Delicious; Flickr; MTurk; crowdsourcing systems; iTag system; incentive-based tagging system; resource annotation; resource mining; resource quality; resource recommendation; resource searching; social tagging systems; textual descriptions; Collaboration; Facebook; Monitoring; Noise measurement; Real-time systems; Resource management; Tagging
【Paper Link】 【Pages】:1190-1193
【Authors】: Jiamin Lu ; Ralf Hartmut Güting
【Abstract】: Parallel Secondo scales up the capability of processing extensible data models in Secondo. It combines Hadoop with a set of Secondo databases, providing almost all existing SECONDO data types and operators. Therefore it is possible for the user to convert large-scale sequential queries to parallel queries without learning the Map/Reduce programming details. This paper demonstrates such a procedure. It imports the data from the project OpenStreetMap into Secondo databases to build up the urban traffic network and then processes network-based queries like map-matching and symbolic trajectory pattern matching. All involved queries were stated as sequential expressions and time-consuming in single-computer Secondo. However, they can achieve an impressive performance in Parallel Secondo after being converted to the corresponding parallel queries, even on a small cluster consisting of six low-end computers.
【Keywords】: parallel databases; parallel programming; query processing; Hadoop; MapReduce programming details; OpenStreetMap project; Secondo data types; Secondo databases; Secondo operators; data models; map-matching; moving objects processing; network-based queries; parallel Secondo system; parallel queries; sequential queries; single-computer Secondo; symbolic trajectory pattern matching; urban traffic network; Computers; Data models; Decision support systems; Distributed databases; Roads; Trajectory
【Paper Link】 【Pages】:1194-1197
【Authors】: Mikkel Boysen ; Christian de Haas ; Hua Lu ; Xike Xie ; Aiste Pilvinyte
【Abstract】: Indoor navigation is very useful in reality. An indoor navigation system requires an appropriate indoor model that represents the indoor space entities as well as topology and supports indoor distance computation. Indoor distance computation differs significantly from its outdoor counterpart, due to the indoor space characteristics. On the other hand, information about an indoor space is described in digital building information (DBI) formats like the Industry Foundation Classes (IFC). In such formats, geometric representation of entities is the focus but the topology is only implicit or even incomplete. Thus, constructing indoor navigation systems faces two major technical challenges: 1) indoor distance computation; 2) indoor space model creation from raw DBI. This paper presents a prototype system that addresses these two challenges. The system provides functions to process raw DBI files, employs PostgreSQL with PostGIS to manage the information carefully extracted from those files, and offers indoor navigation service that supports indoor distance computation. The system also presents an app that accesses the indoor navigation service from mobile terminals.
【Keywords】: SQL; building management systems; geographic information systems; indoor environment; topology; DBI files; IFC; PostGIS; PostgreSQL; digital building information; geometric representation; indoor distance computation; indoor navigation service; indoor navigation systems; indoor space characteristics; indoor space entities; indoor space model creation; industry foundation classes; mobile terminals; topology; Buildings; Databases; Mobile communication; Navigation; Servers; Solid modeling; Topology
【Paper Link】 【Pages】:1198-1201
【Authors】: Ziawasch Abedjan ; Toni Grütze ; Anja Jentzsch ; Felix Naumann
【Abstract】: Before reaping the benefits of open data to add value to an organizations internal data, such new, external datasets must be analyzed and understood already at the basic level of data types, constraints, value patterns etc. Such data profiling, already difficult for large relational data sources, is even more challenging for RDF datasets, the preferred data model for linked open data. We present ProLod++, a novel tool for various profiling and mining tasks to understand and ultimately improve open RDF data. ProLod++ comprises various traditional data profiling tasks, adapted to the RDF data model. In addition, it features many specific profiling results for open data, such as schema discovery for user-generated attributes, association rule discovery to uncover synonymous predicates, and uniqueness discovery along ontology hierarchies. ProLod++ is highly efficient, allowing interactive profiling for users interested in exploring the properties and structure of yet unknown datasets.
【Keywords】: data analysis; data mining; data models; ProLOD++; RDF data mining; RDF data model; RDF data profiling; association rule discovery; interactive profiling; ontology hierarchies; open RDF data; schema discovery; synonymous predicates; uniqueness discovery; user-generated attributes; Association rules; Data models; Data visualization; Ontologies; Pattern analysis; Resource description framework
【Paper Link】 【Pages】:1202-1205
【Authors】: Omar Alonso ; Kartikay Khandelwal
【Abstract】: Modern social networks such as Twitter provide a platform for people to express their opinions on a variety of topics ranging from personal to global. While the factual part of this information and the opinions of various experts are archived by sources such as Wikipedia and reputable news articles, the opinion of the general public is drowned out in a sea of noise and “un-interesting” information. In this demo we present Kondenzer - an offline system for condensing, archiving and visualizing social data. Specifically, we create digests of social data using a combination of filtering, duplicate removal and efficient clustering. This gives a condensed set of high quality data which is used to generate facets and create a collection that can be visualized using the PivotViewer control.
【Keywords】: data visualisation; information retrieval systems; social networking (online); Kondenzer; PivotViewer control; archived social media exploration; archived social media visualization; duplicate removal; efficient clustering; high quality data; social data archiving; social data condensing; social data visualization; social networks; Companies; Data visualization; Feature extraction; Media; Noise; Twitter
【Paper Link】 【Pages】:1206-1209
【Authors】: Wei Kang ; Anthony K. H. Tung ; Wei Chen ; Xinyu Li ; Qiyue Song ; Chao Zhang ; Feng Zhao ; Xiajuan Zhou
【Abstract】: The popularity of social media services has been innovating the way of information acquisition in modern society. Meanwhile, mass information is generated in every single day. To extract useful knowledge, much effort has been invested in analyzing social media contents, e.g., (emerging) topic discovery. With these findings, however, users may still find it hard to obtain knowledge of great interest in conformity with their preference. In this paper, we present a novel system which brings proper context to continuously incoming social media contents, such that mass information can be indexed, organized and analyzed around Wikipedia entities. Four data analytics tools are employed in the system. Three of them aim to enrich each Wikipedia entity by analyzing the relevant contents while the other one builds an information network among the most relevant Wikipedia entities. With our system, users can easily pinpoint valuable information and knowledge they are interested in, as well as navigate to other closely related entities through the information network for further exploration.
【Keywords】: Internet; data visualisation; indexing; knowledge acquisition; social networking (online); Internet observatory; Trendspedia; Web analysis; Web visualization; Wikipedia entity; data analytics tools; indexing; information acquisition; information network; information organization; knowledge extraction; mass information; social media contents; social media services; topic discovery; Educational institutions; Electronic publishing; Encyclopedias; Internet; Visualization
【Paper Link】 【Pages】:1210-1213
【Authors】: Chen Xu ; Fan Xia ; Mohamed A. Sharaf ; Minqi Zhou ; Aoying Zhou
【Abstract】: NoSQL key-value data stores provide an attractive solution for big data management. With the help of data partitioning and replication, those data stores achieve higher levels of availability, scalability and reliability. Such design choices typically exhibit a tradeoff in which data freshness is sacrificed in favor of reduced access latency. At the replica-level, this tradeoff is primarily shaped by the resource allocation strategies deployed for managing the processing of user queries and replica updates. In this demonstration, we showcase AQUAS: a quality-aware scheduler for Cassandra, which allows application developers to specify requirements on quality of service (QoS) and quality of data (QoD). AQUAS efficiently allocates the available replica resources to execute the incoming read/write tasks so that to minimize the penalties incurred by violating those requirements. We demonstrate AQUAS based on our implementation of a microblogging system.
【Keywords】: Big Data; SQL; Web sites; relational databases; resource allocation; scheduling; AQUAS scheduler; Cassandra; NoSQL data stores; NoSQL key-value data; QoD; QoS; Structured Query Language; access latency; big data management; data availability; data freshness; data partitioning; data reliability; data replication; data scalability; microblogging system; quality of data; quality of service; quality-aware scheduler; read-write tasks; replica updates; resource allocation; resource allocation strategies; user queries; Delays; Educational institutions; Estimation; Generators; Prototypes; Quality of service; Scheduling
【Paper Link】 【Pages】:1214-1217
【Authors】: Yuzhe Tang ; Ling Liu ; Ting Wang ; Xin Hu ; Reiner Sailer ; Peter Pietzuch
【Abstract】: In the age of big data, key-value data updated by intensive write streams is increasingly common, e.g., in social event streams. To serve such data in a cost-effective manner, a popular new paradigm is to outsource it to the cloud and store it in a scalable key-value store while serving a large user base. Due to the limited trust in third-party cloud infrastructures, data owners have to sign the data stream so that the data users can verify the authenticity of query results from the cloud. In this paper, we address the problem of verifiable freshness for multi-version key-value data. We propose a memory-resident digest structure that utilizes limited memory effectively and can have efficient verification performance. The proposed structure is named IncBM-Tree because it can INCrementally build a Bloom filter-embedded Merkle Tree. We have demonstrated the superior performance of verification under small memory footprints for signing, which is typical in an outsourcing scenario where data owners and users have limited resources.
【Keywords】: Big Data; cloud computing; data structures; outsourcing; Bloom filter-embedded Merkle tree; IncBM-Tree; big data; data stream; intensive write streams; memory-resident digest structure; multiversion key-value data; multiversion key-value stores outsourcing; third-party cloud infrastructures; verifiable data freshness; Authentication; Data storage systems; Government; Information management; Merging; Servers
【Paper Link】 【Pages】:1218-1221
【Authors】: Giansalvatore Mecca ; Paolo Papotti ; Donatello Santoro
【Abstract】: We call a data-transformation system any system that maps, translates and exchanges data across different representations. Nowadays, data architects are faced with a large variety of transformation tasks, and there is huge number of different approaches and systems that were conceived to solve them. As a consequence, it is very important to be able to evaluate such alternative solutions, in order to pick up the right ones for the problem at hand. To do this, we introduce IQ-Meter, the first comprehensive tool for the evaluation of data-transformation systems. IQ-Meter can be used to benchmark, test, and even learn the best usage of data-transformation tools. It builds on a number of novel algorithms to measure the quality of outputs and the human effort required by a given system, and ultimately measures “how much intelligence” the system brings to the solution of a data-translation task.
【Keywords】: data structures; IQ-Meter; XML; data architects; data mapping; data model; data-transformation system; data-transformation tools; data-translation task; evaluation tool; human effort; Benchmark testing; Complexity theory; Encoding; Graphical user interfaces; Heuristic algorithms; Time measurement; Velocity measurement
【Paper Link】 【Pages】:1222-1225
【Authors】: Xu Chu ; Ihab F. Ilyas ; Paolo Papotti ; Yin Ye
【Abstract】: Integrity constraints (ICs) are valuables tools for enforcing correct application semantics. However, manually designing ICs require experts and time, hence the need for automatic discovery. Previous automatic ICs discovery suffer from (1) limited ICs language expressiveness; and (2) time-consuming manual verification of discovered ICs. We introduce RULEMINER, a system for discovering data quality rules that addresses the limitations of existing solutions.
【Keywords】: data mining; IC; RuleMiner; application semantics; data quality rules discovery; integrity constraints; Cleaning; Data mining; Inference algorithms; Itemsets; Java; Remuneration
【Paper Link】 【Pages】:1226-1229
【Authors】: Ruilin Liu ; Guan Wang ; Wendy Hui Wang ; Flip Korn
【Abstract】: The completeness of data is vital to data quality. In this demo, we present iCoDA, a system that supports interactive, exploratory data completeness analysis. iCoDA provides algorithms and tools to generate tableau patterns that concisely summarize the incomplete data under various configuration settings. During the demo, the audience can use iCoDA to interactively explore the tableau patterns generated from incomplete data, with the flexibility of filtering and navigating through different granularity of these patterns. iCoDA supports various visualization methods to the audience for the display of tableau patterns. Overall, we will demonstrate that iCoDA provides sophisticated analysis of data completeness.
【Keywords】: data acquisition; data analysis; data visualisation; data completeness analysis; data quality; data visualization methods; iCoDA; interactive completeness data analysis; tableau pattern generation; Data visualization; Detectors; Image color analysis; Loss measurement; Monitoring; Roads; Temperature sensors; Data completeness; exploratory pattern analysis; graph tableau discovery; pattern visualization
【Paper Link】 【Pages】:1230-1233
【Authors】: Christina Christodoulakis ; Christos Faloutsos ; Renée J. Miller
【Abstract】: If Lisa visits Dr. Brown, and there is no record of the drug he prescribed her, can we find it? Data sources, much to analysts' dismay, are too often plagued with incompleteness, making business analytics over the data difficult. Data entries with incomplete values are ignored, making some analytic queries fail to accurately describe how an organization is performing. We introduce a principled way of performing value imputation on missing values, allowing a user to choose a correct value after viewing possible values and why they were inferred. We achieve this by turning our data into a graph network and performing link prediction on nodes of interest using the belief propagation algorithm.
【Keywords】: belief maintenance; data analysis; data visualisation; learning (artificial intelligence); VoidWiz system; analytic queries; belief propagation algorithm; business analytics; data entries; data sources; graph network; link prediction; network effects; value imputation; Algorithm design and analysis; Belief propagation; Clinical trials; Data visualization; Drugs; Educational institutions; Prediction algorithms
【Paper Link】 【Pages】:1234-1237
【Authors】: Antonio Miele ; Elisa Quintarelli ; Emanuele Rabosio ; Letizia Tanca
【Abstract】: This demo presents a framework for personalizing data access on the basis of the users' context and of the preferences they show while in that context. The system is composed of (i) a server application, which “tailors” a view over the available data on the basis of the user's contextual preferences, previously inferred from log data, and (ii) a client application running on the user's mobile device, which allows to query the data view and collects the activity log for later mining. At each change of context detected by the system the corresponding tailored view is loaded on the client device: accordingly, the most relevant data is available to the user even when the connection is unstable or lacking. The demo features a movie database, where users can browse data in different contexts and appreciate the personalization of the data views according to the inferred contextual preferences.
【Keywords】: data mining; query processing; ADaPT; activity log; automatic data personalization; client application; client device; data access personalization; data view personalization; data view query; log data; mining; movie database; server application; user contextual preferences; user mobile device; Context; Data mining; Databases; Graphical user interfaces; Mobile handsets; Motion pictures; Servers
【Paper Link】 【Pages】:1238-1241
【Authors】: Amr Magdy ; Ahmed M. Aly ; Mohamed F. Mokbel ; Sameh Elnikety ; Yuxiong He ; Suman Nath
【Abstract】: Mars demonstration exploits the microblogs location information to support a wide variety of important spatio-temporal queries on microblogs. Supported queries include range, nearest-neighbor, and aggregate queries. Mars works under a challenging environment where streams of microblogs are arriving with high arrival rates. Mars distinguishes itself with three novel contributions: (1) Efficient in-memory digestion/expiration techniques that can handle microblogs of high arrival rates up to 64,000 microblog/sec. This also includes highly accurate and efficient hopping-window based aggregation for incoming microblogs keywords. (2) Smart memory optimization and load shedding techniques that adjust in-memory contents based on the expected query load to trade off a significant storage savings with a slight and bounded accuracy loss. (3) Scalable real-time query processing, exploiting Zipf distributed microblogs data for efficient top-k aggregate query processing. In addition, Mars employs a scalable real-time nearest neighbor and range query processing module that employs various pruning techniques so that it serves heavy query workloads in real time. Mars is demonstrated using a stream of real tweets obtained from Twitter firehose with a production query workload obtained from Bing web search. We show that Mars serves incoming queries with an average latency of less than 4 msec and with 99% answer accuracy while saving up to 70% of storage overhead for different query loads.
【Keywords】: Internet; query processing; social networking (online); Bing Web search; Mars; Twitter firehose; Zipf distributed microblogs data; aggregate queries; heavy query workloads; hopping-window based aggregation; in-memory digestion-expiration techniques; load shedding techniques; microblogs keywords; microblogs location information; nearest-neighbor queries; production query workload; pruning techniques; range queries; real-time query processing; real-time spatio-temporal queries; smart memory optimization; top-k aggregate query processing; Aggregates; Indexes; Mars; Memory management; Query processing; Real-time systems; Twitter
【Paper Link】 【Pages】:1242-1245
【Authors】: Ahmed Eldawy ; Mohamed F. Mokbel
【Abstract】: With the huge amounts of spatial data collected everyday, MapReduce frameworks, such as Hadoop, have become a common choice to analyze big spatial data for scientists and people from industry. Users prefer to use high level languages, such as Pig Latin, to deal with Hadoop for simplicity. Unfortunately, these languages are designed for primitive non-spatial data and have no support for spatial data types or functions. This demonstration presents Pigeon, a spatial extension to Pig which provides spatial functionality in Pig. Pigeon is implemented through user defined functions (UDFs) making it easy to use and compatible with all recent versions of Pig. This also allows it to integrate smoothly with existing non-spatial functions and operations such as Filter, Join and Group By. Pigeon is compatible with the Open Geospatial Consortium (OGC) standard which makes it easy to learn and use for users who are familiar with existing OGC-compliant tools such as PostGIS. This demonstrations shows to audience how to work with Pigeon through some interesting applications running on large scale real datasets extracted from OpenStreetMap.
【Keywords】: Big Data; data analysis; high level languages; standards; visual databases; Hadoop; OGC standard; OGC-compliant tools; Open Geospatial Consortium standard; OpenStreetMap; Pig Latin; Pigeon; PostGIS; UDF; big spatial data analysis; filter; group by; high level languages; join; spatial MapReduce language; user defined functions; Cities and towns; Lakes; Rivers; Roads; Spatial databases; Standards; XML
【Paper Link】 【Pages】:1246-1249
【Authors】: Mohamed F. Mokbel ; Louai Alarabi ; Jie Bao ; Ahmed Eldawy ; Amr Magdy ; Mohamed Sarwat ; Ethan Waytas ; Steven Yackel
【Abstract】: This demo presents Minnesota Traffic Generator (MNTG); an extensible web-based road network traffic generator. MNTG enables its users to generate traffic data at any arbitrary road networks with different traffic generators. Unlike existing traffic generators that require a lot of time/effort to install, configure, and run, MNTG is a web service with a user-friendly interface where users can specify an arbitrary spatial region, select a traffic generator, and submit their traffic generation request. Once the traffic data is generated by MNTG, users can then download and/or visualize the generated data. MNTG can be extended to support: (1) various traffic generators. It is already shipped with the two most common traffic generators, Brinkhoff and BerlinMOD, but other generators can be easily added. (2) various road network sources. It is shipped with U.S. Tiger files and OpenStreetMap, but other sources can be also added. A beta version of MNTG is launched at: http://mntg.cs.umn.edu.
【Keywords】: Web services; data visualisation; road traffic; user interfaces; BerlinMOD; Brinkhoff; MNTG; Minnesota Traffic Generator; OpenStreetMap; US Tiger files; Web service; Web-based road network traffic generator; data visualization; road network sources; traffic generation request; user-friendly interface; Data mining; Data visualization; Databases; Electronic mail; Generators; Roads; Standards
【Paper Link】 【Pages】:1250-1253
【Authors】: Nandish Jayaram ; Mahesh Gupta ; Arijit Khan ; Chengkai Li ; Xifeng Yan ; Ramez Elmasri
【Abstract】: We present GQBE, a system that presents a simple and intuitive mechanism to query large knowledge graphs. Answers to tasks such as “list university professors who have designed some programming languages and also won an award in Computer Science” are best found in knowledge graphs that record entities and their relationships. Real-world knowledge graphs are difficult to use due to their sheer size and complexity and the challenging task of writing complex structured graph queries. Toward better usability of query systems over knowledge graphs, GQBE allows users to query knowledge graphs by example entity tuples without writing complex queries. In this demo we present: 1) a detailed description of the various features and user-friendly GUI of GQBE, 2) a brief description of the system architecture, and 3) a demonstration scenario that we intend to show the audience.
【Keywords】: graph theory; graphical user interfaces; query processing; GQBE; complex structured graph queries; example entity tuples; knowledge graph querying; query system usability; real-world knowledge graphs; system architecture; user-friendly GUI; Awards activities; Computers; Educational institutions; Lattices; Relational databases; Writing
【Paper Link】 【Pages】:1254-1257
【Authors】: Patrick Ernst ; Cynthia Meng ; Amy Siu ; Gerhard Weikum
【Abstract】: Knowledge bases (KB's) contribute to advances in semantic search, Web analytics, and smart recommendations. Their coverage of domain-specific knowledge is limited, though. This demo presents the KnowLife portal, a large KB for health and life sciences, automatically constructed from Web sources. Prior work on biomedical ontologies has focused on molecular biology: genes, proteins, and pathways. In contrast, KnowLife is a one-stop portal for a much wider range of relations about diseases, symptoms, causes, risk factors, drugs, side effects, and more. Moreover, while most prior work relies on manually curated sources as input, the KnowLife system taps into scientific literature as well as online communities. KnowLife uses advanced information extraction methods to populate the relations in the KB. This way, it learns patterns for relations, which are in turn used to semantically annotate newly seen documents, thus aiding users in “speed-reading”. We demonstrate the value of the KnowLife KB by various use-cases, supporting both layman and professional users.
【Keywords】: knowledge based systems; medical computing; portals; KnowLife portal; KnowLife system; Web analytics; Web sources; biomedical ontologies; documents annotation; domain-specific knowledge; genes; health science; knowledge base; knowledge graph; life science; molecular biology; online communities; pathways; proteins; scientific literature; semantic search; smart recommendations; speed-reading; Diseases; Drugs; Information retrieval; Portals; Semantics; Unified modeling language
【Paper Link】 【Pages】:1258-1261
【Authors】: Michael N. Gubanov ; Michael Stonebraker ; Daniel Bruckner
【Abstract】: Large-scale text data research has recently started to regain momentum [1]-[10], because of the wealth of up to date information communicated in unstructured format. For example, new information in online media (e.g. Web blogs, Twitter, Facebook, news feeds, etc) becomes instantly available and is refreshed regularly, has very broad coverage and other valuable properties unusual for other data sources and formats. Therefore, many enterprises and individuals are interested in integrating and using unstructured text in addition to their structured data.
【Keywords】: data integration; data structures; sensor fusion; text analysis; DATA TAMER; data cleaning; data formats; data integration system; data transformations; entity consolidation module; expert-sourcing mechanism; human guidance; large-scale text data research; online media; schema integration facility; structured data fusion; structured data sources; text fusion; Blogs; Cleaning; Data integration; Distributed databases; Media; Motion pictures; Schedules
【Paper Link】 【Pages】:1262-1265
【Authors】: Erietta Liarou ; Stratos Idreos
【Abstract】: A fundamental need in the era of data deluge is data exploration through interactive tools, i.e., being able to quickly determine data and patterns of interest. dbTouch is a new research direction towards a next generation of data management systems that inherently support data exploration by allowing touch-based interaction. Data is represented in a visual format, while users can touch those shapes and interact/query with gestures. In a dbTouch system, the whole database kernel is geared towards quick responses in touch input; the user drives query processing (not just query construction) via touch gestures, dictating how fast or slow data flows through query plans and which data parts are processed at any time. dbTouch translates the gestures into interactive database operators, reacting continuously to the touch input and analytics tasks given by the user in real-time such as sliding a finger over a column to scan it progressively; zoom in with two fingers over a column to progressively get sample data; rotate a table to change the physical design from row-store to column-store, etc. This demo presents the first dbTouch prototype over iOS for iPad.
【Keywords】: interactive systems; query processing; touch sensitive screens; action database kernels; data deluge; data management systems; database kernel; dbTouch system; interactive database operators; interactive tools; query construction; query plans; query processing; touch based data exploration; touch based interaction; touch gestures; visual format; Data visualization; Kernel; Prototypes; Query processing; Shape; Tablet computers
【Paper Link】 【Pages】:1266-1269
【Authors】: Bin Liu ; Wang-Pin Hsiung ; Jun'ichi Tatemura ; Hakan Hacigümüs
【Abstract】: Entity-group based new SQL systems achieve scalability and consistency at the same time by using a key-value store as the storage layer and limiting each transaction's boundary to a collection of data (called an entity-group). Examples of such systems are Google's Megastore, NEC's Partiqle, and LinkedIn's Espresso. Application developers of such systems face tremendous challenges, both in designing entity-groups (and hence transaction boundaries) and physical layout of data in key-value stores. Entity-group designs directly impact consistency semantics of the workload, and physical layout impacts the application throughput. Both problems are challenging for users to solve manually. In this demonstration, we show a system that solves both problems with a user-friendly GUI built on our principled formal methods. We demonstrate the system using a simple Auction benchmark, and a more complex TPC-W benchmark.
【Keywords】: SQL; graphical user interfaces; Auction benchmark; Google Megastore; LinkedIn Espresso; NEC Partiqle; SAGE; SQL system; TPC-W benchmark; consistency semantics; entity-group design; key-value store; logical design tool; physical design tool; principled formal method; storage layer; user-friendly GUI; Benchmark testing; Graphical user interfaces; Indexes; Layout; Optimization; Scalability
【Paper Link】 【Pages】:1270-1273
【Authors】: Vasil Slavov ; Anas Katib ; Praveen R. Rao
【Abstract】: We present a novel tool called XGossip for Internet-scale cardinality estimation of XPath queries over distributed XML data. XGossip relies on the principle of gossip, is scalable, decentralized, and can cope with network churn and failures. It employs a novel divide-and-conquer strategy for load balancing and reducing the overall network bandwidth consumption. It has a strong theoretical underpinning and provides provable guarantees on the accuracy of cardinality estimates, the number of messages exchanged, and the total bandwidth usage. In this demonstration, users will experience three engaging scenarios: In the first scenario, they can set up, configure, and deploy XGossip on Amazon Elastic Compute Cloud (EC2). In the second scenario, they can execute XGossip, pose XPath queries, observe in real-time the convergence speed of XGossip, the accuracy of cardinality estimates, the bandwidth usage, and the number of messages exchanged. In the third scenario, they can introduce network churn and failures during the execution of XGossip and observe how these impact the behavior of XGossip.
【Keywords】: Internet; XML; divide and conquer methods; query processing; resource allocation; Amazon Elastic Compute Cloud; EC2; Internet-scale cardinality estimation; XGossip tool; XPath queries; bandwidth usage; distributed XML data; distributed semistructured data; divide-and-conquer strategy; extensible markup language; load balancing; message exchange; network bandwidth consumption; network churn; Accuracy; Bandwidth; Convergence; Engines; Estimation; Medical services; XML
【Paper Link】 【Pages】:1274-1277
【Authors】: Yu Cao ; Xiaoyan Guo ; Baoyao Zhou ; Stephen Todd
【Abstract】: This paper demonstrates HOPE, an efficient and effective database partitioning system that is designed for OLTP workloads. HOPE is built on top of a novel tuple-group based database partitioning model, which is able to minimize the number of distributed transactions as well as the extent of partition and workload skews during the workload execution. HOPE conducts the partitioning in an iterative manner in order to achieve better partitioning quality, save the user's time spent on partitioning design and increase its application scenes. HOPE is also highly interactive, as it provides rich opportunities for the user to help it further improve the partitioning quality by passing expertise and indirect domain knowledge.
【Keywords】: transaction processing; HOPE system; OLTP workloads; database partitioning system; distributed transactions; domain knowledge; hypergraph based OLTP database partitioning engine; online transaction processing; partitioning design; partitioning quality; tuple-group based database partitioning model; workload execution; Computer architecture; Database systems; Distributed databases; Partitioning algorithms; Scalability; Throughput
【Paper Link】 【Pages】:1278-1281
【Authors】: Zhibo Peng ; Mitch Cherniack ; Olga Papaemmanouil
【Abstract】: Recent advances in the underlying architectures of database management systems (DBMS) have motivated the redesign of key DBMS components such as the query optimizer. Optimizers are inherently difficult to build and maintain, and yet there exists no software engineering tools to facilitate their development. In this paper, we introduce a [Devel]opment Environment for Query [Op]timizers (Devel-Op) designed to facilitate the rapid prototyping, profiling and benchmarking of optimizers. Our current version of the tool permits declarative specification and generation of two key optimizer components (the logical plan enumerator and physical plan generator) as well as debugging and visualization tools for profiling generated components.
【Keywords】: data visualisation; database management systems; formal specification; program debugging; DBMS components; Devel-op; database management systems; debugging tools; declarative specification; logical plan enumerator; optimizer benchmarking; optimizer development environment; optimizer profiling; optimizer rapid prototyping; physical plan generator; query optimizer; visualization tools; Computer architecture; Debugging; Generators; Grammar; Optimization; Query processing; Visualization
【Paper Link】 【Pages】:1282-1285
【Authors】: Rohit Jain ; Sunil Prabhakar
【Abstract】: Data are often stored at untrusted database servers. The lack of trust arises naturally when the database server is owned by a third party, as in the case of cloud computing. It also arises if the server may have been compromised, or there is a malicious insider. Ensuring the trustworthiness of data retrieved from such untrusted database is of utmost importance. Trustworthiness of data is defined by faithful execution of valid and authorized transactions on the initial data. Earlier work on this problem is limited to cases where data are either not updated, or data are updated by a single trustworthy entity. However, for a truly dynamic database, multiple clients should be allowed to update data without having to route the updates through a central server. In this demonstration, we present a system to establish authenticity and integrity of data in a dynamic database where the clients can run transactions directly on the database server. Our system provides provable authenticity and integrity of data with absolutely no requirement for the server to be trustworthy. Our system also provides assured provenance of data. This demonstration is built using the solutions proposed in our previous work[5]. Our system is built on top of Oracle with no modifications to the database internals. We show that the system can be easily adopted in existing databases without any internal changes to the database. We also demonstrate how our system can provide authentic provenance.
【Keywords】: data integrity; database management systems; trusted computing; Oracle; cloud computing; data authenticity; data integrity; data provenance; data transactions; data trustworthiness; database internals; database servers; dynamic database; malicious insider; trustworthy entity; Cloud computing; Hardware; Indexes; Protocols; Servers
【Paper Link】 【Pages】:1286-1289
【Authors】: Olaf Hartig ; M. Tamer Özsu
【Abstract】: The publication of Linked Open Data on the Web has gained tremendous momentum over the last six years. As a consequence, we currently witness the emergence of a new research area that focuses on an online execution of Linked Data queries; i.e., declarative queries that range over Web data that is made available using the Linked Data publishing principles. These principles only require Web servers that respond to simple requests for data about given entities. Therefore, in contrast to approaches for querying a more traditional distributed database, Linked Data query processing approaches cannot assume that data sources provide query processing functionality. Additional challenges are the unbounded nature of the Web and the lack of a complete, up-to-date database catalog that lists all data sources. Our tutorial provides an overview of the new area of Linked Data query processing. We introduce the foundations of Linked Data queries, discuss the specific challenges that need to be addressed, and review techniques for executing such queries.
【Keywords】: Internet; data handling; query processing; Linked Data publishing principles; Linked Data query processing; Linked Open Data; Web data; database catalog; declarative queries; distributed database; Indexes; Query processing; Resource description framework; Semantics; Tutorials; World Wide Web
【Paper Link】 【Pages】:1290-1293
【Authors】: Lukasz Golab ; Theodore Johnson
【Abstract】: Data stream warehousing is a data management technology designed to simultaneously handle big-data and fast-data. Conceptually, a data stream warehouse can be thought of as a data warehouse system that is updated in nearly-real time rather than during downtimes, or as a data stream management system that stores a very long history. In this tutorial, we 1) motivate the need for data stream warehouse systems using real-life examples drawn from our experiences in network and data center monitoring, 2) describe several possible system architectures for data stream warehousing, 3) discuss various issues in query languages, performance optimizations and data stream quality, and 4) conclude with a discussion of open problems.
【Keywords】: Big Data; data warehouses; query languages; big-data handling; data management technology; data stream quality; data stream warehousing; fast-data handling; performance optimizations; query languages; system architectures; Data mining; Data warehouses; Educational institutions; Monitoring; Optimization; Real-time systems; Warehousing
【Paper Link】 【Pages】:1294-1297
【Authors】: Barna Saha ; Divesh Srivastava
【Abstract】:
In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth V' of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three
V's, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.
【Keywords】: Big Data; Internet; decision making; quality management; Web databases; big data quality management; data analysis; data variety; data velocity; data volume; data-driven decision making; source diversity; Cleaning; Data handling; Data storage systems; Databases; Information management; Maintenance engineering; Quality management
【Paper Link】 【Pages】:1298-1301
【Authors】: Stratis D. Viglas
【Abstract】: Just-in-time compilation of SQL queries into native code has recently emerged as a viable technique for query processing and an alternative to the dominant interpretation-based approach. We present the salient results of research in this fresh area, addressing all aspects of the query processing stack: from traditional query compilation techniques, to compilation in managed environments, to state-of-the-art approaches on intermediate and native code emission. Throughout the discussion we refer and draw analogies to the general code generation techniques used in contemporary compiler technology. At the same time we describe the open research problems of the area.
【Keywords】: SQL; program compilers; query processing; SQL query processing; code generation techniques; compiler technology; just-in-time compilation; query processing stack; Computer languages; Educational institutions; Engines; Query processing; Runtime; Virtual machining
【Paper Link】 【Pages】:1302-1305
【Authors】: Reynold Cheng ; Tobias Emrich ; Hans-Peter Kriegel ; Nikos Mamoulis ; Matthias Renz ; Goce Trajcevski ; Andreas Züfle
【Abstract】: Location-related data has a tremendous impact in many applications of high societal relevance and its growing volume from heterogeneous sources is one true example of a Big Data [1]. An inherent property of any spatio-temporal dataset is uncertainty due to various sources of imprecision. This tutorial provides a comprehensive overview of the different challenges involved in managing uncertain spatial and spatio-temporal data and presents state-of-the-art techniques for addressing them.
【Keywords】: Big Data; visual databases; Big Data; location-related data; spatial data; spatio-temporal data; uncertainty management; Data models; Probabilistic logic; Semantics; Spatial databases; Tutorials; Uncertainty
【Paper Link】 【Pages】:1306-1309
【Authors】: Rajeev Gupta ; Krithi Ramamritham
【Abstract】: Data delivered over the internet is increasingly being used for providing dynamic and personalized user experiences. To achieve this, queries are executed over fast changing data from distributed sources. As these queries require data from multiple sources, these queries are executed at an intermediate proxy or data aggregator. Typically, users of these queries are not interested in all the data updates. Query results may be associated with an imprecision bound or threshold which can be used to limit the number of refresh messages. These queries can be categorized based on the types of results required: in an entity based query the user is just interested in knowing the ids of the data items (or entities) satisfying certain selection condition; in a value based query the user is interested in the value of some aggregation over distributed data items; and in a threshold query the user wants to know whether a Boolean condition, expressed as a threshold over an aggregation of data items, is true. We methodically present techniques for executing all these categories of continuous aggregation queries over distributed data so that the number of message exchanges between data sources, aggregators, and users is minimized. The value of individual data items can be uncertain with an associated probability. A data aggregator can execute the query either by getting all the required data or by sending appropriate sub-queries to the distributed data sources. For getting the data, the aggregator can use either push or pull based mechanisms. Each of these methods has different ways of minimizing the number of message exchanges. We present various algorithms for the same.
【Keywords】: data handling; query processing; Boolean condition; continuous aggregation queries; data aggregation; data aggregator; data items; data updates; distributed data items; distributed query execution; entity based query; intermediate proxy; message exchange; pull based mechanisms; push based mechanisms; query results; refresh messages; selection condition; threshold query; user experiences; user interest; value based query; Aggregates; Distributed databases; Monitoring; Planning; Tutorials; Uncertainty; Vehicle dynamics
【Paper Link】 【Pages】:1310
【Authors】: Alfons Kemper ; Thomas Neumann
【Abstract】: The recent advances in processor technology - soon hundreds of cores and terabytes of DRAM in commodity servers - have spawned the academic as well as the industrial interest in main-memory database technology. In this panel, we will discuss the virtues of different architectural designs w.r.t. transaction processing as well as OLAP query processing.
【Keywords】: