34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018. IEEE Computer Society 【DBLP Link】
【Paper Link】 【Pages】:1-12
【Authors】: Sihem Amer-Yahia
【Abstract】: Data Science (DS) has been shifting from libraries and stacks to usage and impact. While "database thinking" is permeating all levels in a DS stack, the DS lifecycle can only be fully realized by looping in humans in a principled and safe fashion. This paper focuses on the role of humans and user data in DS. It starts with the impact of human factors on the design of sustainable and fair data generation and curation. It then reviews how data processing and mining are being revisited to derive insights from user data. That is followed by how human-data interaction shapes the way we think about evaluating DS applications. The paper ends with opportunities that arise when bringing together database thinking and DS for humans.
【Keywords】: Data Science; Human Factors
【Paper Link】 【Pages】:13-14
【Authors】: Philip A. Bernstein
【Abstract】: We present the vision of an actor-oriented database. Its goal is to integrate database abstractions into an actor-oriented programming language for interactive, stateful, scalable, distributed applications that use cloud storage.
【Keywords】: database systems; actor systems; object oriented database; Orleans; transactions; indexing; geo distribution; replication
【Paper Link】 【Pages】:15-23
【Authors】: Margo Seltzer ; Virendra J. Marathe ; Steve Byan
【Abstract】: Around 2010, we observed significant research activity around the development of non-volatile memory technologies. Shortly thereafter, other research communities began considering the implications of non-volatile memory on system design, from storage systems to data management solutions to entire systems. Finally, in July 2015, Intel and Micron Technology announced 3D XPoint. It's now 2018; Intel is shipping its technology in SSD packages, but we've not yet seen the widespread availability of byte-addressable non-volatile memory that resides on the memory bus. We can view non-volatile memory technology and its impact on systems through an historical lens revealing it as the convergence of several past research trends starting with the concept of single-level store, encompassing the 1980s excitement around bubble memory, building upon persistent object systems, and leveraging recent work in transactional memory. We present this historical context, recalling past ideas that seem particularly relevant and potentially applicable and highlighting aspects that are novel.
【Keywords】: non volatile memory; transcations; persistent objects; PCM
【Paper Link】 【Pages】:24-28
【Authors】: Michael Stonebraker
【Abstract】: In this paper, I present my top ten fears about the future of the DBMS field, with apologies to David Letterman. There are three ”big fears”, which I discuss first. Five additional fears are a result of the ”big three”. I then conclude with ”the big enchilada”, which is a pair of fears. In each case, I indicate what I think is the best way to deal with the current situation.
【Keywords】: Resumes; Natural language processing; Cultural differences; Industries; Whales; Wheels
【Paper Link】 【Pages】:29-40
【Authors】: Jianbin Qin ; Yaoshu Wang ; Chuan Xiao ; Wei Wang ; Xuemin Lin ; Yoshiharu Ishikawa
【Abstract】: A similarity search in Hamming space finds binary vectors whose Hamming distances are no more than a threshold from a query vector. It is a fundamental problem in many applications, including image retrieval, near-duplicate Web page detection, and machine learning. State-of-the-art approaches to answering such queries are mainly based on the pigeonhole principle to generate a set of candidates and then verify them. We observe that the constraint based on the pigeonhole principle is not always tight and hence may bring about unnecessary candidates. We also observe that the distribution in real data is often skew, but most existing solutions adopt a simple equiwidth partitioning and allocate the same threshold to all the partitions, and hence fail to exploit the data skewness to optimize the query processing. In this paper, we propose a new form of the pigeonhole principle which allows variable partition size and threshold. Based on the new principle, we first develop a tight constraint of candidates, and then devise cost-aware methods for dimension partitioning and threshold allocation to optimize query processing. Our evaluation on datasets with various data distributions shows the robustness of our solution and its superior query processing performance to the state-of-the-art methods.
【Keywords】: Hamming distance; pigeonhole principle; similarity search
【Paper Link】 【Pages】:41-52
【Authors】: Andrew Ilyas ; Joana M. F. da Trindade ; Raul Castro Fernandez ; Samuel Madden
【Abstract】: Many database columns contain string or numerical data that conforms to a pattern, such as phone numbers, dates, addresses, product identifiers, and employee ids. These patterns are useful in a number of data processing applications, including understanding what a specific field represents when field names are ambiguous, identifying outlier values, and finding similar fields across data sets.One way to express such patterns would be to learn regular expressions for each field in the database. Unfortunately, existing techniques on regular expression learning are slow, taking hundreds of seconds for columns of just a few thousand values. In contrast, we develop XSYSTEM, an efficient method to learn patterns over database columns in significantly less time.We show that these patterns can not only be built quickly, but are expressive enough to capture a number of key applications, including detecting outliers, measuring column similarity, and assigning semantic labels to columns (based on a library of regular expressions). We evaluate these applications with datasets that range from chemical databases (based on a collaboration with a pharmaceutical company), our university data warehouse, and open data from MassData.gov.
【Keywords】: structure extraction; pattern extract; data discovery
【Paper Link】 【Pages】:53-64
【Authors】: Giovanni Simonini ; George Papadakis ; Themis Palpanas ; Sonia Bergamaschi
【Abstract】: Entity Resolution (ER) is the task of finding entity profiles that correspond to the same real-world entity. Progressive ER aims to efficiently resolve large datasets when limited time and/or computational resources are available. In practice, its goal is to provide the best possible partial solution by approximating the optimal comparison order of the entity profiles. So far, Progressive ER has only been examined in the context of structured (relational) data sources, as the existing methods rely on schema knowledge to save unnecessary comparisons: they restrict their search space to similar entities with the help of schema-based blocking keys (i.e., signatures that represent the entity profiles). As a result, these solutions are not applicable in Big Data integration applications, which involve large and heterogeneous datasets, such as relational and RDF databases, JSON files, Web corpus etc. To cover this gap, we propose a family of schema-agnostic Progressive ER methods, which do not require schema information, thus applying to heterogeneous data sources of any schema variety. First, we introduce a naïve schema-agnostic method, showing that the straightforward solution exhibits a poor performance that does not scale well to large volumes of data. Then, we propose three different advanced methods. Through an extensive experimental evaluation over 7 real-world, established datasets, we show that all the advanced methods outperform to a significant extent both the naïve and the state-of-the-art schema-based ones. We also investigate the relative performance of the advanced methods, providing guidelines on the method selection.
【Keywords】: Entity Resolution; Pay as you go; progressive duplicate detection; deduplication; schema agnostic; record linkage; blocking; meta blocking; data cleaning
【Paper Link】 【Pages】:65-76
【Authors】: Fan Yang ; Hyun Ah Song ; Zongge Liu ; Christos Faloutsos ; Vladimir Zadorozhny ; Nicholas D. Sidiropoulos
【Abstract】: We address the challenge of reconstructing historical counts from aggregated, possibly overlapping historical reports. For example, given the monthly and weekly sums, how can we find the daily counts of people infected with flu? We propose an approach, called ARES (Automatic REStoration), that performs automatic data reconstruction in two phases: (1) first, it estimates the sequence of historical counts utilizing domain knowledge, such as smoothness and periodicity of historical events; (2) then, it uses the estimated sequence to learn notable patterns in the target sequence to refine the reconstructed time series. In order to derive such patterns, ARES uses an annihilating filter technique. The idea is to learn a linear shift-invariant operator whose response to the desired sequence is (approximately) zero-yielding a set of null-space equations that the desired signal should satisfy, without the need for the accompanying data. The reconstruction accuracy can be further improved by applying the second phase iteratively. We evaluate ARES on the real epidemiological data from the Tycho project and demonstrate that ARES recovers historical data from aggregated reports with high accuracy. In particular, it considerably outperforms top competitors, including least squares approximation and the more advanced H-FUSE method (42% and 34% improvement based on average RMSE, respectively).
【Keywords】: Information Fusion; Information Disaggregation; Annihilating Filter
【Paper Link】 【Pages】:77-88
【Authors】: Antonio Maccioni ; Riccardo Torlone
【Abstract】: The huge diversity of database technologies in use inside organizations pose today new challenges of data management and integration. Polystores provide a solution to this scenario based on a loosely coupled integration of data sources and the direct access, with the local language, to each storage engine for exploiting its distinctive features. However, given the absence of a global schema, it is hard to know if a query to one system can be satisfied with data stored elsewhere in the polystore. We address this issue by introducing query augmentation, a data manipulation operator for polystores based on the automatic enrichment of the answer to a local query with related data in the rest of the polystore. Augmentation can be used to implement two effective methods for data access in polystores: augmented search and augmented exploration. We show that they provide effective tools for information discovery in polystores that avoid middleware layers, abstract query languages, and shared data models. We also illustrate the design of QUEPA, a system that fully implements our approach in an efficient way. A comprehensive campaign of experiments done with QUEPA shows that our approach is feasible and, unlike other approaches, scales nicely as the polystore grows in the number of stores and size of databases.
【Keywords】: polystore; polyglot persistence; nosql; multi engine; query augmentation
【Paper Link】 【Pages】:89-100
【Authors】: Quoc Viet Hung Nguyen ; Kai Zheng ; Matthias Weidlich ; Bolong Zheng ; Hongzhi Yin ; Nguyen Thanh Tam ; Bela Stantic
【Abstract】: What-if analysis is a data-intensive exploration to inspect how changes in a set of input parameters of a model influence some outcomes. It is motivated by a user trying to understand the sensitivity of a model to a certain parameter in order to reach a set of goals that are defined over the outcomes. To avoid an exploration of all possible combinations of parameter values, efficient what-if analysis calls for a partitioning of parameter values into data ranges and a unified representation of the obtained outcomes per range. Traditional techniques to capture data ranges, such as histograms, are limited to one outcome dimension. Yet, in practice, what-if analysis often involves conflicting goals that are defined over different dimensions of the outcome. Working on each of those goals independently cannot capture the inherent trade-off between them. In this paper, we propose techniques to recommend data ranges for what-if analysis, which capture not only data regularities, but also the trade-off between conflicting goals. Specifically, we formulate a parametric data partitioning problem and propose a method to find an optimal solution for it. Targeting scalability to large datasets, we further provide a heuristic solution to this problem. By theoretical and empirical analyses, we establish performance guarantees in terms of runtime and result quality.
【Keywords】: what if analysis; data partitioning; Pareto analysis
【Paper Link】 【Pages】:101-112
【Authors】: Yuyu Luo ; Xuedi Qin ; Nan Tang ; Guoliang Li
【Abstract】: Data visualization is invaluable for explaining the significance of data to people who are visually oriented. The central task of automatic data visualization is, given a dataset, to visualize its compelling stories by transforming the data (e.g., selecting attributes, grouping and binning values) and deciding the right type of visualization (e.g., bar or line charts). We present DEEPEYE, a novel system for automatic data visualization that tackles three problems: (1) Visualization recognition: given a visualization, is it "good or "bad"? (2) Visualization ranking: given two visualizations, which one is "better"? And (3) Visualization selection: given a dataset, how to find top-k visualizations? DEEPEYE addresses (1) by training a binary classifier to decide whether a particular visualization is good or bad. It solves (2) from two perspectives: (i) Machine learning: it uses a supervised learning-to-rank model to rank visualizations; and (ii) Expert rules: it relies on experts' knowledge to specify partial orders as rules. Moreover, a "boring" dataset may become interesting after data transformations (e.g., binning and grouping), which forms a large search space. We also discuss optimizations to efficiently compute top-k visualizations, for approaching (3). Extensive experiments verify the effectiveness of DEEPEYE".
【Keywords】: data visualization; automatic
【Paper Link】 【Pages】:113-124
【Authors】: Mangesh Bendre ; Vipul Venkataraman ; Xinyan Zhou ; Kevin Chang ; Aditya G. Parameswaran
【Abstract】: Spreadsheet software is the tool of choice for interactive ad-hoc data management, with adoption by billions of users. However, spreadsheets are not scalable, unlike database systems. On the other hand, database systems, while highly scalable, do not support interactivity as a first-class primitive. We are developing DataSpread, to holistically integrate spreadsheets as a front-end interface with databases as a back-end datastore, providing scalability to spreadsheets, and interactivity to databases, an integration we term presentational data management (PDM). In this paper, we make the first step towards this vision for relational databases: developing a storage engine for PDM, studying how to flexibly represent spreadsheet data within a relational database and how to support and maintain access by position. We first conduct an extensive survey of spreadsheet use to motivate our functional requirements for a storage engine for PDM. We develop a natural set of mechanisms for flexibly representing spreadsheet data and demonstrate that identifying the optimal representation is NP-Hard; however, we develop an efficient approach to identify the optimal representation from an important and intuitive subclass of representations. We extend our mechanisms with positional access mechanisms that don't suffer from cascading update issues, leading to constant time access and modification performance. We evaluate these representations on a workload of typical spreadsheets and spreadsheet operations, providing up to 50% reduction in storage, and up to 50% reduction in formula evaluation time.
【Keywords】: DATASPREAD; database; spreadsheet; PDM; storage
【Paper Link】 【Pages】:125-136
【Authors】: Ariel Jarovsky ; Tova Milo ; Slava Novgorodov ; Wang-Chiew Tan
【Abstract】: Writing rules to capture precisely fraudulent transactions is a challenging task where domain experts spend significant effort and time. A key observation is that much of this difficulty originates from the fact that such experts typically work as "lone rangers" or in isolated groups, or work on detecting frauds in one context in isolation from frauds that occur in another context. However, in practice there is a lot of commonality in what different experts are trying to achieve. In this paper, we present the GOLDRUSH system, which facilitates knowledge sharing via effective adaptation of fraud detection rules from one context to another. GOLDRUSH abstracts the possible semantic interpretations of each of the conditions in the rules at the source context and adapts them to the target context. Efficient algorithms are used to identify the most effective rule adaptations w.r.t a given cost-benefit metric. Our extensive set of experiments, based on real-world financial datasets, demonstrate the efficiency and effectiveness of our solution, both in terms of the accuracy of the fraud detection and the actual money saved.
【Keywords】: Rule Sharing; Rule Adaptation; Experts in the loop; Fraud Detection
【Paper Link】 【Pages】:137-148
【Authors】: Hancheng Ge ; Kai Zhang ; Majid Alfifi ; Xia Hu ; James Caverlee
【Abstract】: How can we efficiently recover missing values for very large-scale real-world datasets that are multi-dimensional even when the auxiliary information is regularized at certain mode? Tensor completion is a useful tool to recover a low-rank tensor that best approximates partially observed data and further predicts the unobserved data by this low-rank tensor, which has been successfully used for many applications such as location-based recommender systems, link prediction, targeted advertising, social media search, and event detection. Due to the curse of dimensionality, existing algorithms for tensor completion that integrate auxiliary information do not scale for tensors with billions of elements. In this paper, we propose DisTenC, a new distributed large-scale tensor completion algorithm that can be distributed on Spark. Our key insights are to (i) efficiently handle trace-based regularization terms; (ii) update factor matrices with caching; and (iii) optimize the update of the new tensor via residuals. In this way, we can tackle the high computational costs of traditional approaches and minimize intermediate data, leading to order-of-magnitude improvements in tensor completion. Experimental results demonstrate that DisTenC is capable of handling up to 10~1000X larger tensors than existing methods with much faster convergence rate, shows better linearity on machine scalability, and achieves up to an average improvement of 23.5% in accuracy in applications.
【Keywords】: Tensor Completion; Spark; Distributed Algorithm; Auxiliary Information
【Paper Link】 【Pages】:149-160
【Authors】: Zainab Zolaktaf ; Reza Babanezhad ; Rachel Pottinger
【Abstract】: Standard collaborative filtering approaches for top-N recommendation are biased toward popular items. As a result, they recommend items that users are likely aware of and under-represent long-tail items. This is inadequate, both for consumers who prefer novel items and because concentrating on popular items poorly covers the item space, whereas high item space coverage increases providers' revenue. We present an approach that relies on historical rating data to learn user long-tail novelty preferences. We integrate these preferences into a generic re-ranking framework that customizes balance between accuracy and coverage. We empirically validate that our proposed framework increases the novelty of recommendations. Furthermore, by promoting long-tail items to the right group of users, we significantly increase the system's coverage while scalably maintaining accuracy. Our framework also enables personalization of existing non-personalized algorithms, making them competitive with existing personalized algorithms in key performance metrics, including accuracy and coverage.
【Keywords】: Top N Recommendation; Accuracy; Novelty; Coverage
【Paper Link】 【Pages】:161-172
【Authors】: Wei Shen ; Yinan Liu ; Jianyong Wang
【Abstract】: A knowledge base contains a set of concepts, entities, attributes, and relations. Knowledge bases are increasingly critical to a wide variety of applications in both industry and academia. Yet despite all that, knowledge bases are greatly incomplete. As the world evolves, new entities are generated. Enriching existing knowledge bases with new entities and new location attribute values for them becomes more and more important. Twitter is one of the most popular micro-blogging platforms. Named entities are mentioned frequently in the huge collection of tweets which contain abundant geographical location knowledge. Given a named entity and a set of tweets where the entity appears, we are interested in predicting the entity city-level location using the knowledge embedded in tweets. This task is helpful for many applications such as knowledge base enrichment, tweet location prediction, and entity search. In this paper we propose NELPT, the first unsupervised framework for Named Entity city-level Location Prediction by leveraging the geographical location knowledge from Twitter. This framework leverages a Linear Neural Network model as the predictive model combining two categories of information: (1) local count information; (2) global distributional information. A learning algorithm based on the expectation-maximization (EM) method is proposed to automatically learn the parameters of the Linear Neural Network predictive model without requiring any training data. The experimental results on a real world Twitter data set show that our framework significantly outperforms the baselines in terms of accuracy, and scales very well.
【Keywords】: Entity location prediction; Twitter; Tweet; Knowledge base population
【Paper Link】 【Pages】:173-184
【Authors】: Zihuan Xu ; Siyuan Han ; Lei Chen
【Abstract】: Recently, Blockchain becomes a hot research topic due to the success of Blockchain in many applications, such as cryptocurrency, smart contract, digital assets, distributed cloud storage and so on. The power of Blockchain is that it can achieve the consensus of an ordered set of transactions among nodes which do not trust each other, even with the existence of malicious nodes. However, compared to traditional databases, the current Blockchain technology still cannot handle a massive number of transactions, which is caused by many factors, such as the consensus protocol, structure of the blocks and storage challenge. Among them, the high storage requirement is a key factor that prevents the wide usage of Blockchain on various devices such as mobile phones or low-end PCs. In this paper, to address the storage challenge, we introduce a novel concept called Consensus Unit (CU), which organizes different nodes into one unit and lets them to store at least one copy of Blockchain data in the system together. Based on this idea, we further define the Blocks Assignment Optimization (BAO) problem which determines the optimal assignment of blocks such that the storage space is fully used and the query cost is minimized. We prove that the BAO problem is NP-hard. Thus, we propose three efficient heuristic algorithms to solve the static assignment problem. Furthermore, we present solutions to address the dynamic scenarios when new blocks arrive and nodes join or depart from the CU. To verify the effectiveness of CU, we have conducted extensive experiments on synthetic data and BLOCKBENCH [1]. The results have confirmed the superiority of CU in saving the storage and maintaining the system throughput.
【Keywords】: Blockchain
【Paper Link】 【Pages】:185-196
【Authors】: Viktor Leis ; Michael Haubenschild ; Alfons Kemper ; Thomas Neumann
【Abstract】: Disk-based database systems use buffer managers in order to transparently manage data sets larger than main memory. This traditional approach is effective at minimizing the number of I/O operations, but is also the major source of overhead in comparison with in-memory systems. To avoid this overhead, in-memory database systems therefore abandon buffer management altogether, which makes handling data sets larger than main memory very difficult. In this work, we revisit this fundamental dichotomy and design a novel storage manager that is optimized for modern hardware. Our evaluation, which is based on TPC-C and micro benchmarks, shows that our approach has little overhead in comparison with a pure in-memory system when all data resides in main memory. At the same time, like a traditional buffer manager, it is fully transparent and can manage very large data sets effectively. Furthermore, due to low-overhead synchronization, our implementation is also highly scalable on multi-core CPUs.
【Keywords】: storage engine; in memory; SSD
【Paper Link】 【Pages】:197-208
【Authors】: André Kohn ; Viktor Leis ; Thomas Neumann
【Abstract】: Compiling queries to machine code is a very efficient way for executing queries. One often overlooked problem with compilation is the time it takes to generate machine code. Even with fast compilation frameworks like LLVM, generating machine code for complex queries often takes hundreds of milliseconds. Such durations can be a major disadvantage for workloads that execute many complex, but quick queries. To solve this problem, we propose an adaptive execution framework, which dynamically switches from interpretation to compilation. We also propose a fast bytecode interpreter for LLVM, which can execute queries without costly translation to machine code and dramatically reduces the query latency. Adaptive execution is fine-grained, and can execute code paths of the same query using different execution modes. Our evaluation shows that this approach achieves optimal performance in a wide variety of settings-low latency for small data sets and maximum throughput for large data sizes.
【Keywords】: query execution; compilation; LLVM; interpretation
【Paper Link】 【Pages】:209-220
【Authors】: Martin Boissier ; Rainer Schlosser ; Matthias Uflacker
【Abstract】: Recent developments in database research introduced HTAP systems that are capable of handling both transactional and analytical workloads. These systems achieve their performance by storing the full data set in main memory. An open research question is how far one can reduce the main memory footprint without losing the performance superiority of main memory-resident databases. In this paper, we present a hybrid main memory-optimized database for mixed workloads that evicts cold data to less expensive storage tiers. It adapts the storage layout to mitigate the negative performance impact of secondary storage. A key challenge is to determine which data to place on which storage tier. We introduce a novel workload-driven model that determines Pareto-optimal allocations while also considering reallocation costs. We evaluate our concept for a production enterprise system as well as reproducible data sets.
【Keywords】: data tiering; htap; column selection; hybrid; layout
【Paper Link】 【Pages】:221-232
【Authors】: Yu Yang ; Lingyang Chu ; Yanyan Zhang ; Zhefeng Wang ; Jian Pei ; Enhong Chen
【Abstract】: Dense subgraph discovery is a key primitive in many graph mining applications, such as detecting communities in social networks and mining gene correlation from biological data. Most studies on dense subgraph mining only deal with one graph. However, in many applications, we have more than one graph describing relations among a same group of entities. In this paper, given two graphs sharing the same set of vertices, we investigate the problem of detecting subgraphs that contrast the most with respect to density. We call such subgraphs Density Contrast Subgraphs, or DCS in short. Two widely used graph density measures, average degree and graph affinity, are considered. For both density measures, mining DCS is equivalent to mining the densest subgraph from a "difference" graph, which may have both positive and negative edge weights. Due to the existence of negative edge weights, existing dense subgraph detection algorithms cannot identify the subgraph we need. We prove the computational hardness of mining DCS under the two graph density measures and develop efficient algorithms to find DCS. We also conduct extensive experiments on several real-world datasets to evaluate our algorithms. The experimental results show that our algorithms are both effective and efficient.
【Keywords】: Dense Subgraph; Average Degree; Graph Affinity; Contrast Mining
【Paper Link】 【Pages】:233-244
【Authors】: Kai Wang ; Xin Cao ; Xuemin Lin ; Wenjie Zhang ; Lu Qin
【Abstract】: Driven by real-life applications in geo-social networks, in this paper, we investigate the problem of computing the radius-bounded k-cores (RB-k-cores) that aims to find cohesive subgraphs satisfying both social and spatial constraints on large geo-social networks. In particular, we use k-core to ensure the social cohesiveness and we use a radius-bounded circle to restrict the locations of users in a RB-k-core. We explore several algorithmic paradigms to compute RB-k-cores, including a triple vertex-based paradigm, a binary-vertex-based paradigm, and a paradigm utilizing the concept of rotating circles. The rotating circle-based paradigm is further enhanced with several pruning techniques to achieve better efficiency. The experimental studies conducted on both real and synthetic datasets demonstrate that our proposed rotating-circle-based algorithms can compute all RB-k-cores very efficiently. Moreover, it can also be used to compute the minimum-circle-bounded k-core and significantly outperforms the existing techniques for computing the minimum circle-bounded k-core.
【Keywords】: Community Search; Social networks; Ge social networks
【Paper Link】 【Pages】:245-256
【Authors】: Rong-Hua Li ; Qiangqiang Dai ; Lu Qin ; Guoren Wang ; Xiaokui Xiao ; Jeffrey Xu Yu ; Shaojie Qiao
【Abstract】: Mining cohesive subgraphs from a network is a fundamental problem in network analysis. Most existing cohesive subgraph models are mainly tailored to unsigned networks. In this paper, we study the problem of seeking cohesive subgraphs in a signed network, in which each edge can be positive or negative, denoting friendship or conflict respectively. We propose a novel model, called maximal (α, k)-clique, that represents a cohesive subgraph in signed networks. Specifically, a maximal (α, k)-clique is a clique in which every node has at most. negative neighbors and at least [αk] positive neighbors (α ≥ 1). We show that the problem of enumerating all maximal (α, k)-cliques in a signed network is NP-hard. To enumerate all maximal (α, k)-cliques efficiently, we first develop an elegant signed network reduction technique to significantly prune the signed network. Then, we present an efficient branch and bound enumeration algorithm with several carefully-designed pruning rules to enumerate all maximal (α, k)-cliques in the reduced signed network. The results of extensive experiments on five large real-life datasets demonstrate the efficiency, scalability, and effectiveness of our algorithms.
【Keywords】: Signed Network; Signed Clique; k-core; Community Search
【Paper Link】 【Pages】:257-268
【Authors】: Lu Chen ; Yunjun Gao ; Yuanliang Zhang ; Sibo Wang ; Baihua Zheng
【Abstract】: Massive amounts of images textually annotated by different users are provided by social image websites, e.g., Flickr. Social images are always associated with various information, such as visual features, tags, and users. In this paper, we utilize hypergraph instead of ordinary graph to model social images, since relations among various information are more sophisticated than pairwise. Based on the hypergraph, we propose HIRT, a scalable image retrieval and tagging system, which uses Personalized PageRank to measure vertex similarity, and employs top-k search to support image retrieval and tagging. To achieve good scalability and efficiency, we develop parallel and approximate top-k search algorithms with quality guarantees. Experiments on a large Flickr dataset confirm the effectiveness and efficiency of our proposed system HIRT compared with existing state-of-the-art hypergraph based image retrieval system. In addition, our parallel and approximate top-k search methods are verified to be more efficient than the state-of-the-art methods and meanwhile achieve higher result quality.
【Keywords】: Image Retrieval; Image Tagging; Hypergraph; System
【Paper Link】 【Pages】:269-280
【Authors】: Li Wang ; Ruichu Cai ; Tom Z. J. Fu ; Jiong He ; Zijie Lu ; Marianne Winslett ; Zhenjie Zhang
【Abstract】: Massive data streams from sensors in Internet of Things (IoT) and smart devices with Global Positioning System (GPS) are now flooding to database systems for further processing and analysis. The capability of real-time retrieval from both fresh and historical data turns out to be the key enabler to the real world applications in smart manufacturing and smart city utilizing these data streams. In this paper, we present a simple and effective distributed solution to achieve millions of tuple insertions per second and ad-hoc temporal range query processing in milliseconds. To this end, we propose a new data partitioning scheme that takes advantage of the workload characteristics and avoids expensive global data merging. Furthermore, to resolve the throughput bottleneck, we adopt a template-based index method to skip unnecessary index structure adjustments over the relatively stable distribution of incoming tuples. To parallelize data insertion and query processing, we propose an efficient dispatching mechanism and effective load balancing strategies to fully utilize computational resources in a workload-aware manner. On both synthetic and real workloads, our solution consistently outperforms state-of-the-art open-source systems by at least an order of magnitude.
【Keywords】: index; query processing; distributed system
【Paper Link】 【Pages】:281-292
【Authors】: Guohao Sun ; Guanfeng Liu ; Yan Wang ; Mehmet A. Orgun ; Xiaofang Zhou
【Abstract】: Graph Pattern based Node Matching (GPNM) is to find all the matches of the nodes in a data graph GD based on a given pattern graph GP. GPNM has become increasingly important in many applications, e.g., group finding and expert recommendation. In real scenarios, both GP and GD are updated frequently. However, the existing GPNM methods need to perform a new GPNM procedure from scratch to deliver the node matching results based on the updated GP and updated GD, which consumes much time. Therefore, there is a pressing need for a novel method to efficiently deliver the node matching results. In this paper, we propose a novel INCremental GPNM method called INC-GPNM, where we first build up indices to incrementally maintain the shortest path length range between different label types in GD, and then identify the affected parts of GD in GPNM including nodes and edges w.r.t. the updates of GP and GD. Moreover, based on the index structure and our novel search strategies, INC-GPNM can efficiently deliver node matching results taking the updates of GP and GD as input, and can greatly save the query processing time with improved time complexity. The extensive experiments on five real-world social graphs demonstrate that our method greatly outperforms the state-of-the-art GPNM method in efficiency.
【Keywords】: Graph; Incremental; Graph Pattern Matching
【Paper Link】 【Pages】:293-304
【Authors】: Yixiang Fang ; Reynold Cheng ; Gao Cong ; Nikos Mamoulis ; Yun Li
【Abstract】: In this paper, we study the spatial pattern matching (SPM) query. Given a set D of spatial objects (e.g., houses and shops), each with a textual description, we aim at finding all combinations of objects from D that match a user-defined spatial pattern P. A pattern P is a graph where vertices represent spatial objects, and edges denote distance relationships between them. The SPM query returns the instances that satisfy P. An example of P can be "a house within 10-minute walk from a school, which is at least 2km away from a hospital". The SPM query can benefit users such as house buyers, urban planners, and archaeologists. We prove that answering such queries is computationally intractable, and propose two efficient algorithms for their evaluation. Extensive experimental evaluation and cases studies on four real datasets show that our proposed solutions are highly effective and efficient.
【Keywords】: Spatial pattern; spatial keyword; pattern matching; pattern query
【Paper Link】 【Pages】:305-316
【Authors】: Ahmed R. Mahmood ; Ahmed M. Aly ; Walid G. Aref
【Abstract】: Many applications need to process massive streams of spatio-textual data in real-time against continuous spatio-textual queries. For example, in location-aware ad targeting publish/subscribe systems, it is required to disseminate millions of ads and promotions to millions of users based on the locations and textual profiles of users. In this paper, we study indexing of continuous spatio-textual queries. There exist several related spatio-textual indexes that typically integrate a spatial index with a textual index. However, these indexes usually have a high demand for main-memory and assume that the entire vocabulary of keywords is known in advance. Also, these indexes do not successfully capture the variations in the frequencies of keywords across different spatial regions and treat frequent and infrequent keywords in the same way. Moreover, existing indexes do not adapt to the changes in workload over space and time. For example, some keywords may be trending at certain times in certain locations and this may change as time passes. This affects the indexing and searching performance of existing indexes significantly. In this paper, we introduce FAST, a Frequency-Aware Spatio-Textual index for continuous spatio-textual queries. FAST is a main-memory index that requires up to one third of the memory needed by the state-of-the-art index. FAST does not assume prior knowledge of the entire vocabulary of indexed objects. FAST adaptively accounts for the difference in the frequencies of keywords within their corresponding spatial regions to automatically choose the best indexing approach that optimizes the insertion and search times. Extensive experimental evaluation using real and synthetic datasets demonstrates that FAST is up to 3x faster in search time and 5x faster in insertion time than the state-of-the-art indexes.
【Keywords】: spatio textual; indexing; publish/subscribe
【Paper Link】 【Pages】:317-328
【Authors】: Yuxiang Zeng ; Yongxin Tong ; Lei Chen ; Zimu Zhou
【Abstract】: Spatial crowdsourcing brings in a new approach for social media and location-based services (LBS) to collect location specific information via mobile users. For example, when a user checks in at a shop on Facebook, he will immediately receive and is asked to complete a set of tasks such as "what is the opening hour of the shop". It is non-trivial to complete a set of tasks timely and accurately via spatial crowdsourcing. Since workers in spatial crowdsourcing are often transient and limited in number, these social media platforms need to properly allocate workers within the set of tasks such that all tasks are completed (i) with high quality and (ii) with a minimal latency (estimated by the arriving index of the last recruited worker). Solutions to quality and latency control in traditional crowdsourcing are inapplicable in this problem because they either assume sufficient workers or ignore the spatiotemporal factors. In this work, we define the Latency-oriented Task Completion (LTC) problem, which trades off quality and latency (number of workers) of task completion in spatial crowdsourcing. We prove that the LTC problem is NP-hard. We first devise a minimum-cost-flow based algorithm with a constant approximation ratio for the LTC problem in the offline scenario, where all information is known a prior. Then we study the more practical online scenario of the LTC problem, where workers appear dynamically and the platform needs to arrange tasks for each worker immediately based on partial information. We design two greedy-based algorithms with competitive ratio guarantees to solve the LTC problem in the online scenario. Finally, we validate the effectiveness and efficiency of the proposed solutions through extensive evaluations on both synthetic and real-world datasets.
【Keywords】: spatial crowdsourcing; latency control
【Paper Link】 【Pages】:329-340
【Authors】: Zheng Liu ; Lei Chen ; Yongxin Tong
【Abstract】: Realtime traffic speed estimation is an important issue in urban computation. Existing approaches usually focus on exploiting the periodicity properties of the traffic speed and utilize crowdsourcing techniques to facilitate real-time estimation. The quality of such estimation is limited in real world: 1) the accuracy of existing estimation over-relies on the probed data; 2) the accidental traffic variance is ignored; 3) existing strategies incur exhaustive usage of human workers to get fine-grained estimation results. Thus, a more intelligent RTSE approach is desired. In this paper, we propose the framework of CrowdRTSE (Crowdsourcing-based Real-time Traffic Speed Estimation), which adopts a hybrid offline-online process to collaboratively exploit the historical and real-time data to produce high-quality RTSE. To accomplish such a framework, we devise effective algorithms to judiciously select the best group of human workers with a constant approximation ratio, and effectively propagate the crowdsourced data with high efficiency. Comprehensive evaluations have been conducted on both synthetic and real world datasets. The experimental results verify the effectiveness and efficiency of our proposed methods.
【Keywords】: crowdsourcing; traffic speed; urban computing
【Paper Link】 【Pages】:341-352
【Authors】: Chengliang Chai ; Ju Fan ; Guoliang Li
【Abstract】: Crowdsourced entity collection leverages human's ability to collect entities that are missing in a database, which has many real-world applications, such as knowledge base enrichment and enterprise data collection. There are several challenges. First, it is hard to evaluate the workers' quality because a worker's quality depends on not only the correctness of her provided entities but also the distinctness of these entities compared with the collected ones by other workers. Second, crowd workers are likely to provide popular entities and different workers will provide many duplicated entities, leading to a waste of money and low coverage. To address these challenges, we propose an incentive-based crowdsourced entity collection framework CrowdEC that encourages workers to provide more distinct items using an incentive strategy. CrowdEC has fundamental differences from existing crowdsourcing collection methods. One the one hand, CrowdEC proposes a worker model and evaluates a worker's quality based on cross validation and entity checking. CrowdEC devises a worker utility model that considers both worker's quality and entities' distinctness provided by workers. CrowdEC proposes a worker elimination method to block workers with a low utility, which solves the first challenge. On the other hand, CrowdEC proposes an incentive pricing technique that encourages each qualified (i.e., non-eliminated) worker to provide distinct entities rather than duplicates. CrowdEC provides two types of tasks and judiciously assigns workers with appropriate tasks to address the second challenge. We have conducted both real and simulated experiments, and the results show that CrowdEC outperforms existing state-of-the-art works on both cost and quality.
【Keywords】: Crowdsourcing; entity collection
【Paper Link】 【Pages】:353-364
【Authors】: Xianrui Meng ; Haohan Zhu ; George Kollios
【Abstract】: Concerns about privacy in outsourced cloud databases have grown recently and many efficient and scalable query processing methods over encrypted data have been proposed. However, there is very limited work on how to securely process top-k ranking queries over encrypted databases in the cloud. In this paper, we propose the first efficient and provably secure top-k query processing construction that achieves adaptive CQA security. We develop an encrypted data structure called EHL and describe several secure sub-protocols under our security model to answer top-k queries. Furthermore, we optimize our query algorithms for both space and time efficiency. Finally, we empirically evaluate our protocol using real world datasets and demonstrate that our construction is efficient and practical.
【Keywords】: database security; data privacy; privacy preserving query processing; top-k-query
【Paper Link】 【Pages】:365-376
【Authors】: Julien Pilourdault ; Sihem Amer-Yahia ; Senjuti Basu Roy ; Dongwon Lee
【Abstract】: Task assignment is a central component in crowdsourcing. Organizational studies have shown that worker motivation in completing tasks has a direct impact on the quality of individual contributions. In this work, we examine motivation-aware task assignment in the presence of a set of workers. We propose to model motivation as a balance between task relevance and task diversity and argue that an adaptive approach to task assignment can best capture the evolving nature of motivation. Worker motivation is observed and task assignment is revisited appropriately across iterations. We prove the problem to be NP-hard as well as MaxSNP-Hard and develop efficient approximation algorithms with provable guarantees. Our experiments with synthetic data examine the scalability of our algorithms, and our live real data experiments show that capturing motivation using relevance and diversity leads to high crowdwork quality.
【Keywords】: Crowdsourcing
【Paper Link】 【Pages】:377-388
【Authors】: Haipei Sun ; Boxiang Dong ; Bo Zhang ; Wendy Hui Wang ; Murat Kantarcioglu
【Abstract】: Crowdsourcing has raised several security concerns. One of the concerns is how to assign sensitive tasks in the crowdsourcing market, especially when there are colluding participants in crowdsourcing. In this paper, we consider adversarial colluding participants who intend to extract sensitive data by exchanging information. We design a 3-step sensitive task assignment method: (1) the collusion estimation step that quantifies the workers' pairwise collusion probability by estimating answer truth based on their responses; (2) the worker selection step that executes a heuristic sampling-based approach to select the fewest workers whose collusion probability satisfies the given security requirement; and (3) the task partitioning step that splits the sensitive information among the selected workers. We perform an extensive set of experiments on both real-world and synthetic datasets. The results demonstrate the accuracy and efficiency of our method.
【Keywords】: Crowdsourcing; Collusion detection; Task assignment
【Paper Link】 【Pages】:389-400
【Authors】: Souvik Bhattacherjee ; Amol Deshpande
【Abstract】: We address the problem of compactly storing a large number of versions (snapshots) of a collection of keyed documents or records in a distributed environment, while efficiently answering a variety of retrieval queries over those, including retrieving full or partial versions, and evolution histories for specific keys. We motivate the increasing need for such a system in a variety of application domains, carefully explore the design space for building such a system and the various storage-computation-retrieval trade-offs, and discuss how different storage layouts influence those trade-offs. We propose a novel system architecture that satisfies the key desiderata for such a system, and offers simple tuning knobs that allow adapting to a specific data and query workload. Our system is intended to act as a layer on top of a distributed key-value store that houses the raw data as well as any indexes. We design novel off-line storage layout algorithms for efficiently partitioning the data to minimize the storage costs while keeping the retrieval costs low. We also present an online algorithm to handle new versions being added to system. Using extensive experiments on large datasets, we demonstrate that our system operates at the scale required in most practical scenarios and often outperforms standard baselines, including a delta-based storage engine, by orders-of-magnitude.
【Keywords】: document versioning; record partitioning; document compression; online partitioning; key value store
【Paper Link】 【Pages】:401-412
【Authors】: Chenggang Wu ; Jose M. Faleiro ; Yihan Lin ; Joseph M. Hellerstein
【Abstract】: Modern cloud providers offer dense hardware with multiple cores and large memories, hosted in global platforms. This raises the challenge of implementing high-performance software systems that can effectively scale from a single core to multicore to the globe. Conventional wisdom says that software designed for one scale point needs to be rewritten when scaling up by 10-100X. In contrast, we explore how a system can be architected to scale across many orders of magnitude by design. We explore this challenge in the context of a new key-value store system called Anna: a partitioned, multi-mastered system that achieves high performance and elasticity via wait-free execution and coordination-free consistency. Our design rests on a simple architecture of coordination-free actors that perform state update via merge of lattice-based composite data structures. We demonstrate that a wide variety of consistency models can be elegantly implemented in this architecture with unprecedented consistency, smooth fine-grained elasticity, and performance that far exceeds the state of the art.
【Keywords】: key value store; distributed storage system; lattice; consistency; scalability; high performance; coordination free; replication; partitioning
【Paper Link】 【Pages】:413-424
【Authors】: Shuang Hao ; Nan Tang ; Guoliang Li ; Jianhua Feng
【Abstract】: Entity categorization - the process of grouping entities into categories for some specific purpose - is an important problem with a great many applications, such as Google Scholar and Amazon products. Unfortunately, in practice, many entities are mis-categorized. In this paper, we study the problem of discovering mis-categorized entities from a given group of entities. This problem is inherently hard: all entities within the same group have been "well" categorized by state-of-the-art solutions. Apparently, it is nontrivial to differentiate them. We propose a novel rule-based framework to solve this problem. It first uses positive rules to compute disjoint partitions of entities, where the partition with the largest size is taken as the correctly categorized partition, namely the pivot partition. It then uses negative rules to identify mis-categorized entities in other partitions that are dissimilar to the entities in the pivot partition. We describe optimizations on applying these rules, and discuss how to generate positive/negative rules. Extensive experimental results on two real-world datasets show the effectiveness of our solution.
【Keywords】: mis categorized entity; rule based framework; signature; rule generation
【Paper Link】 【Pages】:425-436
【Authors】: Darius Sidlauskas ; Sean Chester ; Eleni Tzirita Zacharatou ; Anastasia Ailamaki
【Abstract】: The majority of spatial processing techniques rely heavily on the idea of approximating each group of spatial objects by their minimum bounding box (MBB). As each MBB is compact to store (requiring only two multi-dimensional points) and intersection tests between MBBs are cheap to execute, these approximations are used predominantly to perform the (initial) filtering step of spatial data processing. However, fitting (groups of) spatial objects into a rough box often results in a very poor approximation of the underlying data. The resulting MBBs contain a lot of "dead space"-fragments of bounded area that contain no actual objects-that can significantly reduce the filtering efficacy. This paper introduces the general concept of a clipped bounding box (CBB) that addresses the principal disadvantage of MBBs, i.e., their poor approximation of spatial objects. Essentially, a CBB "clips away" dead space from the corners of an MBB by storing only a few auxiliary points. Turning to four popular R-tree implementations (a ubiquitous application of MBBs), we demonstrate how minor modifications to the query algorithm can exploit our CBB auxiliary points to avoid many unnecessary recursions into dead space. Extensive experiments show that clipped R-tree variants substantially reduce I/Os: e.g., by clipping the state-of-the-art revised R*-tree we can eliminate on average 19% of I/Os.
【Keywords】: Spatial indexing; R tree; skyline
【Paper Link】 【Pages】:437-448
【Authors】: Stefan Noll ; Jens Teubner ; Norman May ; Alexander Böhm
【Abstract】: Modern microprocessors include a sophisticated hierarchy of caches to hide the latency of memory access and thereby speed up data processing. However, multiple cores within a processor usually share the same last-level cache. This can hurt performance, especially in concurrent workloads whenever a query suffers from cache pollution caused by another query running on the same socket. In this work, we confirm that this particularly holds true for the different operators of an in-memory DBMS: The throughput of cache-sensitive operators degrades by more than 50%. To remedy this issue, we devise a cache allocation scheme from an empirical analysis of different operators and integrate a cache partitioning mechanism into the execution engine of a commercial DBMS. Finally, we demonstrate that our approach improves the overall system performance by up to 38%.
【Keywords】: in memory database system; query processing
【Paper Link】 【Pages】:449-460
【Authors】: Christopher R. Aberger ; Andrew Lamb ; Kunle Olukotun ; Christopher Ré
【Abstract】: Pipelines combining SQL-style business intelligence (BI) queries and linear algebra (LA) are becoming increasingly common in industry. As a result, there is a growing need to unify these workloads in a single framework. Unfortunately, existing solutions either sacrifice the inherent benefits of ex-clusively using a relational database (e.g. logical and physical independence) or incur orders of magnitude performance gaps compared to specialized engines (or both). In this work, we study applying a new type of query processing architecture to standard BI and LA benchmarks. To do this, we present a new in-memory query processing engine called LevelHeaded. LevelHeaded uses worst-case optimal joins as its core execution mechanism for both BI and LA queries. With LevelHeaded, we show how crucial optimizations for BI and LA queries can be captured in a worst-case optimal query architecture. Using these optimizations, LevelHeaded outperforms other relational database engines (LogicBlox, MonetDB, and HyPer) by orders of magnitude on standard LA benchmarks, while performing on average within 31% of the best-of-breed BI (HyPer) and LA (Intel MKL) solutions on their own benchmarks. Our results show that such a single query processing architecture can be efficient on both BI and LA queries.
【Keywords】: join processing; worst case optimal join; business intelligence querying; linear algebra querying
【Paper Link】 【Pages】:461-472
【Authors】: Tianzheng Wang ; Justin J. Levandoski ; Per-Åke Larson
【Abstract】: Large non-volatile memories (NVRAM) will change the durability and recovery mechanisms of main-memory database systems. Today, these systems make operations durable through logging and checkpointing to secondary storage, and recover by rebuilding the in-memory database (records and indexes) from on-disk state. A main-memory database stored in NVRAM, however, can potentially recover instantly after a power failure. Modern main-memory databases typically use lock-free index structures to enable a high degree of concurrency. Thus NVRAM-resident databases need indexes that are both lock-free, persistent, and able to recover (almost) instantly after a crash. In this paper, we show how to easily build such index structures. A key enabling component of our scheme is a multi-word compare-and-swap operation, PMwCAS, that is lock-free, persistent, and efficient. PMwCAS significantly reduces the complexity of building lock-free indexes, which we illustrate by implementing both doubly-linked skip lists and the Bw-tree lock-free B+-tree for NVRAM. Experimental results show that PMwCAS's runtime overhead is very low (~4-6% under realistic workloads). This overhead is sufficiently low that the same implementation can be used for both DRAM and NVRAM resident indexes.
【Keywords】: NVRAM; index; non volatile memory; lock free; multi word compare and swap; MWCAS; databases
【Paper Link】 【Pages】:473-484
【Authors】: Simone Dominico ; Eduardo Cunha de Almeida ; Jorge Augusto Meira ; Marco Antonio Zanata Alves
【Abstract】: During the parallel execution of queries in Non-Uniform Memory Access (NUMA) systems, the Operating System (OS) maps the threads (or processes) from modern database systems to the available cores among the NUMA nodes using the standard node-local policy. However, such non smart mapping may result in inefficient memory activity, because shared data may be accessed by scattered threads requiring large data movements or non-shared data may be allocated to threads sharing the same cache memory, increasing its conflicts. In this paper we present a data-distribution aware and elastic multi-core allocation mechanism to improve the OS mapping of database threads in NUMA systems. Our hypothesis is that we mitigate the data movement if we only hand out to the OS the local optimum number of cores in specific nodes. We propose a mechanism based on a rule-condition-action pipeline that uses hardware counters to promptly find out the local optimum number of cores. Our mechanism uses a priority queue to track the history of the memory address space used by database threads in order to decide about the allocation/release of cores and its distribution among the NUMA nodes to decrease remote memory access. We implemented and tested a prototype of our mechanism when executing two popular Volcano-style databases improving their NUMA-affinity. For MonetDB, we show maximum speedup of 1.53×, due to consistent reduction in the local/remote per-query data traffic ratio of up to 3.87× running 256 concurrent clients in the 1 GB TPC-H database also showing system energy savings of 26.05%. For the NUMA-aware SQL Server, we observed speedup of up to 1.27× and reduction on the data traffic ratio of 3.70×.
【Keywords】: Thread affinity; NUMA; OLAP; Abstract Model
【Paper Link】 【Pages】:485-496
【Authors】: Zhaojing Luo ; Shaofeng Cai ; Jinyang Gao ; Meihui Zhang ; Kee Yuan Ngiam ; Gang Chen ; Wang-Chien Lee
【Abstract】: Deep Learning and Machine Learning models have recently been shown to be effective in many real world applications. While these models achieve increasingly better predictive performance, their structures have also become much more complex. A common and difficult problem for complex models is overfitting. Regularization is used to penalize the complexity of the model in order to avoid overfitting. However, in most learning frameworks, regularization function is usually set as some hyper parameters, and therefore the best setting is difficult to find. In this paper, we propose an adaptive regularization method, as part of a large end-to-end healthcare data analytics software stack, which effectively addresses the above difficulty. First, we propose a general adaptive regularization method based on Gaussian Mixture (GM) to learn the best regularization function according to the observed parameters. Second, we develop an effective update algorithm which integrates Expectation Maximization (EM) with Stochastic Gradient Descent (SGD). Third, we design a lazy update algorithm to reduce the computational cost by 4x. The overall regularization framework is fast, adaptive and easy-to-use. We validate the effectiveness of our regularization method through an extensive experimental study over 13 standard benchmark datasets and three kinds of deep learning/machine learning models. The results illustrate that our proposed adaptive regularization method achieves significant improvement over state-of-the-art regularization methods.
【Keywords】: AI interaction with DB technology; Data Science; Data Mining and Knowledge Discovery; Complex Analytics; Adaptive Regularization Tool
【Paper Link】 【Pages】:497-508
【Authors】: Zicheng Fang ; Peng Wang ; Wei Wang
【Abstract】: Recently, time series classification with shapelets, due to their high discriminative ability and good interpretability, has attracted considerable interests within the research community. Previously, shapelet generating approaches extracted shapelets from training time series or learned shapelets with many parameters. Although they can achieve higher accuracy than other approaches, they still confront some challenges. First, searching or learning shapelets in the raw time series space incurs a huge computation cost. For example, it may cost several hours to deal with only hundreds of time series. Second, they must determine how many shapelets are needed beforehand, which is difficult without prior knowledge. To overcome these challenges, in this paper, we propose a novel algorithm to learn shapelets. We first discover shapelet candidates from the Piecewise Aggregate Approximation (PAA) word space, which is much more efficient than searching in the raw time series space. Moreover, the concept of coverage is proposed to measure the quality of candidates, based on which we design a method to compute the optimal number of shapelets. After that, we apply the logistic regression classifier to adjust the shapelets. Extensive experimentation on 15 datasets demonstrates that our algorithm is more accurate against 6 baselines and outperforms 2 orders of magnitude in terms of efficiency. Moreover, our algorithm has fewer redundant shape-like shapelets and is more convenient to interpret classification decisions.
【Keywords】: Shapelet; Efficiency; Interpretability
【Paper Link】 【Pages】:509-520
【Authors】: Khanh Luong ; Richi Nayak
【Abstract】: Non-negative Matrix Factorization (NMF) methods have been effectively used for clustering high dimensional data. Manifold learning is combined with the NMF framework to ensure the projected lower dimensional representations preserve the local geometric structure of data. In this paper, considering the context of multi-type relational data clustering, we develop a new formulation of manifold learning to be embedded in the factorization process such that the new low-dimensional space can maintain both local and global structures of original data. We also propose to include the interactions between clusters of different data types by enforcing a Normalize Cut-type constraint that leads to a comprehensive NMF-based framework. A theoretical analysis and extensive experiments are provided to validate the effectiveness of the proposed work.
【Keywords】: Non-negative matrix factorization; multi-type relational data; manifold learning; association relationship; k-nearest neighbour graph; p farthest neighbour graph
【Paper Link】 【Pages】:521-532
【Authors】: Rodica Neamtu ; Ramoza Ahsan ; Elke A. Rundensteiner ; Gábor N. Sárközy ; Eamonn J. Keogh ; Hoang Anh Dau ; Cuong Nguyen ; Charles Lovering
【Abstract】: Domain-specific distances preferred by analysts for exploring similarities among time series tend to be "point-to-point" distances. Unfortunately, this point-wise nature limits their ability to perform meaningful comparisons between sequences of different lengths and with temporal mis-alignments. Analysts instead need "elastic" alignment tools such as Dynamic Time Warping (DTW) to perform such flexible comparisons. However, the existing alignment tools are limited in that they do not incorporate diverse distances. To address this shortcoming, our work introduces the first conceptual framework called Generalized Dynamic Time Warping (GDTW) that supports now alignment (warping) of a large array of domain-specific distances in a uniform manner. While the classic DTW and its prior extensions focus on the Euclidean Distance, our GDTW is the first method that generalizes the ubiquitous DTW and "extends" its warping capabilities to a rich diversity of point-to-point distances. Based on our GDTW paradigm that preserves the efficiency of the dynamic programming paradigm of DTW, we design an abstraction that implemented by our GDTW Design Tool enables analysts to "warp" new distances with little programming effort. Through extensive evaluation studies on 85 real public domain benchmark datasets, we show that our newly warped distances offer higher classification accuracy than the previously available distances for the majority of these datasets. Further, our case study on heart arrhythmia data illustrates the utility of the new distances enabled by our GDTW warping methodology.
【Keywords】: data mining; time series similarity; dynamic time warping; time series classification; similarity search
【Paper Link】 【Pages】:533-544
【Authors】: Zijian Li ; Xun Jian ; Xiang Lian ; Lei Chen
【Abstract】: Graph similarity search is a common and fundamental operation in graph databases. One of the most popular graph similarity measures is the Graph Edit Distance (GED) mainly because of its broad applicability and high interpretability. Despite its prevalence, exact GED computation is proved to be NP-hard, which could result in unsatisfactory computational efficiency on large graphs. However, exactly accurate search results are usually unnecessary for real-world applications especially when the responsiveness is far more important than the accuracy. Thus, in this paper, we propose a novel probabilistic approach to efficiently estimate GED, which is further leveraged for the graph similarity search. Specifically, we first take branches as elementary structures in graphs, and introduce a novel graph similarity measure by comparing branches between graphs, i.e., Graph Branch Distance (GBD), which can be efficiently calculated in polynomial time. Then, we formulate the relationship between GED and GBD by considering branch variations as the result ascribed to graph edit operations, and model this process by probabilistic approaches. By applying our model, the GED between any two graphs can be efficiently estimated by their GBD, and these estimations are finally utilized in the graph similarity search. Extensive experiments show that our approach has better accuracy, efficiency and scalability than other comparable methods in the graph similarity search over real and synthetic data sets.
【Keywords】: Graph Edit Distance; Graph Similarity Search
【Paper Link】 【Pages】:545-556
【Authors】: Yue Wang ; Xiang Lian ; Lei Chen
【Abstract】: SimRank is a popular link-based similarity measurement among nodes in a graph. To compute the all-pairs SimRank matrix accurately, iterative methods are usually used. For static graphs, current iterative solutions are not efficient enough, both in time and space, due to unnecessary cost and storage by the nature of iterative updating. For dynamic graphs, all current incremental solutions for updating the Sim-Rank matrix are based on an approximated SimRank definition, and thus have no accuracy guarantee. In this paper, we propose a novel local push based algorithm for computing all-pairs SimRank. We show that our algorithms outperform the state-of-the-art static and dynamic all-pairs SimRank algorithms.
【Keywords】: SimRank; Dynamic algorithm; Graph mining; Similarity measurement
【Paper Link】 【Pages】:557-568
【Authors】: Fan Zhang ; Ying Zhang ; Lu Qin ; Wenjie Zhang ; Xuemin Lin
【Abstract】: User engagement and tie strength are fundamental and important components in social networks. The model of k-truss not only captures actively engaged users, but also ensures strong tie strength among these users. It motivates us to utilize the model of k-truss in preventing network unraveling, which simultaneously considers both of the basic components. In this paper, we propose and investigate the anchored k-truss problem to reinforce a network by anchoring critical users who can significantly stop the unraveling. We prove the problem is NP-hard for k ≥ 4. A fast edge deletion order based algorithm, named AKT, is proposed with efficient candidate exploration and pruning techniques based on the order. Comprehensive experiments on 10 real-life graphs demonstrate the effectiveness of our model and the efficiency of our methods.
【Keywords】: social network; k-truss; user engagement; tie strength
【Paper Link】 【Pages】:569-580
【Authors】: Huiping Liu ; Cheqing Jin ; Bin Yang ; Aoying Zhou
【Abstract】: Motivated by many practical applications in logistics and mobility-as-a-service, we study the top-k optimal sequenced routes (KOSR) querying on large, general graphs where the edge weights may not satisfy the triangle inequality, e.g., road network graphs with travel times as edge weights. The KOSR querying strives to find the top-k optimal routes (i.e., with the top-k minimal total costs) from a given source to a given destination, which must visit a number of vertices with specific vertex categories (e.g., gas stations, restaurants, and shopping malls) in a particular order (e.g., visiting gas stations before restaurants and then shopping malls). To efficiently find the top-k optimal sequenced routes, we propose two algorithms PruningKOSR and StarKOSR. In PruningKOSR, we define a dominance relationship between two partially-explored routes. The partially-explored routes that can be dominated by other partially-explored routes are postponed being extended, which leads to a smaller searching space and thus improves efficiency. In StarKOSR, we further improve the efficiency by extending routes in an A manner. With the help of a judiciously designed heuristic estimation that works for general graphs, the cost of partially explored routes to the destination can be estimated such that the qualified complete routes can be found early. In addition, we demonstrate the high extensibility of the proposed algorithms by incorporating Hop Labeling, an effective label indexing technique for shortest path queries, to further improve efficiency. Extensive experiments on multiple real-world graphs demonstrate that the proposed methods significantly outperform the baseline method. Furthermore, when k = 1, StarKOSR also outperforms the state-of-the-art method for the optimal sequenced route queries.
【Keywords】: route planning; optimal sequenced routes; hub labeling
【Paper Link】 【Pages】:581-592
【Authors】: Efrat Abramovitz ; Daniel Deutch ; Amir Gilad
【Abstract】: Inference of queries from their output examples has been extensively studied in multiple contexts as means to ease the formulation of queries. In this paper we propose a novel approach for the problem, based on provenance. The idea is to use provenance in two manners: first as an additional information that is associated with the given examples and explains their rationale; and then again as a way to show users a description of the difference between candidate queries, prompting their feedback. We have implemented the framework in the context of simple graph patterns and union thereof, and demonstrate its effectiveness in the context of multiple ontologies.
【Keywords】: Provenance; SPARQL
【Paper Link】 【Pages】:593-604
【Authors】: Wenfei Fan ; Xueli Liu ; Yingjie Cao
【Abstract】: This paper develops techniques for reasoning about graph functional dependencies (GFDs). We study the satisfiability problem, to decide whether a given set of GFDs has a model, and the implication problem, to decide whether a set of GFDs entails another GFD. While these fundamental problems are important in practice, they are coNP-complete and NP-complete, respectively. We establish a small model property for satisfiability, showing that if a set ? of GFDs is satisfiable, then it has a model of a size bounded by the size |Σ| of Σ; similarly we prove a small model property for implication. Based on the properties, we develop algorithms for checking the satisfiability and implication of GFDs. Moreover, we provide parallel algorithms that guarantee to reduce running time when more processors are used, despite the intractability of the problems. We experimentally verify the efficiency and scalability of the algorithms.
【Keywords】: graph functional dependencies; satisfiability; implication; parallel algorithms
【Paper Link】 【Pages】:605-616
【Authors】: Jiahao Wang ; Peng Cai ; Jinwei Guo ; Weining Qian ; Aoying Zhou
【Abstract】: This work addresses the need for efficient key-range validation for a composite OLTP and bulk processing workload characterized by modern enterprise applications. In-memory database system (IMDB), mostly adopting the optimistic concurrency control (OCC) mechanism, performs well if the contention of conventional OLTP workloads is low and each transaction only contains point read/write query with primary key. In this work we present the performance problem of IMDBs under mixed OLTP and bulk processing workloads. The reason is that existing OCC protocols take expensive cost to generate a serializable schedule if the OLTP workload contains bulk processing operations with key-range scan. To this end, we develop an efficient and scalable range optimistic concurrency control (ROCC) which uses logical ranges to track the potential conflicting transactions and to reduce the number of transactions to be validated. At the read phase, a transaction keeps a set of predicates to remember the version and precise scope of scanned ranges, which eliminates the cost of maintaining scanned records. Before entering the validation phase, if the transaction intends to update records in the logical range, it needs to register to the corresponding lock-free list implemented by a circular array. Finally, ROCC filters out unrelated transactions and validates the bulk operation at range level. Experimental results show that ROCC has good performance and scalability under heterogeneous workloads mixed with point access and bulk processing.
【Keywords】: Range concurrency control; OCC; Serializability
【Paper Link】 【Pages】:617-628
【Authors】: Xiucheng Li ; Kaiqi Zhao ; Gao Cong ; Christian S. Jensen ; Wei Wei
【Abstract】: Trajectory similarity computation is fundamental functionality with many applications such as animal migration pattern studies and vehicle trajectory mining to identify popular routes and similar drivers. While a trajectory is a continuous curve in some spatial domain, e.g., 2D Euclidean space, trajectories are often represented by point sequences. Existing approaches that compute similarity based on point matching suffer from the problem that they treat two different point sequences differently even when the sequences represent the same trajectory. This is particularly a problem when the point sequences are non-uniform, have low sampling rates, and have noisy points. We propose the first deep learning approach to learning representations of trajectories that is robust to low data quality, thus supporting accurate and efficient trajectory similarity computation and search. Experiments show that our method is capable of higher accuracy and is at least one order of magnitude faster than the state-of-the-art methods for k-nearest trajectory search.
【Keywords】: Trajectory similarity; representation learning; deep neural nets
【Paper Link】 【Pages】:629-640
【Authors】: Ying Xu ; Dongxiang Zhang ; Meihui Zhang ; Dongsheng Li ; Xiaoling Wang ; Heng Tao Shen
【Abstract】: Continuous proximity detection monitors the real-time positions of a large set of moving users and sends an alert as long as the distance of any matching pair is smaller than the threshold. Existing solutions construct either a static safe region with maximized area or a mobile safe region with constant speed and direction, which cannot not capture real motion patterns. In this paper, we propose a new type of safe region that relies on trajectory prediction techniques to significantly reduce the communication I/O. It takes into account the complex non-linear motion patterns and constructs a stripe to enclose the sequence of future locations as a predictive safe region. The stripe construction is guided by a holistic cost model with the objective of maximizing the expected time for the next communication. We conduct experiments on four real datasets with four types of prediction models and our method reduces the communication I/O by more than 30% in the default parameter settings.
【Keywords】: continuous proximity detection; trajectory prediction; safe region
【Paper Link】 【Pages】:641-652
【Authors】: Zhongle Xie ; Qingchao Cai ; Gang Chen ; Rui Mao ; Meihui Zhang
【Abstract】: Due to poor cache utilization and latching contention, the B-tree like structures, which have been heavily used in traditional databases, are not suitable for modern in-memory databases running over multi-core infrastructure. To address the problem, several in-memory indices, such as FAST, Masstree, BwTree, ART and PSL, have recently been proposed, and they show good performance in concurrent settings. Given the various design choices and implementation techniques being adopted by these indices, it is therefore important to understand how these techniques and properties actually affect the indexing performance. To this end, we conduct a comprehensive performance study to compare these indices from multiple perspectives, including query throughput, scalability, latency, memory consumption as well as cache/branch miss rate, using various query workloads with different characteristics. Our results indicate that there is no one-size-fits-all solution. For example, PSL achieves better query throughput for most settings, but occupies more memory space and can incur a large overhead in updating the index. Nevertheless, the huge performance gain renders the exploitation of modern hardware features indispensable for modern database indices.
【Keywords】: Index; NUMA; Skip List; B+-Tree
【Paper Link】 【Pages】:653-664
【Authors】: Jinfei Liu ; Juncheng Yang ; Li Xiong ; Jian Pei ; Jun Luo
【Abstract】: Skyline queries are important in many application domains. In this paper, we propose a novel structure Skyline Diagram, which given a set of points, partitions the plane into a set of regions, referred to as skyline polyominos. All query points in the same skyline polyomino have the same skyline query results. Similar to k^th-order Voronoi diagram commonly used to facilitate k nearest neighbor (kNN) queries, skyline diagram can be used to facilitate skyline queries and many other applications. However, it may be computationally expensive to build the skyline diagram. By exploiting some interesting properties of skyline, we present several efficient algorithms for building the diagram with respect to three kinds of skyline queries, quadrant, global, and dynamic skylines. Experimental results on both real and synthetic datasets show that our algorithms are efficient and scalable.
【Keywords】: Skyline; Diagram; Voronoi; Queries
【Paper Link】 【Pages】:665-676
【Authors】: Felix Martin Schuhknecht ; Jens Dittrich ; Laurent Linden
【Abstract】: In nature, many species became extinct as they could not adapt quickly enough to their environment. They were simply not fit enough to adapt to more and more challenging circumstances. Similar things happen when algorithms are too static to cope with particular challenges of their "environment", be it the workload, the machine, or the user requirements. In this regard, in this paper we explore the well-researched and fascinating family of adaptive indexing algorithms. Classical adaptive indexes solely adapt the indexedness of the data to the workload. However, we will learn that so far we have overlooked a second higher level of adaptivity, namely the one of the indexing algorithm itself. We will coin this second level of adaptivity meta-adaptivity. Based on a careful experimental analysis, we will develop an adaptive index, which realizes meta-adaptivity by (1) generalizing the way reorganization is performed, (2) reacting to the evolving indexedness and varying reorganization effort, and (3) defusing skewed distributions in the input data. As we will demonstrate, this allows us to emulate the characteristics of a large set of specialized adaptive indexing algorithms. In an extensive experimental study we will show that our meta-adaptive index is extremely fit in a variety of environments and outperforms a large amount of specialized adaptive indexes under various query access patterns and key distributions.
【Keywords】: Adaptive Indexing; Main memory Indexing; Database Cracking
【Paper Link】 【Pages】:677-688
【Authors】: Badrish Chandramouli ; Jonathan Goldstein ; Yinan Li
【Abstract】: There is a growing interest in processing real-time queries over out-of-order streams in this big data era. This paper presents a comprehensive solution to meet this requirement. Our solution is based on Impatience sort, an online sorting technique that is based on an old technique called Patience sort. Impatience sort is tailored for incrementally sorting streaming datasets that present themselves as almost sorted, usually due to network delays and machine failures. With several optimizations, our solution can adapt to both input streams and query logic. Further, we develop a new Impatience framework that leverages Impatience sort to reduce the latency and memory usage of query execution, and supports a range of user latency requirements, without compromising on query completeness and throughput, while leveraging existing efficient in-order streaming engines and operators. We evaluate our proposed solution in Trill, a high-performance streaming engine, and demonstrate that our techniques significantly improve sorting performance and reduce memory usage - in some cases, by over an order of magnitude.
【Keywords】: Sorting; streaming; out of order; disorder; performance
【Paper Link】 【Pages】:689-700
【Authors】: K. Venkatesh Emani ; S. Sudarshan
【Abstract】: Database applications are typically written using a mixture of imperative languages and declarative frameworks for data processing. Data processing logic gets distributed across the declarative and imperative parts of a program. Often, there is more than one way to implement the same program, whose efficiency may depend on a number of parameters. In this paper, we propose a framework that automatically generates all equivalent alternatives to a given program using a given set of program transformations, and chooses the least cost alternative. We use the concept of program regions as an algebraic abstraction of a program and extend the Volcano/Cascades framework for optimization of algebraic expressions, to optimize programs. We illustrate the use of our framework for optimizing database applications. We show through experimental results, that our framework has wide applicability in real-world applications and provides significant performance benefits.
【Keywords】: database application optimization; cost-based query optimization; program analysis
【Paper Link】 【Pages】:701-712
【Authors】: Rong Zhu ; Zhaonian Zou ; Jianzhong Li
【Abstract】: Mining dense subgraphs on multi-layer graphs is an interesting problem, which has witnessed lots of applications in practice. To overcome the limitations of the quasi-clique-based approach, we propose d-coherent core (d-CC), a new notion of dense subgraph on multi-layer graphs, which has several elegant properties. We formalize the diversified coherent core search (DCCS) problem, which finds k d-CCs that can cover the largest number of vertices. We propose a greedy algorithm with an approximation ratio of 1 - 1/e and two search algorithms with an approximation ratio of 1/4. The experiments verify that the search algorithms are faster than the greedy algorithm and produce comparably good results as the greedy algorithm in practice. As opposed to the quasi-clique-based approach, our DCCS algorithms can fast detect larger dense subgraphs that cover most of the quasi-clique-based results.
【Keywords】: multi layer graph; coherent core; diversity
【Paper Link】 【Pages】:713-724
【Authors】: Dongxiang Zhang ; Long Guo ; Xiangnan He ; Jie Shao ; Sai Wu ; Heng Tao Shen
【Abstract】: Entity resolution identifies all records in a database that refer to the same entity. The mainstream solutions rely on supervised learning or crowd assistance, both requiring labor overhead for data annotation. To avoid human intervention, we propose an unsupervised graph-theoretic fusion framework with two components, namely ITER and CliqueRank. Specifically, ITER constructs a weighted bipartite graph between terms and record-record pairs and iteratively propagates the node salience until convergence. Subsequently, CliqueRank constructs a record graph to estimate the likelihood of two records resident in the same clique. The derived likelihood from CliqueRank is fed back to ITER to rectify the edge weight until a joint optimum can be reached. Experimental evaluation was conducted among 14 competitors and results show that without any labeled data or crowd assistance, our unsupervised framework is comparable or even superior to state-of-the-art methods among three benchmark datasets.
【Keywords】: unsupervised entity resolution; random walk; bipartite graph
【Paper Link】 【Pages】:725-736
【Authors】: Yu Zhang ; Srikanta Tirthapura ; Graham Cormode
【Abstract】: A current challenge for data management systems is to support the construction and maintenance of machine learning models over data that is large, multi-dimensional, and evolving. While systems that could support these tasks are emerging, the need to scale to distributed, streaming data requires new models and algorithms. In this setting, as well as computational scalability and model accuracy, we also need to minimize the amount of communication between distributed processors, which is the chief component of latency. We study Bayesian Networks, the workhorse of graphical models, and present a communication-efficient method for continuously learning and maintaining a Bayesian network model over data that is arriving as a distributed stream partitioned across multiple processors. We show a strategy for maintaining model parameters that leads to an exponential reduction in communication when compared with baseline approaches to maintain the exact MLE (maximum likelihood estimation). Meanwhile, our strategy provides similar prediction errors for the target distribution and for classification tasks.
【Keywords】: Bayesian Networks; Distributed Stream Algorithm; Parameter Learning
【Paper Link】 【Pages】:737-748
【Authors】: Olga Poppe ; Allison Rozet ; Chuan Lei ; Elke A. Rundensteiner ; David Maier
【Abstract】: Streaming systems evaluate massive workloads of event sequence aggregation queries. State-of-the-art approaches suffer from long delays caused by not sharing intermediate results of similar queries and by constructing event sequences prior to their aggregation. To overcome these limitations, our Shared Online Event Sequence Aggregation (Sharon) approach shares intermediate aggregates among multiple queries while avoiding the expensive construction of event sequences. Our Sharon optimizer faces two challenges. One, a sharing decision is not always beneficial. Two, a sharing decision may exclude other sharing opportunities. To guide our Sharon optimizer, we compactly encode sharing candidates, their benefits, and conflicts among candidates into the Sharon graph. Based on the graph, we map our problem of finding an optimal sharing plan to the Maximum Weight Independent Set (MWIS) problem. We then use the guaranteed weight of a greedy algorithm for the MWIS problem to prune the search of our sharing plan finder without sacrificing its optimality. The Sharon optimizer is shown to produce sharing plans that achieve up to an 18-fold speed-up compared to state-of-the-art approaches.
【Keywords】: Complex event processing; Query optimization
【Paper Link】 【Pages】:749-760
【Authors】: Lisi Chen ; Shuo Shang ; Zhiwei Zhang ; Xin Cao ; Christian S. Jensen ; Panos Kalnis
【Abstract】: Massive amount of data that contain spatial, textual, and temporal information are being generated at a high scale. These spatio-temporal documents cover a wide range of topics in local area. Users are interested in receiving local popular terms from spatio-temporal documents published with a specified region. We consider the Top-k Spatial-Temporal Term (ST2) Subscription. Given an ST2 subscription, we continuously maintain up-to-date top-k most popular terms over a stream of spatio-temporal documents. The ST2 subscription takes into account both frequency and recency of a term generated from spatio-temporal document streams in evaluating its popularity. We propose an efficient solution to process a large number of ST2 subscriptions over a stream of spatio-temporal documents. The performance of processing ST2 subscriptions is studied in extensive experiments based on two real spatio-temporal datasets.
【Keywords】: publish; subscribe; spatial; temporal; stream
【Paper Link】 【Pages】:761-772
【Authors】: Bin Yao ; Zhongpu Chen ; Xiaofeng Gao ; Shuo Shang ; Shuai Ma ; Minyi Guo
【Abstract】: Aggregate nearest neighbor (ANN) query has been studied in both the Euclidean space and road networks. The flexible aggregate nearest neighbor (FANN) problem further generalizes ANN by introducing an extra flexibility. Given a set of data points P, a set of query points Q, and a user-defined flexibility parameter φ that ranges in (0, 1], an FANN query returns the best candidate from P, which minimizes the aggregate (usually max or sum) distance to any φ |Q| objects in Q. In this paper, we focus on the problem in road networks (denoted as FANN R ), and present a series of universal (i.e., suitable for both max and sum) algorithms to answer FANN R queries in road networks, including a Dijkstra-based algorithm enumerating P, a queue-based approach that processes data points from-near-to-far, and a framework that combines Incremental Euclidean Restriction (IER) and kNN. We also propose a specific exact solution to max-FANN R and a specific approximate solution to sum-FANN R which can return a near-optimal result with a guaranteed constant-factor approximation. These specific algorithms are easy to implement and can achieve excellent performance in some scenarios. Besides, we further extend the FANN R to k-FANN R , and successfully adapt most of the proposed algorithms to answer k-FANN R queries. We conduct a comprehensive experimental evaluation for the proposed algorithms on real road networks to demonstrate their superior efficiency and high quality.
【Keywords】: road networks; ANN; kNN; FANN
【Paper Link】 【Pages】:773-784
【Authors】: Yurong Cheng ; Lei Chen ; Ye Yuan ; Guoren Wang
【Abstract】: Real-life graph datasets extracted from Web are inevitably full of incompleteness, conflicts, and redundancies, so graph data cleaning shows its necessity. One of the main issues is to automatically repair the graph with some repairing rules. Although rules like data dependencies have been widely studied in relational data repairing, very few works exist to repair the graph data. In this paper, we introduce an automatic repairing semantic for graphs, called Graph-Repairing Rules (GRRs). This semantic can capture the incompleteness, conflicts, and redundancies in the graphs and indicate how to correct these errors. We study three fundamental problems associated with GRRs, implication, consistency and termination, which show whether a given set of GRRs make sense. Repairing the graph data using GRRs involves a problem of finding isomorphic subgraphs of the graph data for each GRR, which is NP-complete. To efficiently circumvent the complex calculation of subgraph isomorphism, we design a decomposition-and-join strategy to solve this problem. Extensive experiments on real datasets show that our GRR semantic and corresponding repairing algorithms can effectively and efficiently repair real-life graph data.
【Keywords】: Graph Repair; Rule
【Paper Link】 【Pages】:785-796
【Authors】: Wentao Li ; Miao Qiao ; Lu Qin ; Ying Zhang ; Lijun Chang ; Xuemin Lin
【Abstract】: This paper studies the efficiency issue on computing the exact eccentricity-distribution of a small-world network. Eccentricity-distribution reflects the importance of each node in a graph, which is beneficial for graph analysis. Moreover, it is key to computing two fundamental graph characters: diameter and radius. Existing eccentricity computation algorithms, however, are either inefficient in handling large-scale networks emerging nowadays in practice or approximate algorithms that are inappropriate to small-world networks. We propose an efficient approach for exact eccentricity computation. Our approach is based on a plethora of insights on the bottleneck of the existing algorithms - one-node eccentricity computation and the upper/lower bounds update. Extensive experiments demonstrate that our approach outperforms the state-of-the-art up to three orders of magnitude on real large small-world networks.
【Keywords】: Eccentricity; Small World Network
【Paper Link】 【Pages】:797-808
【Authors】: Rong-Hua Li ; Jiao Su ; Lu Qin ; Jeffrey Xu Yu ; Qiangqiang Dai
【Abstract】: Community search is a fundamental graph mining task. Unfortunately, most previous community search studies focus mainly on identifying communities in a network without temporal information. In this paper, we study the problem of finding persistent communities in a temporal network, in which every edge is associated with a timestamp. Our goal is to identify the communities that are persistent over time. To this end, we propose a novel persistent community model called (θ,τ) community. We prove that the problem of identifying the maximum (θ,τ) persistent k-core is NP-hard. To solve this problem, we propose a novel branch and bound algorithm with several carefully-designed pruning rules to find the maximum (θ,τ)-persistent. We conduct k-cores efficiently. We conduct extensive experiments in several real-world temporal networks. The results demonstrate the efficiency, scalability, and effectiveness of the proposed solutions.
【Keywords】: k-core; Temporal network; Community Search; Persistent Community
【Paper Link】 【Pages】:809-820
【Authors】: Matteo Lissandrini ; Davide Mottin ; Themis Palpanas ; Yannis Velegrakis
【Abstract】: In rich information spaces, it is often hard for users to formally specify the characteristics of the desired answers, either due to the complexity of the schema or of the query language, or even because they do not know exactly what they are looking for. Exemplar queries constitute a query paradigm that overcomes those problems, by allowing users to provide examples of the elements of interest in place of the query specification. In this paper, we propose a general approach where the user-provided example can comprise several partial specification fragments, where each fragment describes only one part of the desired result. We provide a formal definition of the problem, which generalizes existing formulations for both the relational and the graph model. We then describe exact algorithms for its solution for the case of information graphs, as well as top-k algorithms. Experiments on large real datasets demonstrate the effectiveness and efficiency of the proposed approach.
【Keywords】: Exemplar queries; Knowledge Graph; Graph Search
【Paper Link】 【Pages】:821-832
【Authors】: Ning Wang ; Xiaokui Xiao ; Yin Yang ; Ta Duy Hoang ; Hyejin Shin ; Junbum Shin ; Ge Yu
【Abstract】: A mobile operating system often needs to collect frequent new terms from users in order to build and maintain a comprehensive dictionary. Collecting keyboard usage data, however, raises privacy concerns. Local differential privacy (LDP) has been established as a strong privacy standard for collecting sensitive information from users. Currently, the best known solution for LDP-compliant frequent term discovery transforms the problem into collecting n-grams under LDP, and subsequently reconstructs terms from the collected n-grams by modelling the latter into a graph, and identifying cliques on this graph. Because the transformed problem (i.e., collecting n-grams) is very different from the original one (discovering frequent terms), the end result has poor utility. Further, this method is also rather expensive due to clique computation on a large graph. In this paper we tackle the problem head on: our proposal, PrivTrie, directly collects frequent terms from users by iteratively constructing a trie under LDP. While the methodology of building a trie is an obvious choice, obtaining an accurate trie under LDP is highly challenging. PrivTrie achieves this with a novel adaptive approach that conserves privacy budget by building internal nodes of the trie with the lowest level of accuracy necessary. Experiments using real datasets confirm that PrivTrie achieves high accuracy on common privacy levels, and consistently outperforms all previous methods.
【Keywords】: local differential privacy; frequent term; trie
【Paper Link】 【Pages】:833-844
【Authors】: Hien To ; Cyrus Shahabi ; Li Xiong
【Abstract】: With spatial crowdsourcing (SC), requesters outsource their spatiotemporal tasks (tasks associated with location and time) to a set of workers, who will perform the tasks by physically traveling to the tasks' locations. However, current solutions require the locations of the workers and/or the tasks to be disclosed to untrusted parties (SC server) for effective assignments of tasks to workers. In this paper we propose a framework for assigning tasks to workers in an online manner without compromising the location privacy of workers and tasks. We perturb the locations of both tasks and workers based on geo-indistinguishability and then devise techniques to quantify the probability of reachability between a task and a worker, given their perturbed locations. We investigate both analytical and empirical models for quantifying the worker-task pair reachability and propose task assignment strategies that strike a balance among various metrics such as the number of completed tasks, worker travel distance and system overhead. Extensive experiments on real-world datasets show that our proposed techniques result in minimal disclosure of task locations and no disclosure of worker locations without significantly sacrificing the total number of assigned tasks.
【Keywords】: location privacy; spatial crowdsourcing; geo indistinguishability
【Paper Link】 【Pages】:845-856
【Authors】: Graham Cormode ; Tejas Kulkarni ; Divesh Srivastava
【Abstract】: Concern about how to aggregate sensitive user data without compromising individual privacy is a major barrier to greater availability of data. Differential privacy has emerged as an accepted model to release sensitive information while giving a statistical guarantee for privacy. Many different algorithms are possible to address different target functions. We focus on the core problem of count queries, and seek to design mechanisms to release data associated with a group of n individuals. Prior work has focused on designing mechanisms by raw optimization of a loss function, without regard to the consequences on the results. This can leads to mechanisms with undesirable properties, such as never reporting some outputs (gaps), and overreporting others (spikes). We tame these pathological behaviors by introducing a set of desirable properties that mechanisms can obey. Any combination of these can be satisfied by solving a linear program (LP) which minimizes a cost function, with constraints enforcing the properties. We focus on a particular cost function, and provide explicit constructions that are optimal for certain combinations of properties, and show a closed form for their cost. In the end, there are only a handful of distinct optimal mechanisms to choose between: one is the well-known (truncated) geometric mechanism; the second a novel mechanism that we introduce here, and the remainder are found as the solution to particular LPs. These all avoid the bad behaviors we identify. We demonstrate in a set of experiments on real and synthetic data which is preferable in practice, for different combinations of data distributions, constraints, and privacy parameters.
【Keywords】: differential privacy; mechanism design; privacy
【Paper Link】 【Pages】:857-868
【Authors】: Cetin Sahin ; Tristan Allard ; Reza Akbarinia ; Amr El Abbadi ; Esther Pacitti
【Abstract】: Performing non-aggregate range queries on cloud stored data, while achieving both privacy and efficiency is a challenging problem. This paper proposes constructing a differentially private index to an outsourced encrypted dataset. Efficiency is enabled by using a cleartext index structure to perform range queries. Security relies on both differential privacy (of the index) and semantic security (of the encrypted dataset). Our solution, PINED-RQ develops algorithms for building and updating the differentially private index. Compared to state-of-the-art secure index based range query processing approaches, PINED-RQ executes queries in the order of at least one magnitude faster. The security of PINED-RQ is proved and its efficiency is assessed by an extensive experimental validation.
【Keywords】: Differential Privacy; Range Query; Privacy preserving; Querying encrypted data; Outsourcing to Cloud; Differentially Private Index
【Paper Link】 【Pages】:869-880
【Authors】: Weiguo Zheng ; Qichen Wang ; Jeffrey Xu Yu ; Hong Cheng ; Lei Zou
【Abstract】: Most existing algorithms computing the maximum independent set (MIS) or independent set (IS) are designed for handling static graphs, which may not be practicable as many networks are dynamically evolving over time. In this paper, we study the MIS/IS problem in evolving graphs by considering graph update operations: vertex/edge addition and vertex/edge deletion. Instead of computing the MIS/IS of the updated graph from scratch, we propose a baseline algorithm that finds the MIS/IS at time t i+1 based on the MIS/IS at time ti. Due to the hardness of computing an exact MIS, we develop an efficient constant-time algorithm LSTwo to return a high-quality (large-size) independent set. Then we design a lazy search algorithm which produces higher-quality independent sets. To improve the time efficiency further, we devise the conditional besieging and k-petal based methods to reduce the search space. Extensive experimental studies over large-scale graphs confirm the effectiveness and efficiency of our proposed algorithms.
【Keywords】: independent set; evolving graph; maximum independent set; greedy algorithm
【Paper Link】 【Pages】:881-892
【Authors】: Chuanwen Li ; Yu Gu ; Jianzhong Qi ; Jiayuan He ; Qingxu Deng ; Ge Yu
【Abstract】: The k nearest neighbor (kNN) query in road networks is a traditional query type in spatial databases. This query has found new applications in the fast-growing location-based services, e.g., finding the k nearest Uber cars of a user for ridesharing. KNN queries in these applications are non-trivial to process due to the frequent location updates of data objects (e.g., movements of the cars). This calls for novel spatial indexes with high efficiency in not only query processing but also update handling. To address this need, we propose an index structure that uses a "lazy update" strategy to reduce the costs of update handling without sacrificing query efficiency or answer accuracy. We cache the location updates of data objects and only update the corresponding entries in the index when they are queried. We further propose a kNN query algorithm based on this index. This algorithm takes advantage of the strengths of both the CPU and the GPU. It first identifies the queried region and updates the index over this region using the GPU. Then, it uses the GPU to query the index and produce a candidate result set, which is later refined by the CPU to obtain the final query answer. We conduct experiments on real data and compare the proposed algorithm with state-of-the-art kNN algorithms. The experimental results show that the proposed algorithm outperforms the baseline algorithms by orders of magnitude in query time.
【Keywords】: spatial data; data management; gpu; k-nearest neighbor
【Paper Link】 【Pages】:893-904
【Authors】: Jingbo Zhou ; Qi Guo ; H. V. Jagadish ; Lubos Krcál ; Siyuan Liu ; Wenhao Luan ; Anthony K. H. Tung ; Yueji Yang ; Yuxin Zheng
【Abstract】: We propose a novel generic inverted index framework on the GPU (called GENIE), aiming to reduce the programming complexity of the GPU for parallel similarity search of different data types. Not every data type and similarity measure are supported by GENIE, but many popular ones are. We present the system design of GENIE, and demonstrate similarity search with GENIE on several data types along with a theoretical analysis of search results. A new concept of locality sensitive hashing (LSH) named tau-ANN search, and a novel data structure c-PQ on the GPU are also proposed for achieving this purpose. Extensive experiments on different real-life datasets demonstrate the efficiency and effectiveness of our framework. The implemented system has been released as open source: https://github.com/SeSaMe-NUS/genie.
【Keywords】: similarity search; GPU; locality sensitive hashing; inverted index
【Paper Link】 【Pages】:905-916
【Authors】: Qi Guo ; Hosagrahar Visvesvaraya Jagadish ; Anthony Kum Hoe Tung ; Yuxin Zheng
【Abstract】: Given a d-dimensional point query q, finding data items similar to q is a crucial task in many information retrieval and data mining applications. The typical approach is to find K items in a data set most similar to q, known as K nearest neighbors. Often, it is valuable to avoid too many answers that are too similar, and the importance of diversity has been considered in recent research. There are many different ways to characterize diversity, most of which depend on a notion of distance between points. In this paper, we propose a novel view of diversity based on spatial angles. This approach captures relevant and diverse results surrounding q from distinct directions even in high dimensional space. We present several algorithms to compute the diverse neighbor set, and show that it has several desirable properties. Extensive experiments demonstrate the effectiveness and efficiency of our methods on both real and synthetic data sets.
【Keywords】: diverse nearest neighbor search; angular diversity
【Paper Link】 【Pages】:917-928
【Authors】: Kai Han ; Yuntian He ; Xiaokui Xiao ; Shaojie Tang ; Fei Gui ; Chaoting Xu ; Jun Luo
【Abstract】: Recently, the proliferation of event-based social services has made it possible for organizing personalized offline events through the users' information shared online. In this paper, we study the budget-constrained influential social event organization problem, where the goal is to select a group of influential users with required features to organize a social event under a budget B. We show that our problem is NP-hard and can be formulated as a submodular maximization problem with mixed packing and covering constraints. We then propose several polynomial time algorithms for our problem with provable approximation ratios, which adopt a novel "surrogate optimization approach and the method of reverse-reachable set sampling. Compared with some related work that can only handle special cases of our problem but with exponential time complexity, our algorithms are much more efficient, and their superiorities on both the running time and the influence spread are demonstrated through extensive experiments using real social networks."
【Keywords】: social network; event; influence
【Paper Link】 【Pages】:929-940
【Authors】: Hongzhi Yin ; Lei Zou ; Quoc Viet Hung Nguyen ; Zi Huang ; Xiaofang Zhou
【Abstract】: With the prevalent trend of combining online and offline interactions among users in event-based social networks (EBSNs), event recommendation has become an essential means to help people discover new interesting events to attend. However, existing literatures on event recommendations ignore the social attribute of events: people prefer to attend events with their friends or family rather than alone. Therefore, we propose a new recommendation paradigm: joint event-partner recommendation that focuses on recommending event-partner pairs to users. In this paper, we focus on the new problem of joint event-partner recommendation in EBSNs, which is extremely challenging due to the intrinsic cold-start property of events, the complex decision-making process for choosing event-partner pairs and the huge prediction space of event-partner combinations. We propose a generic graph-based embedding model (GEM) to collectively embed all the observed relations among users, events, locations, time and text content in a shared low-dimension space, which is able to leverage the correlation between events and their associated content and contextual information to address the cold-start issue effectively. To accelerate the convergence of GEM and improve its modeling accuracy, an adaptive noise sampler is developed to generate adversarial negative samples in the model optimization. Besides, to speed up the online recommendation, we propose a novel space transformation method to project each event-partner pair to one point in a new space and then develop effective space pruning and efficient online recommendation techniques. We conduct comprehensive experiments on our created real benchmark datasets, and the experimental results demonstrate the superiority of our proposals in terms of recommendation effectiveness, efficiency and scalability
【Keywords】: Event Partner Recommendation; Information Network Embedding; Adaptive Negative Sampling; Cold Start; Event Recommendation; Partner Recommendation; Efficient Recommendation
【Paper Link】 【Pages】:941-952
【Authors】: Shanshan Feng ; Gao Cong ; Arijit Khan ; Xiucheng Li ; Yong Liu ; Yeow Meng Chee
【Abstract】: As a fundamental problem in social influence propagation analysis, learning influence parameters has been extensively investigated. Most of the existing methods are proposed to estimate the propagation probability for each edge in social networks. However, they cannot effectively learn propagation parameters of all edges due to data sparsity, especially for the edges without sufficient observed propagation. Different from the conventional methods, we introduce a novel social influence embedding problem, which is to learn parameters for nodes rather than edges. Nodes are represented as vectors in a low-dimensional space, and thus social influence information can be reflected by these vectors. We develop a new model Inf2vec, which combines both the local influence neighborhood and global user similarity to learn the representations. We conduct extensive experiments on two real-world datasets, and the results indicate that Inf2vec significantly outperforms state-of-the-art baseline algorithms.
【Keywords】: Social influence analysis; latent representation; Embedding model
【Paper Link】 【Pages】:953-964
【Authors】: Shuai Ma ; Chen Gong ; Renjun Hu ; Dongsheng Luo ; Chunming Hu ; Jinpeng Huai
【Abstract】: Ranking query independent scholarly articles is a practical and difficult task, due to the heterogeneous, evolving and dynamic nature of entities involved in scholarly articles. To do this, we first propose a scholarly article ranking model by assembling the importance of involved entities (i.e., articles, venues and authors) such that the importance is a combination of prestige and popularity to capture the evolving nature of entities. To compute the prestige of articles and venues, we propose a novel Time-Weighted PageRank that extends traditional PageRank with a time decaying factor. We then develop a batch algorithm for scholarly article ranking, in which we propose a block-wise method for Time-Weighted PageRank in terms of an analysis of the citation characteristics of scholarly articles. We further develop an incremental algorithm for dynamic scholarly article ranking, which partitions graphs into affected and unaffected areas, and employs different updating strategies for nodes in different areas. Using real-life data, we finally conduct an extensive experimental study, and show that our approach is both effective and efficient for ranking scholarly articles.
【Keywords】: scholarly article ranking; query independent; time weighted PageRank; block-wise algorithm; dynamic algorithm
【Paper Link】 【Pages】:965-976
【Authors】: Jingwen Zhao ; Yunjun Gao ; Gang Chen ; Rui Chen
【Abstract】: A top-k geo-social keyword (TkGSK) query retrieves the objects by considering spatial, social, and textual constraints. After a user issues an initial TkGSK query and gets back its result, however, the user may find that some expected objects are missing and hence may wonder why. In this paper, we explore the why-not top-k geo-social keyword (WNGSK) query in road networks, due to its importance in decision making. We propose techniques which not only adapt user's query keywords but also recommend new social links, so that expected, but missing objects, are included in the result. We first present a nontrivial basic algorithm having some optimizations. To improve efficiency, we develop a new index so-called PIM-tree, and based on it, we present WNGSK query algorithm with several pruning strategies. Extensive experimental results offer insight into the effectiveness and efficiency of our proposed index and algorithms.
【Keywords】: Why not Question; Top k-Geo Social Keyword Query; Road Network; Database Usability
【Paper Link】 【Pages】:977-988
【Authors】: Haoyuan Xing ; Sofoklis Floratos ; Spyros Blanas ; Suren Byna ; Prabhat ; Kesheng Wu ; Paul Brown
【Abstract】: Scientists are increasingly turning to datacenter-scale computers to analyze massive arrays. Despite decades of database research that extols the virtues of declarative query processing, scientists still write, debug and parallelize imperative HPC programs even for the most mundane queries. This impedance mismatch is due to the cumbersome and costly data format conversions that are needed to use scientific data management tools, such as SciDB, in an HPC setting. Our goal is to make declarative array manipulations from SciDB interoperable with imperative, file-centric analyses from HDF5-based programs. This paper describes ArrayBridge, a bi-directional array view mechanism for the HDF5 file format, that allows scientists to use SciDB, TensorFlow and HDF5-based analysis code in the same file-centric pipeline without converting between file formats. In addition to fast querying over HDF5 array objects, ArrayBridge produces arrays in the HDF5 file format as easily as it can read from it. ArrayBridge also supports time travel queries from imperative codes through the unmodified HDF5 API, and automatically deduplicates between versions for space efficiency. Our performance evaluation in a large scientific computing facility shows that ArrayBridge exhibits statistically indistinguishable performance and I/O scalability to the native SciDB storage engine and is 3× faster than TileDB.
【Keywords】: array processing; in situ processing; SciDB; HDF5
【Paper Link】 【Pages】:989-1000
【Authors】: Raul Castro Fernandez ; Essam Mansour ; Abdulhakim Ali Qahtan ; Ahmed K. Elmagarmid ; Ihab F. Ilyas ; Samuel Madden ; Mourad Ouzzani ; Michael Stonebraker ; Nan Tang
【Abstract】: Employees that spend more time finding relevant data than analyzing it suffer from a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web, we propose to identify semantic links that assist analysts in their discovery tasks. These links relate tables to each other, to facilitate navigating the schemas. They also relate data to external data sources, such as ontologies and dictionaries, to help explain the schema meaning. We materialize the links in an enterprise knowledge graph, where they become available to analysts. The main challenge is how to find pairs of objects that are semantically related. We propose SEMPROP, a DAG of different components that find links based on syntactic and semantic similarities. SEMPROP is commanded by a semantic matcher which leverages word embeddings to find objects that are semantically related. We introduce coherent group, a technique to combine word embeddings that works better than other state of the art combination alternatives. We implement SEMPROP as part of Aurum, a data discovery system we are building, and conduct user studies, real deployments and a quantitative evaluation to understand the benefits of links for data discovery tasks, as well as the benefits of SEMPROP and coherent groups to find those links.
【Keywords】: data discovery; word embeddings
【Paper Link】 【Pages】:1001-1012
【Authors】: Raul Castro Fernandez ; Ziawasch Abedjan ; Famien Koko ; Gina Yuan ; Samuel Madden ; Michael Stonebraker
【Abstract】: Organizations face a data discovery problem when their analysts spend more time looking for relevant data than analyzing it. This problem has become commonplace in modern organizations as: i) data is stored across multiple storage systems, from databases to data lakes, to the cloud; ii) data scientists do not operate within the limits of well-defined schemas or a small number of data sources-instead, to answer complex questions they must access data spread across thousands of data sources. To address this problem, we capture relationships between datasets in an enterprise knowledge graph (EKG), which helps users to navigate among disparate sources. The contribution of this paper is AURUM, a system to build, maintain and query the EKG. To build the EKG, we introduce a Two-step process which scales to large datasets and requires only one-pass over the data, avoiding overloading the source systems. To maintain the EKG without re-reading all data every time, we introduce a resource-efficient sampling signature (RESS) method which works by only using a small sample of the data. Finally, to query the EKG, we introduce a collection of composable primitives, thus allowing users to define many different types of discovery queries. We describe our experience using AURUM in three corporate scenarios and do a performance evaluation of each component.
【Keywords】: data discovery; enterprise knowledge graph
【Paper Link】 【Pages】:1013-1024
【Authors】: Fisnik Kastrati ; Guido Moerkotte
【Abstract】: We present an algorithm that produces optimal plans to evaluate arbitrary Boolean expressions possibly containing conjunctions and disjunctions. The complexity of our algorithm is O ( n 3 n ), where n is the number of simple predicates in the Boolean expression. This complexity is far lower than that of Reinwald and Soland's algorithm ( O (2 2(n) )). This lower complexity allows us to optimize Boolean expressions with up to 16 predicates in a reasonable time. Further, opposed to many existing approaches, our algorithm fulfills all requirements necessary in the context of main memory database systems. We then use this algorithm to (1) determine the optimization potential inherent in Boolean expressions and (2) evaluate the plan quality of two heuristics proposed in the literature.
【Keywords】: Query optimization; Algorithms; Boolean Expressions
【Paper Link】 【Pages】:1025-1036
【Authors】: Yoon-Min Nam ; Min-Soo Kim ; Donghyoung Han
【Abstract】: As the amount of data to process increases, a scalable and efficient horizontal database partitioning method becomes more important for OLAP query processing in parallel database platforms. Existing partitioning methods have a few major drawbacks such as a large amount of data redundancy and not supporting join processing without shuffle in many cases despite their large data redundancy. We elucidate the drawbacks arise from their tree-based partitioning schemes and propose a novel graph-based database partitioning method called GPT that improves query performance with lower data redundancy. Through extensive experiments using three benchmarks, we show that GPT significantly outperforms the state-of-the-art method in terms of both storage overhead and query performance.
【Keywords】: OLAP; graph based; multi column; database partitioning; hub table
【Paper Link】 【Pages】:1037-1048
【Authors】: Bowen Zhang ; Yanyan Shen ; Yanmin Zhu ; Jiadi Yu
【Abstract】: The increasing amount of trajectory data facilitates a wide spectrum of practical applications. In many such applications, large numbers of trajectory range and similarity queries are issued continuously, which calls for high-throughput trajectory query processing. Traditional in-memory databases lack considerations of the unique features of trajectories, thus suffering from inferior performance. Existing trajectory query processing systems are typically designed for only one type of trajectory queries, i.e., either range or similarity query, but not for both. Inspired by the massive parallelism on GPUs, in this paper, we develop a GPU-accelerated framework, named GAT, to support both types of trajectory queries (i.e., both range and similarity queries) with high throughput. For similarity queries, we adopt the Edit Distance on Real sequence (EDR) as the similarity measure which is accurate and robust to noise in real-world trajectories. GAT employs a GPU-friendly index called GTIDX to effectively filter invalid trajectories for both range and similarity queries, and exploits the GPU to perform parallel verifications. To accelerate the verification process on the GPU, we apply the Morton-based encoding method to reorganize trajectory points and facilitate coalesced data accesses for individual point data in global memory, which reduces the global memory bandwidth requirement significantly. We also propose a technique of grouping size-varying cells into balanced blocks with similar numbers of trajectory points, to achieve load balancing among the Streaming Multiprocessors (SMs) of the GPU. We conduct extensive experiments to evaluate the performance of GAT using two real-life trajectory datasets. The results show that GAT is scalable and achieves high throughput with acceptable indexing cost.
【Keywords】: trajectory; query
【Paper Link】 【Pages】:1049-1060
【Authors】: Ingo Müller ; Andrea Arteaga ; Torsten Hoefler ; Gustavo Alonso
【Abstract】: Industry-grade database systems are expected to produce the same result if the same query is repeatedly run on the same input. However, the numerous sources of non-determinism in modern systems make reproducible results difficult to achieve. This is particularly true if floating-point numbers are involved, where the order of the operations affects the final result. As part of a larger effort to extend database engines with data representations more suitable for machine learning and scientific applications, in this paper we explore the problem of making relational GroupBy over floating-point formats bit-reproducible, i.e., ensuring any execution of the operator produces the same result up to every single bit. To that aim, we first propose a numeric data type that can be used as drop-in replacement for other number formats and is-unlike standard floating-point formats-associative. We use this data type to make state-of-the-art GroupBy operators reproducible, but this approach incurs a slowdown between 4x and 12x compared to the same operator using conventional database number formats. We thus explore how to modify existing GroupBy algorithms to make them bit-reproducible and efficient. By using vectorized summation on batches and carefully balancing batch size, cache footprint, and preprocessing costs, we are able to reduce the slowdown due to reproducibility to a factor between 1.9x and 2.4x of aggregation in isolation and to a mere 2.7% of end-to-end query performance even on aggregation-intensive queries in MonetDB. We thereby provide a solid basis for supporting more reproducible operations directly in relational engines.
【Keywords】: aggregation; floating point; reproducibility; group by; performance; determinism
【Paper Link】 【Pages】:1061-1072
【Authors】: Lu Chen ; Qilu Zhong ; Xiaokui Xiao ; Yunjun Gao ; Pengfei Jin ; Christian S. Jensen
【Abstract】: Ridesharing refers to a transportation scenario where travellers with similar itineraries and time schedules share a vehicle for a trip and split the travel cost, which may include fuel, tolls, and parking fees. Ridesharing is popular among travellers because it can reduce their travel costs, and it also holds the potential to reduce travel time, congestion, air pollution, and overall fuel consumption. However, existing ridesharing systems often offer each traveller only one choice that aims to minimize system-wide vehicle travel distance or time. We propose a solution that offers more options. Specifically, we do this by considering both pick-up time and price, so that travellers are able to choose the vehicle that matches their preferences best. In order to identify quickly vehicles that satisfy incoming ridesharing requests, we propose two efficient matching algorithms that follow the single-side and dual-side search paradigms, respectively. To further accelerate the matching, indexes on the road network and vehicles are developed, based on which several pruning heuristics are designed. Extensive experiments on a large Shanghai taxi dataset offer insights into the performance of our proposed techniques and compare with a baseline that extends the state-of-the art method.
【Keywords】: Ridesharing; Time aware; Price aware; Query Processing; Spatial Database
【Paper Link】 【Pages】:1073-1084
【Authors】: Chenjuan Guo ; Bin Yang ; Jilin Hu ; Christian S. Jensen
【Abstract】: Motivated by the increasing availability of vehicle trajectory data, we propose learn-to-route, a comprehensive trajectory-based routing solution. Specifically, we first construct a graph-like structure from trajectories as the routing infrastructure. Second, we enable trajectory-based routing given an arbitrary (source, destination) pair. In the first step, given a road network and a collection of trajectories, we propose a trajectory-based clustering method that identifies regions in a road network. If a pair of regions are connected by trajectories, we maintain the paths used by these trajectories and learn a routing preference for travel between the regions. As trajectories are skewed and sparse, %and although the introduction of regions serves to consolidate the sparse data, many region pairs are not connected by trajectories. We thus transfer routing preferences from region pairs with sufficient trajectories to such region pairs and then use the transferred preferences to identify paths between the regions. In the second step, we exploit the above graph-like structure to achieve a comprehensive trajectory-based routing solution. Empirical studies with two substantial trajectory data sets offer insight into the proposed solution, indicating that it is practical. A comparison with a leading routing service offers evidence that the paper's proposal is able to enhance routing quality.
【Keywords】: Trajectories; Routing; Routing preferences
【Paper Link】 【Pages】:1085-1096
【Authors】: Wei Chen ; Hongzhi Yin ; Weiqing Wang ; Lei Zhao ; Xiaofang Zhou
【Abstract】: Sources of complementary information are connected when we link the user accounts belonging to the same user across different domains or devices. The expanded information promotes the development of a wide range of applications, such as cross-domain prediction, cross-domain recommendation, and advertisement. Due to the great significance of user account linkage, there are increasing research works on this study. With the widespread popularization of GPS-enabled mobile devices, linking user accounts with location data has become an important and promising research topic. Being different from most existing studies in this domain that only focus on the effectiveness, we propose novel approaches to improve both effectiveness and efficiency of user account linkage. In this paper, a kernel density estimation (KDE) based method has been proposed to improve the accuracy by alleviating the data sparsity problem in measuring users' similarities. To improve the efficiency, we develop a grid-based structure to organize location data to prune the search space. The extensive experiments conducted on two real-world datasets demonstrate the superiority of the proposed approach in terms of both effectiveness and efficiency compared with the state-of-art methods.
【Keywords】: cross domain; user linkage; social network
【Paper Link】 【Pages】:1097-1108
【Authors】: Satoshi Koide ; Yukihiro Tadokoro ; Chuan Xiao ; Yoshiharu Ishikawa
【Abstract】: In this paper, we present a compressed data structure for moving object trajectories in a road network, which are represented as sequences of road edges. Unlike existing compression methods for trajectories in a network, our method supports pattern matching and decompression from an arbitrary position while retaining high compressibility with theoretical guarantees. Specifically, our method is based on FM-index, a fast and compact data structure for pattern matching. To further enhance the compression performance, we incorporate the sparsity of road networks. In particular, we present the novel concepts of relative movement labeling and PseudoRank , each contributing to significant reduction in data size and query processing time. Our theoretical analysis and experimental studies reveal the advantages of our proposed method as compared to existing trajectory compression methods and FM-index variants.
【Keywords】: trajectory compression; road network; Burrows Wheeler transform
【Paper Link】 【Pages】:1109-1119
【Authors】: Linnan Jiang ; Lei Chen ; Zhao Chen
【Abstract】: Recently, knowledge base systems such as Freebase, YAGO, etc. have been designed and widely applied while most of the knowledge bases are far from being of a high quality. According to the recent researches, the low quality is mainly caused by the loss and low accuracy of the RDF triples, which are the main components of knowledge base systems. In this paper, we propose approaches to enhance the RDF triples in knowledge bases, which is significant for providing good information retrieval service. Specifically, we utilize data facts stored in database systems to obtain possible updates for knowledge bases. Furthermore, inspired by the popular and successful applications of crowdsourcing platforms, we explore the use of crowdsourcing to verify the updates. We propose KD graph to model the possible updates and design a comprehensive framework for knowledge base enhancement problem. Since crowdsourcing employs human power and requires expenditure, we propose an optimal and dynamic method to select candidates for crowdsourcing within a limited budget so that the benefit of enhancing the knowledge base can be maximized. To reduce the time cost, we adopt split techniques and design Simple Split(SS) and Dynamic Split(DS) algorithms. We verify the effectiveness of our solutions by conducting crowdsourcing simulation experiments and experiments on a crowdsourcing platform namely gMission.
【Keywords】: Knowledge Base Enhancement; Crowdsourcing
【Paper Link】 【Pages】:1120-1131
【Authors】: Sejoon Oh ; Namyong Park ; Lee Sael ; U. Kang
【Abstract】: Given sparse multi-dimensional data (e.g., (user, movie, time; rating) for movie recommendations), how can we discover latent concepts/relations and predict missing values? Tucker factorization has been widely used to solve such problems with multi-dimensional data, which are modeled as tensors. However, most Tucker factorization algorithms regard and estimate missing entries as zeros, which triggers a highly inaccurate decomposition. Moreover, few methods focusing on an accuracy exhibit limited scalability since they require huge memory and heavy computational costs while updating factor matrices. In this paper, we propose P-Tucker, a scalable Tucker factorization method for sparse tensors. P-Tucker performs alternating least squares with a row-wise update rule in a fully parallel way, which significantly reduces memory requirements for updating factor matrices. Furthermore, we offer two variants of P-Tucker: a caching algorithm P-Tucker-Cache and an approximation algorithm P-Tucker-Approx, both of which accelerate the update process. Experimental results show that P-Tucker exhibits 1.7-14.1x speed-up and 1.4-4.8x less error compared to the state-of-the-art. In addition, P-Tucker scales near linearly with the number of observable entries in a tensor and number of threads. Thanks to P-Tucker, we successfully discover hidden concepts and relations in a large-scale real-world tensor, while existing methods cannot reveal latent features due to their limited scalability or low accuracy.
【Keywords】: Tensor Analysis; Tucker Factorization; Parallel Computing; Data Mining
【Paper Link】 【Pages】:1132-1143
【Authors】: Minji Yoon ; Jinhong Jung ; U. Kang
【Abstract】: Given a large graph, how can we determine similarity between nodes in a fast and accurate way? Random walk with restart (RWR) is a popular measure for this purpose and has been exploited in numerous data mining applications including ranking, anomaly detection, link prediction, and community detection. However, previous methods for computing exact RWR require prohibitive storage sizes and computational costs, and alternative methods which avoid such costs by computing approximate RWR have limited accuracy. In this paper, we propose TPA, a fast, scalable, and highly accurate method for computing approximate RWR on large graphs. TPA exploits two important properties in RWR: 1) nodes close to a seed node are likely to be revisited in following steps due to block-wise structure of many real-world graphs, and 2) RWR scores of nodes which reside far from the seed node are proportional to their PageRank scores. Based on these two properties, TPA divides approximate RWR problem into two subproblems called neighbor approximation and stranger approximation. In the neighbor approximation, TPA estimates RWR scores of nodes close to the seed based on scores of few early steps from the seed. In the stranger approximation, TPA estimates RWR scores for nodes far from the seed using their PageRank. The stranger and neighbor approximations are conducted in the preprocessing phase and the online phase, respectively. Through extensive experiments, we show that TPA requires up to 3.5× less time with up to 40× less memory space than other state-of-the-art methods for the preprocessing phase. In the online phase, TPA computes approximate RWR up to 30× faster than existing methods while maintaining high accuracy.
【Keywords】: Random Walk with Restart
【Paper Link】 【Pages】:1144-1155
【Authors】: Xinsheng Li ; Kasim Selçuk Candan ; Maria Luisa Sapino
【Abstract】: Data-and model-driven computer simulations are increasingly critical in many application domains. These simulations may track 10s or 100s of parameters, affected by complex inter-dependent dynamic processes. Moreover, decision makers usually need to run large simulation ensembles, containing 1000s of simulations. In this paper, we rely on a tensor-based framework to represent and analyze patterns in large simulation ensemble data sets to obtain a high-level understanding of the dynamic processes implied by a given ensemble of simulations.We, further, note that the inherent sparsity of the simulation ensembles (relative to the space of potential simulations one can run) constitutes a significant problem in discovering these underlying patterns. To address this challenge, we propose a partition-stitch sampling scheme, which divides the parameter space into subspaces to collect several lower modal ensembles, and complement this with a novel Multi-Task Tensor Decomposition (M2TD), technique which helps effectively and efficiently stitch these subensembles back. Experiments showed that, for a given budget of simulations, the proposed structured sampling scheme leads to significantly better overall accuracy relative to traditional sampling approaches, even when the user does not have a perfect information to help guide the structured partitioning process.
【Keywords】: Tensor; Tensor Decomposition; Simulation
【Paper Link】 【Pages】:1156-1167
【Authors】: Zhaoqiang Chen ; Qun Chen ; Fengfeng Fan ; Yanyan Wang ; Zhuo Wang ; Youcef Nafa ; Zhanhuai Li ; Hailong Liu ; Wei Pan
【Abstract】: Even though many machine algorithms have been proposed for entity resolution, it remains very challenging to find a solution with quality guarantees. In this paper, we propose a novel HUman and Machine cOoperation (HUMO) framework for entity resolution (ER), which divides an ER workload between the machine and the human. HUMO enables a mechanism for quality control that can flexibly enforce both precision and recall levels. We introduce the optimization problem of HUMO, minimizing human cost given a quality requirement, and then present three optimization approaches: a conservative baseline one purely based on the monotonicity assumption of precision, a more aggressive one based on sampling and a hybrid one that can take advantage of the strengths of both previous approaches. Finally, we demonstrate by extensive experiments on real and synthetic datasets that HUMO can achieve high-quality results with reasonable return on investment (ROI) in terms of human cost, and it performs considerably better than the state-of-the-art alternatives in quality control.
【Keywords】: Entity Resolution; Human Machine Cooperation; Quality Guarantees
【Paper Link】 【Pages】:1168-1179
【Authors】: Stefano Ortona ; Venkata Vamsikrishna Meduri ; Paolo Papotti
【Abstract】: We present RUDIK, a system for the discovery of declarative rules over knowledge-bases (KBs). RUDIK discovers rules that express positive relationships between entities, such as "if two persons have the same parent, they are siblings", and negative rules, i.e., patterns that identify contradictions in the data, such as "if two persons are married, one cannot be the child of the other". While the former class infers new facts in the KB, the latter class is crucial for other tasks, such as detecting erroneous triples in data cleaning, or the creation of negative examples to bootstrap learning algorithms. The system is designed to: (i) enlarge the expressive power of the rule language to obtain complex rules and wide coverage of the facts in the KB, (ii) discover approximate rules (soft constraints) to be robust to errors and incompleteness in the KB, (iii) use disk-based algorithms, effectively enabling rule mining in commodity machines. In contrast with traditional ranking of all rules based on a measure of support, we propose an approach to identify the subset of useful rules to be exposed to the user. We model the mining process as an incremental graph exploration problem and prove that our search strategy has guarantees on the optimality of the results. We have conducted extensive experiments using real-world KBs to show that RUDIK outperforms previous proposals in terms of efficiency and that it discovers more effective rules for the application at hand.
【Keywords】: knowledge base; declarative rules; data quality; data cleaning; rule mining
【Paper Link】 【Pages】:1180-1191
【Authors】: Katerina Papaioannou ; Martin Theobald ; Michael H. Böhlen
【Abstract】: In temporal-probabilistic (TP) databases, the combination of the temporal and the probabilistic dimension adds significant overhead to the computation of set operations. Although set queries are guaranteed to yield linearly sized output relations, all of the existing solutions exhibit a quadratic runtime complexity. They suffer from redundant interval comparisons and additional joins for the formation of lineage expressions. In this paper, we formally define TP set operations and study their properties. For their efficient computation, we introduce the lineage-aware temporal window, a mechanism that binds intervals with lineage expressions. We suggest the lineage-aware window advancer (LAWA) for producing lineage-aware temporal windows, which enable direct filtering of irrelevant intervals and finalization of output lineage expressions. This way, we compute TP set operations in linearithmic time. A series of experiments over both synthetic and real-world datasets show that (a) our approach has predictable performance, which depends only on the size of the input relations and not on the number of time intervals per fact or the overlap of the time intervals, and that (b) it outperforms state-of-the-art approaches.
【Keywords】: temporal; probabilistic; lineage
【Paper Link】 【Pages】:1192-1203
【Authors】: You Peng ; Ying Zhang ; Wenjie Zhang ; Xuemin Lin ; Lu Qin
【Abstract】: As uncertainty is inherent in a wide spectrum of graph applications such as social network and brain network, it is highly demanded to re-visit classical graph problems in the context of uncertain graphs. Driven by real-applications, in this paper, we study the problem of k-core computation on uncertain graphs and propose a new model, namely (k,θ)-core, which consists of nodes with probability at least θ to be k-core member in the uncertain graph. We show the computation of (k,θ)-core is NP-hard, and hence resort to sampling based methods. Effective and efficient pruning techniques are proposed to significantly reduce the candidate size. To further reduce the cost of k-core computation on multiple sampled graphs, we design a k-core membership check algorithm following a novel expansion-based search paradigm. Extensive experiments on real-life graphs demonstrate the effectiveness and efficiency of our proposed techniques.
【Keywords】: Uncertain Graph; K-core computation
【Paper Link】 【Pages】:1204-1207
【Authors】: Maryam Fanaeepour ; Benjamin I. P. Rubinstein
【Abstract】: Hay et al. (2016) recently observed that existing histogram release mechanisms under differential privacy do not provide satisfactory privacy protection. Existing work either tunes on sensitive data to optimise parameters without consideration of privacy; or selection is performed arbitrarily and independent of data, degrading utility. We address this open problem by deriving a principled tuning mechanism E2EPRIV that privately optimises data-dependent error bounds. Theoretical analysis establishes privacy and utility, while extensive experimentation demonstrates that E2EPRIV can practically achieve true end-to-end privacy.
【Keywords】: differential privacy; end-to-end privacy; location privacy; spatial histograms
【Paper Link】 【Pages】:1208-1211
【Authors】: Kai Puolamäki ; Emilia Oikarinen ; Bo Kang ; Jefrey Lijffijt ; Tijl De Bie
【Abstract】: The exploration of high-dimensional real-valued data is one of the fundamental exploratory data analysis (EDA) tasks. Existing methods use predefined criteria for the representation of data. There is a lack of methods eliciting the user's knowledge from the data and showing patterns the user does not know yet. We provide a theoretical model where the user can input the patterns she has learned as knowledge. The background knowledge is used to find a MaxEnt distribution of the data, and the user is shown maximally informative projections in which the MaxEnt distribution and the data differ the most. We provide an interactive open source EDA system, study its performance, and present use cases on real data.
【Keywords】: exploratory data analysis; dimensionality reduction; information theory; subjective interestingness
【Paper Link】 【Pages】:1212-1215
【Authors】: Kun Xie ; Yuxiang Chen ; Xin Wang ; Gaogang Xie ; Jigang Wen ; Dafang Zhang
【Abstract】: Tensor completion can be applied to fill in the missing data, which is import for many data applications where the data are incomplete. To infer the missing data, existing tensor-completion algorithms generally assume that the tensor data have global low-rank structure and apply a single model to fit the overall observed data through the global optimization. However, there are different correlation levels among application data, thus the ranks of some sub-tensors can be even lower relative to that of the large tensor. Fitting a single model to all data will compromise the performance of data recovery. To increase the accuracy in missing data recovery, we propose to apply local tensor completion (Local-TC) to recover data from sub-tensors, with each containing data of higher correlations. Although promising, as the tensor data are only organized logically, it is difficult to determine the relationship among data. We propose to exploit locality-sensitive hash (LSH) to quickly find the data correlation and reorganize tensor data, based on which data entries with high correlations are put into the same sub-tensor. The experiment results demonstrate that Local-TC is very effective in increasing the recovery accuracy.
【Keywords】: Tensor Completion; Locality Sensitive Hashing
【Paper Link】 【Pages】:1216-1219
【Authors】: Da Yan ; Hongzhi Chen ; James Cheng ; Zhenkun Cai ; Bin Shao
【Abstract】: De novo genome assembly is the process of stitching short DNA sequences to generate longer DNA sequences, without using any reference sequence for alignment. It enables high-throughput genome sequencing and thus accelerates the discovery of new genomes. In this paper, we present a toolkit, called PPA-assembler, for de novo genome assembly in a distributed setting. PPA-assembler adopts the popular de Bruijn graph based approach, and the operations run on Google's Pregel framework with strong performance guarantees. PPA-assembler demonstrates superior performance compared with existing assemblers, and is open-sourced at https://github.com/yaobaiwei/PPAAssembler. The full version of this paper can be found at https://arxiv.org/abs/1801.04453.
【Keywords】: genome assembly; Pregel; de Bruijn graph; DNA sequence
【Paper Link】 【Pages】:1220-1223
【Authors】: Cheng-Hao Deng ; Wan-Lei Zhao
【Abstract】: In the big data era, k-means clustering has been widely adopted as a basic processing tool in various contexts. However, its computational cost could be prohibitively high when the data size and the cluster number are large. The processing bottleneck of k-means lies in the operation of seeking the closest centroid in each iteration. In this paper, a novel solution towards the scalability issue of k-means is presented. In the proposal, k-means is supported by an approximate k-nearest neighbors graph. In the k-means iteration, each data sample is only compared to clusters that its nearest neighbors reside. Since the number of nearest neighbors we consider is much less than k, the processing cost in this step becomes minor and irrelevant to k. The processing bottleneck is therefore broken. The most interesting thing is that k-nearest neighbor graph is constructed by calling the fast k-means itself. Compared with existing fast k-means variants, the proposed algorithm achieves hundreds to thousands times speed-up while maintaining high clustering quality. As it is tested on 10 million 512-dimensional data, it takes only 5.2 hours to produce 1 million clusters. In contrast, it would take 3 years for traditional k-means to fulfill the same scale of clustering.
【Keywords】: k-means; k-NN Graph; fast
【Paper Link】 【Pages】:1224-1227
【Authors】: Hui Guan ; Yufei Ding ; Xipeng Shen ; Hamid Krim
【Abstract】: K-means configuration is a time-consuming process due to the iterative nature of k-means. This paper proposes reuse-centric k-means configuration to accelerate k-means configuration. It is based on the observation that the explorations of different configurations share lots of common or similar computations. Effectively reusing the computations from prior trials of different configurations could largely shorten the configuration time. The paper presents a set of novel techniques to materialize the idea, including reuse-based filtering, center reuse, and a two-phase design to capitalize on the reuse opportunities on three levels: validation, k, and feature sets. Experiments show that our approach can accelerate some common configuration tuning methods by 5-9X.
【Keywords】: kmeans; algorithm configuration; computation reuse
【Paper Link】 【Pages】:1228-1231
【Authors】: Bo Zhang ; Boxiang Dong ; Wendy Hui Wang
【Abstract】: We design AssureMR, a system that supports efficient verification of SQL Selection-GroupBy-Aggregation (SGA) query evaluation on an untrusted MapReduce system. AssureMR does not rely on a centralized trusted party to construct the authentication data structure (ADS). Instead, AssureMR allows the untrusted mappers/reducers to construct ADS. AssureMR provides the following verification functionality: (1) correctness verification of ADS; (2) correctness verification of intermediate query results by individual mapper; and (3) correctness verification of final query results by reducers. Our experimental results demonstrate the efficiency and effectiveness of AssureMR.
【Keywords】: SQL; Verification; MapReduce
【Paper Link】 【Pages】:1232-1235
【Authors】: Md. Shiblee Sadik ; Le Gruenwald ; Eleazar Leal
【Abstract】: Data streams are sequences of data points that have the properties of transiency, infiniteness, concept drift, uncertainty, multi-dimensionality, cross-correlation among different streams, asynchronous arrival, and heterogeneity. In this paper we propose a new outlier detection technique for multiple multi-dimensional data streams, called Wadjet, that addresses all the issues of outlier detection in multiple data streams. Wadjet exploits the temporal correlations to identify outliers in each individual data stream, and after this, it exploits the cross-correlations between data streams to identify points that do not conform with these cross-correlations. Experiments comparing Wadjet against existing techniques on real and synthetic datasets show that Wadjet achieves 18.8× higher precision, and competitive execution time and recall.
【Keywords】: Outlier detection; data streams; heterogeneous data streams; uncertain data streams
【Paper Link】 【Pages】:1236-1239
【Authors】: Pan Xu ; Cuong Nguyen ; Srikanta Tirthapura
【Abstract】: Space filling curves (SFCs) are widely used in the design of indexes for spatial and temporal data. Clustering is a key metric for an SFC, that measures how well the curve preserves locality in mapping from higher dimensions to a single dimension. We present the onion curve, an SFC whose clustering performance is provably close to the optimal for cube and near-cube shaped query sets. We show that in contrast, the clustering performance of the widely used Hilbert curve can be far from optimal, even for cube-shaped queries. Since clustering performance is critical to the efficiency of multi-dimensional indexes based on the SFC, the onion curve can deliver improved performance for data structures for multi-dimensional data.
【Keywords】: Query Processing; Indexing; and Optimization
【Paper Link】 【Pages】:1240-1243
【Authors】: Tobias Christiani ; Rasmus Pagh ; Johan Sivertsen
【Abstract】: Set similarity join is a fundamental and well-studied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard similarity). But set similarity join is often used in settings where 100% recall may not be important - indeed, where the exact set similarity join is itself only an approximation of the desired result set. We present a new randomized algorithm for set similarity join that can achieve any desired recall up to 100%, and show theoretically and empirically that it significantly improves on existing methods. The present state-of-the-art exact methods are based on prefix-filtering, the performance of which depends on the data set having many rare tokens. Our method is robust against the absence of such structure in the data. At 90% recall our algorithm is often more than an order of magnitude faster than state-of-the-art exact methods, depending on how well a data set lends itself to prefix filtering. Our experiments on benchmark data sets also show that the method is several times faster than comparable approximate methods. Our algorithm makes use of recent theoretical advances in high-dimensional sketching and indexing that we believe to be of wider relevance to the data engineering community.
【Keywords】: set; similarity; join
【Paper Link】 【Pages】:1244-1247
【Authors】: Wenjian Xu ; Eric Lo ; Pengfei Zhang
【Abstract】: Scan is a crucial operation in main-memory column-stores. It scans a column and returns a result bit vector indicating which records satisfy a filter predicate. ByteSlice is an in-memory data layout that chops data into multiple bytes and exploits early-stop capability by high-order bytes comparisons. As column widths are usually not multiples of byte, the last-byte of ByteSlice is padded with 0's, wasting memory bandwidth and computation power. To fully leverage the resources, we propose to weave a secondary index into the vacant bits (i.e., bits originally padded with 0's), forming our new layout coined DIFusion (Data Index Fusion). DIFusion enables skip-scan, a new fast scan that inherits the early-stopping capability from ByteSlice and at the same time possesses the data-skipping ability of index with zero space overhead. Empirical results show that skip-scan on DIFusion outperforms scan on ByteSlice.
【Keywords】: in memory scan; index; SIMD
【Paper Link】 【Pages】:1248-1251
【Authors】: Libin Zheng ; Lei Chen
【Abstract】: The system throughput and workers' travel distance are two important factors in spatial crowdsourcing and improving one of them usually means sacrificing the other. However, existing works either fail to consider the trade-off between these two factors or resolve their conflicts by simply targeting tasks within a bounding circle for each worker. In this paper, we compromise between the throughput and the distance by formulating these two factors as score terms in the objective function. Apart from that, we study the multi-campaign scenario in our problem, which is not uncommon in practical applications while not yet discussed in existing works. The worker diversity of the campaigns is formulated as another score term in the objective function. The problem of multi-campaign oriented spatial crowdsourcing is to maximize the aforementioned score function. We prove the problem is NP-hard and provide several approximation solutions. Extensive experiments have been conducted to validate the devised solutions.
【Keywords】: spatial; crowdsourcing; crowdsensing
【Paper Link】 【Pages】:1252-1255
【Authors】: Zhuhe Fang ; Chuliang Weng ; Li Wang ; Aoying Zhou
【Abstract】: To fully use the advanced resources of a main memory database cluster, we take independent parallelism into account to parallelize multiple pipelines of one query. However, scheduling resources to multiple pipelines is an intractable problem. Traditional static approaches to this problem may lead to a serious waste of resources and suboptimal execution order of pipelines, because it is hard to predict the actual data distribution and fluctuating workloads at compile time. In response, we propose a dynamic scheduling algorithm, List with Filling and Preemption (LFPS), based on two techniques. (1) Adaptive filling improves resource utilization by issuing more extra pipelines to adaptively fill idle resource "holes" during execution. (2) Cost-based preemption strictly guarantees scheduling the pipelines on a critical path first at run time. We implement LFPS in our prototype database system. Under the workloads of TPC-H, experiments show our work improves the finish time of parallelizable pipelines from one query up to 2.3X than a static approach and 1.7X than a serialized execution.
【Keywords】: independent parallelism; multiple pipelines; adaptive filling; preemption
【Paper Link】 【Pages】:1256-1259
【Authors】: Lizhen Wang ; Xuguang Bao ; Longbing Cao
【Abstract】: Spatial co-location pattern mining is an important task in spatial data mining. However, traditional mining frameworks often produce too many prevalent patterns of which only a small proportion may be truly interesting to end users. To satisfy user preferences, this work proposes an interactive probabilistic post-mining method to discover user-preferred co-location patterns from the early-round of mined results by iteratively involving user's feedback and probabilistically refining preferred patterns. We first introduce a framework of interactively post-mining preferred co-location patterns, which enables a user to effectively discover the co-location patterns tailored to his/her specific preference. A probabilistic model is further introduced to measure the user feedback-based subjective preferences on resultant co-location patterns. This measure is used to not only select sample co-location patterns in the iterative user feedback process but also rank the results. The experimental results on real and synthetic data sets demonstrate the effectiveness of our approach.
【Keywords】: Spatial data mining; spatial co location patterns; interactive feedback; probability model
【Paper Link】 【Pages】:1260-1263
【Authors】: Ze-ke Wang ; Kai Zhang ; Haihang Zhou ; Xue Liu ; Bingsheng He
【Abstract】: The optimization of conjunctive predicates is still critical to the overall performance of analytic data processing tasks, especially on a denormalized table, where queries with time-consuming joins on the original normalized tables are converted into simple scans. Existing work relies on the query optimizer to do the selectivity estimation and then produce the optimal evaluation order of predicates. In this paper, we argue for an order-oblivious approach, based on memory-efficient storage layouts. Accordingly, we propose Hebe, a simplified execution scheme which is attractive to the query optimizer, as it does not need to go through a sampling process to determine an optimal evaluation order of predicates. Compared with the state-of-theart implementation with the optimal evaluation order, Hebe can also achieve up to 153% performance improvement.
【Keywords】: Conjunctive predicates; Order obliviousness; Modern Hardware
【Paper Link】 【Pages】:1264-1267
【Authors】: Christine Tex ; Martin Schäler ; Klemens Böhm
【Abstract】: When mining data, organizations rely on service providers to carry out the analyses. However, data owners often are only willing to transfer their data when it is encrypted. So encryption must preserve the mining results. Since many mining algorithms are distance-based, we propose the notion of distance-preserving encryption (DPE). Designing a DPE-scheme is challenging, as it depends both on the data and the distance measure in use. We propose a procedure to engineer DPE-schemes, dubbed KIT-DPE. In a case study, we instantiate KIT-DPE for SQL query logs. We design DPE-schemes for all SQL query-distance measures from the literature. For all these measures, we prove that one can use a combination of existing property-preserving encryption schemes with known security characteristics to guarantee the same mining result.
【Keywords】: Distance Based Data Mining; SQL Logs; Distance Preserving Encryption
【Paper Link】 【Pages】:1268-1271
【Authors】: Yanhao Wang ; Yuchen Li ; Kian-Lee Tan
【Abstract】: Representative subset selection (RSS) is an important tool for users to draw insights from massive datasets. A common approach is to model RSS as the submodular maximization problem because the utility of extracted representatives often satisfies the "diminishing returns" property. To capture the data recency issue and support different types of constraints in real-world problems, we formulate RSS as maximizing a submodular function subject to a d-knapsack constraint (SMDK) over sliding windows. Then, we propose a novel KnapWindow framework for SMDK. Theoretically, KnapWindow is 1-ε/1+d - approximate for SMDK and achieves sublinear complexity. Finally, we evaluate the efficiency and effectiveness of KnapWindow on real-world datasets. The results show that it achieves up to 120x speedups over the batch baseline with at least 94% utility assurance.
【Keywords】: data summarization; submodular maximization; data stream; sliding window; approximation algorithm
【Paper Link】 【Pages】:1272-1275
【Authors】: Nikos Bikakis ; Vana Kalogeraki ; Dimitrios Gunopulos
【Abstract】: A major challenge for social event organizers (e.g., event planning and marketing companies, venues) is attracting the maximum number of participants, since it has great impact on the success of the event, and, consequently, the expected gains (e.g., revenue, artist/brand publicity). In this paper, we introduce the Social Event Scheduling (SES) problem, which schedules a set of social events considering user preferences and behavior, events' spatiotemporal conflicts, and competing events, in order to maximize the overall number of attendees. We show that SES is strongly NP-hard, even in highly restricted instances. To cope with the hardness of the SES problem we design a greedy approximation algorithm. Finally, we evaluate our method experimentally using a dataset from the Meetup event-based social network.
【Keywords】: Social Event Planning; Attendance Maximization; Social Event Arrangement; Event Organizers; Event Participants; Social Event Mining; Assignment Problems; Event based Social Networks
【Paper Link】 【Pages】:1276-1279
【Authors】: Li Su ; Yongluan Zhou
【Abstract】: Correlated failures that usually involve a number of nodes failing simultaneously have significant effect on systems' availability, especially for streaming applications that require real-time analysis. Most state-of-the-art distributed stream processing engines focus on recovering individual operator failure. By analyzing the existing recovery techniques, we identify the challenges and propose a fault-tolerance framework that can tolerate both individual and correlated failures with minimum overhead during the system's normal execution. Our progressive and query-centric recovery paradigm carefully schedules the recovery of failed operators based on the current availability of resources, such that the outputs of queries can be recovered as early as possible. We also formulate the new problem of recovery scheduling under correlated failures and design algorithms to optimize the recovery latency with a performance guarantee.
【Keywords】: Distributed Stream Processing; Fault Tolerance; Correlated Failure
【Paper Link】 【Pages】:1280-1283
【Authors】: Mohammadreza Najafi ; Mohammad Sadoghi ; Hans-Arno Jacobsen
【Abstract】: Efficient real-time analytics are an integral part of a growing number of data management applications such as computational targeted advertising, algorithmic trading, and Internet of Things. In this paper, we primarily focus on accelerating stream joins, arguably one of the most commonly used and resource-intensive operators in stream processing. We propose a scalable circular pipeline design (Circular-MJ) in hardware to orchestrate multi-way join while minimizing data flow disruption. In this circular design, each new tuple (given its origin stream) starts its processing from a specific join core and passes through all respective join cores in a pipeline sequence to produce final results. We further present a novel two-stage pipeline stream join (Stashed-MJ) that uses a best-effort buffering technique (stash) to maintain intermediate results. In a case that an overwrite is detected in the stash, our design automatically resorts to recomputing intermediate results. Our experimental results demonstrate a linear throughput scaling with respect to the number of execution units in hardware.
【Keywords】: Stream Joins; FPGA; hardware acceleration; heterogeneous hardware; query optimization; parallel algorithms
【Paper Link】 【Pages】:1284-1287
【Authors】: Liang Li ; Guoren Wang ; Gang Wu ; Ye Yuan
【Abstract】: In-memory databases (IMDBs) are gaining increasing popularity in big data applications, where clients commit updates intensively. Consistent snapshot is a key step in backup and recovery of IMDBs, thus an important factor for system performance of IMDBs. Formally, the in-memory consistent snapshot problem refers to taking an in-memory consistent time-in-point snapshot with the constraints that 1) clients can read the latest data items, and 2) any data item in the snapshot should not be overwritten. Various snapshot algorithms have been proposed in the academia to trade off throughput and latency, yet industrial IMDBs such as Redis still stick to the simple fork algorithm. As an understanding of this phenomenon, we conduct comprehensive performance evaluations on mainstream snapshot algorithms. Surprisingly, we observe that the simple fork algorithm indeed outperforms the state-of-the-arts in update-intensive workload scenarios. On this basis, we identify the drawbacks of existing research and propose two lightweight improvements. Extensive evaluations on synthetic data and Redis show that our lightweight improvements yield better performance than fork, the current industrial standard, and the representative snapshot algorithms from the academia. Finally, we have opensourced the implementation of all the above snapshot algorithms to facilitate practitioners to benchmark the performance of each algorithm and select proper methods for different application scenarios.
【Keywords】: In Memory databases; Consistent Snapshot
【Paper Link】 【Pages】:1288-1291
【Authors】: Tieying Zhang ; Anthony Tomasic ; Yangjun Sheng ; Andrew Pavlo
【Abstract】: Current architectures for main-memory online transaction processing (OLTP) database management systems (DBMS) typically use random scheduling to assign transactions to threads. This approach achieves uniform load across threads but it ignores the likelihood of conflicts between transactions. If the DBMS could estimate the potential for transaction conflict and then intelligently schedule transactions to avoid conflicts, then the system could improve its performance. Such estimation of transaction conflict, however, is non-trivial for several reasons. First, conflicts occur under complex conditions that are far removed in time from the scheduling decision. Second, transactions must be represented in a compact and efficient manner to allow for fast conflict detection. Third, given some evidence of potential conflict, the DBMS must schedule transactions in such a way that minimizes this conflict. In this paper, we systematically explore the design decisions for solving these problems. We then empirically measure the performance impact of different representations on a standard OLTP benchmark.
【Keywords】: online transaction processing; scheduling
【Paper Link】 【Pages】:1292-1295
【Authors】: Kaiyu Feng ; Tao Guo ; Gao Cong ; Sourav S. Bhowmick ; Shuai Ma
【Abstract】: With the proliferation of location-based services, the generation of massive geo-tagged data opens up new opportunities to address real-world problems. In this paper, we present a novel continuous bursty region detection (SURGE) problem that aims to continuously detect a bursty region of a given size in a specified geographical area from a stream of spatial objects. The SURGE problem is useful in addressing several real-world challenges such as disease outbreak detection. We propose an exact solution to address the problem, and show the efficiency and effectiveness by conducting experiments on real-world datasets.
【Keywords】: stream; burst detection; spatial data management
【Paper Link】 【Pages】:1296-1299
【Authors】: Vasilis Efthymiou ; George Papadakis ; Kostas Stefanidis ; Vassilis Christophides
【Abstract】: Entity Resolution (ER) aims to identify different descriptions in various Knowledge Bases (KBs) that refer to the same entity. ER is challenged by the Variety, Volume and Veracity of descriptions published in the Web of Data. To address them, we propose the MinoanER framework that fulfills full automation and support of highly heterogeneous entities. MinoanER leverages a token-based similarity of entities to define a new metric that derives the similarity of neighboring entities from the most important relations, indicated only by statistics. For high efficiency, similarities are computed from a set of schema-agnostic blocks and processed in a non-iterative way that involves four threshold-free heuristics. We demonstrate that the effectiveness of MinoanER is comparable to existing ER tools over real KBs exhibiting low heterogeneity in terms of entity types and content. Yet, MinoanER outperforms state-of-the-art ER tools when matching highly heterogeneous KBs.
【Keywords】: Entity Resolution; Web Data
【Paper Link】 【Pages】:1300-1303
【Authors】: Jonas Traub ; Philipp Marian Grulich ; Alejandro Rodriguez Cuellar ; Sebastian Breß ; Asterios Katsifodimos ; Tilmann Rabl ; Volker Markl
【Abstract】: Computing aggregates over windows is at the core of virtually every stream processing job. Typical stream processing applications involve overlapping windows and, therefore, cause redundant computations. Several techniques prevent this redundancy by sharing partial aggregates among windows. However, these techniques do not support out-of-order processing and session windows. Out-of-order processing is a key requirement to deal with delayed tuples in case of source failures such as temporary sensor outages. Session windows are widely used to separate different periods of user activity from each other. In this paper, we present Scotty, a high throughput operator for window discretization and aggregation. Scotty splits streams into non-overlapping slices and computes partial aggregates per slice. These partial aggregates are shared among all concurrent queries with arbitrary combinations of tumbling, sliding, and session windows. Scotty introduces the first slicing technique which (1) enables stream slicing for session windows in addition to tumbling and sliding windows and (2) processes out-of-order tuples efficiently. Our technique is generally applicable to a broad group of dataflow systems which use a unified batch and stream processing model. Our experiments show that we achieve a throughput an order of magnitude higher than alternative state-of-the-art solutions.
【Keywords】: Window; Aggregation; Session; Session Windows; Scotty; out of order; Stream; Stream Processing; Slicing; Stream Slicing; Aggregate sharing
【Paper Link】 【Pages】:1304-1307
【Authors】: Chao Yan ; Bo Li ; Yevgeniy Vorobeychik ; Aron Laszka ; Daniel Fabbri ; Bradley Malin
【Abstract】: A wide variety of mechanisms, such as alert triggers and auditing routines, have been developed to notify administrators about types of suspicious activities in the daily use of large databases of personal and sensitive information. However, such mechanisms are limited in that: 1) the volume of such alerts is often substantially greater than the auditing capabilities of budget-constrained organizations and 2) strategic attackers may disguise their actions or carefully choose which records they touch, thus evading auditing routines. To address these problems, we introduce a novel approach to database auditing that explicitly accounts for adversarial behavior by 1) prioritizing the order in which types of alerts are investigated and 2) providing an upper bound on how much budget to allocate for auditing each alert type. We model the interaction between a database auditor and potential attackers as a Stackelberg game in which the auditor chooses an auditing policy and attackers choose which records in a database to target. We further introduce an efficient approach that combines linear programming, column generation, and heuristic search to derive an auditing policy, in the form of a mixed strategy. We assess the performance of the policy selection method using a publicly available credit card application dataset, the results of which indicate that our method produces high-quality database audit policies, significantly outperforming baselines that are not based in a game theoretic framing.
【Keywords】: Alert Prioritization; Game Theory; Database Auditing; Anomaly Detection
【Paper Link】 【Pages】:1308-1311
【Authors】: Yazeed Alabdulkarim ; Marwan Almaymoni ; Shahram Ghandeharizadeh
【Abstract】: Polygraph is a tool to quantify system behavior that violates serial execution of transactions. It is a plug-n-play framework that operates externally to the system at the conceptual granularity of entities and their relationships.We demonstrate Polygraph's ability to plug-in to existing benchmarks including TPC-C, SEATS, TATP, YCSB, and BG. In addition, we show Polygraph processes transaction log records in realtime and characterize its scalability characteristics.
【Keywords】: Consistency; Update anomalies; Transaction processing; Serial schedules
【Paper Link】 【Pages】:1312-1315
【Authors】: Chunyao Song ; Tingjian Ge
【Abstract】: Nowadays, a graph serves as a fundamental data structure for many applications. As graph edges stream in, users are often only interested in the recent data. In data exploration, how to store and process such massive amounts of graph stream data becomes a significant problem. As vertex and edge attributes are often referred to as labels, we propose a labeled graph sketch that stores real-time graph structural information in sublinear space and supports queries of diverse types. This sketch also supports sliding window queries. We conduct experiments on three real-world datasets, comparing with a state-of-the-art method to show the superiority of our sketch.
【Keywords】: graph stream; graph sketch; big graph
【Paper Link】 【Pages】:1316-1319
【Authors】: Caihua Shan ; Nikos Mamoulis ; Guoliang Li ; Reynold Cheng ; Zhipeng Huang ; Yudian Zheng
【Abstract】: We study the effective use of crowdsourcing in filling missing values in a given relation (e.g., a table containing different attributes of celebrity stars, such as nationality and age). A task given to a worker typically consists of questions about the missing attribute values (e.g., what is the age of Jet Li?). Existing work often treats related attributes independently, leading to suboptimal performance. We present T-Crowd: a crowdsourcing system that considers attribute relationships. T-Crowd integrates each worker's answers on different attributes to effectively learn his/her trustworthiness and the true data values. Our solution seamlessly supports categorical and continuous attributes. Our experiments on real datasets show that T-Crowd outperforms state-of-the-art methods, improving the quality of truth inference.
【Keywords】: Crowdsourcing; Tabular Data
【Paper Link】 【Pages】:1320-1323
【Authors】: Christopher Bartley ; Wei Liu ; Mark Reynolds
【Abstract】: In many machine learning applications there exists prior knowledge that the response variable should be non-decreasing in one or more of the features. For example, the chance of a tumour being malignant should not decrease with increasing diameter (all else being equal). While a number of classification algorithms make use of monotone knowledge, many are limited to full monotonicity (in all features). Taking inspiration from instance based classifiers, we present a framework for monotone additive rule ensembles that is the first to cater for partial monotonicity (in some features). We demonstrate it by developing a partially monotone instance based classifier based on L1 cones. Experiments show that the algorithm produces reasonable results on real data sets while ensuring perfect partial monotonicity.
【Keywords】: monotonicity; classification; instance-based; domain knowledge
【Paper Link】 【Pages】:1324-1327
【Authors】: Yuanyuan Zhu ; Qian Zhang ; Lu Qin ; Lijun Chang ; Jeffrey Xu Yu
【Abstract】: Keyword search problem has been widely studied to retrieve related substructures from graphs for a keyword set. However, existing well-studied approaches aim at finding compact trees/subgraphs containing the keywords, and ignore a critical measure, density, to reflect how strongly and stablely the keyword nodes are connected in the substructure. In this paper, we study the problem of finding a cohesive subgraph containing the query keywords based on the k-truss model, and formulate it as minimal dense truss search problem, i.e., finding minimal subgraph with maximum trussness covering the keywords. We first propose an efficient algorithm to find the dense truss with the maximum trussness containing keywords based on a novel hybrid KT-Index (Keyword-Truss Index). Then, we develop a novel refinement approach to extract the minimal dense truss based on the anti-monotonicity property of k-truss. Experimental studies on real datasets show the outperformance of our method.
【Keywords】: keyword search; cohesive subgraph; large graph
【Paper Link】 【Pages】:1328-1331
【Authors】: Xun Jian ; Xiang Lian ; Lei Chen
【Abstract】: Modern networks are of huge sizes as well as high dynamics, which challenges the efficiency of community detection algorithms. In this paper, we study the problem of overlapping community detection on distributed and dynamic graphs. Given a distributed, undirected and unweighted graph, the goal is to detect overlapping communities incrementally as the graph is dynamically changing. We propose an efficient algorithm, called randomized Speaker-Listener Label Propagation Algorithm (rSLPA), based on the Speaker-Listener Label Propagation Algorithm (SLPA) by relaxing the probability distribution of label propagation. Besides detecting high-quality communities, rSLPA can incrementally update the detected communities after a batch of edge insertion and deletion operations. To the best of our knowledge, rSLPA is the first algorithm that can incrementally capture the same communities as those obtained by applying the detection algorithm from the scratch on the updated graph. Extensive experiments are conducted on both synthetic and real-world datasets, and the results show that our algorithm can achieve high accuracy and efficiency at the same time.
【Keywords】: community detection; MapReduce; dynamic graphs
【Paper Link】 【Pages】:1332-1335
【Authors】: Jiabao Sun ; Jiajie Xu ; Rui Zhou ; Kai Zheng ; Chengfei Liu
【Abstract】: Discovering expert drivers is highly important for a broad range of location based services, but this issue is largely untouched in previous trajectory mining and search studies. In this paper, we study the problem of trajectory data driven expert driver discovery. It aims to find out top-k expert drivers about a region of interest, based on the understanding of their historical trajectories. To this end, we first investigate the construction of reference system, which collectively describes exemplar routes among important junctions, so that the driving behaviors embedded in each trajectory can be evaluated. To discover expert drivers accurately, a novel tf-idf concept based measure is proposed afterwards, such that the rationality of their trajectories are not only precisely evaluated by the match to reference system, but also properly aggregated for modelling expert drivers. Extensive experimental evaluation using real trajectory datasets demonstrates the effectiveness and efficiency of our proposed solutions.
【Keywords】: spatial database; trajectory mining
【Paper Link】 【Pages】:1336-1339
【Authors】: Dongqing Xiao ; Mohamed Y. Eltabakh ; Xiangnan Kong
【Abstract】: Many graphs in social and business applications are not deterministic, but are uncertain in nature. Related research requires open access to these uncertain graphs. While sharing these datasets often risks exposing sensitive user data to the public. However, current graph anonymization works only target on deterministic graphs and overlook the uncertain scenario. Our work seeks a solution to release uncertain graphs with high utility without compromising user privacy. We show that simply combining the representative extraction strategy and conventional graph anonymization method will result in the addition of noise that significantly disrupts uncertain graph structure. Instead, we introduce an uncertainty-aware method, Chameleon, that provides identical privacy guarantees with much less noise. With the possible world semantics, it enables a fine-grained control over the injected noise. Finally, we apply our method to real uncertain graphs and show that it produces anonymized uncertain graphs that closely match the originals in graph structure statistics.
【Keywords】: Uncertain graph; Privacy protection; k-anonymity
【Paper Link】 【Pages】:1340-1343
【Authors】: Jianxin Li ; Taotao Cai ; Ajmal Mian ; Rong-Hua Li ; Timos Sellis ; Jeffrey Xu Yu
【Abstract】: The problem of influence maximization has recently received significant attention. However, most studies focused on user influence via cyber interactions while ignoring their physical interactions which are important to gauge influence propagation. Additionally, targeted campaigns or advertisements have not received sufficient attention. To do this, we first devise a novel holistic influence diffusion model and then formulate a new holistic influence maximization query problem and develop three algorithms. Finally, we conduct extensive experiments to evaluate the effectiveness and efficiency of the proposed solutions.
【Keywords】: Influence Maximization; Targeted Advertisements; Spatial Social Networks
【Paper Link】 【Pages】:1344-1347
【Authors】: Pericles de Oliveira ; Altigran Soares da Silva ; Edleno Silva de Moura ; Rosiane Rodrigues
【Abstract】: Several systems for processing keyword queries over relational databases rely on the generation and evaluation of Candidate Networks (CNs), i.e., networks of joined relations that when processed as SQL queries, provide a relevant answer to the input keyword query. Although the evaluation of CNs has been extensively addressed in the literature, the problem of generating CNs has received much less attention. We propose a novel approach for generating CNs, wherein the possible matches for the query in the database are efficiently enumerated at first. These query matches are then used to guide the CN generation process, avoiding the exhaustive search procedure used by the current state-of-art approaches. We experimentally show that our approach allows the generation of a compact set of CNs that results in superior quality answers and demands less resources in terms of processing time and memory.
【Keywords】: Keyword Queries; Candidate Networks; Schema Graphs
【Paper Link】 【Pages】:1348-1351
【Authors】: Antonia Saravanou ; Ioannis Katakis ; George Valkanas ; Dimitrios Gunopulos
【Abstract】: We define and solve the problem of event detection and delineation as a task of identifying events and decomposing them to their major sub-events, with a description and a timeline. We propose DeLi, an algorithm that focuses on providing such an understanding of events and sub-events. DeLi, to the best of our knowledge, is the first method that addresses the problem in a generic stream of text, and in an online fashion. Extensive evaluation on social streaming data demonstrates that, by combining the structure of a social network with content attributes, our method outperforms the state-of-the-art techniques.
【Keywords】: Sub Event Detection; Event Detection; Anomaly Detection; Social Network Analysis; Graph Mining; Text Mining; Data Mining
【Paper Link】 【Pages】:1352-1355
【Authors】: Jefrey Lijffijt ; Bo Kang ; Wouter Duivesteijn ; Kai Puolamäki ; Emilia Oikarinen ; Tijl De Bie
【Abstract】: Deriving insights from high-dimensional data is one of the core problems in data mining. The difficulty mainly stems from the large number of variable combinations to potentially consider. Hence, an obvious question is whether we can automate the search for interesting patterns. Here, we consider the setting where a user wants to learn as efficiently as possible about real-valued attributes. We introduce a method to find subgroups in the data that are maximally informative (in the Information Theoretic sense) with respect to one or more real-valued target attributes. The succinct subgroup descriptions are in terms of arbitrarily-typed description attributes. The approach is based on the Subjective Interestingness framework FORSIED to use prior knowledge when mining most informative patterns.
【Keywords】: Exploratory Data Mining; Pattern Mining; Subgroup Discovery; Exceptional Model Mining; Subjective Interestingness
【Paper Link】 【Pages】:1356-1359
【Authors】: Michele Linardi ; Themis Palpanas
【Abstract】: Data series similarity search is an important operation and at the core of several analysis tasks and applications related to data series collections. Despite the fact that data series indexes enable fast similarity search, all existing indexes can only answer queries of a single length (fixed at index construction time), which is a severe limitation. In this work, we propose ULISSE, the first data series index structure designed for answering similarity search queries of variable length. Our contribution is two-fold. First, we introduce a novel representation technique, which effectively and succinctly summarizes multiple sequences of different length. Based on the proposed index, we describe efficient algorithms for approximate and exact similarity search, combining disk based index visits and in-memory sequential scans. We experimentally evaluate our approach using several synthetic and real datasets. The results show that ULISSE is several times (and up to orders of magnitude) more efficient in terms of both space and time cost, when compared to competing approaches.
【Keywords】: Data Series; Time Series; Similarity Search; Indexing; Variable length
【Paper Link】 【Pages】:1360-1363
【Authors】: Andreas Kipf ; Harald Lang ; Varun Pandey ; Raul Alexandru Persa ; Peter A. Boncz ; Thomas Neumann ; Alfons Kemper
【Abstract】: Geospatial joins are a core building block of connected mobility applications. An especially challenging problem are joins between streaming points and static polygons. Since points are not known beforehand, they cannot be indexed. Nevertheless, points need to be mapped to polygons with low latencies to enable real-time feedback. We present an approximate geospatial join that guarantees a user-defined precision. Our technique uses a quadtree-based hierarchical grid to approximate polygons and stores these approximations in a specialized radix tree. Our approach can perform up to several orders of magnitude faster than existing techniques while providing sufficiently precise results for many applications.
【Keywords】: approximate query processing; geospatial joins; main memory database systems
【Paper Link】 【Pages】:1364-1367
【Authors】: Swarup Chandra ; Ahsanul Haque ; Hemeng Tao ; Jie Liu ; Latifur Khan ; Charu C. Aggarwal
【Abstract】: In traditional machine learning, it is assumed that training data conforming to the stationary distribution of test data is readily available. Yet, such an assumption is not valid in practice due to a high cost of obtaining the truth value of data instances. This is particularly true when computing over non-stationary data streams. Recent studies in the multistream setting aim to address this issue by leveraging a stream of data with biased labeled instances (called the source stream) to train a suitable model for prediction over unlabeled instances (called the target stream). They use sampling bias correction techniques as a preprocessing step for estimating source instance weights for training a bias-corrected classifier useful to predict label of target data instances. In this regard, a recent framework proposes to utilize a Gaussian kernel model to estimate source instance weights. Unfortunately, it suffers from large computational time complexity and consequently deteriorates the rate at which streaming data is processed. In this paper, we address this issue by proposing a divide-and-conquer method suitable for simultaneously evaluating source instance weights and detecting changes in distribution over time. Our empirical results demonstrate a significant gain in execution time, compared to the previous approach, while achieving similar or better classification accuracy on real-world datasets.
【Keywords】: data stream classification; ensemble methods; sampling bias; concept drift
【Paper Link】 【Pages】:1368-1371
【Authors】: Zhixuan Zhou ; Henry Hoffmann
【Abstract】: Recent programming frameworks enable small computing systems to achieve high performance on large-scale graph analytics by supporting out-of-core graph analytics (i.e., processing graphs that exceed main memory capacity). These frameworks make out-of-core programming easy by automating the tedious process of scheduling data transfer between memory and disk. This paper presents two innovations that improve the performance of software frameworks for out-of-core graph analytics. The first is degree-ordered storage, a new storage format that dramatically lowers book-keeping overhead when graphs are larger than memory. The second innovation replaces existing static messages with novel ordered dynamic messages which update their destination immediately, reducing both the memory required for intermediate storage and IO pressure. We implement these innovations in a framework called GraphZ-which we release as open source-and we compare its performance to two state-of-the-art out-of-core graph frameworks. For graphs that exceed memory size, GraphZ's harmonic mean performance improvements are 1.8-8.3× over existing solutions.
【Keywords】: Graph Computation; degree ordered storage
【Paper Link】 【Pages】:1372-1375
【Authors】: Shuang Zhou ; Pingpeng Yuan ; Ling Liu ; Hai Jin
【Abstract】: Reachability query asks whether a vertex can reach another vertex on large directed graphs. It is one of the most fundamental graph operators and has attracted many researchers to study it. Although there are many approaches solving this problem, it still remains a challenging problem when it comes to leverage the three main costs: the index construction time, the index size, and the query time on large and dense graphs. In this paper, we propose a High Dimension Graph Labeling approach to answer reachability queries. First, we recursively partition a graph into disjoint non-shared graphs, among which there are no common vertices, and cross edges. Second, we build a four dimensional label - one dimension of layer, one dimension of sub-graph and two dimensions of interval for each vertex. With the layer label and the sub-graph label, we can determine the positions (non-shared graphs) of any two vertices. Two dimensional interval label is used to assist to answer the reachability queries for vertex pair in those non-shared graphs. Finally, we design algorithms to answer researchability queries efficiently. In order to speed up query answering, we also build two directional labels: up and down labels to filter vertices quickly. The extensive experiments on 28 large/small and dense/sparse graphs show that building the high dimensional index is quickly and the index size is also competitive compared with most of the state of the art approaches. The results also show that our approach is more scalable and efficient than the state-of-the-art approaches in answering reachability queries.
【Keywords】: Reachability query; Partition
【Paper Link】 【Pages】:1376-1379
【Authors】: Qian Lin ; Gang Chen ; Meihui Zhang
【Abstract】: Efficient online transaction processing is key to many database applications, and existing concurrency control protocols perform remarkably well under specific workloads or access patterns that they have been designed for. However, they often do not scale well when the workload is dynamic. To tackle the challenge of dynamic workloads, we propose an Adaptive and Speculative Optimistic Concurrency Control (ASOCC) protocol for effective transaction processing. Based on real-time monitoring of data access frequency, ASOCC adaptively embeds 2PL into the OCC scheme to facilitate superior contention resolution with reduced transaction aborts. Further, ASOCC dynamically inspects the correlation of data accesses and exploits such information to perform speculative transaction restart to save CPU cycles wasted on the processing of transactions that are destined to abort.
【Keywords】: OLTP; distributed database; concurrency control; OCC; 2PL; speculation
【Paper Link】 【Pages】:1380-1383
【Authors】: Shahram Ghandeharizadeh ; Sandy Irani ; Jenny Lam
【Abstract】: Advances in storage technology have introduced Non-Volatile Memory, NVM, as a new storage medium. NVM, along with DRAM and Disk present a system designer with a wide array of options in designing caching middleware. Moreover, design decisions to replicate a data item in more than one level of a caching memory hierarchy may enhance the overall system performance with a faster recovery time in the event of a memory failure. Given a fixed budget, the key configuration questions are: Which storage media should constitute the memory hierarchy? What is the storage capacity of each hierarchy? Should data be replicated or partitioned across the different levels of the hierarchy? We study a model of these cache configuration questions and present results from a simple algorithm to evaluate design tradeoffs in the context of a memory hierarchy for a Key-Value Store, e.g., memcached. The results show selective replication is appropriate with certain failure rates and workload characteristics. With a slim failure rate and frequent data updates, tiering of data across the different storage media that constitute the cache is superior to replication.
【Keywords】: Non Volatile Memory; Caching; Data placement
【Paper Link】 【Pages】:1384-1387
【Authors】: Lalitha Viswanathan ; Alekh Jindal ; Konstantinos Karanasos
【Abstract】: Modern big data systems run on cloud environments where resources are shared among several users and applications. As a result, declarative user queries need to be optimized and executed over resources that constantly change and are provisioned on demand for each job. This requires us to rethink traditional query optimization designed for systems that run on dedicated resources. In this paper, we show evidence that the choice of query plans depends heavily on the resources that the plan will be executed on. The current practice of determining query plans without accounting for resources could lead to significant performance loss in popular big data systems, such as Hive and SparkSQL. Therefore, we make a case for Query and Resource Optimization (or QROP), i.e., choosing both the query plan and the resource configuration at the same time, and present a research agenda towards this direction.
【Keywords】: query optimization; resource configuration; resource optimization; big data analytics
【Paper Link】 【Pages】:1388-1391
【Authors】: Chen Zhang ; Sen Zhang ; Chen Lei ; Peiguang Lin
【Abstract】: Web search analysis plays a critical role in improving the performance of cutting-edge search engines. Most of the existing models, such as the click graph and its variants, focus on utilizing the wisdom of the crowd. However, how to design a model supporting both the collective wisdom as well as the unique characteristic of individuals is rarely studied. In this paper, our goal is to solve the new problem of user-specific web search analysis. We go beyond click graph and propose two probabilistic topic models, Topic Independence Model(TIM) and Topic Dependence Model (TDM). TIM adopts an assumption that the generation of query terms and URLs are topically independent; TDM captures the coupling between search queries and URLs. We also capture the temporal burstiness of topics by utilizing the continuous Beta distribution. Through a large-scale analysis of a real-life search query log, we observe that each user's web search trail enjoys multiple kinds of user-based unique characteristics. On a massive search query log, the new models achieve a better held-out likelihood than standard LDA, DCMLDA and TOT, and they can also effectively reveal the latent evolutions of topics on the corpus level and user-based level.
【Keywords】: web search; topic model; burstiness
【Paper Link】 【Pages】:1392-1403
【Authors】: Carl Yang ; Chao Zhang ; Xuewen Chen ; Jieping Ye ; Jiawei Han
【Abstract】: Online taxicab platforms like DiDi and Uber have impacted hundreds of millions of users on their choices of traveling, but how do users feel about the ride-sharing services, and how to improve their experience? While current ride-sharing services have collected massive travel data, it remains challenging to develop data-driven techniques for modeling and predicting user ride experience. In this work, we aim to accurately predict passenger satisfaction over their rides and understand the key factors that lead to good/bad experiences. Based on in-depth analysis of large-scale travel data from a popular taxicab platform in China, we develop PHINE (Pattern-aware Heterogeneous Information Network Embedding) for data-driven user experience modeling. Our PHINE framework is novel in that it is composed of spatial-temporal node binding and grouping for addressing the inherent data variation, and pattern preservation based joint training for modeling the interactions among drivers, passengers, locations, and time. Extensive experiments on 12 real-world travel datasets demonstrate the effectiveness of PHINE over strong baseline methods. We have deployed PHINE in the DiDi Big Data Center, delivering high-quality predictions for passenger satisfaction on a daily basis.
【Keywords】: Passenger experience; representation learning; joint training
【Paper Link】 【Pages】:1404-1413
【Authors】: Bhargav Kanagal ; Sandeep Tata
【Abstract】: Recommender systems are a key technology for many online services including e-commerce, movies, music, and news. Online retailers use product recommender systems to help users discover items that they may like. However, building a large-scale product recommender system is a challenging task. The problems of sparsity and cold-start are much more pronounced in this domain. Large online retailers have used good recommendations to drive user engagement and improve revenue, but the complexity involved is a roadblock to widespread adoption by smaller retailers. In this paper, we tackle the problem of generating product recommendations for tens of thousands of online retailers. Sigmund is an industrial-scale system for providing recommendations as a service. Sigmund was deployed to production in early 2014 and has been serving retailers every day. We describe the design choices that we made in order to train accurate matrix factorization models at minimal cost. We also share the lessons we learned from this experience - both from a machine learning perspective and a systems perspective. We hope that these lessons are useful for building future machine-learning services.
【Keywords】: Recommender Systems; Matrix Factorization; e commerce
【Paper Link】 【Pages】:1414-1422
【Authors】: Panagiotis Korvesis ; Stephane Besseau ; Michalis Vazirgiannis
【Abstract】: In this paper we present an approach to tackle the problem of event prediction for the purpose of performing predictive maintenance in aviation. Given a collection of recorded events that correspond to equipment failures, our method predicts the next occurrence of one or more events of interest (target events or critical failures). Our objective is to develop an alerting system that would notify aviation engineers well in advance for upcoming aircraft failures, providing enough time to prepare the corresponding maintenance actions. We formulate a regression problem in order to approximate the risk of occurrence of a target event, given the past occurrences of other events. In order to achieve the best results we employed a multiple instance learning scheme (multiple instance regression) along with extensive data preprocessing. We applied our method on data coming from a fleet of aircraft and our predictions involve failures of components onboard, specifically components that are related to the landing gear. The event logs correspond to post flight reports retrieved from multiple aircraft during several years of operation. To the best of our knowledge, this paper is the first attempt on aircraft failure prediction using post flight report data and finally, our findings show high potential impact on the aviation industry.
【Keywords】: predictive maintenance; failure prediction; aviation
【Paper Link】 【Pages】:1423-1434
【Authors】: Mahesh P. Singh ; Puneet Agarwal ; Ashish Chaudhary ; Gautam Shroff ; Prerna Khurana ; Mayur Patidar ; Vivek Bisht ; Rachit Bansal ; Prateek Sachan ; Rohit Kumar
【Abstract】: In this paper we present the design, architecture and implementation of KNADIA, a conversational dialogue system for intra-enterprise use, providing knowledge-assisted question answering and transactional assistance to employees of a large organization. KNADIA has been deployed in production in TCS, a large organization with over 380,000 employees distributed globally; the system is currently supporting a few thousand active users making hundreds of queries per day. We identify, define and distinguish two distinct classes of use-cases: virtual assistance and knowledge synthesis, which we have found to cover a variety of enterprise needs. KNADIA supports both types of conversational agents, with multiple instances of each, while presenting a common digital persona. KNADIA disambiguates which agent should respond to each query. Further, since individual components use deep learning algorithms giving probabilistic outputs, the confidence with which different components answer is often required before answering or acting. The user's dialogue context also needs to be maintained judiciously. Due to these and many other challenges, the overall architecture of KNADIA is non-trivial. We present instances of HR-assistance and technical knowledge synthesis that are in production use in TCS along with accuracy figures and key performance metrics. Finally, we suggest that many elements of our architecture are also generally applicable to other complex deep learning systems.
【Keywords】: Conversational Dialogue System; Deep Learning; natural language processing; virtual assistance; knowledge synthesis; conversational agents; chatbot; digital persona; knowledge graph; Intent Identification; Conversational Systems; AI chatbots
【Paper Link】 【Pages】:1435-1440
【Authors】: Haiqin Weng ; Zhao Li ; Shouling Ji ; Chen Chu ; Haifeng Lu ; Tianyu Du ; Qinming He
【Abstract】: Nowadays, e-commerce has become prevalent world-wide. With the big success of e-commerce, many malicious promotion services also rise: with the goal of increasing sales, malicious merchants attempt to promote their target items by illegally optimizing the search results using fake visits, purchases, etc. In this paper, we study the fraud detection problem on large-scale e-commerce platforms. First, we develop an efficient and scalable AnTi-Fraud system (ATF) to detect e-commerce frauds for large-scale e-commerce platforms, and implement it in parallel on a large-scale computing platform, called Open Data Processing Service (ODPS). Then, we evaluate ATF using two real large-scale e-commerce datasets (with tens of millions users and items). The results demonstrate that both the precision and the recall of ATF can achieve 0.97+, which suggests that ATF is very effective. More importantly, we deploy ATF on the Taobao platform of Alibaba, which is one of the world's largest e-commerce platforms. The evaluation results show that ATF can also achieve an accuracy of 98.16% on Taobao, which again suggests that ATF is very effective and deployable in practice. Our study in this paper is expected to shed light on defending against online frauds for practical e-commerce platforms.
【Keywords】: e-commerce fraud; fraud detection
【Paper Link】 【Pages】:1441-1452
【Authors】: Shasank Chavan ; Albert Hopeman ; Sangho Lee ; Dennis Lui ; Ajit Mylavarapu ; Ekrem Soylemez
【Abstract】: OLAP and real-time analytic workloads in data management systems are dominated by joins, aggregations, scan and filtering costs. In-Memory columnar databases have successfully optimized scans by many orders of magnitude using compressed data formats and SIMD vectorization techniques, but have largely made little impact to the rest of the query execution plan. The Oracle Database In-Memory (DBIM) Option introduced new SQL execution operators that accelerate a wide range of analytic queries, delivering orders of magnitude performance improvement by optimizing aggregation over joins for star and similar schemas. Group-by expressions are pushed down into the scans of dimension tables, creating a unique key per distinct group called a Dense Grouping Key (DGK). A structure called a Key Vector is allocated that maps join keys to DGKs, which is used to filter non-matching rows during the fact table scan. Passing rows are then aggregated directly on compressed codes into DGK-indexed result buffers using SIMD and other novel aggregation techniques. Our innovative solution replaces traditional join and group-by processing (bloom filters, hash table build and probe, serial aggregation) with blazing fast inlined scan operators. And with DBIM's unique dual-format architecture, DML activity (inserts, updates, deletes) do not dampen the phenomenal gains we see with our solution. Using a set of aggregation-heavy queries against the Star Schema Benchmark (SSB) schema, we show that our technique can drastically reduce query elapsed time by more than 10x, making real-time analytics truly achievable.
【Keywords】: In Memory; Aggregation; Group By; Groupby; Analytics; OLAP; OLTP; Columnar; Performance; Oracle; SIMD; Inmemory; Compression; Optimization; Dictionary Encoding; Late Materialization; Star Schema Benchmark; SSB; Star Queries; Data Warehouse; Vectorization
【Paper Link】 【Pages】:1453-1464
【Authors】: Pedro Pedreira ; Yinghai Lu ; Sergey Pershin ; Amit Dutta ; Chris Croswhite
【Abstract】: Although OLTP and OLAP database systems have fundamentally disparate architectures, most research work on concurrency control is geared towards transactional systems and simply adopted by OLAP DBMSs. In this paper we describe a new concurrency control protocol specifically designed for analytical DBMSs that can provide Snapshot Isolation for distributed in-memory OLAP database systems, called Append-Only Snapshot Isolation (AOSI). Unlike previous work, which are either based on multiversion concurrency control (MVCC) or Two Phase Locking (2PL), AOSI is completely lock-free and always maintains a single version of each data item. In addition, it removes the need for per-record timestamps of traditional MVCC implementations and thus considerably reduces the memory overhead incurred by concurrency control. In order to support these characteristics, the protocol sacrifices flexibility and removes support for a few operations, particularly record updates and single record deletions; however, we argue that even though these operations are essential in a pure transactional system, they are not strictly required in most analytic pipelines and OLAP systems. We also present an experimental evaluation of AOSI's current implementation within the Cubrick in-memory OLAP DBMS at Facebook, and show that lock-free single-version Snapshot Isolation can be achieved with low memory overhead and minor impact in query latency.
【Keywords】: concurrency control; OLAP; in memory; snapshot isolation
【Paper Link】 【Pages】:1465-1476
【Authors】: Weiqing Yang ; MingJie Tang ; Yongyang Yu ; Yanbo Liang ; Bikas Saha
【Abstract】: We introduce a simple data model to process non-relational data for relational operations, and SHC (Apache Spark - Apache HBase Connector), an implementation of this model in the cluster computing framework, Spark. SHC leverages optimization techniques of relational data processing over the distributed and column-oriented key-value store (i.e., HBase). Compared to existing systems, SHC makes two major contributions. At first, SHC offers a much tighter integration between optimizations of relational data processing and non-relational data store, through a plug-in implementation that integrates with Spark SQL, a distributed in-memory computing engine for relational data. The design makes the system maintenance relatively easy, and enables users to perform complex data analytics on top of key-value store. Second, SHC leverages the Spark SQL Catalyst engine for high performance query optimizations and processing, e.g., data partitions pruning, columns pruning, predicates pushdown and data locality. SHC has been deployed and used in multiple production environments with hundreds of nodes, and provides OLAP query processing on petabytes of data efficiently.
【Keywords】: Query processing and optimization; distributed computing; in memory computing
【Paper Link】 【Pages】:1477-1488
【Authors】: SungJu Cho ; Roman Averbukh ; Yanwei Zhang ; Andrew Carter ; Jane Alam Jan
【Abstract】: In order to efficiently serve a high volume of queries, LinkedIn's distributed graph service caches computationally-expensive materialized views. Despite this, even basic view maintenance has been a scalability bottleneck. In this paper, we introduce Partial Update, which efficiently applies incremental view maintenance in the absence of deletions in a materialized view. Without expensive transaction handling, Partial Update guarantees eventual consistency across source of truth databases, materialized views and base relations stores. By design, its delta change computation logic only produces lightweight queries, reducing any need to augment the underlying data storage systems. This allows Partial Update to seamlessly integrate on top of traditional eventually consistent systems. In LinkedIn's production systems, Partial Update achieves up to a 50% reduction in both maintenance time and network bandwidth utilization. The improved view maintenance scalability allows us to improve the quality of cached data by a factor of 20. We consider that Partial Update is general enough to be applicable to the majority of select-project-join view maintenance strategies in eventually consistent distributed data stores.
【Keywords】: Graph Database; Eventual Consistency; Materialized View Maintenance; Incremental View Maintenance; LinkedIn
【Paper Link】 【Pages】:1489-1494
【Authors】: Himanshu Gupta ; Sandeep Hans ; Kushagra Aggarwal ; Sameep Mehta ; Bapi Chatterjee ; Praveen Jayachandran
【Abstract】: In this paper, we discuss the problem of efficiently handling temporal queries on Hyperledger Fabric, a popular implementation of Blockchain technology. The temporal nature of the data inserted by the Hyperledger Fabric transactions can be leveraged to support various use-cases. This requires that the temporal queries be processed efficiently on this data. Currently this presents significant challenges as this data is organized on file-system, is exposed to users via a limited API and does not support any temporal indexes. We present two models for overcoming these limitations and improving the performance of temporal queries on Fabric. The first model creates a copy of each event inserted by a Fabric transaction and stores temporally close events together on Fabric. The second model keeps the event count intact but tags some metadata to each event being inserted on Fabric s.t. temporally close events share the same metadata. We discuss these two models in detail and show that these two models significantly outperform the naive ways of handling temporal queries on Fabric. We also discuss the performance trade-offs for these two models across various dimensions - data storage, query performance, data ingestion time etc.
【Keywords】: Hyperledger Fabric; Blockchain; Temporal Queries; Temporal Indexes
【Paper Link】 【Pages】:1495-1506
【Authors】: Zeyi Wen ; Xingyang Liu ; Hongjian Cao ; Bingsheng He
【Abstract】: Audio streaming services have become increasingly popular due to the wide use of smart phones. More and more people are enjoying live audio broadcasting while they are doing various kinds of activities. Meanwhile, the data volume of live audio streams is also ever increasing. Searching and indexing these audio streams is still an important and open problem, with the following challenges: (i) queries on the large number of audio streams need to be answered in real-time; (ii) a live audio stream is inserted into the index continuously to enable live audio streams to appear in query results, and the number of insertions is large which often becomes a performance issue. In this application paper, we propose a multi-modal and unified log structured merge-tree (a.k.a. LSM-tree which consists of multiple inverted indices) based index to support intensive insertions and real-time search on live audio stream applications. Our index natively supports two major types of indexing techniques in audio search: text based indexing and sound based indexing. To address the challenges of live audio indexing and search, we propose an index (called RTSI) which avoids traversing multiple inverted indices to compute the score of an audio stream. In RTSI, we propose various techniques to address the technical challenges. First, for each term we use three inverted lists which contain the sorted score of popularity, freshness and relevance, respectively, such that we can compute the top-k query results efficiently. Second, we devise an upper bound for the unchecked audio streams, such that the query answering process can be terminated earlier. Third, we create mirrors for the indices that need to be merged, such that queries can be answered in real-time when the indices are merging. We conduct extensive experiments on audio streams obtained from Ximalaya. The experimental results show that RTSI can answer a large number of queries in a real-time manner while concurrently handling massive insertions.
【Keywords】: Indexing; Live Audio Stream; Efficiency; Speech Recognition
【Paper Link】 【Pages】:1507-1518
【Authors】: Jeyhun Karimov ; Tilmann Rabl ; Asterios Katsifodimos ; Roman Samarev ; Henri Heiskanen ; Volker Markl
【Abstract】: The need for scalable and efficient stream analysis has led to the development of many open-source streaming data processing systems (SDPSs) with highly diverging capabilities and performance characteristics. While first initiatives try to compare the systems for simple workloads, there is a clear gap of detailed analyses of the systems' performance characteristics. In this paper, we propose a framework for benchmarking distributed stream processing engines. We use our suite to evaluate the performance of three widely used SDPSs in detail, namely Apache Storm, Apache Spark, and Apache Flink. Our evaluation focuses in particular on measuring the throughput and latency of windowed operations, which are the basic type of operations in stream analytics. For this benchmark, we design workloads based on real-life, industrial use-cases inspired by the online gaming industry. The contribution of our work is threefold. First, we give a definition of latency and throughput for stateful operators. Second, we carefully separate the system under test and driver, in order to correctly represent the open world model of typical stream processing deployments and can, therefore, measure system performance under realistic conditions. Third, we build the first benchmarking framework to define and test the sustainable performance of streaming systems. Our detailed evaluation highlights the individual characteristics and use-cases of each system.
【Keywords】: Stream data processing; Stream benchmark; Apache Flink; Apache Spark; Apache Storm
【Paper Link】 【Pages】:1519-1530
【Authors】: Meikel Poess ; Raghunath Nambiar ; Karthik Kulkarni ; Chinmayi Narasimhadevara ; Tilmann Rabl ; Hans-Arno Jacobsen
【Abstract】: By 2020 it is estimated that 20 billion devices will be connected to the Internet. While the initial hype around this Internet of Things (IoT) stems from consumer use cases, the number of devices and data from enterprise use cases is significant in terms of market share. With companies being challenged to choose the right digital infrastructure from different providers, there is an pressing need to objectively measure the hardware, operating system, data storage, and data management systems that can ingest, persist, and process the massive amounts of data arriving from sensors (edge devices). The Transaction Processing Performance Council (TPC) recently released the first industry standard benchmark for measuring the performance of gateway systems, TPCx-IoT. In this paper, we provide a detailed description of TPCx-IoT, mention design decisions behind key elements of this benchmark, and experimentally analyze how TPCx-IoT measures the performance of IoT gateway systems.
【Keywords】: performance measurement; benchmark; performance methodologies; Internet of Things; TPC
【Paper Link】 【Pages】:1531-1542
【Authors】: Spiros Antonatos ; Stefano Braghin ; Naoise Holohan ; Yiannis Gkoufas ; Pol Mac Aonghusa
【Abstract】: Person-specific data offer enormous opportunities for deriving insights that can radically improve different facets of our everyday lives, ranging from the provisioning of personalized medicine and healthcare, to the offering of smart transportation and smart energy. At the same time, the use of person-specific data to support these applications can come at a high cost to individuals' privacy, unless proper de-identification technology is in place to provide rigorous privacy guarantees. In this paper we introduce PRIMA, an end-to-end solution allowing decision makers to map out and execute their data privacy strategy through a comprehensive workflow. Our toolkit offers an intuitive risk-utility exploration framework for end users to navigate through the enormous number of possible combinations of anonymization settings and provide meaningful reports that help them understand the impact of each strategy in terms of utility and risk. Unlike traditional approaches, that rely on limited scale tools and manual analyses, our toolkit is the first scalable, production-grade system that can execute all of its components (such as vulnerability analysis, anonymization, risk and information loss measurements) on arbitrarily large datasets. Furthermore, it offers a flexible library for developers to integrate and extend its functionality to embed de-identification components into their applications.
【Keywords】: privacy; scalability; k-anonymity; privacy risk
【Paper Link】 【Pages】:1543-1548
【Authors】: Daniel Delling ; Dennis Schieferdecker ; Christian Sommer
【Abstract】: We study how to compute routes that avoid traffic in road networks. Imperfections in real-time traffic feeds may yield routes with undesirable detours through parking lots or residential areas. The main challenge we address in this work is that of defining and computing paths that incorporate a volatile secondary cost function. We define the problem, study its complexity, and present algorithms that compute routes without undesirable detours. Experiments on continental-sized road networks demonstrate the feasibility of our approach.
【Keywords】: shortest paths; route planning; road traffic; road networks
【Paper Link】 【Pages】:1549-1552
【Authors】: Aniruddh Gandhi ; Sven Hartmann ; Henning Koehler ; Sebastian Link
【Abstract】: Cardinality constraints and functional dependencies can enforce complex business rules within database systems. As the interaction of these constraints is intricate on SQL data, data engineers and domain experts face the challenge of deciding which of the constraints are meaningful for the underlying application domain. We present a tool that computes data samples that perfectly summarize which of the constraints are currently perceived meaningful. It is demonstrated how the tool facilitates the interaction and understanding of data engineers and domain experts to help them separate meaningful from meaningless cardinality constraints and functional dependencies.
【Keywords】: Armstrong database; Cardinality constraint; Functional dependency; SQL
【Paper Link】 【Pages】:1553-1556
【Authors】: Jose Picado ; Sudhanshu Pathak ; Arash Termehchy ; Alan Fern
【Abstract】: Relational learning algorithms learn the Datalog definition of novel relations in terms of existing relations in the database. In order to effectively use these algorithms, users must constraint the space of candidate definitions by specifying a language bias. Unfortunately, specifying the language bias takes a great deal of time and effort, as it is done via trial and error and is guided by the expert's intuitions. We demonstrate AutoMode, a system that leverages information in the schema and content of the database to automatically induce the language bias used by popular relational learning algorithms.
【Keywords】: relational learning; language bias; parameter tuning
【Paper Link】 【Pages】:1557-1560
【Authors】: Sihem Amer-Yahia ; Behrooz Omidvar-Tehrani ; João Comba ; Viviane Moreira ; Fabian Colque Zegarra
【Abstract】: We demonstrate VEXUS, an interactive visualization framework for exploring user data to fulfill tasks such as finding a set of experts, forming discussion groups and analyzing collective behaviors. User data is characterized by a combination of demographics like age and occupation, and actions such as rating a movie, writing a paper or following a medical treatment. The ubiquity of user data requires tools that help explorers, be they specialists or novice users, acquire new insights. VEXUS lets explorers interact with user data via visual primitives and builds an exploration profile to recommend the next exploration steps. VEXUS combines state-of-the-art visualization techniques with appropriate indexing of user data to provide fast and relevant exploration.
【Keywords】: user group; data exploration; visualization
【Paper Link】 【Pages】:1561-1564
【Authors】: Alexandar Mihaylov ; Parke Godfrey ; Lukasz Golab ; Mehdi Kargar ; Divesh Srivastava ; Jaroslaw Szlichta
【Abstract】: We present the design of and a demonstration plan for FASTOD, a tool for efficiently discovering (approximate) order dependencies (ODs) from data. FASTOD converts ODs to a novel canonical form which makes it orders of magnitude faster than existing techniques. Using real datasets, we demonstrate how the discovered ODs can help users understand data semantics, identify potential data quality problems, and interactively clean the data. We also demonstrate the effectiveness of our system.
【Keywords】: data quality; data profiling; order dependency
【Paper Link】 【Pages】:1565-1568
【Authors】: Xiao Qin ; Tabassum Kakar ; Susmitha Wunnava ; Brian McCarthy ; Andrew Schade ; Huy Quoc Tran ; Brian Zylich ; Elke A. Rundensteiner ; Lane Harrison ; Sanjay K. Sahoo ; Suranjan De
【Abstract】: Adverse drug reactions (ADRs) caused by drug-drug interactions (DDI) are a major cause of morbidity and mortality worldwide. There is a growing need for computing-supported methods that facilitate the automated signaling of DDI related ADRs (DIARs) that otherwise would remain undiscovered in millions of ADR reports. In this demonstration, we showcase our MeDIAR technology - an end-to-end DIAR signal generation, exploration and validation solution for pharmaceutical regulatory agencies to detect true DIAR signals from a drug surveillance database. MeDIAR's innovations include an efficient rule-driven learning algorithm for deriving DIAR signals from ADR reports, an innovative scoring methodology based on the proposed contextual association cluster model to rank the generated signals by their importance. Further, these ranked signals are augmented with meta information such as their significance level and their severity, along with links to their supporting ADR reports. Lastly, MeDIAR features an interactive visual analytics interface to support drug safety evaluators in reviewing and discovering unknown severe DIARs.
【Keywords】: Pharmacovigilance; Association Rule Mining; Database; Visualization
【Paper Link】 【Pages】:1569-1572
【Authors】: Ju Fan ; Jiarong Qiu ; Yuchen Li ; Qingfei Meng ; Dongxiang Zhang ; Guoliang Li ; Kian-Lee Tan ; Xiaoyong Du
【Abstract】: The wide adoption of social networks has brought a new demand on influence analysis. This paper presents OCTOPUS that offers social network users and analysts valuable insights through topic-aware social influence analysis services. OCTOPUS has the following novel features. First, OCTOPUS provides a user-friendly interface that allows users to employ simple and easy-to-use keywords to perform influence analysis. Second, OCTOPUS provides three powerful keyword-based topic-aware influence analysis tools: keyword-based influential user discovery, personalized influential keywords suggestion, and interactive influential paths exploration. These tools can not only discover influential users, but also provide insights on how the users influence the network. Third, OCTOPUS enables online influence analysis, which provides end-users with instant results. We have implemented and deployed OCTOPUS, and demonstrate its usability and efficiency on two social networks.
【Keywords】: Influence Analysis; Social Networks; Topic Aware
【Paper Link】 【Pages】:1573-1576
【Authors】: Ji Lucas ; Yasser Idris ; Bertty Contreras-Rojas ; Jorge-Arnulfo Quiané-Ruiz ; Sanjay Chawla
【Abstract】: Many of today's applications need several data processing platforms for complex analytics. Thus, recent systems have taken steps towards supporting cross-platform data analytics. Yet, current cross-platform systems lack of ease-of-use, which is crucial for their adoption. This demo presents RheemStudio, a visual IDE on top of Rheem. It allows users to easily specify their cross-platform data analytic tasks. In this demo, we will demonstrate five main features of RheemStudio: drag-and-drop, declarative, interactive, and customized specification of data analytic tasks as well as easy monitoring of tasks. With this in mind, we will consider two real use cases, one from the machine learning world and the second one based on data discovery. During all the demo, the audience will be able to take part and create their own data analytic tasks too.
【Keywords】: Cross platform data processing; query processing; data analytics
【Paper Link】 【Pages】:1577-1580
【Authors】: Yixiang Fang ; Reynold Cheng ; Jikun Wang ; Lukito Budiman ; Gao Cong ; Nikos Mamoulis
【Abstract】: Spatial objects associated with keywords are prevalent in applications such as Google Maps and Twitter. Recently, the topic of spatial keyword queries has received plenty of attention. Spatial Group Keyword (SGK) search is a popular class of queries; their goal is to find a set of objects which are close to each other and are associated to a set of input keywords. In this paper, we propose SpaceKey, a system for retrieving and visualizing spatial objects returned by SGK queries. In addition to existing SGK query types, SpaceKey supports a novel query, called SPM query. An SPM query is defined by a spatial pattern, a graph whose vertices contain keywords and its edges are associated with distance constraints. The results are sets of objects that match the pattern. SpaceKey allows users to perform comparison analysis between different SGK query types. We plan to make SpaceKey an open-source web-based platform, and design API functions for software developers to plug other SGK query algorithms into our system.
【Keywords】: spatial; pattern; spatial pattern; spatial keyword; spatial query
【Paper Link】 【Pages】:1581-1584
【Authors】: Panagiotis Tampakis ; Nikos Pelekis ; Natalia V. Andrienko ; Gennady L. Andrienko ; Georg Fuchs ; Yannis Theodoridis
【Abstract】: In this paper, we present an efficient in-DBMS framework for progressive time-aware sub-trajectory cluster analysis. In particular, we address two variants of the problem: (a) spatiotemporal sub-trajectory clustering and (b) index-based time-aware clustering at querying environment. Our approach for (a) relies on a two-phase process: a voting-and-segmentation phase followed by a sampling-and-clustering phase. Regarding (b), we organize data into partitions that correspond to groups of sub-trajectories, which are incrementally maintained in a hierarchical structure. Both approaches have been implemented in Hermes@PostgreSQL, a real Moving Object Database engine built on top of PostgreSQL, enabling users to perform progressive cluster analysis via simple SQL. The framework is also extended with a Visual Analytics (VA) tool to facilitate real world analysis.
【Keywords】: moving object databases; trajectory clustering; trajectory visualization
【Paper Link】 【Pages】:1585-1588
【Authors】: Yuli Jiang ; Xin Huang ; Hong Cheng ; Jeffrey Xu Yu
【Abstract】: Given a query vertex in a graph, the task of community search is to find all meaningful communities containing the query vertex in an online manner. In this demonstration, we propose a novel query processing system for searching and visualizing communities in graphs, called VizCS. It exhibits three key innovative features. First, VizCS adopts several community models and supports community search on dynamic graphs where nodes/edges undergo frequently insertions/deletions. Second, VizCS offers a user-friendly visual interface to formulate queries and a real-time response query processing engine. Last but not least, VizCS generates a community exploration wall by offering interactive community visualization, which facilitates users to in-depth understanding of the data. Furthermore, VizCS becomes a community search platform that can visualize and compare different community results by various state-of-the-art algorithms and user-uploaded approaches.
【Keywords】: community search; k-truss; online search
【Paper Link】 【Pages】:1589-1592
【Authors】: Romila Pradhan ; Siarhei Bykau ; Sunil Prabhakar
【Abstract】: Data fusion addresses the problem of consolidating data from disparate information providers into a single unified interface. The different data sources often provide conflicting information for the same data item. Recently, several automated data fusion models have been proposed to resolve conflicts and identify correct data. Although quite effective, these data fusion models do not achieve a close-to-perfect accuracy. We present the demonstration of a system that leverages users as first-class citizens to confirm data conflicts and rapidly improve the effectiveness of fusion. This demonstration is built on solutions proposed in our previous work [1]. To utilize the user judiciously, our system presents claims in an order that is the most beneficial to effectiveness of fusion across data items. We describe ranking algorithms that are built on concepts from information theory and decision theory, and do not need access to ground truth. We describe the user input framework and demonstrate how conflict resolution can be expedited with minimal feedback from the user. We show that: (a) the framework can be easily adopted to existing data fusion models without any internal changes to the models, and (b) the framework can integrate both perfect and imperfect feedback from users.
【Keywords】: data integration; data fusion; truth discovery; user feedback; active learning
【Paper Link】 【Pages】:1593-1596
【Authors】: Essam Mansour ; Dong Deng ; Raul Castro Fernandez ; Abdulhakim Ali Qahtan ; Wenbo Tao ; Ziawasch Abedjan ; Ahmed K. Elmagarmid ; Ihab F. Ilyas ; Samuel Madden ; Mourad Ouzzani ; Michael Stonebraker ; Nan Tang
【Abstract】: In order for an enterprise to gain insight into its internal business and the changing outside environment, it is essential to provide the relevant data for in-depth analysis. Enterprise data is usually scattered across departments and geographic regions and is often inconsistent. Data scientists spend the majority of their time finding, preparing, integrating, and cleaning relevant data sets. Data Civilizer is an end-to-end data preparation system. In this paper, we present the complete system, focusing on our new workflow engine, a superior system for entity matching and consolidation, and new cleaning tools. Our workflow engine allows data scientists to author, execute and retrofit data preparation pipelines of different data discovery and cleaning services. Our end-to-end demo scenario is based on data from the MIT data warehouse and e-commerce data sets.
【Keywords】: Data Cleaning; Data Integration; Data Discovery
【Paper Link】 【Pages】:1597-1600
【Authors】: Shuang Hao ; Yi Xu ; Nan Tang ; Guoliang Li ; Jianhua Feng
【Abstract】: Entity categorization - the process of grouping entities into categories for some specific purpose - is an important problem with a great many applications, such as Google Scholar and Amazon products. Unfortunately, many real-world categories contain mis-categorized entities, such as publications in one's Google Scholar page that are published by the others. We have proposed a general framework for a new research problem - discovering mis-categorized entities. In this demonstration, we have developed a Google Chrome extension, namely GSCleaner, as one important application of our studied problem. The attendees will have the opportunity to experience the following features: (1) mis-categorized entity discovery - The attendee can check mis-categorized entities on anyone's Google Scholar page; and (2) Cleaning onsite - Any attendee can login and clean his Google Scholar page using GSCleaner.We describe our novel rule-based framework to discover mis-categorized entities. We also propose effective optimization techniques to apply the rules. Some empirical results show the effectiveness of GSCleaner on discovering mis-categorized entities.
【Keywords】: mis categorized entity; Google Scholar cleaner; rule based framework; signature
【Paper Link】 【Pages】:1601-1604
【Authors】: Sijie Ruan ; Ruiyuan Li ; Jie Bao ; Tianfu He ; Yu Zheng
【Abstract】: Trajectory data preprocessing is to convert raw GPS logs into organized trajectories, which is a common, necessary but tedious task in many urban applications. This paper proposes CloudTP, a cloud-based flexible trajectory data preprocessing framework, to provide an efficient online service, easing the burdens of urban application builders. The proposed system is designed and implemented based on the cloud storage and parallel computing framework (i.e. Spark). Its features consist of 1) noise filtering, 2) trajectory segmentation, 3) map matching, and 4) index building. CloudTP is useful for both normal users and advanced users. By simply uploading trajectory datasets and setting corresponding parameters, normal users can get organized trajectories, statistics and visualizations on the cloud, while advanced users can also customize their own algorithms in any preprocessing module. Finally, usage scenarios are demonstrated to show the capability and flexibility of CloudTP.
【Keywords】: trajectory processing; cloud computing; map matching; spatio temporal index
【Paper Link】 【Pages】:1605-1608
【Authors】: Uta Störl ; Daniel Müller ; Alexander Tekleab ; Stephane Tolale ; Julian Stenzel ; Meike Klettke ; Stefanie Scherzinger
【Abstract】: Building applications for processing data lakes is a software engineering challenge. We present Darwin, a middleware for applications that operate on variational data. This concerns data with heterogeneous structure, usually stored within a schema-flexible NoSQL database. Darwin assists application developers in essential data and schema curation tasks: Upon request, Darwin extracts a schema description, discovers the history of schema versions, and proposes mappings between these versions. Users of Darwin may interactively choose which mappings are most realistic. Darwin is further capable of rewriting queries at runtime, to ensure that queries also comply with legacy data. Alternatively, Darwin can migrate legacy data to reduce the structural heterogeneity. Using Darwin, developers may thus evolve their data in sync with their code. In our hands-on demo, we curate synthetic as well as real-life datasets.
【Keywords】: NoSQL databases; schema evolution; schema management; data migration; query rewriting; variational data
【Paper Link】 【Pages】:1609-1612
【Authors】: Abdulhakim Ali Qahtan ; Ahmed K. Elmagarmid ; Mourad Ouzzani ; Nan Tang
【Abstract】: It is well established that missing values, if not dealt with properly, may lead to poor data analytics models, misleading conclusions, and limitation in the generalization of findings. A key challenge in detecting these missing values is when they manifest themselves in a form that is otherwise valid, making it hard to distinguish them from other legitimate values. We propose to demonstrate FAHES, a system for detecting different types of disguised missing values (DMVs) which often occur in real world data. FAHES consists of several components, namely a profiler to generate rules for detecting repeated patterns, an outlier detection module, and a module to detect values that are used repeatedly in random records. Using several real world datasets, we will demonstrate how FAHES can easily catch DMVs.
【Keywords】: disguised missing values
【Paper Link】 【Pages】:1613-1616
【Authors】: Kun Qian ; Nikita Bhutani ; Yunyao Li ; H. V. Jagadish ; Mauricio A. Hernández
【Abstract】: Many data analysis and data integration applications need to account for multiple representations of entities. The variations in entity mentions arise in complex ways that are hard to capture using a textual similarity function. More sophisticated functions require the knowledge of underlying structure in the representation of entities. People traditionally identify these structures manually and write programs to manipulate them: such work is tedious and cumbersome. We have built LUSTRE, an active learning based system that can learn the structured representations of entities interactively from a few labels. In the background, it automatically generates programs to map entity mentions to their representations and to standardize them to a unique representation. Furthermore, LUSTRE provides a user-friendly interface to allow user declaratively specify normalization and variant generation functions for downstream applications.
【Keywords】: entity standardization; active learning
【Paper Link】 【Pages】:1617-1620
【Authors】: Sona Hasani ; Ning Yan ; Chengkai Li
【Abstract】: This paper introduces TableView, a tool that helps generate preview tables of massive heterogeneous entity graphs. Given the large number of datasets from many different sources, it is challenging to select entity graphs for a particular need. TableView produces preview tables to offer a compact presentation of important entity types and relationships in an entity graph. Users can quickly browse the preview in a limited space instead of spending precious time or resources to fetch and explore the complete dataset that turns out to be inappropriate for their needs. In this paper we present: 1) a detailed description of TableView's graphical user interface and its various features, 2) a brief description of TableView's preview scoring measures and algorithms, and 3) the demonstration scenarios we intend to present to the audience.
【Keywords】: entity graph; knowledge graph; usability; summarization
【Paper Link】 【Pages】:1621-1624
【Authors】: Pooja Agrawal ; Bikash Chandra ; K. Venkatesh Emani ; Neha Garg ; S. Sudarshan
【Abstract】: Unit test cases have become an essential tool to test application code. Several applications make use of SQL queries in order to retrieve or update information from a database. Database queries for these applications are written natively in SQL using JDBC or using ORM frameworks like Hibernate. Unit testing these applications is typically done by loading a fixed dataset and running unit tests. However with fixed datasets, errors in queries may be missed. In this demonstration, we present a system that takes as input a database application program, and generates datasets and unit tests using the datasets to test the correctness of function with queries in the application. Our techniques are based on static program analysis and mutation testing. We consider database applications written in Java using JDBC or Hibernate APIs. The front-end of our system is a plugin to the IntelliJ IDEA IDE. We believe that such a system would be of great value to application developers and testers.
【Keywords】: Database application testing; Program analysis; Test data generation
【Paper Link】 【Pages】:1625-1628
【Authors】: Jianzhong Qi ; Vivek Kumar ; Rui Zhang ; Egemen Tanin ; Goce Trajcevski ; Peter Scheuermann
【Abstract】: We study the problem of continuous maintenance of range sum heat maps over dynamically updating data objects. The range sum (RS) here refers to the sum of the weights of the data objects enclosed by a given range (rectangle) R. Range sum problems are useful in spatio-temporal data analytics and decision making processes. Recent studies on range sum problems focus on computing the MaxRS query, which finds a location to place a rectangle R such that its RS is maximized. In real applications, knowing only the location with the maximum RS may be insufficient, because decision making is a multi-factor process where maximizing the RS may just be one of the factors. It is also important to gain an overview of the RS distribution at different locations, so that decisions can be made based on global knowledge. We therefore propose to compute a range-sum heat map that visualizes the RS value for every location in a data space. Considering that data objects may be inserted into or removed from the data space dynamically, we further study the continuous maintenance of range-sum heat maps over dynamically updating data objects. We adapt algorithms to compute range-sum heat maps and to perform heat map updates. We build a demo system to showcase the usefulness of range sum heat maps and the effectiveness of the adapted algorithms.
【Keywords】: Range Sum; Heat Map; Continuous Heat Map Maintenance
【Paper Link】 【Pages】:1629-1632
【Authors】: Xiang Yu ; Guoliang Li ; Yudian Zheng ; Yan Huang ; Songfan Zhang ; Fei Chen
【Abstract】: Crowdsourcing is widely accepted as a means for resolving tasks that are hard for computers, e.g., entity resolution. Unfortunately, Crowdsourcing may yield relatively low-quality results if there is no proper quality control. Although previous studies attempt to eliminate workers by estimating workers' qualities via qualification tests or hidden tests, the qualities estimated may not be accurate, because workers may have diverse qualities across tasks. Thus, the quality of the results could be further improved by wisely assigning tasks to the workers who are specialized in the tasks and online task assignment is an effective way to achieve this goal. However, existing crowdsourcing platforms either do not support online task assignment (e.g., CrowdFlower) or are not user-friendly because they require to write complicated codes (e.g., Amazon MTurk). To address these limitations, we develop an online task assignment system, which can on-the-fly assign workers with appropriate tasks. We have deployed our system on top of MTurk. We demonstrate the following scenarios using our system. Firstly, requesters can easily utilize our system to enable online task assignment in order to improve answer quality. Moreover, requesters do not need to write any code. Secondly, our system can infer the quality of workers, and requesters can design and test their own task assignment algorithms using our proposed information. Thirdly, our system can monitor tasks and workers in real time, and the requesters can eliminate bad workers or terminate the crowdsourcing process to reduce the unnecessary cost.
【Keywords】: Crowdsourcing; Online; task assignment
【Paper Link】 【Pages】:1633-1636
【Authors】: Rikuya Suzuki ; Tetsuo Sakaguchi ; Masaki Matsubara ; Hiroyuki Kitagawa ; Atsuyuki Morishima
【Abstract】: We demonstrate CrowdSheet, a spreadsheet interface for implementing complex crowdsourcing. Despite its appeal, adoption of the spreadsheet paradigm is associated with two nontrivial problems: (1) how to design the interface, which must be a natural extension of existing spreadsheets, while guaranteeing a reasonable expressive power, and (2) how to incorporate techniques for improving data quality without sacrificing the spreadsheet's easy-to-use feature. In this demo, we show three things. First, CrowdSheet allows non IT experts to easily implement crowdsourcing applications with complex workflows. Second, CrowdSheet adopts only new two spreadsheet functions to implement a fairly wide range of real-world applications. Third, its modular architecture gives CrowdSheet a declarative feature that lets users choose alternative plans for improving data quality, while keeping the CrowdSheet description simple.
【Keywords】: rapid development; declarative crowdsourcing; end user computing
【Paper Link】 【Pages】:1637-1640
【Authors】: Nikos Zacheilas ; Stathis Maroulis ; Thanasis Priovolos ; Vana Kalogeraki ; Dimitrios Gunopulos
【Abstract】: In this demonstration we present Dione a novel framework for automatic profiling and tuning big data applications. Our system allows a non-expert user to submit Spark or Flink applications to his/her cluster and Dione automatically determines the impact of different configuration parameters on the application's execution time and monetary cost. Dione is the first framework that exploits similarities in the execution plans of different applications to narrow down the amount of profiling runs that are required for building prediction models that capture the impact of the configuration parameters on the metrics of interest. Dione exploits these prediction models to tune the configuration parameters in a way that minimizes the application's execution time or the user's budget. Finally, Dione's Web-UI visualizes the impact of the configuration parameters on the execution time and the monetary cost, and enables the user to submit the application with the recommended parameters' values.
【Keywords】: distributed systems; profiling; Big Data
【Paper Link】 【Pages】:1641-1644
【Authors】: Athanasios Naskos ; Anastasios Gounaris ; Ioannis Konstantinou
【Abstract】: We present the Elton tool, a publicly available cloud resource elasticity management system tailored to NoSQL databases. Elton is integrated in the Ganetimgr web platform, and offers an easy to use web interface, through which monitoring and horizontal scaling of NoSQL databases can be performed and what-if analysis queries are enabled. Elton uses Markov Decision Processes (MDPs) as the underlying modeling framework, and encapsulates state-of-the-art horizontal scaling policies that offer different trade-offs between performance and monetary deployment cost. Its main novelty is that it employs probabilistic model checking to allow for both efficient elasticity decisions and analysis of scaling actions and serves as a case study about the benefits of model checking in online decision making and analysis.
【Keywords】: elasticity; cloud; NoSQL; model checking; horizontal scaling
【Paper Link】 【Pages】:1645-1648
【Authors】: Qing Liu ; Zijin Feng ; Xike Xie ; Jianliang Xu ; Xin Lin ; Christian S. Jensen
【Abstract】: Owing to the widespread use of location-aware devices and the increased popularity of micro-blogging applications, we are witnessing a rapid proliferation of geo-textual data. In this demonstration, we present iZone, an efficient system for determining influence zones over geo-textual data. Specifically, iZone allows users to browse geo-textual objects, evaluate the influence zones of specified geo-textual objects, and obtain explanations of the evaluation results. The iZone system adopts a browser-server model. The server side integrates two types of spatial keyword search, namely top-k spatial keyword query and reverse top-k keyword-based location query, to support the functionality of the system. A variety of spatial indexes are employed to enhance the efficiency of the system. The browser side provides a map-based GUI interface, which enables convenient and user-friendly interaction with the system. Using a real hotel dataset from Hong Kong, iZone offers hands-on experience with influence zone evaluation in real-life applications.
【Keywords】: Influence zone; Geo textual data; Demonstration
【Paper Link】 【Pages】:1649-1652
【Authors】: Siyuan Han ; Zihuan Xu ; Lei Chen
【Abstract】: With the success of Bitcoin, the technique behind it, Blockchain, is catching massive attention recently. Blockchain is a collection of several techniques like cryptology, P2P and distributed consensus protocol. The main idea of Blockchain is that nodes in the network keep the same distributed ledger. Because of this immutable ledger, a trusted bridge is built among parties without fully trust. Blockchain can be used in variety of areas, especially in financial fields, like supply chain management, cross-border payment and global bank settlement. Meanwhile, we can observe that mobile network is growing rapidly and nibbling the PC market. However, the current public Blockchain applications like Bitcoin or Ethereum require the nodes to store the whole ledger which exceeds the capacity of the mobile devices. Thus, we need to develop a blockchain platform to support mobile devices. In this demo, we introduce Jupiter, a mobile-based Blockchain platform which provides a novel concept called consensus unit (CU) to alleviate the storage problem of mobile. We present the system architecture and demonstrate several CU scenarios via Jupiter.
【Keywords】: Blockchain
【Paper Link】 【Pages】:1653-1656
【Authors】: Yeshwanth Durairaj Gunasekaran ; Abolfazl Asudeh ; Sona Hasani ; Nan Zhang ; Ali Jaoua ; Gautam Das
【Abstract】: The ranked retrieval model has rapidly become the de-facto way for search query processing in web databases. Despite the extensive efforts on designing better ranking mechanisms, in practice, many such databases fail to address the diverse and sometimes contradicting preferences of users. In this paper, we present QR2, a third-party service that uses nothing but the public search interface of a web database and enables the on-the-fly processing of queries with any user-specified ranking functions, no matter if the ranking function is supported by the database or not.
【Keywords】: Query Re ranking; Top K; Web Databases; Third party Service
【Paper Link】 【Pages】:1657-1660
【Authors】: Haoqiong Bian ; Youxian Tao ; Guodong Jin ; Yueguo Chen ; Xiongpai Qin ; Xiaoyong Du
【Abstract】: Popular column stores such as ORC and Parquet have been widely used in many Hadoop-oriented data analysis systems. With the effective column skipping and data compression functionalities provided by column stores, wide tables with hundreds or even thousands of columns are applied by many big data analysis applications to avoid the expensive distributed joins. We found that the performance of such systems can be further improved by optimizing the physical data layout to fit certain workloads and system settings. However, it is nontrivial to perform such optimization manually. In this demo, we present a data layout optimization tool called Rainbow, which leverages workload-driven layout optimization algorithms to adjust data layouts adaptively without intervening the previous data blocks that have been stored. We also provide a Web UI for users to interact with the layout optimization process. Furthermore, Rainbow is open sourced with an accompanying benchmark for performance evaluation of wide tables.
【Keywords】: HDFS; wide table; data layout; workload driven; adaptive database; data placement
【Paper Link】 【Pages】:1661
【Authors】: David B. Lomet
【Abstract】: A caching data store, e.g., a traditional dbms, moves data between main memory and secondary storage as dictated by access patterns. Such a system provides good cost/performance by hosting hot data in expensive DRAM, while moving cold data to cheaper SSD storage. Data caching system success is demonstrated by the health of the traditional database systems market, and its growth with new vendors offering data caching systems. However, achieving both high performance and low cost using any database system, including data caching systems, has been a challenge. We describe how this problem has been attacked, including in our Deuteronomy project.
【Keywords】: database system; performance; cost; caching; log structured; latch free
【Paper Link】 【Pages】:1662
【Authors】: Uta Störl ; Alexander Tekleab ; Meike Klettke ; Stefanie Scherzinger
【Abstract】: Schema-flexible NoSQL data stores lend themselves nicely for storing versioned data, a product of schema evolution. In this lightning talk, we apply pending schema changes to records that have been persisted several schema versions back. We present first experiments with MongoDB and Cassandra, where we explore the trade-off between applying chains of pending changes stepwise (one after the other), and as composite operations. Contrary to intuition, composite migration is not necessarily faster. The culprit is the computational overhead for deriving the compositions. However, caching composition formulae achieves a speed up: For Cassandra, we can cut the runtime by nearly 80%. Surprisingly, the relative speedup seems to be system-dependent. Our take away message is that in applying pending schema changes in NoSQL data stores, we need to base our design decisions on experimental evidence rather than on intuition alone.
【Keywords】: NoSQL databases; schema evolution; data migration; composite migration
【Paper Link】 【Pages】:1663
【Authors】: Wolfgang Lehner ; Dirk Habich ; Till Kolditz
【Abstract】: The key objective of database systems is to reliably manage data, whereby high query throughput and low query latency are core requirements. To satisfy these requirements, database systems constantly adapt to novel hardware features. Although it has been intensively studied and commonly accepted that hardware error rates in terms of bit flips increase dramatically with the decrease of the underlying chip structures, most database system research activities neglected this fact, leaving error (bit flip) detection as well as correction to the underlying hardware. Especially for main memory, silent data corruption (SDC) as a result of transient bit flips leading to faulty data is mainly detected and corrected at the DRAM and memory-controller layer. However, since future hardware becomes less reliable and error detection as well as correction by hardware becomes more expensive, this free ride will come to an end in the near future. To further provide a reliable data management, an emerging research direction is employing specific and tailored protection techniques at the database system level. Following that, we are currently developing and implementing an adopted system design for state-of-the-art in-memory column stores. In our lightning talk, we will summarize our current state and outline future work.
【Keywords】: Reliability; Error Detection; Database Systems; Query Processing
【Paper Link】 【Pages】:1664-1665
【Authors】: Bruce R. Childers ; Panos K. Chrysanthis
【Abstract】: Data Management (DM), like many areas of computer science (CS), relies on empirical evaluation that uses software, data sets and benchmarks to evaluate new ideas and compare with past innovation. Despite the importance of these artifacts and associated information about experimental evaluations, few researchers make these available in a findable, accessible, interoperable and reusable (FAIR) manner, in this way hindering the scientific process by limiting open collaboration, credibility of published outcomes, and research progress. Fortunately, this problem is recognized and many CS communities, including the DM one, are advocating and providing incentives for software and analysis papers to follow FAIR principles and be treated equally to traditional publications. Some ACM/IEEE conferences adopted Artifact Evaluation (AE) to reward authors for doing a great job in conducting experiments with FAIR software and data. After half a decade since AE's inception, the question is whether the emerging emphasis on artifacts, is having a real impact in CS research.
【Keywords】: repeatability; artifact evaluation
【Paper Link】 【Pages】:1666-1667
【Authors】: Peter Triantafillou
【Abstract】: This paper outlines my vision for the next generation data analytics processing systems, revolving around the concept of data-less data analytics.
【Keywords】: data systems; analytics; big data
【Paper Link】 【Pages】:1668-1669
【Authors】: Ben McCamish ; Arash Termehchy ; Behrouz Touri
【Abstract】: Most users do not know the structure and content of databases or concepts such as schema or formal query languages sufficiently well to express their information needs precisely in form of queries. They may convey their intents via usable but inherently ambiguous query interfaces, such as keyword or natural language queries, which are open to numerous interpretations. Thus, it is very challenging for a database management system (DBMS) to understand and satisfy the intents behind these queries. A useful source of information to decode the intent behind a query is the user feedback on the returned results, which may come in various ways, such as click-through information. To use this information, DBMS has to establish a trade-off between showing strongly reinforced answers, i.e., exploitation, and exploring never-before-shown or rarely shown answers to solicit user feedback on them, i.e. exploration. A DBMS that returns only positively reinforced results may always return a small subset of all relevant answers to a query. On the other hand, if a DBMS shows too many unseen answers to the user, it may discourage and disengage users as many such answers may not be relevant to the query. Moreover, users may learn more about the structure and content of the database and how to express intents as they submit queries and observe the returned results. Thus, they may modify the way they express intents over the course of their interactions with the DBMS.
【Keywords】: Game theory; information retrieval; reinforcement learning; database interaction
【Paper Link】 【Pages】:1670-1671
【Authors】: Eyal Rozenberg
【Abstract】: We argue for a richer view of the space of lightweight compression schemes for columnar DBMSes: We demonstrate how even simple simple schemes used in DBMSes decompose into constituent schemes through a columnar perspective on their decompression. With our concrete examples, we touch briefly on what follows from these and other decompositions: Composition of alternative compression schemes as well as other practical and analytical implications.
【Keywords】: dbms; compression; lightweight compression; compression scheme decomposition; compression scheme; rle; for; delta; pfor; patching; run length encoding; frame of reference; patched frame of reference; analytic dbms; column store; columar; columnar compression; function decomposition; column space; metric spaces; decompression; columnar processing; modeling; step functions; low degree polynomials; polynomial models; piecewise polynomials; l_0 norm; l_infinity norm; maximum norm; gather; scatter; prefix sum; run position encoding; rpe
【Paper Link】 【Pages】:1672
【Authors】: Ying Zhang ; Richard Koopmanschap ; Martin L. Kersten
【Abstract】: This talk first shows how an in-database machine learning system has been realised by a seamless integration of MonetDB (an open-source analytical columnar DBMS) and TensorFlow (an open-source machine learning library). Then we show with an example application of entity linking using neural embeddings the potential of this integration.
【Keywords】: monetdb; tensorflow; in database machine learning; in database analytics; in database entity linking
【Paper Link】 【Pages】:1673-1674
【Authors】: Mohamed Sarwat ; Venkata Vamsikrishna Meduri
【Abstract】: There has been an increasing interest into blurring the line between human-interaction and database systems. Several research papers tackled the Human-Database Interaction (HDI) challenge, yet none of them provides a holistic HDI approach. In this talk, we briefly describe dbTinder, a database engine that bridges the gap between the conversation approach humans use to interact and the Query → Answer approach used in classic database systems. To achieve that, dbTinder turns HDI into an intent discovery process where the system pro-actively converses with the user to guide her towards potentially interesting tuples in the database.
【Keywords】: Human database interaction; User centric databases; Data exploration
【Paper Link】 【Pages】:1675-1676
【Authors】: Vahid Ghadakchi ; Arash Termehchy
【Abstract】: N/A
【Keywords】: database; imprecise queries; effective query answering
【Paper Link】 【Pages】:1677-1678
【Authors】: Kostas Zoumpatianos ; Themis Palpanas
【Abstract】: Massive data sequence collections exist in virtually every scientific and social domain, and have to be analyzed to extract useful knowledge. However, no existing data management solution (such as relational databases, column stores, array databases, and time series management systems) can offer native support for sequences and the corresponding operators necessary for complex analytics. We argue for the need to study the theory and foundations for sequence management of big data sequences, and to build corresponding systems that will enable scalable management and analysis of very large sequence collections. To this effect, we need to develop novel techniques to efficiently support a wide range of sequence queries and mining operations, while leveraging modern hardware. The overall goal is to allow analysts across domains to tap in the goldmine of the massive and ever-growing sequence collections they (already) have.
【Keywords】: Time Series; Data Series; Data Series Management Systems; Sequence Management Systems
【Paper Link】 【Pages】:1679-1680
【Authors】: Raghunath Nambiar
【Abstract】: Over the past three decades, the Transaction Processing Performance Council (TPC) has developed many standards for performance benchmarking. These standards have been a significant driving force behind the development of faster, less expensive, and more energy efficient systems. Historically, we have seen benchmark standards for transaction processing, decision support systems and virtualization. In the recent years the TPC has developed benchmark standards for emerging areas such as big data analytics (BDA), the Internet of Things (IoT), database virtualization and Hyper-Convergence Infrastructure (HCI). This short paper discusses the TPC's plans for creating benchmark standards for Artificial Intelligence (AI) systems.
【Keywords】: artificial intelligence; machine learning; performance benchmarks
【Paper Link】 【Pages】:1681
【Authors】: Graham Cormode ; Chris Hickey
【Abstract】: Much of computer science involves problems where it is considered to be easier to check that an answer is correct than to find a correct answer (the complexity class NP). In this talk, we outline results that apply this notion to checking outsourced computations for data analytics.
【Keywords】: Interactive proofs
【Paper Link】 【Pages】:1682-1683
【Authors】: Thomas Neumann
【Abstract】: NULL values offer an explicit way to model missing, unknown or invalid values in a database. While having such an explicitly value is convenient, the resulting semantic is unfortunately often surprising. Furthermore, there seem to be cases where the desired semantic is unclear, and different systems produce different results. This paper highlights some of the problematic cases and argues for a literal interpretation of the SQL UNKNOWN.
【Keywords】: query optimization; NULL
【Paper Link】 【Pages】:1684-1688
【Authors】: Tim Gubner
【Abstract】: Modern hardware tends to become increasingly heterogeneous which leads to major challenges for existing systems: (a) Given that performance on modern hardware should be maximized, hardware features should be fully exploited. Together with the increasing heterogeneity this leads to more complex systems. (b) In general it is non-trivial for a system to determine the most efficient way to execute a program on a specific piece of hardware. Based on (a) and (b) we believe it is necessary to extend code generation in data processing systems. First, to mitigate the increasing complexity we propose the usage of domain-specific languages (DSL) that abstract specific details away. Our DSL is based on data-parallel operations and control-flow statements which allows to easily exploit SIMD (on multiple architectures: CPU, GPU etc.). We briefly sketch our idea of such a DSL. Second, we propose a virtual machine executing this DSL. We plan to exploit different implementation flavors (adaptivity) and dynamically compile & optimize hot paths in the program (JIT-compilation). We sketch ideas about which paths to compile, how to achieve adaptive execution of different JIT-compiled paths and elaborate a bit more on our ideas on workload-specific optimizations. Finally, we present our plan for future research and ideas for major publications.
【Keywords】: heterogeneous hardware; domain specific language; SIMD; data parallelism; virtual machine; just in time compilation; adaptivity; dark silicon
【Paper Link】 【Pages】:1689-1693
【Authors】: Alvis Logins
【Abstract】: A node selection query returns a set of network nodes that optimize some objective function. The problem of enabling fast and accurate node selection is of high importance in fields such as logistics, service planning, and advertising. Due to the computational complexity, it is often impossible to provide an optimal solution to particular node selection queries. We study new approximation methods for node selection in million-node networks, such as social and road networks. We extend existing models to a more realistic scenarios by introducing time and uncertainty into the problem domain, and we apply the proposed solutions to real-world datasets.
【Keywords】: Optimization; Graph; Node Selection
【Paper Link】 【Pages】:1694-1698
【Authors】: Tunaggina Subrina Khan
【Abstract】: Zoning Improvement Plan (ZIP) Code polygon maps, obtained from different data sources, do not match, thus creating uncertainty in spatial analysis. In this dissertation, we want to combine multiple source of ZIP-code polygon data into a coherent system that maps a given location to a ZIP-code. For this purpose, we want to harness the wisdom of the crowd by combining various polygon map data sets with various sources of volunteered geographical information. The system that we want to build will employ traditional classification methods to map a given spatial coordinate to a distribution of ZIP-codes using the (not publicly available) United States Postal Service (USPS) map as an authoritative ground truth. In our first studies, we train a Naïve Bayes classifier using multiple (publicly available) ZIP Code polygon maps. In addition, we enrich our classification by using a lazy K-Nearest Neighbor classifier to predict the ZIP Codes for a given location using Twitter. Feeding the result of this classifier to the Naïve Bayes classification our experimental evaluation shows an improvement in classification accuracy, compared to using only map data.
【Keywords】: ZIP Code; Classification; Wisdom of Crowd; Spatial analysis
【Paper Link】 【Pages】:1699-1703
【Authors】: Bo Zhao
【Abstract】: Complex event processing (CEP) systems evaluate queries over event streams for low-latency detection of user-specified event patterns. They need to process streams of growing volume and velocity, while the heterogeneity of event sources yields unpredictable input rates. Against this background, models and algorithms for the optimisation of CEP systems have been proposed in the literature. However, when input rates grow by orders of magnitude during short peak times, exhaustive real-time processing of event streams becomes infeasible. CEP systems shall therefore resort to best-effort query evaluation, which maximises the accuracy of pattern detection while staying within a predefined latency bound. For traditional data stream processing, this is achieved by load shedding that drops some input data without processing it, guided by the estimated importance of particular data entities for the processing accuracy. In this work, we argue that such input-based load shedding is not suited for CEP queries in all situations. Unlike for relational stream processing, where the impact of shedding is assessed based on the operator selectivity, the importance of an event for a CEP query is highly dynamic and largely depends on the state of query processing. Depending on the presence of particular partial matches, the impact of dropping a single event can vary drastically. Hence, this PhD project is devoted to state-based load shedding that, instead of dropping input events, discards partial matches to realise best-effort processing under constrained resources. In this paper, we describe the addressed problem in detail, sketch our envisioned solution for state-based load shedding, and present preliminary experimental results that indicate the general feasibility of our approach.
【Keywords】: complex event processing; load shedding; sequential pattern matching; stream processing
【Paper Link】 【Pages】:1704-1708
【Authors】: James Wagner
【Abstract】: The pervasive use of databases for the storage of critical and sensitive information in many organizations has led to an increase in the rate at which databases are exploited in computer crimes. While there are several techniques and tools available for database forensics, they mostly assume apriori database preparation, such as relying on tamper-detection software to already be in place or use of detailed logging. Alternatively, investigators need forensic tools and techniques that work on poorly-configured databases and make no assumptions about the extent of damage in a database. In this paper, we present our database forensics methods, which are capable of examining database content from a database image without using any log or system metadata. We describe how these methods can be used to detect security breaches in untrusted environments where the security threat arose from a privileged user (or someone who has obtained such privileges).
【Keywords】: database forensics; digital forensics; page carving; tampering detection
【Paper Link】 【Pages】:1709-1713
【Authors】: Kathryn Dahlgren
【Abstract】: NoSQL databases offer powerful abstractions for querying non-relational data. However, NoSQL products generally pursue superior flexibility, customizability, scalability, and performance goals while neglecting support for generally useful data management tools. In particular, products typically ship without integrated support for management features rendered conventional by the long history of RDBMSs, such as sophisticated query processing systems, join operations, aggregate functions, and integrity constraints. The design decision forces users of NoSQL technologies to find alternative methods for providing missing tools by engaging either directly or indirectly in a suboptimal k-implementation cycle as developers re-invent new instances of the same data management tools across NoSQL products. This paper articulates the problem associated with the lax regard for data management support currently defining the class of NoSQL databases and introduces the Piper package index and management system as an exploratory solution.
【Keywords】: NoSQL; Databases; Data Management
【Paper Link】 【Pages】:1714-1718
【Authors】: Roman Zoun
【Abstract】: Metaproteomics is a field of biology to analyze a microbial community. However, the growth of available data and of the scientific community itself is not reflected in the currently used software tools in this field. The standard tools are file-based and use proprietary methods or implementations. Thus, calculation performance does not increase with the same speed that new data is generated. Furthermore, a reasonable data exchange mechanism is missing which makes collaboration across departments cumbersome. As a solution, we identified the internet of things workflow as promising, because it counters the above-mentioned challenges. To this end, we are going to implement a streaming-based cloud platform using a fast data architecture and integrate the algorithms into a new cloud-based system. In this paper, we elaborate arising challenges and how we envision solving them.
【Keywords】: Fast Data; NoSQL; Bioinformatics; Metaproteomics; Proteomics; Internet of Things; Cloud Computing
【Paper Link】 【Pages】:1719-1722
【Authors】: Hanan Samet
【Abstract】: Techniques are reviewed for representing multi-dimensional spatial data geometrically and textually based on sorting it. These ideas are also used for metric data where only a distance function indicating the degree of similarity between all object pairs in the dataset are available.
【Keywords】: Spatial data; metric data; spatiotextual data; sorting
【Paper Link】 【Pages】:1723-1726
【Authors】: Zoi Kaoudi ; Jorge-Arnulfo Quiané-Ruiz
【Abstract】: There is a zoo of data processing platforms which help users and organizations to extract value out of their data. Although each of these platforms excels in specific aspects, users typically end up running their data analytics on suboptimal platforms. This is not only because choosing the right platform among the myriad of big data platforms is a daunting task, but also due to the fact that today's data analytics are moving beyond the limits of a single platform. Thus, there is an urgent need for cross-platform data processing, i.e., using more than one data processing platform to perform a data analytics task. Despite the need, achieving this is still a dreadful process where developers have to get intimate with many systems and write ad hoc scripts for integrating them. This tutorial is motivated by this need. We will discuss the importance of supporting cross-platform data processing in a systematic way as well as the current efforts to achieve that. In particular, we will introduce a classification of the different cases where an application needs or benefits from cross-platform data processing and the challenges of each case. Along with this classification, we will also present the efforts known up to date to support cross-platform data processing. We will conclude this tutorial with a discussion of several important open problems.
【Keywords】: cross platform; data processing; polystores; query processing
【Paper Link】 【Pages】:1727-1730
【Authors】: Avigdor Gal ; Arik Senderovich ; Matthias Weidlich
【Abstract】: Temporal analysis for online monitoring and improvement of complex systems such as hospitals, public transportation networks, or supply chains has been in the focus of several areas in operations management. These include queueing theory for bottleneck analysis, mathematical scheduling for resource assignments to customers, and inventory management for ordering products under uncertain demand. In recent years, with the increasing availability of data sensed by Internet-of-Things (IoT) infrastructures, these online temporal analyses drift towards automated and data-driven solutions. In this tutorial, we cover existing approaches to answer online temporal queries based on sensed data. We discuss two complementary angles, namely operations management and machine learning. The operational approach is driven by models, while machine learning methods are grounded in feature encoding. Both techniques require methods for translating low-level data readings coming from sensors into high-level activities with their temporal relations. Further, some of the techniques consider only dependencies of the sensed entities on their own individual histories, while others take into account dependencies between entities that share system resources. We outline the state-of-the-art in temporal querying, with demonstrations of interesting phenomena and main results using a real-world case study in the healthcare domain. Finally, we chart the territory of online data analytics for complex systems in a broader context and provide future research directions.
【Keywords】: Temporal Analysis; Queueing Theory; Process Mining; Predictive Monitoring
【Paper Link】 【Pages】:1731-1734
【Authors】: Cetin Sahin ; Amr El Abbadi
【Abstract】: Although outsourcing data to cloud storage has become popular, the increasing concerns about data security and privacy in the cloud limits broader cloud adoption. Ensuring data security and privacy, therefore, is crucial for better and broader adoption of the cloud. This tutorial provides a comprehensive analysis of the state-of-the-art in the context of data security and privacy for outsourced data. We aim to cover common security and privacy threats for outsourced data, and relevant novel schemes and techniques with their design choices regarding security, privacy, functionality, and performance. Our explicit focus is on recent schemes from both the database and the cryptography and security communities that enable query processing over encrypted data and access oblivious cloud storage systems.
【Keywords】: Cloud Storage; Data Security and Privacy; Range Query; SSE; ORAM; Querying Encrypted Data; Access Privacy
【Paper Link】 【Pages】:1735-1738
【Authors】: Laure Berti-Équille ; Angela Bonifati ; Tova Milo
【Abstract】: With the emergence of machine learning (ML) techniques in database research, ML has already proved a tremendous potential to dramatically impact the foundations, algorithms, and models of several data management tasks, such as error detection, data cleaning, data integration, and query inference. Part of the data preparation, standardization, and cleaning processes, such as data matching and deduplication for instance, could be automated by making a ML model "learn" and predict the matches routinely. Data integration can also benefit from ML as the data to be integrated can be sampled and used to design the data integration algorithms. After the initial manual work to setup the labels, ML models can start learning from the new incoming data that are being submitted for standardization, integration, and cleaning. The more data supplied to the model, the better the ML algorithm can perform and deliver accurate results. Therefore, ML is more scalable compared to traditional and time-consuming approaches. Nevertheless, many ML algorithms require an out-of-the-box tuning and their parameters and scope are often not adapted to the problem at hand. To make an example, in cleaning and integration processes, the window sizes of values used for the ML models cannot be arbitrarily chosen and require an adaptation of the learning parameters. This tutorial will survey the recent trend of applying machine learning solutions to improve data management tasks and establish new paradigms to sharpen data error detection, cleaning, and integration at the data instance level, as well as at schema, system, and user levels.
【Keywords】: machine learning; data cleaning; error detection; data repairing; query inference; data management; clustering; classification
【Paper Link】 【Pages】:1739-1740
【Authors】: C. Mohan
【Abstract】: In this tutorial, I survey some of the private blockchain systems with respect to their architectures in general and their approaches to some specific technical areas. Specifically, I focus on how the functionality of traditional and modern data stores are being utilized or not utilized in the different blockchain systems.
【Keywords】: Bitcoin; smart contracts; private blockchains; databases; Hyperledger Fabric; IBM; Enterprise Ethereum; R3 Corda; Intel; Sawtooth; cryptocurrencies; consensus; tutorial
【Paper Link】 【Pages】:1741-1742
【Authors】: Shuo Shang ; Lisi Chen ; Christian S. Jensen ; Ji-Rong Wen ; Panos Kalnis
【Abstract】: We propose and investigate a novel query type named trajectory search by regions of interest (TSR query). Given an argument set of trajectories, a TSR query takes a set of regions of interest as a parameter and returns the trajectory in the argument set with the highest spatial-density correlation to the query regions. This type of query is useful in applications such as trip planning and recommendation. To process the TSR query, a set of new metrics are defined to model spatial-density correlations. An efficient trajectory search algorithm is developed that exploits upper and lower bounds to prune the search space and that adopts a query-source selection strategy, as well as integrates a heuristic search strategy based on priority ranking to schedule multiple query sources. The performance of TSR query processing is studied in extensive experiments based on real and synthetic spatial data.
【Keywords】: Trajectory search by regions; spatial density correlation; spatial networks; spatial databases
【Paper Link】 【Pages】:1743-1744
【Authors】: Melissa Ailem ; François Role ; Mohamed Nadif
【Abstract】: We present a novel generative mixture model for co-clustering text data. This model, the Sparse Poisson Latent Block Model (SPLBM), is based on the Poisson distribution, which arises naturally for contingency tables, such as document-term matrices. The advantages of SPLBM are two-fold. First, it is a rigorous statistical model which is also very parsimonious. Second, it has been designed from the ground up to deal with data sparsity problems. Extensive experiments on various real-world datasets of different size and structure provide strong evidence for the effectiveness of the proposed approach.
【Keywords】: Co clustering; Clustering; Poisson distribution; Mixture Model; Latent Block Model; Text Mining; Data Sparsity
【Paper Link】 【Pages】:1745-1746
【Authors】: Jiali Mao ; Tao Wang ; Cheqing Jin ; Aoying Zhou
【Abstract】: The existing detection techniques are not tailored to identify the outlier which is close to its neighbors according to some features, but behaves significantly distinct from its neighbors in terms of the other features. In this paper, we propose a feature grouping-based mechanism, and then present two algorithms to detect outliers (TF-outlier and MO-outlier) upon trajectory streams. The effectiveness and efficiency of our proposal are validated by the experiments on real trajectory data.
【Keywords】: trajectory stream; outlier detecton; feature grouping based
【Paper Link】 【Pages】:1747-1748
【Authors】: Daichi Amagata ; Takahiro Hara
【Abstract】: We study the novel problem of continuous mining of top-k closed co-occurrence patterns across multiple streams. We employ sliding window setting in this problem, and each pattern is ranked based on count, which is the number of streams that have generated the pattern. Since objects are consecutively generated and deleted, the count of a given pattern is dynamic, which may change the rank of the pattern. This renders a challenge to monitoring the top-k answer in real-time. We propose an index-based algorithm that addresses the challenge and provides the exact answer. Specifically, we propose the CP-Graph, a hybrid index of graph and inverted file structures. The CP-Graph can efficiently compute the count of a given pattern and update the answer while pruning unnecessary patterns. Our experimental study on real datasets demonstrates the efficiency and scalability of our solution.
【Keywords】: Co occurrence pattern mining; multiple streams
【Paper Link】 【Pages】:1749-1750
【Authors】: Quoc Viet Hung Nguyen ; Huynh Huu Viet ; Thanh Tam Nguyen ; Matthias Weidlich ; Hongzhi Yin ; Xiaofang Zhou
【Abstract】: Crowdsourcing has been widely established as a means to enable human computation at large-scale, in particular for tasks that require manual labelling of large sets of data items. Answers obtained from heterogeneous crowd workers are aggregated to obtain a robust result. In this paper, we consider partial-agreement tasks that are common in many applications such as image tagging and document annotation, where items are assigned sets of labels. Going beyond the state-of-the-art, we propose a novel Bayesian nonparametric model to aggregate the partial-agreement answers in a generic way. This model enables us to compute the consensus of partially-sound and partially-complete worker answers, while taking into account mutual relations in labels and different answer sets. An evaluation of our method using real-world datasets reveals that it consistently outperforms the state-of-the-art in terms of precision, recall, and scalability.
【Keywords】: Crowdsourcing; Nonparametric Models; Bayesian Models; Answer Aggregation
【Paper Link】 【Pages】:1751-1752
【Authors】: Natalia Arzamasova ; Martin Schäler ; Klemens Böhm
【Abstract】: Today, many scientific data sets are open to the public. For their owners, it is important to know what the information needs of the users are. In this paper, we study the problem of extracting and analyzing patterns from the query log of a database. We focus on design errors (antipatterns). Antipatterns do not only have a negative effect on query performance, they also might introduce bias on any subsequent analysis of the SQL log. We propose a framework to discover patterns and antipatterns in SQL query logs and to clean antipatterns. To study the usefulness of our approach and to reveal insights on antipatterns in logs of real-world systems, we examine the SQL log of the SkyServer project, with more than 40 million queries. Among the top 15 patterns, we found 6 antipatterns. Altogether, our results give way to the conclusion that antipatterns might falsify refactoring and any other downstream analyses.
【Keywords】: SQL Log analysis; Antipatterns; Data cleaning
【Paper Link】 【Pages】:1753-1754
【Authors】: Yingxia Shao ; Kai Lei ; Lei Chen ; Zi Huang ; Bin Cui ; Zhongyi Liu ; Yunhai Tong ; Jin Xu
【Abstract】: In this paper, we study the problem of extracting a homogeneous graph from a heterogeneous graph. The key challenges of the extraction problem are how to efficiently enumerate paths matched by the provided line pattern and aggregate values for each pair of vertices from the matched paths. To address above two challenges, we propose a parallel graph extraction framework (PGE), where we use vertex-centric model to enumerate paths and compute aggregate functions in parallel. The framework compiles the line pattern into a path concatenation plan and generates the final weighted edges in a divide-and-conquer manner. The new solution outperforms the state-of-the-art ones through the comprehensive experiments.
【Keywords】: Heterogenegous Graph; Graph Extraction; Path Concatenation; Parallelisim
【Paper Link】 【Pages】:1755-1756
【Authors】: Xin Lin ; Yun Peng ; Jianliang Xu ; Byron Choi
【Abstract】: In this paper, we consider probabilistic reachability queries on uncertain graphs. To make the results more informative, we adopt a crowdsourcing-based approach to clean the uncertain edges. One important problem is how to efficiently select a limited set of edges for cleaning that maximizes the quality improvement. We prove that the edge selection problem is #P-hard. In light of the hardness of the problem, we propose a series of edge selection algorithms, followed by a number of optimization techniques and pruning heuristics for minimizing the computation time. Our experimental results demonstrate that our proposed techniques outperform a random selection by up to 27 times in terms of the result quality improvement and the brute-force solution by up to 60 times in terms of the elapsed time.
【Keywords】: Crowdsourcing; Uncertain Graph; Reachability query
【Paper Link】 【Pages】:1757-1758
【Authors】: Xin Lin ; Jianliang Xu ; Haibo Hu ; Zhe Fan
【Abstract】: In this paper, we propose a novel pairwise crowd-sourcing model to reduce the uncertainty of top-k ranking using a crowd of domain experts. Given a crowdsourcing task of limited budget, we propose efficient algorithms to select the best object pairs for crowdsourcing that will bring in the highest quality improvement. Extensive experiments show that our proposed solutions outperform a random selection method by up to 30 times in terms of quality improvement of probabilistic top- k ranking queries. In terms of efficiency, our proposed solutions can reduce the elapsed time of a brute-force algorithm from several days to one minute.
【Keywords】: Crowdsourcing; Top k; Uncertain query
【Paper Link】 【Pages】:1759-1760
【Authors】: Abdulhakim Ali Qahtan ; Suojin Wang ; Xiangliang Zhang
【Abstract】: Recent developments in sensors, global positioning system devices and smart phones have increased the availability of spatiotemporal data streams. Developing models for mining such streams is challenged by the huge amount of data that cannot be stored in the memory, the high arrival speed and the dynamic changes in the data distribution. Density estimation is an important technique in stream mining for a wide variety of applications. In this paper, we present a method called KDE-Track to estimate the density of spatiotemporal data streams. KDE-Track can efficiently estimate the density function with linear time complexity using interpolation on a kernel model, which is incrementally updated upon the arrival of new samples from the stream.
【Keywords】: Data Stream; Dynamic Density; Density Visualization; Adaptive Resampling
【Paper Link】 【Pages】:1761-1762
【Authors】: Huiping Liu ; Cheqing Jin ; Bin Yang ; Aoying Zhou
【Abstract】: The classical K Shortest Paths (KSP) problem, which identifies the k shortest paths in a directed graph, plays an important role in many application domains, such as providing alternative paths for vehicle routing services. However, the returned k shortest paths may be highly similar, i.e., sharing significant amounts of edges, thus adversely affecting service qualities. In this paper, we formalize the K Shortest Paths with Diversity (KSPD) problem that identifies top-k shortest paths such that the paths are dissimilar with each other and the total length of the paths is minimized. We first prove that the KSPD problem is NP-hard and then propose a generic greedy framework to solve the KSPD problem in the sense that (1) it supports a wide variety of path similarity metrics which are widely adopted in the literature and (2) it is also able to efficiently solve the traditional KSP problem if no path similarity metric is specified. The core of the framework includes the use of two judiciously designed lower bounds, where one is dependent on and the other one is independent on the chosen path similarity metric, which effectively reduces the search space and significantly improves efficiency. Empirical studies on 5 real-world and synthetic graphs and 5 different path similarity metrics offer insight into the design properties of the proposed general framework and offer evidence that the proposed lower bounds are effective.
【Keywords】: route planning; path diversity
【Paper Link】 【Pages】:1763-1764
【Authors】: Jialong Han ; Aixin Sun ; Gao Cong ; Wayne Xin Zhao ; Zongcheng Ji ; Minh C. Phan
【Abstract】: Many domain-specific websites host a profile page for each entity (e.g., locations on Foursquare, movies on IMDb, and products on Amazon), and users can post comments on it. When commenting on an entity, users often mention other entities for reference or comparison. Compared with web pages and tweets, disambiguating the mentioned entities in user comments has not received much attention. This paper investigates linking fine-grained locations in Foursquare comments. We demonstrate that the focal location, i.e., the location that a comment is posted on, provides rich contexts for linking. To exploit such information, we represent the Foursquare data in a graph, which includes locations, comments, and their relations. A probabilistic model named FocalLink is proposed to estimate the probability that a user mentions a location when commenting on a focal location, by following different kinds of relations. Experimental results show that FocalLink is consistently superior to different baselines.
【Keywords】: LSBN; Entity Linking; Knowledge Graph
【Paper Link】 【Pages】:1765-1766
【Authors】: Xiang Cheng ; Shuguang Zhu ; Sen Su ; Gang Chen
【Abstract】: Community Question Answering (CQA) has increasingly become an important service for people asking questions and providing answers online, which enables people to help each other by sharing knowledge. Recently, with accumulation of users and contents, much concern has arisen over the efficiency and answer quality of CQA services. To address this problem, question routing has been proposed which aims at routing new questions to suitable answerers, who have both high possibility and high ability to answer the questions. In this paper, we formulate question routing as a multi-objective ranking problem, and present a multi-objective learning-to-rank approach for question routing (MLQR), which can simultaneously optimize the answering possibility and answer quality of routed users. In MLQR, realizing that questions are relatively short and usually attached with tags, we first propose a tagword topic model (TTM) to derive topical representations of questions. Based on TTM, we then develop features for each question-user pair, which are captured at both platform level and thread level. In particular, the platform-level features summarize the information of a user from his/her history posts in the CQA platform, while the thread-level features model the pairwise competitions of a user with others in his/her answered threads. Finally, we extend a state-of-the-art learning-to-rank algorithm for training a multi-objective ranking model. Extensive experimental results on real-world datasets show that our MLQR can outperform state-of-the-art methods in terms of both answering possibility and answer quality.
【Keywords】: Community question answering services; question routing; multiple objective optimization
【Paper Link】 【Pages】:1767-1768
【Authors】: Ehab Abdelhamid ; Mustafa Canim ; Mohammad Sadoghi ; Bishwaranjan Bhattacharjee ; Yuan-Chi Chang ; Panos Kalnis
【Abstract】: Frequent subgraph mining is a core graph operation used in many domains. Most existing techniques target static graphs. However, modern applications utilize large evolving graphs. Mining these graphs using existing techniques is infeasible because of the high computational cost. We propose IncGM+, a fast incremental approach for frequent subgraph mining on large evolving graphs. We adapt the notion of "fringe" to the graph context, that is, the set of subgraphs on the border between frequent and infrequent subgraphs. IncGM+ maintains fringe subgraphs and exploits them to prune the search space. To boost efficiency, IncGM+ stores a number of selected embeddings to avoid redundant expensive subgraph isomorphism operations. Moreover, the proposed system supports batch updates. Our results confirm that IncGM+ outperforms existing methods, scales to larger graphs and consumes less memory.
【Keywords】: Frequent Subgraph Mining; Incremental Index; Evolving graph
【Paper Link】 【Pages】:1769-1770
【Authors】: Changping Wang ; Chaokun Wang ; Gaoyang Guo ; Xiaojun Ye ; Philip S. Yu
【Abstract】: The skyline of a data point set consists of the best points in the set, and is very important for multi-criteria decision making. One recent and important variant of the traditional skyline is group-based skyline, which aims to find the best groups of points in a given set. This paper brings forward an efficient approach, called minimum dominance search (MDS), to solve the g-skyline problem, a latest group-based skyline problem. MDS consists of two steps: In the first step, a novel g-skyline support structure, i.e., minimum dominance graph (MDG), is constructed to store all the points which may occur in g-skyline groups. In the second step, two searching algorithms are proposed to find g-skyline groups based on the MDG through two searching algorithms, and a skyline-combination based optimization strategy is employed to improve these two algorithms. The support for dynamic group sizes, i.e., a practical extension of the origin g-skyline problem, is provided through slightly modifying MDS. Comprehensive experiments are conducted on both synthetic and real-world data sets, and the results show that our algorithms are orders of magnitude faster than the state-of-the-art.
【Keywords】: Skyline; Group Skyline; G Skyline; Minimum Dominance Search; Minimum Dominance Graph
【Paper Link】 【Pages】:1771-1772
【Authors】: Xixian Han ; Jianzhong Li ; Hong Gao
【Abstract】: Top-k dominating query is an important operation to return a set of interesting points from a potentially huge data space. For any tuple, its domination score is defined as the number of tuples dominated by the tuple. Top-k dominating query returns the k tuples with the highest domination scores. This paper proposes a novel table-scan-based TDTS algorithm to compute the top-k dominating results on massive data efficiently. TDTS presorts table T to generate PT, whose tuples are arranged in the order of round-robin retrieval on the sorted lists. TDTS performs sequential scan on PT to obtain query results. It is proved that TDTS has the characteristic of early termination. This paper devises efficient pruning operation to reduce the number of candidate tuples and the number of assistant tuples significantly. The experimental results show that, TDTS has a markedly superior performance compared with the existing algorithms.
【Keywords】: Massive data; TDTS algorithm; Table scan; Early termination; Pruning operation
【Paper Link】 【Pages】:1773-1774
【Authors】: Pinghui Wang ; Junzhou Zhao ; Xiangliang Zhang ; Zhenguo Li ; Jiefeng Cheng ; John C. S. Lui ; Don Towsley ; Xiaohong Guan
【Abstract】: Despite recent efforts in counting 3-node and 4-node graphlets, little attention has been paid to characterizing 5-node graphlets. In this paper, we develop a computationally efficient sampling method to estimate 5-node graphlet counts. We not only provide a fast sampling method and unbiased estimators of graphlet counts, but also derive simple yet exact formulas for the variances of the estimators which are of great value in practice-the variances can be used to bound the estimates' errors and determine the smallest necessary sampling budget for a desired accuracy. We conduct experiments on a variety of real-world datasets, and the results show that our method is several orders of magnitude faster than the state-of-the-art methods with the same accuracy.
【Keywords】: graphlets; sampling; subgraphs
【Paper Link】 【Pages】:1775-1776
【Authors】: Jianxin Li ; Timos Sellis ; J. Shane Culpepper ; Zhenying He ; Chengfei Liu ; Junhu Wang
【Abstract】: The problem of influence maximization has attracted a lot of attention as it provides a way to improve marketing, branding, and product adoption. However, existing studies rarely consider the physical locations of the social users, although location is an important factor in targeted marketing. In this paper, we investigate the problem of influence spanning maximization in location-aware social networks. Our target is to identify the maximum spanning geographical regions in a query region, which is very different from the existing methods that focus on the quantity of the activated users in the query region. Since the problem is NP-hard, we develop one greedy algorithm with a 1-1/e approximation ratio and further improve its efficiency by developing an upper bound based approach. Then, we propose the OIR index by combining ordered influential node lists and an R*-tree and design the index based solution. The efficiency and effectiveness of our proposed solutions and index have been verified using three real datasets.
【Keywords】: spatial social network; influence maximization
【Paper Link】 【Pages】:1777-1778
【Authors】: Lizhen Wang ; Xuguang Bao ; Lihua Zhou
【Abstract】: Spatial co-location pattern mining is an interesting and important task in spatial data mining which discovers the subsets of spatial features frequently observed together in nearby geographic space. However, the traditional framework of mining prevalent co-location patterns produces numerous redundant co-location patterns, which makes it hard for users to understand or apply. In this paper we study the problem of reducing redundancy in a collection of prevalent co-location patterns. We first introduce the concept of semantic distance between two co-location patterns, and then define redundant co-locations by introducing the concept of δ-covered, where δ (0≤δ≤1) is a coverage measure. We develop two algorithms RRclosed and RRnull to perform the redundancy reduction for prevalent co-location patterns. Our performance studies on the synthetic and real-world data sets demonstrate that our method effectively reduces the size of the original collection of closed co-location patterns by about 50%. Furthermore, the RRnull method runs much faster than the related closed co-location pattern mining algorithm.
【Keywords】: Spatial co location pattern; redundancy; semantic distance; d covered
【Paper Link】 【Pages】:1779-1780
【Authors】: Jiaying Wang ; Xiaochun Yang ; Bin Wang ; Chengfei Liu
【Abstract】: String similarity join, as an essential operation in applications including data integration and data cleaning, has attracted significant attention in the research community. Previous studies focus on global similarity join. In this paper, we study local similarity join with edit distance constraints, which finds string pairs from two string collections that have similar substrings. We study two kinds of local similarity join problems: checking local similar pairs and locating local similar pairs. We first consider the case where if two strings are locally similar to each other, they must share a common gram of a certain length. We show how to do efficient local similarity verification based on a matching gram pair. We propose two pruning techniques and an incremental method to further improve the efficiency of finding matching gram pairs. Then we devise a method to locate the longest similar substring pair for two local similar strings. We conducted a comprehensive experimental study to evaluate the efficiency of these techniques.
【Keywords】: Local Similarity Join; Edit Distance; Similar Substrings; Filtering.
【Paper Link】 【Pages】:1781-1782
【Authors】: Gang Chen ; Jingwen Zhao ; Yunjun Gao ; Lei Chen ; Rui Chen
【Abstract】: This paper explores the Time-Aware Boolean Spatial Keyword Query (TABSKQ) that finds the geo-tagged objects satisfying user's spatial, textual, and temporal constraints. Towards this, we propose an efficient index structure so-called the TA-tree and its corresponding algorithms, which can efficiently prune the search space using both spatio-temporal and textual information. Extensive experiments with real datasets offer insight into the performance of our proposed index and algorithms.
【Keywords】: Boolean spatial keyword query; index structure; query processing; algorithm
【Paper Link】 【Pages】:1783-1784
【Authors】: Bolong Zheng ; Han Su ; Wen Hua ; Kai Zheng ; Xiaofang Zhou ; Guohui Li
【Abstract】: With the advances in geo-positioning technologies and location-based services, it is nowadays quite common for road networks to have textual contents on the vertices. Previous work on identifying an optimal route that covers a sequence of query keywords has been studied in recent years. However, in many practical scenarios, an optimal route might not always be desirable. Therefore, in this paper, we investigate the problem of clue-based route search (CRS), which allows a user to provide clues on keywords and spatial relationships. First, we propose a greedy algorithm and a dynamic programming algorithm as baselines. To improve efficiency, we develop a branch-and-bound algorithm that prunes unnecessary vertices in query processing. In order to quickly locate candidate, we propose an AB-tree that stores both the distance and keyword information in tree structure. To further reduce the index size, we construct a PB-tree by utilizing the virtue of 2-hop label index to pinpoint the candidate. Extensive experiments are conducted and verify the superiority of our algorithms and index structures.
【Keywords】: spatial keyword queries; clue; point of interest; travel route serach; query processing
【Paper Link】 【Pages】:1785-1786
【Authors】: Sheng Wang ; Zhifeng Bao ; J. Shane Culpepper ; Timos Sellis ; Gao Cong
【Abstract】: We study a new kind of query - a Reverse k Nearest Neighbor Search over Trajectories (RkNNT), which can be used for route planning and capacity estimation in the transportation field. Given a set of existing routes DR, a set of passenger transitions DT, and a query route Q, an RkNNT query returns all transitions that take Q as one of its k nearest travel routes. We develop an index to handle dynamic trajectory updates, so that the most up-to-date transition data is available for answering an RkNNT query using a filter-refine processing framework. Further, an application of using RkNNT to plan the optimal route in bus networks, namely MaxRkNNT, is proposed and studied. Experiments on real datasets demonstrate the efficiency and scalability of our approaches. In the future, the RkNNT can be extended to applied to the traffic prediction.
【Keywords】: Trajectory database; route planning; transit network; capacity prediction
【Paper Link】 【Pages】:1787-1788
【Authors】: Gang Chen ; Keyu Yang ; Lu Chen ; Yunjun Gao ; Baihua Zheng ; Chun Chen
【Abstract】: Given two object sets Q and O, a metric similarity join finds similar object pairs according to a certain criterion. This operator has a wide range of applications in data cleaning, data mining, etc. In this paper, we employ a popular distributed framework, namely, MapReduce, to support scalable metric similarity joins. To ensure load balancing, we present two sampling based partition methods, i.e., clustering based partition method and KD-tree based partition method. To avoid unnecessary object pair evaluation, we propose a framework that maps the two involved object sets in order, where plane sweeping and pivot based filtering techniques are utilized for pruning. Extensive experiments confirm that our solution outperforms significantly existing state-of-the-art competitors.
【Keywords】: Similarity Joins; Metric Space; MapReduce; Algorithm
【Paper Link】 【Pages】:1789-1790
【Authors】: Valeria Fionda ; Giuseppe Pirrò
【Abstract】: Community deception is the problem of hiding a community from community detection algorithms. This is an important task whenever a group (e.g., activists, police enforcements) want to observe and cooperate in a social network while avoiding to be detected. We formalize the community deception problem and propose an efficient algorithm, based on the concept of community safeness, which allows to achieve deception by carefully identifying and rewiring a certain number of the community members' connections. Deception can be practically achieved in social networks like Facebook or Twitter by friending (following) or unfriending (unfollowing) network members. We validated our approach and compared it with related research on a variety of (large) real networks with encouraging results.
【Keywords】: Community Hiding; Community Detection
【Paper Link】 【Pages】:1791-1792
【Authors】: Zhiling Luo ; Ling Liu ; Jianwei Yin ; Ying Li ; Zhaohui Wu
【Abstract】: NgramCNN is a deep convolutional neural network developed for classification of graphs based on common substructure patterns and their latent relationships in the collection of graphs. Our NgramCNN deep learning framework consists of three novel components: (1) The concept of n-gram graph block to transform each raw graph object into a sequence of n-gram blocks connected through overlapping regions. (2) The diagonal convolution layer to extract local patterns and connectivity features hidden in the n-gram blocks by performing n-gram normalization before conducting deep learning through the network of convolution layers. (3) The extraction of deeper global patterns based on the local patterns and the ways that they respond to overlapping regions by building a n-gram deep convolutional neural network. Extensive evaluation of NgramCNN using five real graph repositories from bioinformatics and social networks domains show the effectiveness of NgramCNN over the existing state of art methods with high accuracy and comparable performance.
【Keywords】: Graph Classification; Graph Mining; Ngram; Convolutional Neural Network
【Paper Link】 【Pages】:1793-1794
【Authors】: Gong Cheng ; Fei Shao ; Yuzhong Qu
【Abstract】: Searching for associations between entities is needed in many domains. It has been facilitated by the emergence of graph-structured semantic data on the Web, which offers structured semantic associations more explicit than those hiding in unstructured text for computers to discover. The increasing volume of semantic data requires ranking techniques to identify the more important semantic associations for users. Considering a lack of comprehensive empirical evaluation of existing techniques, we carry out an extensive evaluation of eight techniques including two novel ones we propose. The practical effectiveness of these techniques is assessed based on 1,200 ground-truth rankings created by 30 human experts for real-life semantic associations and on the explanations given by the experts.
【Keywords】: semantic association; ranking; semantic data; entity homogeneity; relation heterogeneity
【Paper Link】 【Pages】:1795-1796
【Authors】: Na Ta ; Guoliang Li ; Tianyu Zhao ; Jianhua Feng ; Hanchao Ma ; Zhiguo Gong
【Abstract】: Ride-sharing (RS) has great values in saving energy and alleviating traffic pressure. In this paper, we propose a new ride-sharing model, where each driver requires that the shared route percentage (SRP, the ratio of the shared route's distance to the driver's total traveled distance) exceeds her expected rate (e.g., 0.8) when sharing with a rider. We consider two variants of this problem. The first considers multiple drivers and multiple riders, and aims to compute a set of driver-rider pairs to maximize the overall SRP. We model this problem as the maximum weighted bigraph matching problem. We propose an effective exact algorithm, and an efficient approximate solution with error-bound guarantee. The second considers multiple drivers and a single rider and aims to find the top-k drivers for the rider with the largest SRP. We devise pruning techniques and propose a best-first algorithm to progressively selects drivers with high probability to be in the top-k results.
【Keywords】: ride sharing; shared route percentage; bigraph matching; join based sharing; search based sharing
【Paper Link】 【Pages】:1797-1798
【Authors】: Cheng Xu ; Qian Chen ; Haibo Hu ; Jianliang Xu ; Xiaojun Hei
【Abstract】: With recent advances in data-as-a-service (DaaS) and cloud computing, aggregate query services over set-valued data are becoming widely available for business intelligence that drives decision making. However, as the service provider is often a third-party delegate of the data owner, the integrity of the query results cannot be guaranteed and is thus imperative to be authenticated. Unfortunately, existing query authentication techniques either do not work for set-valued data or they lack data confidentiality. In this paper, we propose authenticated aggregate queries over set-valued data that not only ensure the integrity of query results but also preserve the confidentiality of source data.
【Keywords】: Query Authentication; Aggregate Quereis; Set Valued Data; Merkel Hash Tree
【Paper Link】 【Pages】:1799-1800
【Authors】: Monidipa Das ; Soumya K. Ghosh ; Pramesh Gupta ; Vemuri M. Chowdary ; Ravoori Nagaraja ; Vinay Kumar Dadhwal
【Abstract】: A proper assessment of reservoir water dynamics is of utmost importance for the development of any region. However, the reservoir water dynamics is not merely a periodic event. Rather, it is a result of complex interplay among various water balancing components, especially the meteorological factors. So, the key objectives of this research work are to model the influence of spatial variability of meteorological variables on the hydrological processes in a reservoir, and to utilize this knowledge of spatial variability to aid in prediction of reservoir dynamics. In this regard, we propose FORWARD, a forecasting model based on spatial Bayesian network (SpaBN) which has inherent capability of efficiently modeling the spatial impact of meteorological and topographical factors distributed over the watershed. The forecasting efficiency of FORWARD has been compared with a set of linear and non-linear prediction techniques with respect to a case study on forecasting daily live capacity of the Mayurakshi reservoir in India. The experimental results show the superiority of FORWARD over the others.
【Keywords】: Spatial Bayesian network; Spatial importance; Spatio temporal system; Probabilistic reasoning; Forecasting
【Paper Link】 【Pages】:1801-1802
【Authors】: Christian Frey ; Andreas Züfle ; Tobias Emrich ; Matthias Renz
【Abstract】: In this paper, we address the problem of optimizing information propagation in uncertain networks given a constrained budget of edges. We show that this problem requires to solve two NP-hard subproblems: the computation of expected information flow, and the optimal choice of edges. To compute the expected information flow to a source vertex, we propose the F-tree as a specialized data structure, that identifies independent components of the graph for which the information flow can either be computed analytically and efficiently, or for which traditional Monte-Carlo sampling can be applied independently of the remaining network.
【Keywords】: network analysis; optimization; information flow; reliability; uncertainty; communication networks
【Paper Link】 【Pages】:1803-1804
【Authors】: Leong Hou U ; Junjie Zhang ; Kyriakos Mouratidis ; Ye Li
【Abstract】: The efficient processing of document streams plays an important role in many information filtering systems. Emerging applications, such as news update filtering and social network notifications, demand presenting end-users with the most relevant content to their preferences. In this work, user preferences are indicated by a set of keywords. A central server monitors the document stream and continuously reports to each user the top-k documents that are most relevant to her keywords. The objective is to support large numbers of users and high stream rates, while refreshing the top-k results almost instantaneously. Our solution abandons the traditional frequency-ordered indexing approach, and follows an identifier-ordering paradigm that suits better the nature of the problem. When complemented with a locally adaptive technique, our method offers (i) optimality w.r.t. the number of considered queries per stream event, and (ii) an order of magnitude shorter response time than the state-of-the-art.
【Keywords】: Top k query; Continuous query; Document stream
【Paper Link】 【Pages】:1805-1806
【Authors】: Dimitra Papadimitriou ; Georgia Koutrika ; Yannis Velegrakis ; John Mylopoulos
【Abstract】: We study the problem of finding related forum posts to a post at hand. We developed a multi-segment matching technique that considers posts as a set of segments each one written with a different goal in its author mind and computes the relatedness between two posts based on the similarity of their respective segments that are intended for the same goal. The questions are how our method identifies such segments, how it figures out for what each segment is intended and how it exploits this information to rank the posts. We experimentally illustrate the effectiveness and efficiency of our segmentation method and overall approach of finding related forum posts.
【Keywords】: clustering; segment; document; retrieval; top k
【Paper Link】 【Pages】:1807-1808
【Authors】: Ziyang Liu ; Chong Wang ; Yi Chen
【Abstract】: Archiving graph data over history is demanded in many applications, such as social network and bibliographies. Typically, people are interested in querying temporal graphs. Existing keyword search approaches for graph-structured data are insufficient for querying temporal graphs. This paper initiates the study of supporting keyword-based queries on temporal graphs. We propose a search syntax that is an extension of keyword search, which allows casual users to easily search temporal graphs with temporal predicates and ranking functions. To generate results efficiently, we propose a best path iterator, which finds the "best" paths between two data nodes in each snapshot regarding to three ranking factors. We develop algorithms that efficiently generate top-k query results. Extensive experiments verified the efficiency and effectiveness of our approach.
【Keywords】: algorithm; graph theory; keyword search; Information retrieval
【Paper Link】 【Pages】:1809-1810
【Authors】: Chih-Yu Wang ; Yan Chen ; K. J. Ray Liu
【Abstract】: Deal selection on Groupon is a typical social learning and decision making process, where the quality of a deal is usually unknown to the customers. The customers must acquire this knowledge through social learning from other social medias such as reviews on Yelp. Additionally, the quality of a deal depends on both the state of the vendor and decisions of other customers on Groupon. How social learning and network externality affect the decisions of customers in deal selection on Groupon is our main interest. We develop a data-driven game-theoretic framework to understand the rational deal selection behaviors cross social medias. The sufficient condition of the Nash equilibrium is identified. A value-iteration algorithm is proposed to find the optimal deal selection strategy. We conduct a year-long experiment to trace the competitions among deals on Groupon and the corresponding Yelp ratings. We utilize the dataset to analyze the deal selection game with realistic settings. Finally, the performance of the proposed social learning framework is evaluated with real data. The results suggest that customers do make decisions in a rational way instead of following naive strategies, and there is still room to improve their decisions with assistance from the proposed framework.
【Keywords】: game theory; Chinese restaurant game; deal; social media; bayesian learning
【Paper Link】 【Pages】:1811-1812
【Authors】: Wengen Li ; Jiannong Cao ; Jihong Guan ; Man Lung Yiu ; Shuigeng Zhou
【Abstract】: The widespread location-aware applications produce a vast amount of spatio-textual data that contains both spatial and textual attributes. To make use of this enriched information for users to describe their preferences for travel routes, we propose a Bounded-Cost Informative Route (BCIR) query to retrieve the routes that are the most textually relevant to the user-specified query keywords subject to a travel cost constraint. BCIR query is particularly helpful for tourists and city explorers to plan their travel routes. We will show that BCIR query is an NP-hard problem. To answer BCIR query efficiently, we propose an exact solution with effective pruning techniques and an approximate solution with performance guarantee. Extensive experiments over real data sets demonstrate that the proposed solutions achieve the expected performance.
【Keywords】: road network; route query; query keywords; informative routes
【Paper Link】 【Pages】:1813-1814
【Authors】: Damir Vandic ; Steven S. Aanen ; Flavius Frasincar ; Uzay Kaymak
【Abstract】: Currently many webshops rely on a fixed list of product facets to help users find the products of interest. Such a solution suffers from two problems: (1) it is difficult to devise a fixed list of facets that would satisfy all user interests, and (2) the top facets could become obsolete when all product results have these facets. To address these two problems we propose a novel algorithm for dynamic ordering of the product facets based on the query results. This algorithm relies on measures such as specificity and dispersion for qualitative and quantitative facets, respectively, to rank the properties associated with these facets so that users are able to find the products of interest with a minimum number of drill-down steps. Using a large-scale simulation study and a user-based evaluation, we show that our algorithm outperforms the expert-based fixed facets approach, a greedy baseline, and a state-of-the-art entropy-based solution. This paper is an extended abstract of our previous work [1].
【Keywords】: facet ordering; product search; user interfaces
【Paper Link】 【Pages】:1815-1816
【Authors】: Sen Hu ; Lei Zou ; Jeffrey Xu Yu ; Haixun Wang ; Dongyan Zhao
【Abstract】: RDF question/answering (Q/A) allows users to ask questions in natural languages over a knowledge base represented by RDF. To answer a natural language question, the existing works focus on question understanding to deal with the disambiguation of phrases linking, which ignore the query composition and execution. In this paper, we propose a systematic framework to answer natural language questions over RDF repository (RDF Q/A) from a graph data-driven perspective. We propose the (super) semantic query graph to model the query intention in the natural language question in a structural way, based on which, RDF Q/A is reduced to subgraph matching problem. More importantly, we resolve the ambiguity both of phrases and structures at the time when matches of query are found. To build the super semantic query graph, we propose a node-first framework which has high robustness and can tackle with complex questions. Extensive experiments confirm that our method not only improves the precision but also speeds up query performance greatly.
【Keywords】: RDF; Graph database; Question Answering
【Paper Link】 【Pages】:1817-1818
【Authors】: Alberto Costa ; Emanuele Di Buccio ; Massimo Melucci ; Giacomo Nannicini
【Abstract】: Information Retrieval (IR) is the complex of activities that represent information as data and rank the data representing information relevant to the user's information needs by a retrieval function. Such a function involves parameters. They can in principle be set irrespective of the specific set of documents and queries, but can in practice maximize retrieval effectiveness. However, algorithms to select retrieval function parameters must be efficient due to the large search space. We can remark that: (i) all the tested methods are similarly effective, but the plots of the maximum value of NDCG@20 at a given evaluation show that our algorithm is more efficient; (ii) performance metrics and datasets studied in this paper seem to yield objective functions with few, if any, local optima with large basin of attraction; (iii) our algorithm is considerably more efficient, quickly finding parameterizations of the retrieval function yielding high performance - much faster than line search.
【Keywords】: black box optimization; rpfopt; information retrieval
【Paper Link】 【Pages】:1819-1820
【Authors】: Rui Zhu ; Bin Wang ; Xiaochun Yang ; Baihua Zheng ; Guoren Wang
【Abstract】: Continuous top-k query over streaming data is a fundamental problem in database. In this paper, we focus on sliding window scenario, where a continuous top-k query returns the top-k objects within each query window on the data stream. Existing algorithms support this type of queries via incrementally maintaining a subset of objects in the window and try to retrieve the answer from this subset as much as possible whenever the window slides. However, since all the existing algorithms are sensitive to query parameters and data distribution, they all suffer from expensive incremental maintenance cost. In this paper, we propose a self-adaptive partition framework to support continuous top-k query. It partitions the window into subwindows and only maintains a small number of candidates with highest scores in each sub-window. Based on this framework, we have developed several partition algorithms to cater for different object distributions and query parameters. It is the first algorithm that achieves logarithmic complexity w.r.t. k for incremental maintaining the candidate set even in the worst case.
【Keywords】: Microsoft Windows; Partitioning algorithms; Maintenance engineering; Heuristic algorithms; Complexity theory; Windows; Silicon
【Paper Link】 【Pages】:1821-1822
【Authors】: Muhammad U. Arshad ; Ashish Kundu ; Elisa Bertino ; Arif Ghafoor ; Chinmay Kundu
【Abstract】: Graphs are used for representing and understanding objects and their relationships for numerous applications such as social networks, semantic webs, and biological networks. Integrity assurance of data and query results for graph databases is an essential security requirement. In this paper, we propose two efficient integrity verification schemes - HMACs for graphs (gHMAC) for two-party data sharing, and redactable HMACs for graphs (rgHMAC) for third-party data sharing, such as a cloud-based graph database service. The proposed schemes have linear complexity in terms of the number of vertices and edges in the graphs, which is shown to be optimal. Our experimental results corroborate that the proposed HMAC-based schemes for graphs are highly efficient as compared to the digital signature-based schemes - computation of HMAC tags is about 10 times faster than the computation of digital signatures.
【Keywords】: Graph data; integrity; HMAC; third party data publishing; cloud databases; efficiency; scalability
【Paper Link】 【Pages】:1823-1824
【Authors】: Mingyi Zhang ; Patrick Martin ; Wendy Powley ; Jianjun Chen
【Abstract】: Workload management is the discipline of effectively monitoring, managing and controlling work flow across computing systems. In particular, workload management in database management systems (DBMSs) is the process or act of monitoring and controlling work (i.e., requests) executing on a database system in order to make efficient use of system resources in addition to achieving any performance objectives assigned to that work. In the past decade, workload management studies and practice have made considerable progress in both academia and industry. New techniques have been proposed by researchers, and new features of workload management facilities have been implemented in most commercial database products. In this paper, we provide a systematic study of workload management in today's DBMSs by developing a taxonomy of workload management techniques. We apply the taxonomy to evaluate and classify existing workload management techniques implemented in the commercial databases and available in the recent research literature. We also introduce the underlying principles of today's workload management technology for DBMSs, discuss open problems and outline some research opportunities in this research area.
【Keywords】: Taxonomy; Workload Management; Database Management Systems
【Paper Link】 【Pages】:1825-1826
【Authors】: Zhiting Lin ; Liang Dong
【Abstract】: Establishing trustworthy relationships among the objects greatly improves the effectiveness of node interaction in the social Internet of Things (IoT). It helps nodes overcome perceptions of uncertainty and risk. However, there are limitations in the existing trust models. In this paper, a comprehensive model of trust is proposed that is tailored to the social IoT. The model includes ingredients such as trustor, trustee, goal, trustworthiness evaluation, decision, action, result, and context. Building on this trust model, we clarify the concepts of trust in the social IoT in five aspects such as: (1) mutuality of trustor and trustee; (2) inferential transfer of trust; (3) transitivity of trust; (4) trustworthiness update; and (5) trustworthiness affected by dynamic environment. With network connectivities that are from real-world social networks, simulations are conducted to evaluate the performance of the social IoT operated with the proposed trust model. An experimental IoT network is used to further validate the proposed trust model.
【Keywords】: Social Internet of Things; task delegation; trust model; trustworthiness evaluation; trust inference; trust transfer; trustworthiness update
【Paper Link】 【Pages】:1827-1828
【Authors】: Yuefeng Li ; Libiao Zhang ; Yue Xu ; Yiyu Yao ; Raymond Y. K. Lau ; Yutong Wu
【Abstract】: Text classification techniques are playing a crucial role in identifying relevant texts from a large data set, e.g., various online crimes such as Cyberbullying, terrorist recruiting, propaganda or attack planning. Until now, supervised deep learning has brought about breakthroughs in processing multimedia data; however, there was no good practical way to harvest this opportunity for text classification because acquiring and maintaining a massive amount of training examples are too expensive for a large number of categories (e.g., Yahoo! taxonomy contains nearly 300,000 categories and the Library of Congress Subject Headings (LCSH) contains 394,070 subjects). Therefore, the question of how to effectively learn from sparse or small set of training examples is crucial for the true success of text classification. Semi-supervised approaches have been proposed for this challenge, which usually use a pair or several existing classifiers to extend a small training set. However, extracted pseudo training samples are uncertain because they are determined by a machine rather than people. Also, the massive volume and high variability of text data are creating a number of challenging issues such as the scalability and complicated relations between words. There are two fundamental issues with regards to the performance of existing classifiers: overlook and overload. Overlook means that some objects relevant to a class have been omitted, whereas overload means that some objects assigned to a class are actually not relevant to that class. The two issues are even more serious in the following two cases: (1) large uncertain boundary - the decision boundary between two classes includes many mixed examples (e.g., relevant and nonrelevant documents together), and (2) unbalanced classes - one class (e.g., information about terrorist attacks) is much smaller than another class (e.g., normal descriptions). We propose a three-way decision model [1] for dealing with the uncertain boundary for improving text classification performance based on rough set techniques and centroid solution. It aims to understand the uncertain boundary through partitioning the training samples into three regions (the positive, boundary and negative regions) by two main boundary vectors created from the labeled positive and negative training subsets, respectively, and further resolve the objects in the boundary region by two derived boundary vectors produced according to the structure of the boundary region. Four decision rules are proposed from the training process and applied to the incoming documents for more precise classification. The experimental results on the standard data sets RCV1 and Reuters-21578 show that the usage of boundary vectors is very effective and efficient for dealing with uncertainties of the decision boundary, and the proposed model has significantly improved the performance of binary text classification in terms of F1 measure and AUC area compared with six other popular baseline models.
【Keywords】: Uncertain decision boundary; Text classification; Three way decision; Rough set; Decision rule