33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017. IEEE Computer Society 【DBLP Link】
【Paper Link】 【Pages】:3
【Authors】: Volker Markl
【Abstract】: The global database research community has greatly impacted the functionality and performance of data storage and processing systems along the dimensions that define “big data”, i.e., volume, velocity, variety, and veracity. Locally, over the past five years, we have also been working on varying fronts. Among our contributions are: (1) establishing a vision for a database-inspired big data analytics system, which unifies the best of database and distributed systems technologies, and augments it with concepts drawn from compilers (e.g., iterations) and data stream processing, as well as (2) forming a community of researchers and institutions to create the Stratosphere platform to realize our vision. One major result from these activities was Apache Flink, an open-source big data analytics platform and its thriving global community of developers and production users. Although much progress has been made, when looking at the overall big data stack, a major challenge for database research community still remains. That is, how to maintain the ease-of-use despite the increasing heterogeneity and complexity of data analytics, involving specialized engines for various aspects of an end-to-end data analytics pipeline, including, among others, graph-based, linear algebra-based, and relational-based algorithms, and the underlying, increasingly heterogeneous hardware and computing infrastructure. At TU Berlin, DFKI, and the Berlin Big Data Center (BBDC), we aim to advance research in this field via the Mosaics project. Our goal is to remedy some of the heterogeneity challenges that hamper developer productivity and limit the use of data science technologies to just the privileged few, who are coveted experts.
【Keywords】: Big Data; Terrestrial atmosphere; Distributed databases; Data analysis; Conferences; Data engineering
【Paper Link】 【Pages】:4
【Authors】: Laura M. Haas
【Abstract】: Doing data science - extracting insight by analyzing data - is not easy. Data science is used to answer interesting questions that typically involve multiple diverse data sources, many different types of analysis, and often, large and messy data volumes. To answer one of these questions, several types of expertise may be needed to understand the context and domain being served, to import and transform individual data sets, to implement effective machine learning and/or statistical methods, to design and program applications and interfaces to extract and share data and insights, and to manage the data and systems used for analysis and storage. In the IBM Research Accelerated Discovery Lab, we are studying how data scientists work, and using what we learn to help them gain insights faster. In this talk, we will look at what we have learned to date, through user studies and experience with tens of analytics projects, and the environment that we’ve built as a result. In particular, I will describe how we capture information to enable contextual search, provenance queries, and other functionality to afford teams faster progress in data-intensive investigations. I will also touch on our efforts to leverage data and people to explain what happens during an investigation, with an ultimate goal of moving from descriptive to prescriptive analytics in order to accelerate data science and the analytic process. I will illustrate these various efforts using an ambitious current project on applying metagenomics to food safety, and will conclude with a discussion of where more work is needed and our future directions.
【Keywords】: Acceleration; Data science; Data mining; Computer science; Technological innovation; Conferences; Data engineering
【Paper Link】 【Pages】:7-8
【Authors】: Wentao Wu ; Hongsong Li ; Haixun Wang ; Kenny Q. Zhu
【Abstract】: Knowledge acquisition is an iterative process. Most prior work used syntactic bootstrapping approaches, while semantic bootstrapping was proposed recently. Unlike syntactic bootstrapping, semantic bootstrapping bootstraps directly on knowledge rather than on syntactic patterns, that is, it uses existing knowledge to understand the text and acquire more knowledge. It has been shown that semantic bootstrapping can achieve superb precision while retaining good recall on extracting isA relation. Nonetheless, the working mechanism of semantc bootstrapping remains elusive. In this extended abstract, we present a theoretical analysis as well as an experimental study to provide deeper insights into semantic bootstrapping.
【Keywords】: Semantics; Syntactics; Algorithm design and analysis; Pattern matching; Conferences; Data engineering; Facebook
【Paper Link】 【Pages】:9-10
【Authors】: Gang Hu ; Jie Shao ; Fenglin Liu ; Yuan Wang ; Heng Tao Shen
【Abstract】: With the advance of various location-acquisition technologies, a myriad of GPS trajectories can be collected every day. However, the raw coordinate data captured by sensors often cannot reflect real positions due to many physical constraints and some rules of law. How to accurately match GPS trajectories to roads on a digital map is an important issue. Many existing methods still cannot meet stringent performance requirements, especially for low/unstable sampling rate and noisy/lost data. As in practice, some other measurements such as speed and moving direction are collected together with the spatial locations acquired, we can make use of not only location coordinates but all data collected. In this paper, we propose a novel model using the related meta-information to describe a moving object, and present an algorithm called IF-Matching for map-matching. It can handle many ambiguous cases which cannot be correctly matched by existing methods. We run our algorithm with taxi trajectory data on a city-wide road network. Compared with two state-of-the-art algorithms of ST-Matching and the winner of GIS Cup 2012, our approach achieves more accurate results.
【Keywords】: Roads; Trajectory; Global Positioning System; History; Sensors; Public transportation; Measurement errors
【Paper Link】 【Pages】:11-12
【Authors】: Qi Zhang ; Yang Wang ; Jin Qian ; Binbin Deng ; Xuanjing Huang
【Abstract】: Hashing methods have proven to be useful for a variety of tasks and have attracted extensive attention in recent years. Various hashing approaches have been proposed to capture similarities between textual, visual, and cross-media information. However, most of the existing works use a bag-of-words methods to represent textual information. Since words with different forms may have similar meaning, semantic level text similarities can not be well processed in these methods. To address these challenges, in this paper, we propose a novel method called semantic cross-media hashing (SCMH), which uses continuous word representations to capture the textual similarity at the semantic level and use a deep belief network (DBN) to construct the correlation between different modalities. To demonstrate the effectiveness of the proposed method, we evaluate the proposed method on three commonly used cross-media data sets are used in this work. Experimental results show that the proposed method achieves significantly better performance than state-of-the-art approaches. Moreover, the efficiency of the proposed method is comparable to or better than that of some other hashing methods.
【Keywords】: Semantics; Visualization; Correlation; Kernel; Conferences; Aggregates; Manifolds
【Paper Link】 【Pages】:13-14
【Authors】: Yung-Chun Chang ; Chien Chin Chen ; Wen-Lian Hsu
【Abstract】: In this paper, we investigate the interactions between topic persons to help readers construct the background knowledge of a topic. We proposed a rich interactive tree structure to represent syntactic, context, and semantic information of text, and this structure is incorporated into a tree-based convolution kernel to identify segments that convey person interactions and further construct person interaction networks. Empirical evaluations demonstrate that the proposed method is effective in detecting and extracting the interactions between topic persons in the text, and outperforms other extraction approaches used for comparison. Furthermore, readers will be able to easily navigate through the topic persons of interest within the interaction networks, and further construct the background knowledge of the topic to facilitate comprehension.
【Keywords】: Syntactics; Kernel; Semantics; Voting; Convolution; Feature extraction; Context
【Paper Link】 【Pages】:15-16
【Authors】: Ying Zhang ; Yu-Ling Hsueh ; Wang-Chien Lee ; Yi-Hao Jhang
【Abstract】: Owing to the wide availability of the global positioning system (GPS) and digital mapping of roads, road network navigation services have become a basic application on many mobile devices. Path planning, a fundamental function of road network navigation services, finds a route between the specified start location and destination. The efficiency of this path planning function is critical for mobile users on roads due to various dynamic scenarios, such as a sudden change in driving direction, unexpected traffic conditions, lost or unstable GPS signals, and so on. In these scenarios, the path planning service needs to be delivered in a timely fashion. In this paper, we propose a system, namely, Path Planning by Caching (PPC), to answer a new path planning query in real time by efficiently caching and reusing historical queried-paths. Unlike the conventional cachebased path planning systems, where a queried-path in cache is used only when it matches perfectly with the new query, PPC leverages the partially matched queries to answer part(s) of the new query. Comprehensive experimentation on a real road network database shows that our system outperforms the state of-the-art path planning techniques by reducing 32% of the computation latency on average.
【Keywords】: Path planning; Roads; Global Positioning System; Servers; Mobile communication; Computer science
【Paper Link】 【Pages】:17-18
【Authors】: Jianxin Li ; Chengfei Liu ; Jeffrey Xu Yu ; Yi Chen ; Timos Sellis ; J. Shane Culpepper
【Abstract】: Social networks have become a vital mechanism to disseminate information to friends and colleagues. But the dynamic nature of information and user connectivity within these networks raised many new and challenging research problems. One of them is the query-related topic search in social networks. In this work, we investigate the important problem of the personalized influential topic search. There are two challenging questions that need to be answered: how to extract the social summarization of the social network so as to measure the topics' influence at the similar granularity scale? and how to apply the social summarization to the problem of personalized influential topic search. Based on the evaluation using real-world datasets, our proposed algorithms are proved to efficient and effective.
【Keywords】: Social network services; Electronic mail; Search problems; Aggregates; Measurement; Indexes; Conferences
【Paper Link】 【Pages】:19-20
【Authors】: Jun Chen ; Chaokun Wang ; Jianmin Wang ; Philip S. Yu
【Abstract】: We attempt to address a case of the Recommendation for Repeat Consumption (RRC) problem by proposing a Time-Sensitive Personalized Pairwise Ranking (TS-PPR) model based on the behavioral features extracted from user implicit feedback in the consumption history. TS-PPR factorizes the temporal useritem interactions via learning the mappings from the behavioral features in observable space to the preference features in latent space, and combines users static and dynamic preferences together in recommendation. An empirical study on real-world data sets shows encouraging results.
【Keywords】: Feature extraction; Training; Electronic mail; Recommender systems
【Paper Link】 【Pages】:21-22
【Authors】: Meng Wang ; Hui Li ; Jiangtao Cui ; Ke Deng ; Sourav S. Bhowmick ; Zhenhua Dong
【Abstract】: The location selection (LS) problem aims to mine the optimal location to place a new facility from a set of candidates such that the benefit or influence on a given set of objects is maximized. State-of-the-art LS techniques assume each object is static and can only be influenced by a single facility. However, in reality, objects (e.g., people, vehicles) are mobile and are influenced by multiple facilities. Consequently, classical LS solutions fail to select locations accurately. In this work, we introduce a generalized LS problem called PRIME-LS which takes mobility and probability factors into consideration to address the aforementioned limitations. To solve the problem, we propose an algorithm called PINOCCHIO, which leverages two pruning rules based on a novel distance measure, and further extend it by incorporating two optimization strategies. Experimental study over two real-world datasets demonstrates superiority of our framework in comparison to state-of-the-art LS techniques.
【Keywords】: Probabilistic logic; Computer science; Mobile communication; Optimization; Conferences; Data engineering; Information technology
【Paper Link】 【Pages】:23-24
【Authors】: Zeyuan Shang ; Yaxiao Liu ; Guoliang Li ; Jianhua Feng
【Abstract】: Similarity join is a fundamental operation in data cleaning and integration. Existing similarity-join methods utilize the string similarity to quantify the relevance but neglect the knowledge behind the data, which plays an important role in understanding the data. Thanks to public knowledge bases, e.g., Freebase and Yago, we have an opportunity to use the knowledge to improve similarity join. To address this problem, we study knowledge-aware similarity join, which, given a knowledge hierarchy and two collections of objects (e.g., documents), finds all knowledge-aware similar object pairs. To the best of our knowledge, this is the first study on knowledge-aware similarity join. There are two main challenges. The first is how to quantify the knowledge-aware similarity. The second is how to efficiently identify the similar pairs. To address these challenges, we first propose a new similarity metric to quantify the knowledgeaware similarity using the knowledge hierarchy. We then devise a filter-and-verification framework to efficiently identify the similar pairs. We propose effective signature-based filtering techniques to prune large numbers of dissimilar pairs and develop efficient verification algorithms to verify the candidates that are not pruned in the filter step. Experimental results on real-world datasets show that our method significantly outperforms baseline algorithms in terms of both efficiency and effectiveness.
【Keywords】: Measurement; Upper bound; Cleaning; Conferences; Data engineering; Computer science; Knowledge based systems
【Paper Link】 【Pages】:25-26
【Authors】: Feng Tian ; Tian Lan ; Qinghua Zheng ; Kuo-Ming Chao ; Nick Godwin ; Nazaraf Shah ; Fan Zhang
【Abstract】: There is evidence that an increasing number of enterprises plot together to evade tax in an unperceived way. At the same time, the taxation information related data is a classic kind of big data. These issues challenge the effectiveness of traditional data mining-based tax evasion detection methods. To address this problem, we first investigate the classic tax evasion cases, and employ a graph-based method to characterize their property that describes two suspicious relationship trails with a same antecedent node behind an Interest-Affiliated Transaction (IAT). Next, we propose a Colored Network-Based Model (CNBM) for characterizing economic behaviors, social relationships and the IATs between taxpayers, and generating a Taxpayer Interest Interacted Network (TPIIN). To accomplish the tax evasion detection task by discovering suspicious groups in a TPIIN, methods for building a patterns tree and matching component patterns are introduced and the completeness of the methods based on graph theory is presented. Then, we describe an experiment based on real data and a simulated network. The experimental results show that our proposed method greatly improves the efficiency of tax evasion detection, as well as provides a clear explanation of the tax evasion behaviors of taxpayer groups. This is an extended abstract of F. Tian, et al., "Mining suspicious tax evasion groups in big data," Transactions on Knowledge and Data Engineering, vol. 28, no. 10, pp. 2651-2664, Oct. 2016.
【Keywords】: Finance; Companies; Data mining; Color; Big Data; Inspection; Economics
【Paper Link】 【Pages】:27-28
【Authors】: Long Guo ; Dongxiang Zhang ; Gao Cong ; Wei Wu ; Kian-Lee Tan
【Abstract】: We study a novel problem of influence maximization in trajectory databases that is very useful in precise locationaware advertising. It finds k best trajectories to be attached with a given advertisement and maximizes the expected influence among a large group of audience. We show that the problem is NP-hard and propose both exact and approximate solutions to find the best set of trajectories. We also extend our problem to support the scenario when there are a group of advertisements. We validate our approach via extensive experiments with real datasets.
【Keywords】: Trajectory; Databases; Upper bound; Social network services; Q measurement; Advertising; Estimation
【Paper Link】 【Pages】:29-30
【Authors】: Chenyun Yu ; Sarana Nutanong ; Hangyu Li ; Cong Wang ; Xingliang Yuan
【Abstract】: Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to accelerate the process of joining two large datasets using LSH. The crux of our method lies in the way we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyses show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: the Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyses and show significant improvements of our method over the state-of-the-art LSH method: to achieve over 0.95 recall, we only need to operate LSH lookups for at most 15% of the query points.
【Keywords】: Acceleration; Algorithm design and analysis; Search problems; Approximation algorithms; Hamming distance; Error analysis; Geographic information systems
【Paper Link】 【Pages】:31-32
【Authors】: Yu Gu ; Guanli Liu ; Jianzhong Qi ; Hongfei Xu ; Ge Yu ; Rui Zhang
【Abstract】: We study result diversification in continuous spatial query processing and formulate a new type of queries, the moving k diversified nearest neighbor query (MkDNN). Given a moving query object, an MkDNN query maintains continuously the k diversified nearest neighbors of the query object. Here, how diversified the nearest neighbors are is defined on the distance between the nearest neighbors. We propose an algorithm to maintain incrementally the k diversified nearest neighbors to reduce the costs of continuous query processing. We further propose two approximate algorithms to obtain even higher query efficiency with precision bounds. We verify the effectiveness and efficiency of the proposed algorithms empirically. The results confirm the superiority of the proposed algorithms.
【Keywords】: Approximation algorithms; Nearest neighbor searches; Prefetching; Query processing; Silicon; Conferences; Data engineering
【Paper Link】 【Pages】:33-34
【Authors】: Binbin Gu ; Zhixu Li ; Xiangliang Zhang ; An Liu ; Guanfeng Liu ; Kai Zheng ; Lei Zhao ; Xiaofang Zhou
【Abstract】: Schema Matching (SM) and Record Matching (RM) are two necessary steps in integrating multiple relational tables of different schemas, where SM unifies the schemas and RM detects records referring to the same real-world entity. The two processes have been thoroughly studied separately, but few attention has been paid to the interaction of SM and RM. In this work we find that, even alternating them in a simple manner, SM and RM can benefit from each other to reach a better integration performance (i.e., in terms of precision and recall). Therefore, combining SM and RM is a promising solution for improving data integration.
【Keywords】: Electronic mail; Semantics; Mobile communication; Data integration; Read only memory; Cameras; Conferences
【Paper Link】 【Pages】:35-36
【Authors】: Dixin Luo ; Hongteng Xu ; Yi Zhen ; Bistra Dilkina ; Hongyuan Zha ; Xiaokang Yang ; Wenjun Zhang
【Abstract】: We explore the learning task of mixtures of Markov chains (MMCs) from aggregate data. Our work demonstrates that although this challenging task is generally intractable because of the identifiability problem, it can be solved approximately by imposing structural constraints on its transition matrices Specifically, the proposed structural constraints include specifying active state sets corresponding to the chains and adding a series of pairwise sparse regularizers on transition matrices. Based on these two structural constraints, we propose a constrained least-squares method to learn mixtures of Markov chains. We develop a novel iterative algorithm that decomposes the overall problem into a set of convex subproblems and solves each subproblem efficiently. Experimental results on synthetic data prove that our learning method converges well and is robust to the noise in data. Moreover, the comparison with state-of-art competitors on real-world data further validates the superiority of our method.
【Keywords】: Markov processes; Data aggregation; Aggregates; Data models; Biological system modeling; Analytical models; Estimation
【Paper Link】 【Pages】:37-38
【Authors】: Hongteng Xu ; Weichang Wu ; Shamim Nemati ; Hongyuan Zha
【Abstract】: We focus on an important problem of predicting the so-called "patient flow" from longitudinal electronic health records (EHRs), which has not been explored via existing machine learning techniques. We develop a point processbased framework for modeling patient flow through various care units (CUs) and jointly predicting patients’ destination CUs and duration days. We propose a novel discriminative learning algorithm aiming at improving the prediction of transition events in the case of sparse data. By parameterizing the proposed model as mutually-correcting processes, we formulate the estimation problem via generalized linear models and solve it based on alternating direction method of multipliers (ADMM). We achieve simultaneous feature selection and learning by adding a group-lasso regularizer to the ADMM algorithm. Additionally, we synthesize auxiliary training data for the classes with extremely few samples, and improve the robustness of our learning method to the problem of data imbalance.
【Keywords】: Medical diagnostic imaging; Convex functions; Predictive models; Prediction algorithms; Training data; Robustness
【Paper Link】 【Pages】:39-40
【Authors】: Guoliang Li ; Jiannan Wang ; Yudian Zheng ; Michael J. Franklin
【Abstract】: Many important data management and analytics tasks cannot be completely addressed by automated processes. These tasks, such as entity resolution, sentiment analysis, and image recognition can be enhanced through the use of human cognitive ability. Crowdsouring is an effective way to harness the capabilities of people (i.e., the crowd) to apply human computation for such tasks. Thus, crowdsourced data management has become an area of increasing interest in research and industry. We identify three important problems in crowdsourced data management. (1) Quality Control: Workers may return noisy or incorrect results so effective techniques are required to achieve high quality, (2) Cost Control: The crowd is not free, and cost control aims to reduce the monetary cost, (3) Latency Control: The human workers can be slow, particularly compared to automated computing time scales, so latency-control techniques are required. There has been significant work addressing these three factors for designing crowdsourced tasks, developing crowdsourced data manipulation operators, and optimizing plans consisting of multiple operators. We survey and synthesize a wide spectrum of existing studies on crowdsourced data management.
【Keywords】: Crowdsourcing; Quality control; Noise measurement; Conferences; Image resolution; Image recognition; Time factors
【Paper Link】 【Pages】:41-42
【Authors】: Cheng Chen ; Lan Zheng ; Venkatesh Srinivasan ; Alex Thomo ; Kui Wu ; Anthony Sukow
【Abstract】: Weighted bipartite b-matching (WBM) is one of the fundamental and widely studied problems in combinatorial optimization. An implicit assumption of WBM is that any two nodes on the same side do not conflict with each other, even if they share similar features in practice. For example, a recommender system running WBM can recommend several books of the same subject to a reader, as long as the subject is his/her favorite and the availability constraints of the books are not violated. This, however, does not generate desired results in some real-world scenarios. For book recommendation, a reader may not want all recommended books from the same subject but instead may prefer books of diverse subjects so that more interesting topics can be discovered. However, considering conflicts would inevitably lead to new challenges for WBM when generating the matching result. This conflict challenge has not been fully studied and thus is the focus of this work. In this article, we introduce a new generalization of WBM, Conflict-Aware Weighted Bipartite b-Matching (CA-WBM), that can address the conflict challenges mentioned above.
【Keywords】: Approximation algorithms; Greedy algorithms; Scalability; Computer science; Electronic mail; Manganese; Conferences
【Paper Link】 【Pages】:43-44
【Authors】: Xinpeng Zhang ; Yasuhito Asano ; Masatoshi Yoshikawa
【Abstract】: We address a new multi-user routing problem: mutually beneficial confluent routing (MCR). In the MCR, every user has his/her own source and destination; confluences of user routes occur so that users can mutually benefit from travelling together on the confluences. The idea of gaining benefit from travelling together is valuable in various practical applications, such as ride sharing, delivery routing, and pedestrian navigation. We formulate the MCR as a new combinatorial optimization problem on road networks to find optimum routes. As the main contributions of this paper [see X. Zhang, et al., IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 10, pp. 2681–2696, Oct 2016], we propose exact and efficient algorithms for the MCR for the setting of two or three users. The setting is reasonable for various practical applications. We have the following two difficulties in solving the MCR: (D1) enumerating patterns of how the users meet and part in the MCR; and (D2) searching optimal meeting and parting nodes from numerous nodes. To overcome the difficulty (D1), we ascertain "confluence patterns" of the optimal solutions. Regarding (D2), we propose several efficient node pruning techniques. Experiments done on large scale road networks reveal that our algorithms are sufficiently efficient.
【Keywords】: Routing; Upper bound; Meters; Roads; Partitioning algorithms; Informatics; Optimization
【Paper Link】 【Pages】:45-46
【Authors】: Bo Tang ; Man Lung Yiu ; Kien A. Hua
【Abstract】: In high-dimensional kNN search, both exact and approximate kNN solutions incur considerable time in the candidate refinement phase. In this paper, we investigate a caching solution to reduce the candidate refinement time. Our caching method HC-O is faster than EXACT caching by at least an order of magnitude, on an approximate index (C2LSH). Our work is also applicable to exact indexes (e.g., iDistance, VPtree and VA-file).
【Keywords】: Histograms; Multimedia communication; Nearest neighbor searches; Streaming media; Indexing; Flickr
【Paper Link】 【Pages】:47-48
【Authors】: Sreevani ; C. A. Murthy
【Abstract】: A new strategy for Dimensionality Reduction with the aim of providing reduced set with both original and combinations of features is studied. For this purpose, a framework called Minimum Projection error and Minimum Redundancy (MPeMR) to generate orthogonal compound (both original and combinations) features by minimizing both projection error and redundancy among them is proposed. An iterative approximation method under the proposed framework is also presented. Using different measures for projection error, redundancy and representative feature selection, three supervised and three unsupervised methods are considered. Experimental results are compared with both state-of-the-art feature selection and extraction methods, in both supervised and unsupervised cases, and proposed CFG methods outperform these methods. In each stage of the proposed approximation method, pairs of features are considered i.e., NPe is calculated based on two features and removal of redundancy is also based on two features. Instead of (2, 2), one could have considered (2, 3), or (3, 2) or any other such combination of features. Further research is focused along this direction.
【Keywords】: Feature extraction; Compounds; Redundancy; Principal component analysis; Frequency modulation; Data engineering
【Paper Link】 【Pages】:49-50
【Authors】: Shuang Hao ; Nan Tang ; Guoliang Li ; Jian He ; Na Ta ; Jianhua Feng
【Abstract】: Integrity constraint (IC) based data repairing is typically an iterative process consisting of two parts: detecting and grouping errors that violate given ICs, and modifying values inside each group such that the modified database satisfies those ICs. However, most existing automatic solutions treat the process of detecting and grouping errors straightforwardly (e.g., violations of functional dependencies using string equality), while putting more attention on heuristics of modifying values within each group. In this paper, we propose a revised semantics of violations and data consistency w.r.t. a set of ICs. The revised semantics relies on string similarities, in contrast to traditional methods that use syntactic error detection using string equality. Along with the revised semantics, we also propose a new cost model to quantify the cost of data repairing by considering distances between strings. We show that the revised semantics provides a significant change for better detecting and grouping errors, which in turn improves both precision and recall of the following data repairing step. We prove that finding minimumcost repairs in the new model is NP-hard, even for a single FD. We devise efficient algorithms to find approximate repairs.
【Keywords】: Maintenance engineering; Databases; Semantics; Data models; Approximation algorithms; Integrated circuits; Cleaning
【Paper Link】 【Pages】:51-52
【Authors】: Mahsa Salehi ; Christopher Leckie ; James C. Bezdek ; Tharshan Vaithianathan ; Xuyun Zhang
【Abstract】: Outlier detection is an important task in data mining. With the growing need to analyze high speed data streams, the task of outlier detection becomes even more challenging as traditional outlier detection techniques can no longer assume that all the data can be stored for processing. While the wellknown Local Outlier Factor (LOF) algorithm has an incremental version (called iLOF), it assumes unbounded memory to keep all previous data points. In this paper, we propose a memory efficient incremental local outlier (MiLOF) detection algorithm for data streams, and a more flexible version (MiLOF F), both have an accuracy close to iLOF but within a fixed memory bound. In addition MiLOF F is robust to changes in the number of data points, underlying clusters and dimensions in the data stream.
【Keywords】: Memory management; Clustering algorithms; Electronic mail; Time complexity; Prototypes; Australia; Detection algorithms
【Paper Link】 【Pages】:53-54
【Authors】: Yunjun Gao ; Qing Liu ; Gang Chen ; Linlin Zhou ; Baihua Zheng
【Abstract】: This paper explores the causality and responsibility problem (CRP) for the non-answers to probabilistic reverse skyline queries (PRSQ). Towards this, we propose an efficient algorithm called CP to compute the causality and responsibility for the non-answers to PRSQ. CP first finds candidate causes, and then, it performs verification to obtain actual causes with their responsibilities, during which several strategies are used to boost efficiency. Extensive experiments using both real and synthetic data sets demonstrate the effectiveness and efficiency of the presented algorithms.
【Keywords】: Probabilistic logic; Algorithm design and analysis; Databases; Tools; Filtering algorithms; Computational modeling; Conferences
【Paper Link】 【Pages】:55-56
【Authors】: Xiaoyang Wang ; Ying Zhang ; Wenjie Zhang ; Xuemin Lin ; Chen Chen
【Abstract】: Given a positive integer k, a social network G and a certain propagation model M, influence maximization aims to find a set of k nodes that has the largest influence spread. The state-of-the-art method IMM is based on the reverse influence sampling (RIS) framework. By using the martingale technique, it greatly outperforms the previous methods in efficiency. However, IMM still has limitations in scalability due to the high overhead of deciding a tight sample size. In this paper, instead of spending the effort on deciding a tight sample size, we present a novel bottomk sketch based RIS framework, namely BKRIS, which brings the order of samples into the RIS framework. By applying the sketch technique, we can derive early termination conditions to significantly accelerate the seed set selection procedure. Moreover, we provide several optimization techniques to reduce the cost of generating and processing samples. Finally, we conduct experiments over 10 real social networks to demonstrate the efficiency and effectiveness of the proposed method. Further details are reported in [1].
【Keywords】: Integrated circuit modeling; Social network services; Scalability; Acceleration; Optimization; Mathematical model
【Paper Link】 【Pages】:57-58
【Authors】: Linhong Zhu ; Dong Guo ; Junming Yin ; Greg Ver Steeg ; Aram Galstyan
【Abstract】: We propose to model dependence within a network view using the temporal latent space model, which uses a time-dependent low-dimensional geometric projections to represent the high-dimensional dependence structure in time-varying networks. Once we obtain the lowdimensional temporal latent space representation for graphs from time 1 to t, we can accurately predict future links in time t + 1 (i.e., Gt+1). We present a global optimization algorithm to effectively infer the temporal latent space using block coordinate gradient descent (BCGD). We further introduce two new variants of BCGD: a local BCGD algorithm and an incremental BCGD algorithm, to scale the inference algorithm to massive networks.
【Keywords】: Inference algorithms; Prediction algorithms; Facebook; Heuristic algorithms; Conferences; Feathers
【Paper Link】 【Pages】:59-60
【Authors】: Shuo Shang ; Lisi Chen ; Zhewei Wei ; Christian S. Jensen ; Ji-Rong Wen ; Panos Kalnis
【Abstract】: We propose and investigate a novel query, the Collective Travel Planning (CTP) query, that finds the lowest-cost route connecting multiple query sources and a destination via at most k meeting points. This type of query is useful in organizing large events, and it can bring significant benefits to society and the environment: it can help optimize the allocation of transportation resources, reduce resource consumption, and enable smarter and greener transportation; and it can help reduce greenhouse-gas emissions and traffic congestion.
【Keywords】: Approximation algorithms; Planning; Time complexity; Spatial databases; Computer science; Measurement; Transportation
【Paper Link】 【Pages】:61-62
【Authors】: Shuai Ma ; Kaiyu Feng ; Jianxin Li ; Haixun Wang ; Gao Cong ; Jinpeng Huai
【Abstract】: This study investigates a light-weight data reduction technique for speeding-up shortest path and distance queries on large graphs. We propose a notion of routing proxies (or simply proxies), each of which represents a small subgraph, referred to as deterministic routing areas (DRAs). We show that routing proxies hold good properties for speeding-up shortest path and distance queries, and there exists a linear-time algorithm to compute routing proxies and their corresponding DRAs. Finally, we discuss the experimental results, and verify that our solution is a general technique for reducing graph sizes and speeding-up shortest path and distance queries, using real-life large graphs
【Keywords】: Routing; Gold; Roads; Partitioning algorithms; Conferences; Data engineering; Technological innovation
【Paper Link】 【Pages】:63-64
【Authors】: Guillaume Bagan ; Angela Bonifati ; Radu Ciucanu ; George H. L. Fletcher ; Aurélien Lemay ; Nicky Advokaat
【Abstract】: Abstract-Massive graph data sets are pervasive in contemporary application domains. Hence, graph database systems are becoming increasingly important. In the experimental study of these systems, it is vital that the research community has shared solutions for the generation of database instances and query workloads having predictable and controllable properties. We present the design and engineering principles of gMark, a domain- and query language-independent graph instance and query workload generator. A core contribution of gMark is its ability to target and control the diversity of properties of both the generated instances and the generated workloads coupled to these instances. Further novelties include support for regular path queries, a fundamental graph query paradigm, and schema-driven selectivity estimation of queries, a key feature in controlling workload chokepoints. We illustrate the flexibility and practical usability of gMark by showcasing the framework’s capabilities in generating high quality graphs and workloads, and its ability to encode user-defined schemas across a variety of application domains.
【Keywords】: Benchmark testing; Query processing; Generators; Database languages; Estimation
【Paper Link】 【Pages】:65-66
【Authors】: Yoones A. Sekhavat ; Jeffrey Parsons
【Abstract】: Data exchange is the process of generating an instance of a target schema from an instance of a source schema such that source data is reflected in the target. The prevailing approach for data exchange is based on schema mappings, which are high level expressions that describe relationships between database schemas [1]. However, schema-mapping based data exchange techniques suffer from two problems: (1) entity fragmentation, in which information about a single entity is spread across several tuples in the target schema, and (2) ambiguity in generalization, in which incorrect mappings result from using different methods to represent entity type generalization in source and target schemas. In this paper, we propose the Scalable Entity Preserving Data Exchange (SEDEX) method, which combines schema-level and datalevel information to address these problems. We also provide extensive evaluation to demonstrate the benefits and scalability of the approach.
【Keywords】: Scalability; Electronic mail; Size measurement; Conferences; Data engineering; Multimedia communication; Art
【Paper Link】 【Pages】:67-68
【Authors】: Yanfeng Zhang ; Shimin Chen ; Ge Yu
【Abstract】: Abstract-Density Peaks (DP) is a recently proposed clustering algorithm that has distinctive advantages over existing clustering algorithms. It has already been used in a wide range of applications. However, DP requires computing the distance between every pair of input points, therefore incurring quadratic computation overhead, which is prohibitive for large data sets. In this paper, we propose an efficient distributed algorithm LSHDDP, which is an approximate algorithm that exploits Locality Sensitive Hashing. We present formal analysis of LSH-DDP, and show that the approximation quality and the runtime can be controlled by tuning the parameters of LSH-DDP. Experimental results on both a local cluster and EC2 show that LSH-DDP achieves a factor of 1.7-70x speedup over the na¨ıve distributed DP implementation and 2x speedup over the state-of-the-art EDDPC approach, while returning comparable cluster results.
【Keywords】: Clustering algorithms; Approximation algorithms; Layout; Runtime; Partitioning algorithms; Distributed databases; Computer architecture
【Paper Link】 【Pages】:69-70
【Authors】: Junbeom Hur ; Dongyoung Koo ; Young-joo Shin ; Kyungtae Kang
【Abstract】: In cloud services, deduplication technology is commonly used to reduce the space and bandwidth requirements of services by eliminating redundant data and storing only a single copy. Deduplication is most effective when multiple users outsource the same data to the cloud storage, but it raises issues relating to security and ownership. Proof-ofownership schemes allow any owner of the same data to prove to the cloud storage server that he owns the data in a robust way. However, if encrypted data is outsourced into the cloud storage and the ownership changes dynamically, deduplication would be hampered. Thus, we propose a secure deduplication scheme that supports dynamic ownership management based on randomized convergent encryption [3] in this study.
【Keywords】: Cloud computing; Encryption; Servers; Data privacy; Computer science
【Paper Link】 【Pages】:71-72
【Authors】: Libin Zheng ; Lei Chen
【Abstract】: With the rapid development of mobile networks and the widespread usage of mobile devices, spatial crowdsourcing [1], which outsources location-related tasks to moving workers, has drawn increasing attention. Spatial crowdsourcing differs from traditional crowdsourcing in that tasks released by a requester are location-related, and workers should physically travel to a specific spot to perform the task. Task assignment is an important issue in spatial crowdsourcing. Most existing works adopt SAT (Server Assigned Tasks) assignment mode [1], where the SC-server assigns tasks to workers in regular timestamps. However, existing works [1], [2], [3] on this mode are based on the assumption that no rejection would happen after the assignment and workers guarantee to perform their assigned tasks. In practice, a rejection would happen when the worker is not interested in the assigned tasks, which may result from long travel distance, critical time requirement, dissatisfactory payment, workers’ lacking skills and etc. In the typical application of spatial crowdsourcing, Uber, workers (drivers) are allowed to reject the ride requests assigned by the server. Uber requires its drivers to accept at least 90 percent of the trip requests in order to avoid low system throughput and poor experience for riders [4]. Taking workers’ rejections into concern, in spatial crowdsourcing, how to match the workers with the right tasks to maximize their acceptance is of great importance. Without consideration of workers’ acceptance, the system’s throughput cannot be guaranteed.
【Keywords】: Crowdsourcing; Approximation methods; Servers; Time complexity; Throughput; Greedy algorithms; Conferences
【Paper Link】 【Pages】:75-78
【Authors】: Hongzhi Yin ; Liang Chen ; Weiqing Wang ; Xingzhong Du ; Nguyen Quoc Viet Hung ; Xiaofang Zhou
【Abstract】: With the rapid prevalence of smart mobile devices and the dramatic proliferation of mobile applications (Apps), App recommendation becomes an emergent task that will benefit different stockholders of mobile App ecosystems. Unlike traditional items, Apps have privileges to access a user's sensitive resources (e.g., contacts, messages and locations) which may lead to security risk or privacy leak. Thus, users' choosing of Apps are influenced by not only their personal interests but also their privacy preferences. Moreover, user privacy preferences vary with App categories. In this paper, we propose a mobile sparse additive generative model (Mobi-SAGE) to recommend Apps by considering both user interests and category-aware user privacy preferences. We collected a real-world dataset from 360 App store - the biggest Android App platform in China, and conduct extensive experiments on it. The experimental results show that our Mobi-SAGE consistently and significantly outperforms the state-of-the-art approaches, which implies the importance of exploiting category-aware user privacy preferences.
【Keywords】: Privacy; Mobile communication; Visualization; Data privacy; Additives; Smart phones
【Paper Link】 【Pages】:79-82
【Authors】: Zhao Kang ; Chong Peng ; Qiang Cheng
【Abstract】: Construction of a reliable similarity matrix is fundamental for graph-based clustering methods. However, most of the current work is built upon some simple manifold structure, whereas limited work has been conducted on nonlinear data sets where data reside in a union of manifolds rather than a union of subspaces. Therefore, we construct a similarity graph to capture both global and local manifold structures of the input data set. The global structure is exploited based on the self-expressive property of data in an implicit feature space using kernel methods. Since the similarity graph computation is independent of the subsequent clustering, the final results may be far from optimal. To overcome this limitation, we simultaneously learn similarity graph and clustering structure in a principled way. Experimental studies demonstrate that our proposed algorithms deliver consistently superior results to other state-of-the-art algorithms.
【Keywords】: Manifolds; Kernel; Clustering algorithms; Periodic structures; Data models; Adaptation models; Noise measurement
【Paper Link】 【Pages】:83-86
【Authors】: Xiang Ao ; Ping Luo ; Jin Wang ; Fuzhen Zhuang ; Qing He
【Abstract】: Episode Rule Mining is a popular framework for discovering sequential rules from event sequential data. However, traditional episode rule mining methods only tell that the consequent event is likely to happen within a given time intervals after the occurrence of the antecedent events. As a result, they cannot satisfy the requirement of many time sensitive applications, such as program security trading due to the lack of fine-grained response time. In this study, we come up with the concept of fixed-gap episode to address this problem. A fixed-gap episode consists of an ordered set of events where the elapsed time between any two consecutive events is a constant. Based on this concept, we formulate the problem of mining precise-positioning episode rules in which the occurrence time of each event in the consequent is clearly specified. In addition, we develop a triebased data structure to mine such precise-positioning episode rules with several pruning strategies incorporated for improving the performance as well as reducing memory consumption. Experimental results on real datasets show the superiority of our proposed algorithms.
【Keywords】: Data mining; Finite element analysis; Security; Time factors; Data structures; Registers; Partitioning algorithms
【Paper Link】 【Pages】:87-90
【Authors】: Shubhadip Mitra ; Priya Saraf ; Richa Sharma ; Arnab Bhattacharya ; Sayan Ranu ; Harsh Bhandari
【Abstract】: Optimal location queries identify the best locations to set up new facilities for providing service to its users. For several businesses such as fuel stations, cellphone base-stations, etc., placement queries require taking into account the mobility patterns (or trajectories) of the users. In this work, we formulate the TOPS (Trajectory-Aware Optimal Placement of Services) query that locates the best k sites on a road network for the prevailing user trajectories. The problem is NP-hard. The greedy approach, which is the state-of-the-art technique for this problem, is not scalable and practical for real urban-scale scenarios, primarily due to its high memory footprint beyond the capabilities of commodity machines. To overcome these challenges, we develop an indexing framework called NETCLUS that derives its power through an unique combination of FM sketches with network clustering. Empirical studies show that NETCLUS requires less than 100 s to answer the TOPS query on real datasets comprising of more than 250,000 sites and 120,000 trajectories.
【Keywords】: Trajectory; Roads; Frequency modulation; Indexes; Fuels; Global Positioning System; Memory management
【Paper Link】 【Pages】:91-94
【Authors】: Gang Hu ; Jie Shao ; Dongxiang Zhang ; Yang Yang ; Heng Tao Shen
【Abstract】: Locality sensitive hashing (LSH) and its variants are widely used for approximate kNN (k nearest neighbor) search in high-dimensional space. The success of these techniques largely depends on the ability of preserving kNN information. Unfortunately, LSH only provides a high probability that nearby points in the original space are projected into nearby region in a new space. This potentially makes many false positives and false negatives resulting from unrelated points. Many extensions of LSH aim to alleviate the above issue by improving the distance preserving ability. In this paper, we abound improving LSH function but propose a novel idea to enhance the performance by transforming the original data to a new space before applying LSH. A preserving-ignoring transformation (PIT) function satisfying some rigorous conditions can be used to convert original points to an interim space with strict distance preserving-ignoring capacity. Based on this property, a linear order is utilized to build an efficient index structure in the interim space. Finally, LSH can be applied to candidate set searched by our index structure for final results. Experiments are conducted and the proposed approach performs better than state-of-the-art methods SK-LSH, DSH and NSH in terms of both accuracy and efficiency.
【Keywords】: Compounds; Indexes; Visualization; Buildings; Transforms; Measurement
【Paper Link】 【Pages】:95-98
【Authors】: Lijun Chang ; Chen Zhang ; Xuemin Lin ; Lu Qin
【Abstract】: This paper studies the problem of top-k structural diversity search, which is to compute k users with the highest structural diversities that is measured by the number of connected components in the neighborhood of a user. As the existing algorithms are not scalable for processing large graphs due to their limits, in this paper we propose a scalable algorithm Div-TriE to improve the e ciency. Div-TriE has two optimal features compared with the existing algorithms. Firstly, we show that as a key building block, we only need to enumerate each triangle at most once in Div-TriE, in contrast to the up-to three times in the existing techniques. Secondly, we develop e cient techniques so that the computation against each enumerated triangle is (amortized) constant, in contrast to the non-constant costs in the corresponding costs of the existing techniques. Extensive experimental results on real graphs show that Div-TriE outperforms the existing techniques by one order of magnitude.
【Keywords】: Upper bound; Time complexity; Search problems; Algorithm design and analysis; Diversity reception; Data structures; Australia
【Paper Link】 【Pages】:99-102
【Authors】: Anuradha Awasthi ; Arnab Bhattacharya ; Sanchit Gupta ; Ujjwal Kumar Singh
【Abstract】: Skyline queries enable multi-criteria optimization by filtering objects that are worse in all the attributes of interest than another object. To handle the large answer set of skyline queries in high-dimensional datasets, the concept of k-dominance was proposed where an object is said to dominate another object if it is better in at least k attributes. However, many practical applications, such as flights having multiple stopovers, require that the preferences are applied on a joined relation. In this paper, we extend the k-dominant skyline queries to work on joined relations. We name such queries KSJQ (k-dominant skyline join queries). We show how pre-processing the base relations helps in making such queries efficient. We also extend the query to handle cases where the skyline preference is on aggregated values in the joined relation (such as total cost of the multiple legs of the flight). In addition, we devise efficient algorithms to choose the value of k based on the desired cardinality of the skyline set. Experiments demonstrate the efficiency and scalability of our algorithms.
【Keywords】: Urban areas; Aggregates; Optimization; Legged locomotion; Conferences; Data engineering; Computer science
【Paper Link】 【Pages】:103-106
【Authors】: Tong Yang ; Lingtong Liu ; Yibo Yan ; Muhammad Shahzad ; Yulong Shen ; Xiaoming Li ; Bin Cui ; Gaogang Xie
【Abstract】: A sketch is a probabilistic data structure that is used to record frequencies of items in a multi-set. Sketches have been applied in a variety of fields, such as data stream processing, natural language processing, distributed data sets etc. In this paper, we propose a new sketch, called Slim-Fat (SF) sketch, which has a much smaller memory footprint for query while supporting updates. The key idea behind our proposed SF-sketch is to maintain two separate sketches: a small sketch called Slimsubsketch and a large sketch called Fat-subsketch. The Slimsubsketch enables fast and accurate querying. The Fat-subsketch is used to assist the insertion and deletion from Slim-subsketch. We implemented and evaluated SF-sketch along with several prior sketches and compared them side by side. Our experimental results show that SF-sketch significantly outperforms the most commonly used CM-sketch in terms of accuracy. The full version is provided at arXiv.org [12].
【Keywords】: Radiation detectors; Random access memory; Data structures; Fats; Probabilistic logic; Distributed databases
【Paper Link】 【Pages】:107-110
【Authors】: Lei Chen ; Yafei Li ; Jianliang Xu ; Christian S. Jensen
【Abstract】: With the continued proliferation of location-based services, a growing number of web-accessible data objects are geotagged and have text descriptions. An important query over such web objects is the direction-aware spatial keyword query that aims to retrieve the top-k objects that best match query parameters in terms of spatial distance and textual similarity in a given query direction. In some cases, it can be difficult for users to specify appropriate query parameters. After getting a query result, users may find some desired objects are unexpectedly missing and may therefore question the entire result. Enabling why-not questions in this setting may aid users to retrieve better results, thus improving the overall utility of the query functionality. This paper studies the direction-aware why-not spatial keyword top-k query problem. We propose efficient query refinement techniques to revive missing objects by minimally modifying users' directionaware queries. Experimental studies demonstrate the efficiency and effectiveness of the proposed techniques.
【Keywords】: Search problems; Legged locomotion; Computer science; Spatial databases; Computational modeling; Q measurement; Computer aided software engineering
【Paper Link】 【Pages】:111-114
【Authors】: Yilin Shen ; Yanping Chen ; Eamonn J. Keogh ; Hongxia Jin
【Abstract】: Similarity search is arguably the most important primitive in time series data mining. Recent research has made significant progress on fast algorithms for time series similarity search under Dynamic Time Warping (DTW) and Uniform Scaling (US) distance measures. However, the current state-ofthe-art algorithms cannot support greater amounts of rescaling in many practical applications. In this paper, we introduce a novel lower bound, LBnew, to allow efficient search even in domains that exhibit more than a factor-of-two variability in scale. The effectiveness of our idea is validated on various large-scale real datasets from commercial important domains.
【Keywords】: Lower Bounds; Time Series Analytics; Dynamic Time Warping; Uniform Scaling
【Paper Link】 【Pages】:115-118
【Authors】: Shengli Sun ; Yimo Wang ; Weilong Liao ; Wei Wang
【Abstract】: Maximal Clique Enumeration (MCE) is a long standing problem in database community. Though it is extensively studied, almost all researches focus on calculating maximal cliques as a one-time effort. MCE on dynamic graph has been rarely discussed so far, the only work on this topic is to maintain maximal cliques with graph evolving. The key within this problem is to find maximal cliques that contains vertices incident to the inserted edge when edge insertion happens. Up to O(W2) candidates are generated in prior method based on Cartesian product, the overall complexity is O(W2n2) where n, W represents the number of vertices and maximal cliques on the graph. Besides, maximality verification of candidate is conducted frequently by global search. Change of maximal clique induced by graph's updating presents some localities. We propose novel local construction strategy to generate candidates based on linear scan, number of candidates is reduced to O(W), the overall complexity is then reduced to O(Wn2). Furthermore, we present heuristics to reduce the cost incurred by maximality verification. Theoretical analysis and experiments on real graphs indicate that our proposals are effective and efficient.
【Keywords】: Maximal Clique Enumeration; Dynamic Graphs; Local Construction; Maximality Verification
【Paper Link】 【Pages】:119-122
【Authors】: Zhongle Xie ; Qingchao Cai ; H. V. Jagadish ; Beng Chin Ooi ; Weng-Fai Wong
【Abstract】: Due to the coarse granularity of data accesses and the heavy use of latches, indices in the B-tree family are not efficient for in-memory databases, especially in the context of today's multi-core architecture. In this paper, we study the parallelizability of skip lists for the parallel and concurrent environment, and present PSL, a Parallel in-memory Skip List that lends itself naturally to the multi-core environment, particularly with non-uniform memory access. For each query, PSL traverses the index in a Breadth-First-Search (BFS) to find the list node with the matching key, and exploits SIMD processing to speed up this process. Furthermore, PSL distributes incoming queries among multiple execution threads disjointly and uniformly to eliminate the use of latches and achieve a high parallelizability. The experimental results show that PSL is comparable to a readonly index, FAST, in terms of read performance, and outperforms ART and Masstree respectively by up to 30% and 5x for a variety of workloads.
【Keywords】: Indexes; Latches; Instruction sets; Query processing; Routing; Subspace constraints
【Paper Link】 【Pages】:123-126
【Authors】: Long Guo ; Dongxiang Zhang ; Huayu Wu ; Bin Cui ; Kian-Lee Tan
【Abstract】: User-generated trajectories (UGT), such as GPS footprints from wearable devices or travel records from bus companies, capture rich information of human mobility and urban dynamics in the offline world. In this paper, our objective is to enrich these raw footprints and discover the users' personal interests by utilizing the semantic information contained in the spatial-and temporal-aware user-generated contents (STUGC) published in the online world. We design a novel probabilistic framework named CO2 to connect the offline world with the online world in order to discover the users' interests directly from their raw footprints in UGT. In particular, we first propose a latent probabilistic generative model named STLDA to infer the intention attached with each trip, and then aggregate the extracted trip intentions to discover the users' personal interests. To tackle the inherent sparsity and noisiness problems of the tags in STUGC, STLDA considers the inner correlation between tags (i.e., semantic, spatial and temporal correlation) on the topic-level. To evaluate the effectiveness of CO2, we utilize a dataset containing three months of data with 5.3 billion bus records and a Twitter dataset with 1.5 million tweets published in 6 months in Singapore as a case study. Experimental results on these two real-world datasets show that CO2 is effective in discovering user interests and improves the precision of the state-of-the-art method by 280%. In addition, we also conduct a questionnaire survey in Singapore to evaluate the effectiveness of CO2. The results further validate the superiority of CO2.
【Keywords】: Semantics; Probabilistic logic; Trajectory; Correlation; Bars; Noise measurement; Bridges
【Paper Link】 【Pages】:127-130
【Authors】: Yunfan Chen ; Lei Chen ; Chen Jason Zhang
【Abstract】: Data fusion has played an important role in data mining because high quality data is required in a lot of applications. As on-line data may be out-of-date and errors in the data may propagate with copying and referring between sources, it is hard to achieve satisfying results with merely applying existing data fusion methods to fuse Web data. In this paper, we make use of the crowd to achieve high quality data fusion result. We design a framework selecting a set of tasks to ask crowds in order to improve the confidence of data. Since data are correlated and crowds may provide incorrect answers, how to select a proper set of tasks to ask the crowd is a very challenging problem. In this paper, we design an approximation solution to address this challenge since we prove that the problem is at NP-hard. To further improve the efficiency, we design a pruning strategy and a preprocessing method, which effectively improve the performance of the proposed approximation solution. We verify the solutions with extensive experiments on a real crowdsourcing platform.
【Keywords】: Data integration; Approximation algorithms; Crowdsourcing; Data models; Clustering algorithms; Entropy; Optimized production technology
【Paper Link】 【Pages】:131-134
【Authors】: David B. Blumenthal ; Johann Gamper
【Abstract】: The problem of deriving lower and upper bounds for the edit distance between labelled undirected graphs has recently received increasing attention. However, only one algorithm has been proposed that allegedly computes not only an upper but also a lower bound for non-uniform metric edit costs and incorporates information about both node and edge labels. In this paper, we show that this algorithm is incorrect in the sense that, in general, it does not compute a lower bound. We present BRANCH, a corrected version of the algorithm that runs in O(n5) time. We also develop a speed-up BRANCHFAST that runs in O(n4) time and computes a lower bound, which is only slightly less accurate than the one computed by BRANCH. An experimental evaluation shows that BRANCH and BRANCHFAST yield excellent runtime/accuracy-tradeoffs, as they outperform all existing competitors in terms of runtime or in terms of accuracy.
【Keywords】: Upper bound; Measurement; Bipartite graph; Runtime; Optimization; Complexity theory; Electronic mail
【Paper Link】 【Pages】:135-138
【Authors】: Dimitrios Karapiperis ; Aris Gkoulalas-Divanis ; Vassilios S. Verykios
【Abstract】: In this work, we propose Bit Vectors (BV), an accurate, distance-preserving encoding scheme for representing numerical data values in privacy-preserving tasks. Although many methods have been proposed in the literature for encoding strings, the problem of encoding numerical values has not been effectively addressed yet. In Privacy-Preserving Record Linkage (PPRL), a number of data custodians encode their records and submit them to a trusted third-party that is responsible to identify those records that refer to the same real-world entity. BV is supported by a strong theoretical foundation for embedding numerical values into an anonymization space in a way that preserves the initial distances. Key components of this embedding process are (a) the employed hash functions which, by utilizing random intervals, they allow for approximate matching, and (b) the threshold that is required by the distance computations, which we prove that can be specified in a way that guarantees accurate results.
【Keywords】: Hamming distance; Encoding; Couplings; Euclidean distance; Numerical models; Conferences
【Paper Link】 【Pages】:139-142
【Authors】: Ibrahim Abdelaziz ; Essam Mansour ; Mourad Ouzzani ; Ashraf Aboulnaga ; Panos Kalnis
【Abstract】: Applications in life sciences, decentralized social networks, Internet of Things, and statistical linked dataspaces integrate data from multiple decentralized RDF graphs via SPARQL queries. Several approaches have been proposed to optimize query processing over a small number of heterogeneous data sources by utilizing schema information. In the case of schema similarity and interlinks among sources, these approaches cause unnecessary data retrieval and communication, leading to poor scalability and response time. This paper addresses these limitations and presents Lusail, a system for scalable and efficient SPARQL query processing over decentralized graphs. Lusail achieves scalability and low query response time through various optimizations at compile and run times. At compile time, we use a novel locality-aware query decomposition technique that maximizes the number of query triple patterns sent together to a source based on the actual location of the instances satisfying these triple patterns. At run time, we use selectivity-awareness and parallel query execution to reduce network latency and to increase parallelism by delaying the execution of subqueries expected to return large results. We evaluate Lusail using real and synthetic benchmarks, with data sizes up to billions of triples on an in-house cluster and a public cloud. We show that Lusail outperforms state-of-the-art systems by orders of magnitude in terms of scalability and response time.
【Keywords】: Resource description framework; Pattern matching; Parallel processing; Query processing; Scalability; Time factors; Life sciences
【Paper Link】 【Pages】:143-146
【Authors】: Klaus Broelemann ; Thomas Gottron ; Gjergji Kasneci
【Abstract】: We address the problem of latent truth discovery, LTD for short, where the goal is to discover the underlying true values of entity attributes in the presence of noisy, conflicting or incomplete information. Despite a multitude of algorithms addressing the LTD problem, only little is known about their overall performance with respect to effectiveness, efficiency and robustness. The LTD model proposed in this paper is based on Restricted Boltzmann Machines, thus coined LTD-RBM. In extensive experiments on various heterogeneous and publicly available datasets, LTD-RBM is superior to state-of-the-art LTD techniques in terms of an overall consideration of effectiveness, efficiency and robustness.
【Keywords】: Data models; Robustness; Inference algorithms; Training; Probabilistic logic; Maximum likelihood estimation
【Paper Link】 【Pages】:147-150
【Authors】: Young-Seok Kim ; Taewoo Kim ; Michael J. Carey ; Chen Li
【Abstract】: The proliferation of GPS-enabled mobile devices has generated geo-tagged data at an unprecedented rate over the past decade. Data-processing systems that aim to ingest, store, index, and analyze Big Data must deal with such geo-tagged data efficiently. In this paper, among representative, disk-resident spatial indexing methods that have been adopted by major SQL and NoSQL systems, we implement five variants of these methods in the form of Log-Structured Merge-tree-based (LSM) spatial indexes in order to evaluate their pros and cons for dynamic geo-tagged Big Data. We have implemented the alternatives, including LSM-based B-tree, R-tree, and inverted index variants, in Apache AsterixDB, an open source Big Data management system. This implementation enabled comparison in terms of real end-to-end performance, including logging and locking overheads, in a full-function, query-based system setting. Our evaluation includes both static and dynamic workloads, ranging from a "load once, query many" case to a case where continuous concurrent incremental inserts are mixed with concurrent queries. Based on the results, we discuss the pros and cons of the five index variants.
【Keywords】: Indexing; Big Data; Spatial indexes; Spatial databases; Servers; Two dimensional displays
【Paper Link】 【Pages】:151-154
【Authors】: Angen Zheng ; Alexandros Labrinidis ; Christos Faloutsos
【Abstract】: Large graph datasets have caused renewed interest for graph partitioning. However, existing well-studied graph partitioners often assume that vertices of the graph are always active during the computation, which may lead to time-varying skewness for traversal-style graph workloads, like Breadth First Search, since they only explore part of the graph in each superstep. Additionally, existing solutions do not consider what vertices each partition will have, as a result, high-degree vertices may be concentrated into a few partitions, causing imbalance. Towards this, we introduce the idea of skew-resistant graph partitioning, where the objective is to create an initial partitioning that will "hold well" over time without suffering from skewness. Skewresistant graph partitioning tries to mitigate skewness by taking the characteristics of both the target workload and the graph structure into consideration.
【Keywords】: Partitioning algorithms; Runtime; Multicore processing; Data communication; Network interfaces; Social network services; Computational modeling
【Paper Link】 【Pages】:155-158
【Authors】: Ahsanul Haque ; Swarup Chandra ; Latifur Khan ; Kevin W. Hamlen ; Charu C. Aggarwal
【Abstract】: Traditional data stream classification techniques assume that the stream of data is generated from a single non-stationary process. On the contrary, a recently introduced problem setting, referred to as Multistream Classification involves two independent non-stationary data generating processes. One of them is the source stream that continuously generates labeled data instances. The other one is the target stream that generates unlabeled test data instances from the same domain. The distributions represented by the source stream data is biased compared to that of the target stream. Moreover, these streams may have asynchronous concept drifts between them. The multistream classification problem is to predict the class labels of target stream instances, while utilizing labeled data available from the source stream. In this paper, we propose an efficient solution for multistream classification by fusing drift detection into online data shift adaptation. Experiment results on benchmark data sets indicate significantly improved performance over the only existing approach for multistream classification.
【Keywords】: Multistream Classification; Data Shift adaptation; Direct Density Ratio Estimation
【Paper Link】 【Pages】:159-162
【Authors】: Victor Amelkin ; Petko Bogdanov ; Ambuj K. Singh
【Abstract】: Modeling and predicting people's opinions plays an important role in today's life. For viral marketing and political strategy design, it is particularly important to be able to analyze competing opinions, such as pro-Democrat vs. pro-Republican. While observing the evolution of polar opinions in a social network over time, can we tell when the network "behaved"' abnormally? Furthermore, can we predict how the opinions of individual users will change in the future? To answer such questions, it is insufficient to study individual user behavior, since opinions spread beyond users' ego-networks. Instead, we need to consider the opinion dynamics of all users simultaneously. In this work, we introduce the Social Network Distance (SND)—a distance measure that quantifies the likelihood of evolution of one snapshot of a social network into another snapshot under a chosen opinion dynamics model. SND has a rich semantics of a transportation problem, yet, is computable in pseudo-linear time, thereby, being applicable to large-scale social networks analysis. We demonstrate the effectiveness of SND in experiments with Twitter data.
【Keywords】: Transportation; Histograms; Computational modeling; Twitter; Earth; Measurement
【Paper Link】 【Pages】:163-166
【Authors】: Jingchao Ni ; Hongliang Fei ; Wei Fan ; Xiang Zhang
【Abstract】: Automating medical diagnosis is an important data mining problem, which is to infer likely disease(s) for some observed symptoms. Algorithms to the problem are very beneficial as a supplement to a real diagnosis. Existing diagnosis methods typically perform the inference on a sparse bipartite graph with two sets of nodes representing diseases and symptoms, respectively. By using this graph, existing methods basically assume no direct dependency exists between diseases (or symptoms), which may not be true in reality. To address this limitation, in this paper, we introduce two domain networks encoding similarities between diseases and those between symptoms to avoid information loss as well as to alleviate the sparsity problem of the bipartite graph. Based on the domain networks and the bipartite graph bridging them, we develop a novel algorithm, CCCR, to perform diagnosis by ranking symptom-disease clusters. Comparing with existing approaches, CCCR is more accurate, and more interpretable since its results deliver rich information about how the inferred diseases are categorized. Experimental results on real-life datasets demonstrate the effectiveness of the proposed method.
【Keywords】: Diseases; Clustering algorithms; Bipartite graph; Linear programming; Medical diagnosis; Stochastic processes; Medical diagnostic imaging
【Paper Link】 【Pages】:167-170
【Authors】: Qiang Huang ; Jianlin Feng ; Qiong Fang
【Abstract】: The c-Approximate Furthest Neighbor (c-AFN) search is a fundamental problem in many applications. However, existing hashing schemes are designed for internal memory. The old techniques for external memory, such as furthest point Voronoi diagram and the tree-based methods, are only suitable for the low-dimensional case. In this paper, we introduce a novel concept of Reverse Locality-Sensitive Hashing (RLSH) family which is directly designed for c-AFN search. Accordingly, we propose two novel hashing schemes RQALSH and RQALSH for highdimensional c-AFN search over external memory. Experimental results validate the efficiency and effectiveness of RQALSH and RQALSH.
【Keywords】: Search problems; Euclidean distance; Computer science; Sun; Probability density function; Buildings; Distortion
【Paper Link】 【Pages】:171-174
【Authors】: Nhat X. T. Le ; Vagelis Hristidis ; Neal Young
【Abstract】: In this Web 2.0 era, there is an ever increasing number of product or service reviews, which must be summarized to help consumers effortlessly make informed decisions. Previous work on reviews summarization has simplified the problem by assuming that features (e.g., "display") are independent of each other and that the opinion for each feature in a review is Boolean: positive or negative. However, in reality features may be interrelated – e.g., "display" and "display color" – and the sentiment takes values in a continuous range – e.g., somewhat vs very positive. We present a novel review summarization framework that advances the state-of-the-art by leveraging a domain hierarchy of concepts to handle the semantic overlap among the features, and by accounting for different sentiment levels. We show that the problem is NP-hard and present bounded approximate algorithms to compute the most representative set of sentences, based on a principled opinion coverage framework. We experimentally evaluate the quality of the summaries using both intuitive coverage measure and a user study.
【Keywords】: Review summarization; sentiment analysis
【Paper Link】 【Pages】:175-178
【Authors】: Fengchao Peng ; Qiong Luo ; Lionel M. Ni
【Abstract】: Active learning has been widely used to select the most informative data for labeling in classification tasks, except for time series classification. The main challenge of active learning in time series classification is to evaluate the informativeness of a time series instance. Specifically, many informativeness metrics have been proposed for traditional active learning, however, none of them is particularly effective on time series data. In this paper, we design an informativeness metric that considers the characteristics of time series data in defining our instance uncertainty and utility. We prove that our informativeness metric is a submodular set function, and further develop an effective and efficient algorithm to select the most informative time series instances for training. In the experiment, we validate our method on a variety of datasets in the UCR Time Series Data Archive. The results show that our method achieves a higher classification accuracy than existing methods, using only 50% of the training instances.
【Keywords】: Time series analysis; Training; Measurement; Uncertainty; Labeling; Buildings; Entropy
【Paper Link】 【Pages】:179-182
【Authors】: Ran Yu ; Ujwal Gadiraju ; Besnik Fetahu ; Stefan Dietze
【Abstract】: Embedded markup based on Microdata, RDFa, and Microformats have become prevalent on the Web and constitute an unprecedented source of data. However, RDF statements extracted from markup are fundamentally different from traditional RDF graphs: entity descriptions are flat, facts are highly redundant, and despite very frequent co-references explicit links are missing. Therefore, carrying out typical entity-centric tasks such as retrieval and summarisation cannot be tackled sufficiently with state-of-the-art methods and require preliminary data fusion. Given the scale and dynamics of Web markup, the applicability of general data fusion approaches is limited. We present a novel query-centric data fusion approach which overcomes such issues through a combination of entity retrieval and fusion techniques geared towards the specific challenges associated with embedded markup. To ensure precise and diverse entity descriptions, we follow a supervised learning approach and train a classifier for data fusion of a pool of candidate facts relevant to a given query and obtained through a preliminary entity retrieval step. We perform a thorough evaluation on a subset of the Web Data Commons dataset and show significant improvement over existing baselines. In addition, an investigation into the coverage and complementarity of facts from the constructed entity descriptions compared to DBpedia, shows potential for aiding tasks such as knowledge base population.
【Keywords】: Data integration; Feature extraction; Resource description framework; Data mining; Support vector machines; Knowledge based systems; HTML
【Paper Link】 【Pages】:183-186
【Authors】: Zhaonian Zou ; Faming Li ; Jianzhong Li ; Yingshu Li
【Abstract】: This paper studies a novel approach to processing massive uncertain graph data. In this approach, we propose a new framework to simultaneously process a query on a set of randomly sampled possible worlds of an uncertain graph. Based on this framework, we develop a series of algorithms to analyze massive uncertain graphs, including breadth-first search, shortest distance queries, triangle counting, and core decomposition. We implement this approach based on GraphLab, one of the stateof-the-art graph processing frameworks. By sharing fine-grained internal processing steps on common substructures of sampled possible worlds, the new approach achieves tens to hundreds of times speedup in execution time on a cluster of 20 servers.
【Keywords】: Query processing; Reactive power; Data structures; Sampling methods; Uncertainty; Servers; Probabilistic logic
【Paper Link】 【Pages】:187-190
【Authors】: Charu C. Aggarwal ; Yao Li ; Philip S. Yu ; Yuchen Zhao
【Abstract】: The problem of node classification has been widely studied in a variety of network-based scenarios. In this paper, we will study the more challenging scenario in which some of the edges in a content-based network are labeled, and it is desirable to use this information in order to determine the labels of other arbitrary edges. Furthermore, each edge is associated with text content, which may correspond to either communication or relationship information between the different nodes. Such a problem often arises in the context of many social or communication networks in which edges are associated with communication between different nodes, and the text is associated with the content of the communication. This situation can also arise in many online social networks such as chat messengers or email networks, where the edges in the network may also correspond to the actual content of the chats or emails. The problem of edge classification is much more challenging from a scalability point of view, because the number of edges is typically significantly larger than the number of nodes in the network. In this paper, we will design a holistic classification approach which can combine content and structure for effective edge classification.
【Keywords】: Electronic mail; Indexes; Context; Social network services; Classification algorithms; Clustering algorithms; Algorithm design and analysis
【Paper Link】 【Pages】:191-194
【Authors】: Konstantinos Lolos ; Ioannis Konstantinou ; Verena Kantere ; Nectarios Koziris
【Abstract】: Modern large-scale computing deployments consist of complex applications running over machine clusters. An important issue there is the offering of elasticity, i.e., the dynamic allocation of resources to applications to meet fluctuating workload demands. Threshold based approaches are typically employed, yet they are difficult to configure and optimize. Approaches based on reinforcement learning have been proposed, but they require a large number of states in order to model complex application behavior. Methods that adaptively partition the state space have been proposed, but their partitioning criteria and strategies are sub-optimal. In this work we present MDP DT, a novel fullmodel based reinforcement learning algorithm for elastic resource management that employs adaptive state space partitioning. We propose two novel statistical criteria and three strategies and we experimentally prove that they correctly decide both where and when to partition, outperforming existing approaches. We experimentally evaluate MDP DT in a real large scale cluster over variable not-encountered workloads and we show that it takes more informed decisions compared to static and model-free approaches, while requiring a minimal amount of training data.
【Keywords】: Partitioning algorithms; Heuristic algorithms; Resource management; Adaptation models; Clustering algorithms; Markov processes; Learning (artificial intelligence)
【Paper Link】 【Pages】:195-198
【Authors】: Neha Sengupta ; Amitabha Bagchi ; Srikanta Bedathur ; Maya Ramanath
【Abstract】: In this paper, we address the problem of sampling from a set and reconstructing a set stored as a Bloom filter. To the best of our knowledge our work is the first to address this question. We introduce a novel hierarchical data structure called BloomSampleTree that helps us design efficient algorithms to extract an almost uniform sample from the set stored in a Bloom filter and also allows us to reconstruct the set efficiently. In the case where the hash functions used in the Bloom filter implementation are partially invertible, in the sense that it is easy to calculate the set of elements that map to a particular hash value, we propose a second, more space-efficient method called HashInvert for the reconstruction. We study the properties of these two methods both analytically as well as experimentally. We provide bounds on run times for both methods and sample quality for the BloomSampleTree based algorithm, and show through an extensive experimental evaluation that our methods are efficient and effective.
【Keywords】: Dictionaries; Databases; Force; Binary search trees; Approximation algorithms; Conferences
【Paper Link】 【Pages】:199-202
【Authors】: Chunyao Song ; Xuanming Liu ; Tingjian Ge
【Abstract】: Many big data applications today require querying highly dynamic and large-scale data streams for top-k frequent items in the most recent window of any specified size at any time. This is a challenging problem. We show that our novel solution is not only accurate, but it also one to two orders of magnitude faster than previous approaches. Moreover, its memory footprint grows only logarithmically with the window size, rather than linearly as in previous work. Our comprehensive experiments over real-world datasets show that our solution is very effective and scalable. In addition, we devise a concise and efficient solution to a related problem of tracking the frequency of selected items, improving upon previous work by twenty to thirty times in model conciseness while providing the same accuracy and efficiency.
【Keywords】: Upper bound; Twitter; Real-time systems; Data structures; Time-frequency analysis; Windows; Electronic mail
【Paper Link】 【Pages】:203-206
【Authors】: Linfei Zhou ; Claudia Plant ; Christian Böhm
【Abstract】: As an actively investigated topic in machine learning, Multiple-Instance Learning (MIL) has many proposed solutions, including both supervised and unsupervised methods. Most of these solutions are restricted to the original assumption that comes with the notion of MIL: the label of a multipleinstance object is directly determined by the labels of its instances. However, this assumption faces adverse circumstances when there is no clear relation between the over-all label and the labels of instances. Most previous approaches avoid this problem in practice by taking each multiple-instance object as a whole instead of starting with learning in instance spaces, but they either lose information or are time consuming. In this paper, we introduce two joint Gaussian based measures for MIL, Joint Gaussian Similarity (JGS) and Joint Gaussian Distance (JGD), which require no prior knowledge of relations between the labels of multiple-instance objects and their instances. JGS is a measure of similarity while JGD is a metric of which the properties are necessary for many techniques like clustering and embedding. JGS and JGD take all the information into account and many traditional machine learning methods can be introduced to MIL. Extensive experimental evaluations on various real-world data demonstrate the effectiveness of both measures, and better performances than state-of-the-art MIL algorithms on benchmark tasks.
【Keywords】: Density measurement; Clustering algorithms; Benchmark testing; Extraterrestrial measurements; Meteorology; Computer science
【Paper Link】 【Pages】:207-210
【Authors】: Mohamed Sarwat ; Yuhan Sun
【Abstract】: Thanks to the wide spread use of mobile and wearable devices, popular social networks, e.g., Facebook, prompts users to add spatial attributes to social entities, e.g., check-ins, traveling posts, and geotagged photos, leading to what is known as, The GeoSocial Graph. In such graph, usersmay issue a Reachability Query with Spatial Range Predicate (abbr. RangeReach). RangeReach finds whether an input vertex can reach any spatial vertex that lies within an input spatial range. The paper proposes GEOREACH, an approach that adds spatial data awareness to a graph database management system. GEOREACH allows efficient execution of RangeReach queries, yet without compromising a lot on the overall system scalability. Experiments based on system implementation inside Neo4j prove that GEOREACH exhibits up to two orders of magnitude better query performance and up to four times less storage than the state-of-the-art spatial and reachability indexing approaches.
【Keywords】: Indexing; Facebook; Scalability; Data structures; Query processing
【Paper Link】 【Pages】:211-214
【Authors】: Moloud Shahbazi ; Matthew T. Wiley ; Vagelis Hristidis
【Abstract】: Item (e.g., product) reviews are one of the most popular types of user-generated content in Web 2.0. Reviews have been effectively used in collaborative filtering to recommend products to users based on similar users, and also to compute a product's star rating. However, little work has studied how reviews can be used to perform query-specific ranking of items. In this paper, we present efficient top-k algorithms to rank items, by weighing each review's rating by its relevance to the user query. We propose a non-random access algorithm and perform a comprehensive evaluation of our method on multiple datasets. We show that our solution significantly outperforms the baseline approach in terms of query response time.
【Keywords】: Footwear; Electronic mail; Indexes; Web 2.0; Collaboration; Medical services; Feature extraction
【Paper Link】 【Pages】:215-218
【Authors】: Yang Zhang ; Yusu Wang ; Srinivasan Parthasarathy
【Abstract】: In this article we propose a novel visualization method to explore graphs with numerical attributes associated with nodes – referred to as scalar graphs. The proposed visualization strategy seeks to simultaneously uncover the relationship between attribute values and graph topology, and relies on transforming the network to generate a terrain map. A key objective here is to ensure that the terrain map reveals the overall distribution of components-of-interest (e.g. dense subgraphs, kcores) and the relationships among them while being sensitive to the attribute values over the graph.
【Keywords】: Visualization; Data visualization; Vegetation; Layout; Two dimensional displays; Tools; Computer science
【Paper Link】 【Pages】:219-222
【Authors】: Jiawei Zhang ; Philip S. Yu ; Yuanhua Lv
【Abstract】: Employees in companies can be divided into different social communities, and those who frequently socialize with each other are treated as close friends and will be grouped in the same community. In the enterprise context, a large amount of information about the employees is available in both (1) offline company internal sources and (2) online enterprise social networks (ESNs). What's more, each of the information sources can also contain multiple categories of employees' socialization activity information at the same time. In this paper, we propose to detect the social communities of the employees in companies based on these different information sources simultaneously, and the problem is formally called the "Enterprise Community Detection" (ECD) problem. To address the problem, a novel community detection framework named "HeterogeneoUs MultisOurce ClusteRing" (HUMOR) is introduced in this paper. Based on the various enterprise social intimacy measures introduced in this paper, HUMOR detects a set of micro community structures of the employees based on these different categories of information available in the online and offline sources respectively. (A full version of this paper is available in [5]).
【Keywords】: Community Detection; Enterprise Social Network; Data Mining
【Paper Link】 【Pages】:223-226
【Authors】: Chaoyue Niu ; Zhenzhe Zheng ; Fan Wu ; Xiaofeng Gao ; Guihai Chen
【Abstract】: As a significant business paradigm, many online information platforms have emerged to satisfy society's needs for person-specific data, where a service provider collects raw data from data contributors, and then offers value-added data services to data consumers. However, in the data trading layer, the data consumers face a pressing problem, i.e., how to verify whether the service provider has truthfully collected and processed data? Furthermore, the data contributors are usually unwilling to reveal their sensitive personal data and real identities to the data consumers. In this paper, we propose TPDM, which efficiently integrates Truthfulness and Privacy preservation in Data Markets. TPDM is structured internally in an Encrypt-then-Sign fashion, using somewhat homomorphic encryption and identitybased signature. It simultaneously facilitates batch verification, data processing, and outcome verification, while maintaining identity preservation and data confidentiality. We also instantiate TPDM with a profile-matching service, and extensively evaluate its performance on Yahoo! Music ratings dataset. Our evaluation results show that TPDM achieves several desirable properties, while incurring low computation and communication overheads when supporting a large-scale data market.
【Keywords】: Data collection; Data privacy; Encryption; Twitter
【Paper Link】 【Pages】:227-230
【Authors】: Kasper Grud Skat Madsen ; Yongluan Zhou ; Jianneng Cao
【Abstract】: Load balancing, operator instance collocations and horizontal scaling are critical issues in Parallel Stream Processing Engines to achieve low data processing latency, optimized cluster utilization and minimized communication cost respectively. In previous work, these issues are typically tackled separately and independently. We argue that these problems are tightly coupled in the sense that they all need to determine the allocations of workloads and migrate computational states at runtime. Optimizing them independently would result in suboptimal solutions. Therefore, in this paper, we investigate how these three issues can be modeled as one integrated optimization problem. In particular, we first consider jobs, where workload allocations have little effect on the communication cost, and model the problem of load balance as a Mixed-Integer Linear Program. Afterwards, we present an extended solution called ALBIC, which supports general jobs. We implement the proposed techniques on top of Apache Storm, an open-source Parallel Stream Processing Engine. The extensive experimental results over both synthetic and real datasets show that our techniques clearly outperform existing approaches.
【Keywords】: Load management; Resource management; Load modeling; Engines; Runtime; Computational modeling; Storms
【Paper Link】 【Pages】:231-234
【Authors】: Xinyu Lei ; Alex X. Liu ; Rui Li
【Abstract】: The fast increasing location-dependent applications in mobile devices are manufacturing a plethora of geospatial data. Outsourcing geospatial data storage to a powerful cloud is an economical approach. However, safeguarding data users' location privacy against the untrusted cloud while providing efficient location-aware query processing over encrypted data are in conflict with each other. As a step to reconcile such conflict, we study secure k nearest neighbor (SkNN) queries processing over encrypted geospatial data in cloud computing. We design 2D SkNN (2DSkNN), a scheme achieves both strong provable security and high-efficiency. Our approach employs locality sensitive hashing (LSH) in a dimensional-increased manner. This is a counter-intuitive leverage of LSH since the traditional usage of LSH is to reduce the data dimensionality and solve the so-called "curse of dimensionality" problem. We show that increasing the data dimensionality via LSH is indeed helpful to tackle 2DSkNN problem. By LSH-based neighbor region encoding and two-tier prefix-free encoding, we turn the proximity test to be sequential keywords query with a stop condition, which can be well addressed by any existing symmetric searchable encryption (SSE) scheme. We show that 2DSkNN achieves adaptive indistinguishability under chosen-keyword attack (IND2-CKA) secure in the random oracle model. A prototype implementation and experiments on both real-world and synthetic datasets confirm the high practicality of 2DSkNN.
【Keywords】: Cloud computing; Cryptography; Two dimensional displays; Indexes; Data privacy; Geospatial analysis
【Paper Link】 【Pages】:237-242
【Authors】: Anjan Kumar Amirishetty ; Yunrui Li ; Tolga Yurek ; Mahesh Girkar ; Wilson Chan ; Graham Ivey ; Vsevolod Panteleenko ; Ken Wong
【Abstract】: Oracle's Real Application Cluster (RAC) allows multiple database instances to run on different server nodes in a cluster against a shared set of data files. A critical aspect of an Oracle RAC system is that of instance recovery. When a node suffers from a hardware failure, or a database instance suffers from a software failure, instance recovery is performed by a surviving instance to ensure that the database remains in a consistent state. High-availability comes from the surviving database instances, each running on a surviving node, that are still able to provide database services. During instance recovery, the set of database resources that are in need of recovery must be identified, locked and then repaired. Until such time as the identification and locking of these resources has been done, Oracle needs to block any requests by database clients to all database resources. The whole database appears to be frozen during this time, a period that is called application brown-out. In the interests of availability it is therefore important that instance recovery endeavors to keep this period of identification and locking as short as possible. In doing so, not only is the brown-out period reduced, but also the overall time to make available those resources that need repair, is reduced. This paper describes the use of a Buddy Instance, a mechanism that significantly reduces the brown-out time and therefore also, the duration of instance recovery. Each database instance has a buddy database instance whose purpose is to construct in-memory metadata that describes the resources needing recovery, on a continuous basis at run-time. In the event of node or instance failure, the buddy instance for the failed instance uses the in-memory metadata in performing instance recovery. The buddy instance mechanism was introduced in the 12.2.0.1 release of Oracle Database to minimize and potentially eliminate the period of identification of resources needing recovery. This mechanism was enhanced in the 12.2.0.2 release of Oracle Database by the construction of a Bloom filter that temporarily serves the purpose of locking the identified resources, thereby enabling the database to be made available sooner than was previously the case.
【Keywords】: Database; Real Application Cluster; Recovery; Availability
【Paper Link】 【Pages】:243-254
【Authors】: Dong Wang ; Wei Cao ; Jian Li ; Jieping Ye
【Abstract】: The online car-hailing service has gained great popularity all over the world. As more passengers and more drivers use the service, it becomes increasingly more important for the the car-hailing service providers to effectively schedule the drivers to minimize the waiting time of passengers and maximize the driver utilization, thus to improve the overall user experience. In this paper, we study the problem of predicting the real-time car-hailing supply-demand, which is one of the most important component of an effective scheduling system. Our objective is to predict the gap between the car-hailing supply and demand in a certain area in the next few minutes. Based on the prediction, we can balance the supply-demands by scheduling the drivers in advance. We present an end-to-end framework called Deep Supply-Demand (DeepSD) using a novel deep neural network structure. Our approach can automatically discover complicated supply-demand patterns from the car-hailing service data while only requires a minimal amount hand-crafted features. Moreover, our framework is highly flexible and extendable. Based on our framework, it is very easy to utilize multiple data sources (e.g., car-hailing orders, weather and traffic data) to achieve a high accuracy. We conduct extensive experimental evaluations, which show that our framework provides more accurate prediction results than the existing methods.
【Keywords】: Vehicles; Meteorology; Neural networks; Data models; Vocabulary; Schedules
【Paper Link】 【Pages】:255-258
【Authors】: Sung Jin Kim ; Mohammed Al-Kateb ; Paul Sinclair ; Alain Crolotte ; Chengyang Zhang ; Linda Rose
【Abstract】: The Unified Data Architecture (UDA) of Teradata is an inclusive multisystem data analytics solution. A key challenge for query optimization under the UDA is to find optimal plans for queries that access data on heterogeneous remote data stores. The challenge comes primarily from the lack of statistics for data stored on remote systems. In this paper, we present techniques implemented in Teradata Database for dynamically collecting statistics on data fetched from a remote system and feeding these statistics back to the query optimizer during query execution. We demonstrate the performance impact of dynamic statistics collection and feedback with experiments conducted on a system that consists of Teradata Database and a remote Hadoop server.
【Keywords】: Reservoirs; Frequency estimation; Databases; Estimation; Memory management; Parallel processing
【Paper Link】 【Pages】:259-270
【Authors】: Joan Serrà ; Ilias Leontiadis ; Alexandros Karatzoglou ; Konstantina Papagiannaki
【Abstract】: To manage and maintain large-scale cellular networks, operators need to know which sectors underperform at any given time. For this purpose, they use the so-called hot spot score, which is the result of a combination of multiple network measurements and reflects the instantaneous overall performance of individual sectors. While operators have a good understanding of the current performance of a network and its overall trend, forecasting the performance of each sector over time is a challenging task, as it is affected by both regular and non-regular events, triggered by human behavior and hardware failures. In this paper, we study the spatio-temporal patterns of the hot spot score and uncover its regularities. Based on our observations, we then explore the possibility to use recent measurements' history to predict future hot spots. To this end, we consider tree-based machine learning models, and study their performance as a function of time, amount of past data, and prediction horizon. Our results indicate that, compared to the best baseline, tree-based models can deliver up to 14% better forecasts for regular hot spots and 153% better forecasts for non-regular hot spots. The latter brings strong evidence that, for moderate horizons, forecasts can be made even for sectors exhibiting isolated, non-regular behavior. Overall, our work provides insight into the dynamics of cellular sectors and their predictability. It also paves the way for more proactive network operations with greater forecasting horizons.
【Keywords】: Forecasting; Cellular networks; Predictive models; Measurement; Tensile stress; Monitoring; Hardware
【Paper Link】 【Pages】:271-280
【Authors】: Lijun Tang ; Eric Yi Liu
【Abstract】: User-managed-events is a popular feature on social networks. Take Facebook Events as an example: over 135 million events were created in 2015 and over 550 million people use events each month. In this work, we consider the heavy sparseness in both user and event feedback history caused by short lifespans (transiency) of events and user participation patterns in a production event system. We propose to solve the resulting cold-start problems by introducing a joint representation model to project users and events into the same latent space. Our model based on parallel Convolutional Neural Networks captures semantic meaning in event text and also utilizes heterogeneous user knowledge available in the social network. By feeding the model output as user and event representation into a combiner prediction model, we show that our representation model improves the prediction accuracy over existing techniques (+6% AUC lift). Our method provides a generic way to match heterogeneous information from different domains and applies to a wide range of applications in social networks.
【Keywords】: Representation Learning; User modeling; Neural Network; Recommendation
【Paper Link】 【Pages】:281-284
【Authors】: Jie Jiang ; Jiawei Jiang ; Bin Cui ; Ce Zhang
【Abstract】: Gradient boosting tree (GBT), a widely used machine learning algorithm, achieves state-of-the-art performance in academia, industry, and data analytics competitions. Although existing scalable systems which implement GBT, such as XGBoost and MLlib, perform well for datasets with medium-dimensional features, they can suffer performance degradation for many industrial applications where the trained datasets contain highdimensional features. The performance degradation derives from their inefficient mechanisms for model aggregation—either mapreduce or all-reduce. To address this high-dimensional problem, we propose a scalable execution plan using the parameter server architecture to facilitate the model aggregation. Further, we introduce a sparse-pull method and an efficient index structure to increase the processing speed. We implement a GBT system, namely TencentBoost, in the production cluster of Tencent Inc. The empirical results show that our system is 2-20× faster than existing platforms.
【Keywords】: Histograms; Servers; Boosting; Training; Buildings; Indexes; Computer architecture
【Paper Link】 【Pages】:285-296
【Authors】: Yongseok Son ; Jaeyoon Choi ; Jekyeom Jeon ; Cheolgi Min ; Sunggon Kim ; Heon Young Yeom ; Hyuck Han
【Abstract】: Backup and recovery is an important feature of database systems since it protects data against unexpected hardware and software failures. Database systems can provide data safety and reliability by creating a backup and restoring the backup from a failure. Database administrators can use backup/recovery tools that are provided with database systems or backup/recovery methods with operating systems. However, the existing tools perform time-consuming jobs and the existing methods may negatively affect run-time performance during normal operation even though high-performance SSDs are used. In this paper, we present an SSD-assisted backup/recovery scheme for database systems. In our scheme, we extend the out-of-place update characteristics of flash-based SSDs for backup/recovery operations. To this end, we exploit the resources (e.g., flash translation layer and DRAM cache with supercapacitors) inside SSDs, and we call our SSD with new backup/recovery features BR-SSD. We design and implement the backup/recovery functionality in the Samsung enterprise-class SSD (i.e., SM843Tn) for more realistic systems. Furthermore, we conduct a case study of BR-SSDs in replicated database systems and modify MySQL with replication to integrate BR-SSDs. The experimental result demonstrates that our scheme provides fast recovery while it does not negatively affect the run-time performance during normal operation.
【Keywords】: Database systems; Tools; Random access memory; Servers; Cows; Operating systems
【Paper Link】 【Pages】:297-308
【Authors】: Ilias Leontiadis ; Joan Serrà ; Alessandro Finamore ; Giorgos Dimopoulos ; Konstantina Papagiannaki
【Abstract】: Mobile network operators collect a humongous amount of network measurements. Among those, sector Key Performance Indicators (KPIs) are used to monitor the radio access, i.e., the "last mile" of mobile networks. Thresholding mechanisms and synthetic combinations of KPIs are used to assess the network health, and rank sectors to identify the underperforming ones. It follows that the available monitoring methodologies heavily rely on the fine grained tuning of thresholds and weights, currently established through domain knowledge of both vendors and operators. In this paper, we study how to bridge sector KPIs to reflect Quality of Experience (QoE) groundtruth measurements, namely throughput, latency and video streaming stall events. We leverage one month of data collected in the operational network of mobile network operator serving more than 10 million subscribers. We extensively investigate up to which extent adopted methodologies efficiently capture QoE. Moreover, we challenge the current state of the art by presenting data-driven approaches based on Particle Swarm Optimization (PSO) metaheuristics and random forest regression algorithms, to better assess sector performance. Results show that the proposed methodologies outperforms state of the art solution improving the correlation with respect to the baseline by a factor of 3, and improving visibility on underperforming sectors. Our work opens new areas for research in monitoring solutions for enriching the quality and accuracy of the network performance indicators collected at the network edge.
【Keywords】: Monitoring; Mobile communication; Mobile computing; Measurement; Poles and towers; Bridges
【Paper Link】 【Pages】:309-320
【Authors】: Yosef Moatti ; Eran Rom ; Raúl Gracia Tinedo ; Dalit Naor ; Doron Chen ; Josep Sampé ; Marc Sánchez Artigas ; Pedro García López ; Filip Gluszak ; Eric Deschdt ; Francesco Pace ; Daniele Venzano ; Pietro Michiardi
【Abstract】: Extracting value from data stored in object stores,such as OpenStack Swift and Amazon S3, can be problematicin common scenarios where analytics frameworks and objectstores run in physically disaggregated clusters. One of the mainproblems is that analytics frameworks must ingest large amountsof data from the object store prior to the actual computation;this incurs a significant resources and performance overhead. Toovercome this problem, we present Scoop. Scoop enables analyticsframeworks to benefit from the computational resources of objectstores to optimize the execution of analytics jobs. Scoop achievesthis by enabling the addition of ETL-type actions to the dataupload path and by offloading querying functions to the objectstore through a rich and extensible active object storage layer. Asa proof-of-concept, Scoop enables Apache Spark SQL selectionsand projections to be executed close to the data in OpenStackSwift for accelerating analytics workloads of a smart energy gridcompany (GridPocket). Our experiments in a 63-machine clusterwith real IoT data and SQL queries from GridPocket show thatScoop exhibits query execution times up to 30x faster than thetraditional “ingest-then-compute” approach.
【Keywords】: Sparks; Companies; Data analysis; Big Data; Computer architecture; Libraries; Energy measurement
【Paper Link】 【Pages】:325-336
【Authors】: Xiongcai Luo ; Jun Gao ; Chang Zhou ; Jeffrey Xu Yu
【Abstract】: SimRank is an effective structural similarity measurement between two vertices in a graph, which can be used in many applications like recommender systems. Although progresses have been achieved, existing methods still face challenges to handle large graphs. Besides huge index construction and maintenance cost, the existing methods require considerable search space and time overheads in the online SimRank query. In this paper, we design a Monte Carlo based method, Uni-Walk, to enable the fast top-k SimRank computation over large undirected graphs without indexing. UniWalk directly locates the top-k similar vertices for any single source vertex u via O(R) sampling paths originating from u only, which avoids the selection of candidate vertex set C and the following O(|C|R) bidirectional sampling paths starting from u and each candidate respectively in existing methods. We also design a space-efficient method to reduce intermediate results, and a path-sharing strategy to optimize path sampling for multiple source vertices. Furthermore, we extend UniWalk to existing distributed graph processing frameworks to improve its scalability. We conduct extensive experiments to illustrate that UniWalk has high scalability, and outperforms the state-of-the-art methods by orders of magnitude, and such an improvement is achieved without any indexing overheads.
【Keywords】: Monte Carlo methods; Computers; Computational modeling; Scalability; Optimization; Indexing; Mathematical model
【Paper Link】 【Pages】:337-348
【Authors】: Yikai Zhang ; Jeffrey Xu Yu ; Ying Zhang ; Lu Qin
【Abstract】: Graphs have been widely used in many applications such as social networks, collaboration networks, and biological networks. One important graph analytics is to explore cohesive subgraphs in a large graph. Among several cohesive subgraphs studied, k-core is one that can be computed in linear time for a static graph. Since graphs are evolving in real applications, in this paper, we study core maintenance which is to reduce the computational cost to compute k-cores for a graph when graphs are updated from time to time dynamically. We identify drawbacks of the existing efficient algorithm, which needs a large search space to find the vertices that need to be updated, and has high overhead to maintain the index built, when a graph is updated. We propose a new order-based approach to maintain an order, called k-order, among vertices, while a graph is updated. Our new algorithm can significantly outperform the state-of-theart algorithm up to 3 orders of magnitude for the 11 large real graphs tested. We report our findings in this paper.
【Keywords】: Maintenance engineering; Heuristic algorithms; Bars; Indexes; Approximation algorithms; Patents; Social network services
【Paper Link】 【Pages】:349-360
【Authors】: Son T. Mai ; Martin Storgaard Dieu ; Ira Assent ; Jon Jacobsen ; Jesper Kristensen ; Mathias Birk
【Abstract】: The structural graph clustering algorithm SCAN is a fundamental technique for managing and analyzing graph data. However, its high runtime remains a computational bottleneck, which limits its applicability. In this paper, we propose a novel interactive approach for tackling this problem on multicore CPUs. Our algorithm, called anySCAN, iteratively processes vertices in blocks. The acquired results are merged into an underlying cluster structures consisting of the so-called supernodes for building clusters. During its runtime, anySCAN can be suppressed for examining intermediate results and resumed for finding better result at arbitrary time points, making it an anytime algorithm which is capable to deal with very large graphs in an interactive way and under arbitrary time constraints. Moreover, its block processing scheme allows the design of a scalable parallel algorithm on shared memory architectures such as multicore CPUs for further speeding up the algorithm at each iteration. Consequently, anySCAN uniquely is an interactive and parallel algorithm at the same time. Experiments are conducted on very large real graph datasets for demonstrating the performance of anySCAN. It acquires very good approximate results early, leading to orders of magnitude speedup factor compared to SCAN and its variants. Using 16 threads, the acquired speed up factors are up to 13.5 times over its sequential version.
【Keywords】: Structural graph clustering; SCAN; anytime clustering; parallel algorithm; multicore CPUs
【Paper Link】 【Pages】:361-372
【Authors】: Shuai Ma ; Renjun Hu ; Luoshu Wang ; Xuelian Lin ; Jinpeng Huai
【Abstract】: Dense subgraph discovery has proven useful in various applications of temporal networks. We focus on a special class of temporal networks whose nodes and edges are kept fixed, but edge weights constantly and regularly vary with timestamps. However, finding dense subgraphs in temporal networks is nontrivial, and its state of the art solution uses a filter-and-verification framework, which is not scalable on large temporal networks. In this study, we propose a data-driven approach to finding dense subgraphs in large temporal networks with T timestamps. (1) We first develop a data-driven approach employing hidden statistics to identifying k time intervals, instead of T (T + 1)/2 ones (k is typically much smaller than T), which strikes a balance between quality and efficiency. (2) After proving that the problem has no constant factor approximation algorithms, we design better heuristic algorithms to attack the problem, by building the connection of finding dense subgraphs with a variant of the Prize Collecting Steiner Tree problem. (3) Finally, we have conducted an extensive experimental study to demonstrate the effectiveness and efficiency of our approach, using real-life and synthetic data.
【Keywords】: Aggregates; Roads; Approximation algorithms; Algorithm design and analysis; Heuristic algorithms; Steiner trees; Data analysis
【Paper Link】 【Pages】:375-386
【Authors】: Xike Xie ; Xin Lin ; Jianliang Xu ; Christian S. Jensen
【Abstract】: The proliferation of geo-textual data gives prominence to spatial keyword search. The basic top-k spatial keyword query, returns k geo-textual objects that rank the highest according to their textual relevance and spatial proximity to query keywords and a query location. We define, study, and provide means of computing the reverse top-k keyword-based location query. This new type of query takes a set of keywords, a query object q, and a number k as arguments, and it returns a spatial region such that any top-k spatial keyword query with the query keywords and a location in this region would contain object q in its result. This query targets applications in market analysis, geographical planning, and location optimization, and it may support applications related to safe zones and influence zones that are used widely in location-based services. We show that computing an exact query result requires evaluating and merging a set of weighted Voronoi cells, which is expensive. We therefore devise effective algorithms that approximate result regions with quality guarantees. We develop novel pruning techniques on top of an index, and we offer a series of optimization techniques that aim to further accelerate query processing. Empirical studies suggest that the proposed query processing is efficient and scalable.
【Keywords】: Computer science; Optimization; Algorithm design and analysis; Query processing; Indexing; Keyword search; Planning
【Paper Link】 【Pages】:387-398
【Authors】: Jingwen Zhao ; Yunjun Gao ; Gang Chen ; Christian S. Jensen ; Rui Chen ; Deng Cai
【Abstract】: Identifying prospective customers is an important aspect of marketing research. In this paper, we provide support for a new type of query, the Reverse Top-k Geo-Social Keyword (RkGSK) query. This query takes into account spatial, textual, and social information, and finds prospective customers for geotagged objects. As an example, a restaurant manager might apply the query to find prospective customers. To address this, we propose a hybrid index, the GIM-tree, which indexes locations, keywords, and social information of geo-tagged users and objects, and then, using the GIM-tree, we present efficient RkGSK query processing algorithms that exploit several pruning strategies. The effectiveness of RkGSK retrieval is characterized via a case study, and extensive experiments using real datasets offer insight into the efficiency of the proposed index and algorithms.
【Keywords】: Indexes; Roads; Query processing; Social network services; Computer science; Search problems; Mobile communication
【Paper Link】 【Pages】:399-410
【Authors】: Yangjun Chen ; Yujia Wu
【Abstract】: In this paper, we discuss an efficient and effective index mechanism to do the string matching with k mismatches, by which we will find all the substrings in a target string s having at most k positions different from a pattern string r. The main idea is to transform s to a BWT-array as index, denoted as BWT(s), and search r against it. During the process, the precomputed mismatch information of r will be utilized to speed up the BWT(s)'s navigation. In this way, the time complexity can be reduced to O(kn' + n + mlogm), where m = |r|, n = |s|, and n' is the number of leaf nodes of a tree structure, called a mismatching tree, produced during a search of BWT(s). Extensive experiments have been conducted, which show that our method for this problem is promising
【Keywords】: String matching; DNA sequences; tries; BWT-transformation
【Paper Link】 【Pages】:411-422
【Authors】: Justin Wood ; Patrick Tan ; Wei Wang ; Corey W. Arnold
【Abstract】: Topic modeling has increasingly attracted interests from researchers. Common methods of topic modeling usually produce a collection of unlabeled topics where each topic is depicted by a distribution of words. Associating semantic meaning with these word distributions is not always straightforward. Traditionally, this task is left to human interpretation. Manually labeling the topics is unfortunately not always easy, as topics generated by unsupervised learning methods do not necessarily align well with our prior knowledge in the subject domains. Currently, two approaches to solve this issue exist. The first is a post-processing procedure that assigns each topic with a label from the prior knowledge base that is semantically closest to the word distribution of the topic. The second is a supervised topic modeling approach that restricts the topics to a predefined set whose word distributions are provided beforehand. Neither approach is ideal, as the former may produce labels that do not accurately describe the word distributions, and the latter lacks the ability to detect unknown topics that are crucial to enrich our knowledge base. Our goal in this paper is to introduce a semisupervised Latent Dirichlet allocation (LDA) model, Source-LDA, which incorporates prior knowledge to guide the topic modeling process to improve both the quality of the resulting topics and of the topic labeling. We accomplish this by integrating existing labeled knowledge sources representing known potential topics into a probabilistic topic model. These knowledge sources are translated into a distribution and used to set the hyperparameters of the Dirichlet generated distribution over words. This approach ensures that the topic inference process is consistent with existing knowledge, and simultaneously, allows for discovery of new topics. The results show improved topic generation and increased accuracy in topic labeling when compared to those obtained using various labeling approaches based off LDA.
【Keywords】: Mathematical model; Labeling; Sports equipment; Encyclopedias; Electronic publishing; Internet
【Paper Link】 【Pages】:425-436
【Authors】: Michele Coscia ; Frank M. H. Neffke
【Abstract】: Networks are powerful instruments to study complex phenomena, but they become hard to analyze in data that contain noise. Network backbones provide a tool to extract the latent structure from noisy networks by pruning non-salient edges. We describe a new approach to extract such backbones. We assume that edge weights are drawn from a binomial distribution, and estimate the error-variance in edge weights using a Bayesian framework. Our approach uses a more realistic null model for the edge weight creation process than prior work. In particular, it simultaneously considers the propensity of nodes to send and receive connections, whereas previous approaches only considered nodes as emitters of edges. We test our model with real world networks of different types (flows, stocks, cooccurrences, directed, undirected) and show that our Noise-Corrected approach returns backbones that outperform other approaches on a number of criteria. Our approach is scalable, able to deal with networks with millions of edges.
【Keywords】: Algorithm design and analysis; Noise measurement; Tools; Prediction algorithms; Classification algorithms; Joining processes; Electronic mail
【Paper Link】 【Pages】:437-448
【Authors】: Guoyao Feng ; Lukasz Golab ; Divesh Srivastava
【Abstract】: We present SIRUM: a system for Scalable Informative RUle Mining from multi-dimensional data. Informative rules have recently been studied in several contexts, including data summarization, data cube exploration and data quality. The objective is to produce a small set of rules (patterns) over the values of the dimension attributes that provide the most information about the distribution of a numeric measure attribute. Within SIRUM, we propose several optimizations for tall, wide and distributed datasets. We implemented SIRUM in Spark and observed significant performance and scalability improvements on real datasets due to our optimizations. As a result, SIRUM is able to generate informative rules on much wider and taller datasets than using distributed implementations of the previous state of the art.
【Keywords】: Data mining; Optimization; Delays; Distributed databases; Sparks; Airports; Context
【Paper Link】 【Pages】:449-460
【Authors】: Yu Zhang ; Kanat Tangwongsan ; Srikanta Tirthapura
【Abstract】: We present methods for k-means clustering on a stream with a focus on providing fast responses to clustering queries. Compared to the current state-of-the-art, our methods provide substantial improvement in the query time for cluster centers while retaining the desirable properties of provably small approximation error and low space usage. Our algorithms rely on a novel idea of "coreset caching" that systematically reuses coresets (summaries of data) computed for recent queries in answering the current clustering query. We present both theoretical analysis and detailed experiments demonstrating their correctness and e ciency.
【Keywords】: Clustering algorithms; Approximation algorithms; Algorithm design and analysis; Runtime; Data structures; Clustering methods; Merging
【Paper Link】 【Pages】:461-469
【Authors】: Md Farhadur Rahman ; Weimo Liu ; Saad Bin Suhaim ; Saravanan Thirumuruganathan ; Nan Zhang ; Gautam Das
【Abstract】: Abstract-Location Based Services (LBS) have become extremely popular over the past decade, being used on a daily basis by millions of users. Instances of real-world LBS range from mapping services (e.g., Google Maps) to lifestyle recommendations (e.g., Yelp) to real-estate search (e.g., Redfin). In general, an LBS provides a public (often web-based) search interface over its backend database (of tuples with 2D geolocations), taking as input a 2D query point and returning k tuples in the database that are closest to the query point, where k is usually a small constant such as 20 or 50. Such a public interface is often called a k-Nearest-Neighbor, i.e., kNN, interface.
【Keywords】: Clustering algorithms; Two dimensional displays; Google; Algorithm design and analysis; Spatial databases; Data mining
【Paper Link】 【Pages】:473-484
【Authors】: Xing Niu ; Raghav Kapoor ; Boris Glavic ; Dieter Gawlick ; Zhen Hua Liu ; Venkatesh Radhakrishnan
【Abstract】: Data provenance is essential for debugging query results, auditing data in cloud environments, and explaining outputs of Big Data analytics. A well-established technique is to represent provenance as annotations on data and to instrument queries to propagate these annotations to produce results annotated with provenance. However, even sophisticated optimizers are often incapable of producing efficient execution plans for instrumented queries, because of their inherent complexity and unusual structure. Thus, while instrumentation enables provenance support for databases without requiring any modification to the DBMS, the performance of this approach is far from optimal. In this work, we develop provenancespecific optimizations to address this problem. Specifically, we introduce algebraic equivalences targeted at instrumented queries and discuss alternative, equivalent ways of instrumenting a query for provenance capture. Furthermore, we present an extensible heuristic and cost-based optimization (CBO) framework that governs the application of these optimizations and implement this framework in our GProM provenance system. Our CBO is agnostic to the plan space shape, uses a DBMS for cost estimation, and enables retrofitting of optimization choices into existing code by adding a few LOC. Our experiments confirm that these optimizations are highly effective, often improving performance by several orders of magnitude for diverse provenance tasks.
【Keywords】: Instruments; Optimization; Pipelines; Algebra; Databases; Encoding; Semantics
【Paper Link】 【Pages】:485-496
【Authors】: Seokki Lee ; Sven Köhler ; Bertram Ludäscher ; Boris Glavic
【Abstract】: Explaining why an answer is in the result of a query or why it is missing from the result is important for many applications including auditing, debugging data and queries, and answering hypothetical questions about data. Both types of questions, i.e., why and why-not provenance, have been studied extensively. In this work, we present the first practical approach for answering such questions for queries with negation (firstorder queries). Our approach is based on a rewriting of Datalog rules (called firing rules) that captures successful rule derivations within the context of a Datalog query. We extend this rewriting to support negation and to capture failed derivations that explain missing answers. Given a (why or why-not) provenance question, we compute an explanation, i.e., the part of the provenance that is relevant to answer the question. We introduce optimizations that prune parts of a provenance graph early on if we can determine that they will not be part of the explanation for a given question. We present an implementation that runs on top of a relational database using SQL to compute explanations. Our experiments demonstrate that our approach scales to large instances and significantly outperforms an earlier approach which instantiates the full provenance to compute explanations.
【Keywords】: Urban areas; Relational databases; Firing; Conferences; Data engineering; Debugging
【Paper Link】 【Pages】:497-508
【Authors】: Marios Meimaris ; George Papastefanatos ; Nikos Mamoulis ; Ioannis Anagnostopoulos
【Abstract】: SPARQL query execution in state of the art RDF engines depends on, and is often limited by the underlying storage and indexing schemes. Typically, these systems exhaustively store permutations of the standard three-column triples table. However, even though RDF can give birth to datasets with loosely defined schemas, it is common for an emerging structure to appear in the data. In this paper, we introduce a novel indexing scheme for RDF data, that takes advantage of the inherent structure of triples. To this end, we define the Extended Characteristic Set (ECS), a schema abstraction that classifies triples based on the properties of their subjects and objects, and we discuss methods and algorithms for the identification and extraction of ECSs. We show how these can be used to assist query processing, and we implement axonDB, an RDF storage and querying engine based on ECS indexing. We perform an experimental evaluation on real world and synthetic datasets and observe that axonDB outperforms the competition by a few orders of magnitude.
【Keywords】: Resource description framework; Indexing; Query processing; Engines; Triples (Data structure); Data mining
【Paper Link】 【Pages】:509-520
【Authors】: Jianye Yang ; Wenjie Zhang ; Shiyu Yang ; Ying Zhang ; Xuemin Lin
【Abstract】: In this paper, we study the problem of set containment join. Given two collections R and S of records, the set containment join R./S retrieves all record pairs f(r, s)g 2 R S such that r s. This problem has been extensively studied in the literature and has many important applications in commercial and scientific fields. Recent research focuses on the in-memory set containment join algorithms, and several techniques have been developed following intersectionoriented or union-oriented computing paradigms. Nevertheless, we observe that two computing paradigms have their limits due to the nature of the intersection and union operators. Particularly, intersection-oriented method relies on the intersection of the relevant inverted lists built on the elements of S. A nice property of the intersection-oriented method is that the join computation is verification free. However, the number of records explored during the join process may be large because there are multiple replicas for each record in S. On the other hand, the unionaornidenttehde mcaenthdoiddatgeenpearairtses aare siogbntaatiunreed fobry etahceh urneicoonrdofin thRe inverted lists of the relevant signatures. The candidate size of the union-oriented method is usually small because each record contributes only one replica in the index. Unfortunately, unionoriented method needs to verify the candidate pairs, which may be cost expensive especially when the join result size is large. As a matter of fact, the state-of-the-art union-oriented solution is not competitive compared to the intersection-oriented ones. In this paper, we propose a new union-oriented method, namely TT-Join, which not only enhances the advantage of the previous unionoriented methods but also integrates the goodness of intersectionoriented methods by imposing a variant of prefix tree structure. We conduct extensive experiments on 20 real-life datasets by comparing our method with 7 existing methods. The experiment results demonstrate that TT-Join significantly outperforms the existing algorithms on most of the datasets, and can achieve up to two orders of magnitude speedup.
【Keywords】: Indexes; Algorithm design and analysis; Australia; Companies; Conferences; Data engineering; Computer science
【Paper Link】 【Pages】:523-534
【Authors】: Shangyu Luo ; Zekai J. Gao ; Michael N. Gubanov ; Luis Leopoldo Perez ; Christopher M. Jermaine
【Abstract】: As data analytics has become an important application for modern data management systems, a new category of data management system has appeared recently: the scalable linear algebra system. In this paper, we argue that a parallel or distributed database system is actually an excellent platform upon which to build such functionality. Most relational systems already have support for cost-based optimization—which is vital to scaling linear algebra computations—and it is well-known how to make relational systems scale. We show that by making just a few changes to a parallel/ distributed relational database system, such a system can be a competitive platform for scalable linear algebra. Taken together, our results should at least raise the possibility that brand new systems designed from the ground up to support scalable linear algebra are not absolutely necessary, and that such systems could instead be built on top of existing relational technology. Our results also suggest that if scalable linear algebra is to be added to a modern dataflow platform such as Spark, they should be added on top of the system's more structured (relational) data abstractions, rather than being constructed directly on top of the system's raw dataflow operators.
【Keywords】: Linear algebra; Relational databases; Database systems; Arrays; Buildings
【Paper Link】 【Pages】:535-546
【Authors】: Evan R. Sparks ; Shivaram Venkataraman ; Tomer Kaftan ; Michael J. Franklin ; Benjamin Recht
【Abstract】: Modern advanced analytics applications make use of machine learning techniques and contain multiple steps of domain-specific and general-purpose processing with high resource requirements. We present KeystoneML, a system that captures and optimizes the end-to-end large-scale machine learning applications for high-throughput training in a distributed environment with a high-level API. This approach offers increased ease of use and higher performance over existing systems for large scale learning. We demonstrate the effectiveness of KeystoneML in achieving high quality statistical accuracy and scalable training using real world datasets in several domains.
【Keywords】: Pipelines; Optimization; Training; Distributed databases; Data models; Computer science; Libraries
【Paper Link】 【Pages】:547-558
【Authors】: Buwen Wu ; Yongluan Zhou ; Hai Jin ; Amol Deshpande
【Abstract】: Existing parallel SPARQL query optimizers assume hash-based data partitioning and adopt plan enumeration algorithms with unnecessarily high complexity. Therefore, they cannot easily accommodate other partitioning methods and only consider an unnecessarily limited plan space. To address these problems, we first define a generic RDF data partitioning model to capture the common structure of various state-of-the-art RDF data partitioning methods. Then we propose a query plan enumeration algorithm that not only has an optimal efficiency, but also accommodates different data partitioning methods. Furthermore, based on a solid analysis of the complexity of the plan enumeration algorithm, we propose two new heuristic methods that can consider a much larger plan space than the existing methods, and at the same time can still confine the search space of the algorithm. An autonomous approach is proposed to choose one of the two methods by considering the structure and the size of a complex SPARQL query. We conduct extensive experiments using synthetic and a real-world dataset, which show the superiority of our algorithms in comparing to existing ones.
【Keywords】: Resource description framework; Partitioning algorithms; Heuristic algorithms; Complexity theory; Algorithm design and analysis; Data models; Query processing
【Paper Link】 【Pages】:559-570
【Authors】: Christos Anagnostopoulos ; Peter Triantafillou
【Abstract】: Recent trends aim to incorporate advanced data analytics capabilities within DBMSs. Linear regression queries are fundamental to exploratory analytics and predictive modeling. However, computing their exact answers leaves a lot to be desired in terms of efficiency and scalability. We contribute a novel predictive analytics model and associated regression query processing algorithms, which are efficient, scalable and accurate. We focus on predicting the answers to two key query types that reveal dependencies between the values of different attributes: (i) mean-value queries and (ii) multivariate linear regression queries, both within specific data subspaces defined based on the values of other attributes. Our algorithms achieve many orders of magnitude improvement in query processing efficiency and nearperfect approximations of the underlying relationships among data attributes.
【Keywords】: Data models; Linear regression; Predictive models; Analytical models; Mathematical model; Linear approximation; Scalability
【Paper Link】 【Pages】:571-582
【Authors】: Hui Miao ; Ang Li ; Larry S. Davis ; Amol Deshpande
【Abstract】: Deep learning has improved state-of-the-art results in many important fields, and has been the subject of much research in recent years, leading to the development of several systems for facilitating deep learning. Current systems, however, mainly focus on model building and training phases, while the issues of data management, model sharing, and lifecycle management are largely ignored. Deep learning modeling lifecycle generates a rich set of data artifacts, e.g., learned parameters and training logs, and it comprises of several frequently conducted tasks, e.g., to understand the model behaviors and to try out new models. Dealing with such artifacts and tasks is cumbersome and largely left to the users. This paper describes our vision and implementation of a data and lifecycle management system for deep learning. First, we generalize model exploration and model enumeration queries from commonly conducted tasks by deep learning modelers, and propose a high-level domain specific language (DSL), inspired by SQL, to raise the abstraction level and thereby accelerate the modeling process. To manage the variety of data artifacts, especially the large amount of checkpointed float parameters, we design a novel model versioning system (dlv), and a read-optimized parameter archival storage system (PAS) that minimizes storage footprint and accelerates query workloads with minimal loss of accuracy. PAS archives versioned models using deltas in a multi-resolution fashion by separately storing the less significant bits, and features a novel progressive query (inference) evaluation algorithm. Third, we develop e cient algorithms for archiving versioned models using deltas under co-retrieval constraints. We conduct extensive experiments over several real datasets from computer vision domain to show the e ciency of the proposed techniques.
【Keywords】: Data models; Machine learning; Training; Computational modeling; Predictive models; Open area test sites; Neural networks
【Paper Link】 【Pages】:585-596
【Authors】: Farhana Murtaza Choudhury ; Zhifeng Bao ; J. Shane Culpepper ; Timos Sellis
【Abstract】: In this paper, we propose and study the problem of top-m rank aggregation of spatial objects in streaming queries, where, given a set of objects O, a stream of spatial queries (kNN or range), the goal is to report the m objects with the highest aggregate rank. The rank of an object with respect to an individual query is computed based on its distance from the query location, and the aggregate rank is computed from all of the individual rank orderings. In order to solve this problem, we show how to upper and lower bound the rank of an object for any unseen query. Then we propose an approximation solution to continuously monitor the top-m objects efficiently, for which we design an Inverted Rank File (IRF) index to guarantee the error bound of the solution. In particular, we propose the notion of safe ranking to determine whether the current result is still valid or not when new queries arrive, and propose the notion of validation objects to limit the number of objects to update in the top-m results. We also propose an exact solution for applications where an approximate solution is not sufficient. Last, we conduct extensive experiments to verify the efficiency and effectiveness of our solutions. This is a fundamental problem that draws inspiration from three different domains: rank aggregation, continuous queries and spatial databases, and the solution can be used to monitor the importance / popularity of spatial objects, which in turn can provide new analytical tools for spatial data.
【Keywords】: Spatial databases; Aggregates; Monitoring; Computational modeling; Australia
【Paper Link】 【Pages】:597-608
【Authors】: Sheng Wang ; Zhifeng Bao ; J. Shane Culpepper ; Timos Sellis ; Mark Sanderson ; Xiaolin Qin
【Abstract】: We study a new type of spatial-textual trajectory search: the Exemplar Trajectory Query (ETQ), which specifies one or more places to visit, and descriptions of activities at each place. Our goal is to efficiently find the top-k trajectories by computing spatial and textual similarity at each point. The computational cost for pointwise matching is significantly higher than previous approaches. Therefore, we introduce an incremental pruning baseline and explore how to adaptively tune our approach, introducing a gap-based optimization and a novel twolevel threshold algorithm to improve efficiency. Our proposed methods support order-sensitive ETQ with a minor extension. Experiments on two datasets verify the efficiency and scalability of our proposed solution.
【Keywords】: Trajectory; Semantics; Upper bound; Search problems; Lifting equipment; Measurement; Keyword search
【Paper Link】 【Pages】:609-620
【Authors】: Bilong Shen ; Ying Zhao ; Guoliang Li ; Weimin Zheng ; Yue Qin ; Bo Yuan ; Yongming Rao
【Abstract】: Intelligent transportation systems, e.g., Uber, have become an important tool for urban transportation. An important problem is k nearest neighbor (kNN) search on moving objects with road-network constraints, which, given moving objects on the road networks and a query, finds k nearest objects to the query location. Existing studies focus on either kNN search on static objects or continuous kNN search with Euclidean-distance constraints. The former cannot support dynamic updates of moving objects while the latter cannot support road networks. Since the objects are dynamically moving on the road networks, there are two main challenges. The first is how to index the moving objects on road networks and the second is how to find the k nearest moving objects. To address these challenges, in this paper we proposes a new index, V-Tree, which has two salient features. Firstly, it is a balanced search tree and can support efficient kNN search. Secondly, it can support dynamical updates of moving objects. To build a V-Tree, we iteratively partition the road network into sub-networks and build a tree structure on top of the sub-networks. Then we associate the moving objects on their nearest vertices in the V-Tree. When the location of an object is updated, we only need to update the tree nodes on the path from the corresponding leaf node to the root. We design a novel kNN search algorithm using V-Tree by pruning large numbers of irrelevant vertices in the road network. Experimental results on real datasets show that our method significantly outperforms baseline approaches by 2-3 orders of magnitude.
【Keywords】: Roads; Search problems; Indexes; Tools; Algorithm design and analysis; Complexity theory
【Paper Link】 【Pages】:621-632
【Authors】: Guoyang Chen ; Yufei Ding ; Xipeng Shen
【Abstract】: Finding the k nearest neighbors of a query point or a set of query points (KNN) is a fundamental problem in many application domains. It is expensive to do. Prior efforts in improving its speed have followed two directions with conflicting considerations: One tries to minimize the redundant distance computations but often introduces irregularities into computations, the other tries to exploit the regularity in computations to best exert the power of GPU-like massively parallel processors, which often introduces even extra distance computations. This work gives a detailed study on how to effectively combine the strengths of both approaches. It manages to reconcile the polar opposite effects of the two directions through elastic algorithmic designs, adaptive runtime configurations, and a set of careful implementation-level optimizations. The efforts finally lead to a new KNN on GPU named Sweet KNN, the first high-performance triangular-inequality-based KNN on GPU that manages to reach a sweet point between redundancy minimization and regularity preservation for various datasets. Experiments on a set of datasets show that Sweet KNN outperforms existing GPU implementations on KNN by up to 120X (11X on average).
【Keywords】: Graphics processing units; Instruction sets; Redundancy; Optimization; Kernel; Upper bound; Algorithm design and analysis
【Paper Link】 【Pages】:633-644
【Authors】: Jinfei Liu ; Juncheng Yang ; Li Xiong ; Jian Pei
【Abstract】: Outsourcing data and computation to cloud server provides a cost-e ective way to support large scale data storage and query processing. However, due to security and privacy concerns, sensitive data (e.g., medical records) need to be protected from the cloud server and other unauthorized users. One approach is to outsource encrypted data to the cloud server and have the cloud server perform query processing on the encrypted data only. It remains a challenging task to support various queries over encrypted data in a secure and e cient way such that the cloud server does not gain any knowledge about the data, query, and query result. In this paper, we study the problem of secure skyline queries over encrypted data. The skyline query is particularly important for multicriteria decision making but also presents significant challenges due to its complex computations. We propose a fully secure skyline query protocol on data encrypted using semanticallysecure encryption. As a key subroutine, we present a new secure dominance protocol, which can be also used as a building block for other queries. Finally, we provide both serial and parallelized implementations and empirically study the protocols in terms of e ciency and scalability under di erent parameter settings, verifying the feasibility of our proposed solutions.
【Keywords】: Cryptography; Servers; Protocols; Cloud computing; Query processing; Data privacy
【Paper Link】 【Pages】:647-658
【Authors】: David Broneske ; Veit Köppen ; Gunter Saake ; Martin Schäler
【Abstract】: Evaluating selection predicates is a data-intensive task that reduces intermediate results, which are the input for further operations. With analytical queries getting more and more complex, the number of evaluated selection predicates per query and table rises, too. This leads to numerous multicolumn selection predicates. Recent approaches to increase the performance of main-memory databases for selection-predicate evaluation aim at optimally exploiting the speed of the CPU by using accelerated scans. However, scanning each column one by one leaves tuning opportunities open that arise if all predicates are considered together. To this end, we introduce Elf, an index structure that is able to exploit the relation between several selection predicates. Elf features cache sensitivity, an optimized storage layout, fixed search paths, and slight data compression. In our evaluation, we compare its query performance to state-of the-art approaches and a sequential scan using SIMD capabilities. Our results indicate a clear superiority of our approach for multicolumn selection predicate queries with a low combined selectivity. For TPC-H queries with multi-column selection predicates, we achieve a speed-up between a factor of five and two orders of magnitude, mainly depending on the selectivity of the predicates.
【Keywords】: Indexes; Layout; Redundancy; Acceleration; Windows; Optimization
【Paper Link】 【Pages】:659-670
【Authors】: Shuhao Zhang ; Bingsheng He ; Daniel Dahlmeier ; Amelie Chi Zhou ; Thomas Heinze
【Abstract】: Driven by the rapidly increasing demand for handling real-time data streams, many data stream processing (DSP) systems have been proposed. Regardless of the different architectures of those DSP systems, they are mostly aiming at scaling out using a cluster of commodity machines and built around a number of key design aspects: a) pipelined processing with message passing, b) on-demand data parallelism, and c) JVM based implementation. However, there lacks a study on those key design aspects on modern scale-up architectures, where more CPU cores are being put on the same die, and the onchip cache hierarchies are getting larger, deeper, and complex. Multiple sockets bring non-uniform memory access (NUMA) effort. In this paper, we revisit the aforementioned design aspects on a modern scale-up server. Specifically, we use a series of applications as micro benchmark to conduct detailed profiling studies on Apache Storm and Flink. From the profiling results, we observe two major performance issues: a) the massively parallel execution model causes serious front-end stalls, which are a major performance bottleneck issue on a single CPU socket, b) the lack of NUMA-aware mechanism causes major drawback on the scalability of DSP systems on multi-socket architectures. Addressing these issues should allow DSP systems to exploit modern scale-up architectures, which also benefits scaling out environments. We present our initial efforts on resolving the above-mentioned performance issues, which have shown up to 3.2x and 3.1x improvement on the performance of Storm and Flink, respectively.
【Keywords】: Digital signal processing; Sockets; Storms; Message passing; Parallel processing; Pipelines; Multicore processing
【Paper Link】 【Pages】:671-682
【Authors】: Kai Zhang ; Jiayu Hu ; Bingsheng He ; Bei Hua
【Abstract】: As an emerging hardware, the coupled CPU-GPU architecture integrates a CPU and a GPU into a single chip, where the two processors share the same memory space. This special property opens up new opportunities for building in-memory keyvalue store systems, as it eliminates the data transfer costs on PCI-e bus, and enables fine-grained cooperation between the CPU and the GPU. In this paper, we propose DIDO, an in-memory key-value store system with dynamic pipeline executions on the coupled CPU-GPU architecture, to address the limitations and drawbacks of state-of-the-art system designs. DIDO is capable of adapting to different workloads through dynamically adjusting the pipeline with fine-grained task assignment to the CPU and the GPU at runtime. By exploiting the hardware features of coupled CPU-GPU architectures, DIDO achieves this goal with a set of techniques, including dynamic pipeline partitioning, flexible index operation assignment, and work stealing. We develop a cost model guided adaption mechanism to determine the optimal pipeline configuration. Our experiments have shown the effectiveness of DIDO in significantly enhancing the system throughput for diverse workloads.
【Keywords】: Pipelines; Graphics processing units; Indexes; Data structures; Architecture
【Paper Link】 【Pages】:683-694
【Authors】: Risi Thonangi ; Jun Yang
【Abstract】: Log-structure merge (LSM) is an increasingly prevalent approach to indexing, especially for modern writeheavy workloads. LSM organizes data in levels with geometrically increasing sizes. Records enter the top level, whenever a level fills up, it is merged down into the next level. Hence, the index is updated only through merges and records are never updated inplace. While originally conceived to avoid slow random accesses of hard drives, LSM also turns out to be especially suited to solidstate drives, or any block-based storage with expensive writes. We study how to further reduce writes in LSM. Traditionally, LSM always merges an overflowing level fully into the next. We investigate in depth how partial merges save writes and prove bounds on their effectiveness. We propose new algorithms that make provably good decisions on whether to perform a partial merge, and if yes, which part of a level to merge. We also show how to further reduce writes by reusing data blocks during merges. Overall, our approach offers better worst-case guarantees and better practical performance than existing LSM variants.
【Keywords】: Drives; Writing; Optimization; Indexing; Merging; Standards
【Paper Link】 【Pages】:697-708
【Authors】: Rui Li ; Alex X. Liu
【Abstract】: This paper concerns the fundamental problem of processing conjunctive queries that contain both keyword conditions and range conditions on public clouds in a privacy preserving manner. No prior Searchable Symmetric Encryption (SSE) based privacy-preserving conjunctive query processing scheme satisfies the three requirements of adaptive security, efficient query processing, and scalable index size. In this paper, we propose the first privacy preserving conjunctive query processing scheme that satisfies the above requirements. To achieve adaptive security, we propose an Indistinguishable Bloom Filter (IBF) data structure for indexing. To achieve efficient query processing and structure indistinguishability, we propose a highly balanced binary tree data structure called Indistinguishable Binary Tree (IBtree). To optimize searching efficiency, we propose a traversal width minimization algorithm and a traversal depth minimization algorithm. To achieve scalable and compact index size, we propose an IBtree space compression algorithm to remove redundant information in IBFs. We formally prove that our scheme is adaptive secure using a random oracle model. The key contribution of this paper is on achieving conjunctive query processing with both strong privacy guarantee and practical efficiency in terms of both speed and space. We implemented our scheme in C++, evaluated and compared its performance with KRB [24] for keyword queries and PBtree [32] for range queries on two realworld data sets. Experimental results show that our scheme is fast and scalable (in milliseconds).
【Keywords】: Indexes; Adaptation models; Cloud computing; Query processing; Data models; Cryptography
【Paper Link】 【Pages】:709-720
【Authors】: Pietro Colombo ; Elena Ferrari
【Abstract】: NoSQL datastores allow the efficient management of high volumes of heterogeneous and unstructured data, meeting the requirements of a variety of today ICT applications. However, most of these systems poorly support data security, and recent surveys show that their simplistic support for data protection is considered as a reason not to use them.1 In recent years, Attribute Based Access Control (ABAC) is getting more and more popularity, for its ability to provide highly flexible and customized forms of data protection at different granularity levels. In the current work, with the aim to raise users' confidence in the protection of data managed by NoSQL systems, we define a general approach to enforce ABAC within NoSQL systems. Our approach relies on SQL++[20], a unifying query language for NoSQL platforms. In particular, we develop a novel SQL++ query rewriting mechanism able to enforce heterogeneous types of ABAC policies specified up to cell level. Experimental results show an overhead which is not negligible for policies covering high percentage of the fields characterizing the protected documents, but which is far more contained when field level policies are more sparsely specified.
【Keywords】: Access control; Data models; Structured Query Language; Data protection; Electronic mail; Databases
【Paper Link】 【Pages】:721-732
【Authors】: Boxiang Dong ; Wendy Hui Wang
【Abstract】: The cloud paradigm enables users to outsource their data to computationally powerful third-party service providers for data management. Many data management tasks rely on the data dependency in the outsourced data. This raises an important issue of how the data owner can protect the sensitive information in the outsourced data while preserving the data dependency. In this paper, we consider functional dependency (FD), an important type of data dependency. Although simple deterministic encryption schemes can preserve FDs, they may be vulnerable against the frequency analysis attack. We design a frequency hiding, FD-preserving probabilistic encryption scheme, named F2, that enables the service provider to discover the FDs from the encrypted dataset. We consider two attacks, namely the frequency analysis (FA) attack and the FD-preserving chosen plaintext attack (FCPA), and show that the F2 encryption scheme can defend against both attacks with formal provable guarantee. Our empirical study demonstrates the efficiency and effectiveness of F2, as well as its security against both FA and FCPA attacks.
【Keywords】: Encryption; Probabilistic logic; Outsourcing; Servers; Cleaning
【Paper Link】 【Pages】:733-744
【Authors】: Kerim Yasin Oktay ; Murat Kantarcioglu ; Sharad Mehrotra
【Abstract】: This paper explores secure data processing in hybrid clouds wherein local computing capability is exploited alongside public cloud services to deliver an efficient and secure data management solution. Hybrid clouds offer numerous advantages including the ability to selectively outsource data and computations based on sensitivity/confidentiality. Data processing in hybrid clouds must address two interrelated challenges: (i) data distribution: how is data distributed across public and private machines, and (ii) distributed query processing: how are queries executed efficiently without leaking sensitive data to untrusted public machines. This paper addresses these challenges and incorporates the respective solutions into an add-on tool for a Hadoop, Spark, and Hive based cloud computing infrastructure. Our results show performance advantages in using our strategy as compared to other secure alternatives, even when the percentage of sensitive data is as high as 50%.
【Keywords】: Cloud computing; Encryption; Electronic mail; Data models; Sensitivity
【Paper Link】 【Pages】:747-758
【Authors】: Georgios Damaskinos ; Rachid Guerraoui ; Rhicheek Patra
【Abstract】: Similarity computations are crucial in various web activities like advertisements, search or trust-distrust predictions. These similarities often vary with time as product perception and popularity constantly change with users' evolving inclination. The huge volume of user-generated data typically results in heavyweight computations for even a single similarity update. We present I-SIM, a novel similarity metric that enables lightweight similarity computations in an incremental and temporal manner. Incrementality enables updates with low latency whereas temporality captures users' evolving inclination. The main idea behind I-SIM is to disintegrate the similarity metric into mutually independent time-aware factors which can be updated incrementally. We illustrate the efficacy of I-SIM through a novel recommender (SWIFT) as well as through a trust-distrust predictor in Online Social Networks (I-TRUST). We experimentally show that I-SIM enables fast and accurate predictions in an energy-efficient manner.
【Keywords】: Measurement; Time complexity; Mathematical model; Social network services; Collaboration; Energy consumption
【Paper Link】 【Pages】:759-770
【Authors】: Yong Zhang ; Xiuxing Li ; Jin Wang ; Ying Zhang ; Chunxiao Xing ; Xiaojie Yuan
【Abstract】: Similarity search is an essential operation in many applications. Given a collection of set records and a query, the exact set similarity search aims at finding all the records that are similar to the query from the collection. Existing methods adopt a filter-and-verify framework, which make use of inverted indexes. However, as the complexity of verification is rather low for setbased similarity metrics, they always fail to make a good tradeoff between filter power and filter cost. In this paper, we proposed an efficient framework for exact set similarity search based on tree index structure. We defined a hash-based ordering to effectively import data into the index structure and then make optimizations to reduce the filter cost. To further improve the filter power, we proposed a dynamic algorithm to partition the dataset into several parts and propose a multiple-index framework. Experimental results on real-world datasets show that our method significantly outperform the state-of-the-art algorithms.
【Keywords】: Indexes; Heuristic algorithms; Search problems; Partitioning algorithms; Measurement; Data collection; Complexity theory
【Paper Link】 【Pages】:771-782
【Authors】: Pratik Vinay Gupte ; Balaraman Ravindran ; Srinivasan Parthasarathy
【Abstract】: In social network analysis, the fundamental idea behind the notion of roles is to discover actors who have similar structural signatures. Actors performing the same role have similar behavioural and functional characteristics. Few examples of structural roles are bridge nodes, clique members and star centers. Role discovery involves partitioning the nodes in a network based on their structural characteristics. The notion of roles is complementary to the notion of community detection, which involves partitioning the network into cohesive subgroups. In this paper we propose a novel algorithm RID"Rs (Role Identification and Discovery using "-equitable Refinements): a graph partitioning approach for extracting soft roles in networks. RID"Rs discovers structural roles based on the global graph characteristics of a network. Evaluating the quality of roles discovered is nontrivial due to the lack of ground-truth role datasets, we present a novel framework for evaluating and comparing various role discovery approaches. We also demonstrate the e ectiveness of RID"Rs on diverse graph mining tasks: role identification/discovery and for finding top-k nodes that are most similar to a given node. Further, the empirical scalability analysis of our proposed algorithm on random power-law graphs shows that our approach is highly scalable.
【Keywords】: Erbium; Algorithm design and analysis; Partitioning algorithms; Feature extraction; Robustness; Electronic mail; Data mining
【Paper Link】 【Pages】:783-794
【Authors】: Yongjiang Liang ; Peixiang Zhao
【Abstract】: We consider in this paper the similarity search problem that retrieves relevant graphs from a graph database under the well-known graph edit distance (GED) constraint. Formally, given a graph database G = fg1, g2, …, gng and a query graph q, we aim to search the graph gi 2 G such that the graph edit distance between gi and q, GED(gi, q), is within a userspecified GED threshold. In spite of its theoretical significance and wide applicability, the GED-based similarity search problem is challenging in large graph databases due in particular to a large amount of GED computation incurred, which has proven to be NP-hard. In this paper, we propose a parameterized, partitionbased GED lower bound that can be instantiated into a series of tight lower bounds towards synergistically pruning false-positive graphs from G before costly GED computation is performed. We design an efficient, selectivity-aware algorithm to partition graphs of G into highly selective subgraphs. They are further incorporated in a cost-effective, multi-layered indexing structure, MLIndex (Multi-Layered Index), for GED lower bound crosschecking and false-positive graph filtering with theoretical performance guarantees. Experimental studies in real and synthetic graph databases validate the efficiency and effectiveness of ML-Index, which achieves up to an order of magnitude speedup over the state-of-the-art method for similarity search in graph databases.
【Keywords】: Search problems; Indexing; Partitioning algorithms; Algorithm design and analysis; Pattern recognition
【Paper Link】 【Pages】:797-808
【Authors】: Xuan Zhou ; Xin Zhou ; Zhengtai Yu ; Kian-Lee Tan
【Abstract】: Snapshot Isolation (SI) is a widely adopted concurrency control mechanism in database systems, which utilizes timestamps to resolve conflicts between transactions. However, centralized allocation of timestamps is a potential bottleneck for parallel transaction management. This bottleneck is becoming increasingly visible with the rapidly growing degree of parallelism of today's computing platforms. This paper introduces Posterior Snapshot Isolation (PostSI), an SI mechanism that allows transactions to determine their timestamps autonomously, without relying on centralized coordination. As such, PostSI can scale well, rendering it suitable for various multi-core and MPP platforms. Extensive experiments are conducted to demonstrate its advantage over existing approaches.
【Keywords】: Silicon; Clocks; Schedules; Database systems; Distributed databases; Scalability
【Paper Link】 【Pages】:809-820
【Authors】: Ning Wang ; Xiaokui Xiao ; Yin Yang ; Zhenjie Zhang ; Yu Gu ; Ge Yu
【Abstract】: Differential privacy, which has been applied in Google Chrome and Apple iOS, provides strong privacy assurance to users while retaining the capability to discover statistical patterns from sensitive data. We focus on top-k frequent itemset mining on sensitive data, with the goal of obtaining high result utility while satisfying differential privacy. There are two basic methodologies to design a high-utility solution: one uses generic differential privacy mechanisms as building blocks, and minimizes result error through algorithm design. Most existing work follows this approach. The other methodology is to devise a new building block customized for frequent itemset mining. This is much more challenging: to our knowledge, only one recent work, NoisyCut, attempts to do so, unfortunately, Noisycut has been found to violate differential privacy. This paper proposes a novel solution PrivSuper, which contains both a new algorithm and a new differential privacy mechanism. Unlike most existing methods that follow the Apriori framework, which starts from single items and iteratively forms larger itemsets, PrivSuper directly searches for maximal frequent itemsets, and subsequently adds their sub-itemsets to the results without additional privacy budget consumption. During the search, PrivSuper applies a customized mechanism to extend the current itemset with one more item, which we call the sequence exponential mechanism (SEM). Notably, SEM does not consume any privacy budget at all, if it turns out that the current itemset cannot be extended. Extensive experiments using several real datasets demonstrate that PrivSuper achieves significantly higher result utility compared to previous solutions.
【Keywords】: Itemsets; Privacy; Data privacy; Noise measurement; Sensitivity
【Paper Link】 【Pages】:821-832
【Authors】: Yang Cao ; Masatoshi Yoshikawa ; Yonghui Xiao ; Li Xiong
【Abstract】: Differential Privacy (DP) has received increasing attention as a rigorous privacy framework. Many existing studies employ traditional DP mechanisms (e.g., the Laplace mechanism) as primitives, which assume that the data are independent, or that adversaries do not have knowledge of the data correlations. However, continuous generated data in the real world tend to be temporally correlated, and such correlations can be acquired by adversaries. In this paper, we investigate the potential privacy loss of a traditional DP mechanism under temporal correlations in the context of continuous data release. First, we model the temporal correlations using Markov model and analyze the privacy leakage of a DP mechanism when adversaries have knowledge of such temporal correlations. Our analysis reveals that the privacy loss of a DP mechanism may accumulate and increase over time. We call it temporal privacy leakage. Second, to measure such privacy loss, we design an efficient algorithm for calculating it in polynomial time. Although the temporal privacy leakage may increase over time, we also show that its supremum may exist in some cases. Third, to bound the privacy loss, we propose mechanisms that convert any existing DP mechanism into one against temporal privacy leakage. Experiments with synthetic data confirm that our approach is efficient and effective.
【Keywords】: Privacy; Correlation; Data privacy; Databases; Aggregates; Probabilistic logic; Algorithm design and analysis
【Paper Link】 【Pages】:833-844
【Authors】: Haida Zhang ; Zengfeng Huang ; Zhewei Wei ; Wenjie Zhang ; Xuemin Lin
【Abstract】: In many modern applications, input data is represented as matrices and often arrives continuously. The ability to summarize and approximate data matrices in streaming fashion has become a common requirement in many emerging environments. In these applications, input data is usually generated at multiple distributed sites and simply centralizing all data is often infeasible. Therefore, novel algorithmic techniques are required. Furthermore, in most of these applications, queries must be answered solely based on the recently observed data points (e.g., data collected over the last hour/day/month), which makes the problem even more challenging. In this paper, we propose to study the problem of tracking matrix approximations over distributed sliding windows. In this problem, there are m distributed sites each observing a stream of d-dimensional data points. The goal is to continuously track a small matrix B as an approximation to Aw, the matrix consists of data points in the union of the streams which arrived during the last W time units. The quality of the approximation is measured by the covariance error kAT wAw? BTBk = kAk2 F [1], and the primary goal is to minimize communication, while providing provable error guarantee. We propose novel communication-efficient algorithms for this problem. Our sampling-based algorithms continuously track a weighted sample of rows according to their squared norms, which generalize and simplify the sampling techniques in [2]. We also propose deterministic tracking algorithms that require only one-way communication and provide better error guarantee. All algorithms have provable guarantees, and extensive experimental studies on real and synthetic datasets validate our theoretical claims and demonstrate the efficiency of these algorithms.
【Keywords】: Covariance matrices; Distributed databases; Monitoring; Windows; Principal component analysis; Protocols
【Paper Link】 【Pages】:847-858
【Authors】: Chonggang Song ; Wynne Hsu ; Mong-Li Lee
【Abstract】: The diffusion of rumors is a major concern for web users. Limiting the spread of rumor on social networks has become an important task. One approach is to identify nodes to start a truth campaign such that when users are aware of the truth, they would not believe or propagate the rumor. However, existing works do not take into account the delays of information diffusion or the time point beyond which propagation of misinformation is no longer critical. In this paper, we consider a more realistic situation where information is propagated with delays and the goal is to reduce the number of rumor-infected users before a deadline. We call this the Temporal Influence Blocking (TIB) problem. We propose a two-phase solution called TIB-Solver to select k nodes to start a truth campaign such that the number of users reached by a rumor is minimized. Experiments show that the proposed TIBSolver outperforms the state-of-the-art algorithms in terms of both effectiveness and efficiency.
【Keywords】: Delay effects; Social network services; Delays; Conferences; Data engineering; Computer science; Drives
【Paper Link】 【Pages】:859-870
【Authors】: Yurong Cheng ; Ye Yuan ; Lei Chen ; Christophe G. Giraud-Carrier ; Guoren Wang
【Abstract】: In recent years, online Event Based Social Network (EBSN) platforms have become increasingly popular. One typical task of EBSN platforms is to help users make suitable and personalized plans for participating in different interesting social events. Existing techniques either ignore the minimumparticipant requirement constraint for each event, which is crucially needed for some events to be held successfully, or assume that events would not change once announced. In this paper, we address the above inadequacies of existing EBSN techniques. We formally define the Global Event Planning with Constraints (GEPC) problem, and its incremental variant. We prove that both are NP-hard, and provide approximate solutions. Finally, we verify the effectiveness and efficiency of our proposed algorithms through extensive experiments over real and synthetic datasets.
【Keywords】: Planning; Upper bound; Computer science; Games; Seminars; Social network services; Approximation algorithms
【Paper Link】 【Pages】:871-882
【Authors】: Jianxin Li ; Xinjue Wang ; Ke Deng ; Xiaochun Yang ; Timos Sellis ; Jeffrey Xu Yu
【Abstract】: Detecting social communities in large social networks provides an effective way to analyze the social media users' behaviors and activities. It has drawn extensive attention from both academia and industry. One essential aspect of communities in social networks is outer influence which is the capability to spread internal information of communities to external users. Detecting the communities of high outer influence has particular interest in a wide range of applications, e.g., Ads trending analytics, social opinion mining and news propagation pattern discovery. However, the existing detection techniques largely ignore the outer influence of the communities. To fill the gap, this work investigates the Most Influential Community Search problem to disclose the communities with the highest outer influences. We firstly propose a new community model, maximal kr-Clique community, which has desirable properties, i.e., society, cohesiveness, connectivity, and maximum. Then, we design a novel tree-based index structure, denoted as C-Tree, to maintain the offline computed r-cliques. To efficiently search the most influential communities, we also develop four advanced index-based algorithms which improve the search performance of non-indexed solution by about 200 times. The efficiency and effectiveness of our solution have been extensively verified using six real datasets and a small case study.
【Keywords】: Search problems; Electronic mail; Australia; Twitter; Indexes; Companies
【Paper Link】 【Pages】:883-894
【Authors】: Yishi Lin ; Wei Chen ; John C. S. Lui
【Abstract】: The majority of influence maximization (IM) studies focus on targeting influential seeders to trigger substantial information spread in social networks. In this paper, we consider a new and complementary problem of how to further increase the influence spread of given seeders. Our study is motivated by the observation that direct incentives could "boost" users so that they are more likely to be influenced by friends. We study the k-boosting problem which aims to find k users to boost so that the final "boosted" influence spread is maximized. The k-boosting problem is different from the IM problem because boosted users behave differently from seeders: boosted users are initially uninfluenced and we only increase their probability to be influenced. Our work also complements the IM studies because we focus on triggering larger influence spread on the basis of given seeders. Both the NP-hardness of the problem and the non-submodularity of the objective function pose challenges to the k-boosting problem. To tackle the problem, we devise two efficient algorithms with the data-dependent approximation ratio. We conduct extensive experiments using real social networks demonstrating the efficiency and effectiveness of our proposed algorithms. We show that boosting solutions returned by our algorithms achieves boosts of influence that are up to several times higher than those achieved by boosting solutions returned by intuitive baselines, which have no guarantee of solution quality. We also explore the "budget allocation" problem in our experiments. Compared with targeting seeders with all budget, larger influence spread is achieved when we allocation the budget to both seeders and boosted users. This also shows that our study complements the IM studies.
【Keywords】: Boosting; Integrated circuit modeling; Companies; Social network services; Approximation algorithms; Advertising
【Paper Link】 【Pages】:897-908
【Authors】: Joeri Rammelaere ; Floris Geerts ; Bart Goethals
【Abstract】: Methods for cleaning dirty data typically rely on additional information about the data, such as user-specified constraints that specify when a database is dirty. These constraints often involve domain restrictions and illegal value combinations. Traditionally, a database is considered clean if all constraints are satisfied. However, many real-world scenario's only have a dirty database available. In such a context, we adopt a dynamic notion of data quality, in which the data is clean if an error discovery algorithm does not find any errors. We introduce forbidden itemsets which capture unlikely value co-occurrences in dirty data, and we derive properties of the lift measure to provide an efficient algorithm for mining low lift forbidden itemsets. We further introduce a repair method which guarantees that the repaired database does not contain any low lift forbidden itemsets. The algorithm uses nearest neighbor imputation to suggest possible repairs. Optional user interaction can easily be integrated into the proposed cleaning method. Evaluation on real-world data shows that errors are typically discovered with high precision, while the suggested repairs are of good quality and do not introduce new forbidden itemsets, as desired.
【Keywords】: Itemsets; Maintenance engineering; Cleaning; Data mining; Heuristic algorithms; Reliability
【Paper Link】 【Pages】:909-920
【Authors】: Yasser Altowim ; Sharad Mehrotra
【Abstract】: Entity resolution (ER) is the process of identifying which entities in a dataset represent the same real-world object. This paper proposes a progressive approach to ER using MapReduce. In contrast to traditional ER, progressive ER aims to resolve the dataset such that the rate at which the data quality improves is maximized. Such a progressive approach is useful for many emerging analytical applications that require low latency response and/or in situations where the underlying resources are constrained or costly to use. Experiments with real-world datasets demonstrate the ability of our approach to generate high-quality results using limited amounts of resolution cost.
【Keywords】: Erbium; Cleaning; Schedules; Parallel processing; Complexity theory; Conferences; Data engineering
【Paper Link】 【Pages】:921-932
【Authors】: Angelika Kimmig ; Alex Memory ; Renée J. Miller ; Lise Getoor
【Abstract】: We propose a probabilistic approach to the problem of schema mapping. Our approach is declarative, scalable, and extensible. It builds upon recent results in both schema mapping and probabilistic reasoning and contributes novel techniques in both fields. We introduce the problem of mapping selection, that is, choosing the best mapping from a space of potential mappings, given both metadata constraints and a data example. As selection has to reason holistically about the inputs and the dependencies between the chosen mappings, we define a new schema mapping optimization problem which captures interactions between mappings. We then introduce Collective Mapping Discovery (CMD), our solution to this problem using stateof- the-art probabilistic reasoning techniques, which allows for inconsistencies and incompleteness. Using hundreds of realistic integration scenarios, we demonstrate that the accuracy of CMD is more than 33% above that of metadata-only approaches already for small data examples, and that CMD routinely finds perfect mappings even if a quarter of the data is inconsistent
【Keywords】: Metadata; Probabilistic logic; Cognition; Optimization; Companies; Conferences; Data engineering
【Paper Link】 【Pages】:933-944
【Authors】: Shuang Hao ; Nan Tang ; Guoliang Li ; Jian Li
【Abstract】: We study the data cleaning problem of detecting and repairing wrong relational data, as well as marking correct data, using well curated knowledge bases (KBs). We propose detective rules (DRs), a new type of data cleaning rules that can make actionable decisions on relational data, by building connections between a relation and a KB. The main invention is that, a DR simultaneously models two opposite semantics of a relation using types and relationships in a KB: the positive semantics that explains how attribute values are linked to each other in correct tuples, and the negative semantics that indicates how wrong attribute values are connected to other correct attribute values within the same tuples. Naturally, a DR can mark correct values in a tuple if it matches the positive semantics. Meanwhile, a DR can detect/repair an error if it matches the negative semantics. We study fundamental problems associated with DRs, e.g., rule generation and rule consistency. We present efficient algorithms to apply DRs to clean a relation, based on rule order selection and inverted indexes. Extensive experiments, using both real-world and synthetic datasets, verify the effectiveness and efficiency of applying DRs in practice.
【Keywords】: Cleaning; Semantics; Chemistry; Urban areas; Maintenance engineering; Knowledge based systems; Integrated circuits
【Paper Link】 【Pages】:947-958
【Authors】: Thach Le Nguyen ; Severin Gsponer ; Georgiana Ifrim
【Abstract】: Existing approaches to time series classification can be grouped into shape-based (numeric) and structure-based (symbolic). Shape-based techniques use the raw numeric time series with Euclidean or Dynamic Time Warping distance and a 1-Nearest Neighbor classifier. They are accurate, but computationally intensive. Structure-based methods discretize the raw data into symbolic representations, then extract features for classifiers. Recent symbolic methods have outperformed numeric ones regarding both accuracy and efficiency. Most approaches employ a bag-of-symbolic-words representation, but typically the word-length is fixed across all time series, an issue identified as a major weakness in the literature. Also, there are no prior attempts to use efficient sequence learning techniques to go beyond single words, to features based on variable-length sequences of words or symbols. We study an efficient linear classification approach, SEQL, originally designed for classification of symbolic sequences. SEQL learns discriminative subsequences from training data by exploiting the all-subsequence space using greedy gradient descent. We explore different discretization approaches, from none at all to increasing smoothing of the original data, and study the effect of these transformations on the accuracy of SEQL classifiers. We propose two adaptations of SEQL for time series data, SAX-VSEQL, can deal with X-axis offsets by learning variable-length symbolic words, and SAX-VFSEQL, can deal with X-axis and Y-axis offsets, by learning fuzzy variable-length symbolic words. Our models are linear classifiers in rich feature spaces. Their predictions are based on the most discriminative subsequences learned during training, and can be investigated for interpreting the classification decision.
【Keywords】: Time series analysis; Training; Optimization; Benchmark testing; Measurement; Data analysis
【Paper Link】 【Pages】:959-970
【Authors】: Lei Cao ; Yizhou Yan ; Caitlin Kuhlman ; Qingyang Wang ; Elke A. Rundensteiner ; Mohamed Y. Eltabakh
【Abstract】: As datasets increase radically in size, highly scalable algorithms leveraging modern distributed infrastructures need to be developed for detecting outliers in massive datasets. In this work, we present the first distributed distance-based outlier detection approach using the MapReduce-based infrastructure, called DOD. DOD features a single-pass execution framework that minimizes communication overhead. Furthermore, DOD overturns two fundamental assumptions widely adopted in the distributed analytics literature, namely cardinality-based load balancing and one algorithm for all data. The multi-tactic strategy of DOD achieves a truly balanced workload by taking into account the data characteristics in data partitioning and assigns most appropriate algorithm for each partition based on our theoretical cost models established for distinct classes of detection algorithms. Thus, DOD effectively minimizes the end-to-end execution time. Our experimental study confirms the efficiency of DOD and its scalability to terabytes of data, beating the baseline solutions by a factor of 20x.
【Keywords】: Partitioning algorithms; US Department of Defense; Detection algorithms; Algorithm design and analysis; Distributed databases; Load management; Scalability
【Paper Link】 【Pages】:971-982
【Authors】: Jiawei Zhang ; Jianhui Chen ; Shi Zhi ; Yi Chang ; Philip S. Yu ; Jiawei Han
【Abstract】: Users' addiction to online social networks is discovered to be highly correlated with their social connections in the networks. Dense social connections can effectively help online social networks retain their active users and improve the social network services. Therefore, it is of great importance to make a good prediction of the social links among users. Meanwhile, to enjoy more social network services, users nowadays are usually involved in multiple online social networks simultaneously. Formally, the social networks which share a number of common users are defined as the "aligned networks".With the information transferred from multiple aligned social networks, we can gain a more comprehensive knowledge about the social preferences of users in the pre-specified target network, which will benefit the social link prediction task greatly. However, when transferring the knowledge from other aligned source networks to the target network, there usually exists a shift in information distribution between different networks, namely domain difference. In this paper, we study the social link prediction problem of the target network, which is aligned with multiple social networks concurrently. To accommodate the domain difference issue, we project the features extracted for links from different aligned networks into a shared lower-dimensional feature space. Moreover, users in social networks usually tend to form communities and would only connect to a small number of users. Thus, the target network structure has both the low-rank and sparse properties. We propose a novel optimization framework, SLAMPRED, to combine both these two properties aforementioned of the target network and the information of multiple aligned networks with nice domain adaptations. Since the objective function is a linear combination of convex and concave functions involving nondifferentiable regularizers, we propose a novel optimization method to iteratively solve it. Extensive experiments have been done on real-world aligned social networks, and the experimental results demonstrate the effectiveness of the proposed model.
【Keywords】: Social network services; Feature extraction; Linear programming; Predictive models; Sparse matrices; Estimation; Optimization
【Paper Link】 【Pages】:983-994
【Authors】: Xuyun Zhang ; Wan-Chun Dou ; Qiang He ; Rui Zhou ; Christopher Leckie ; Kotagiri Ramamohanarao ; Zoran A. Salcic
【Abstract】: Anomaly or outlier detection is a major challenge in big data analytics because anomaly patterns provide valuable insights for decision-making in a wide range of applications. Recently proposed anomaly detection methods based on the tree isolation mechanism are very fast due to their logarithmic time complexity, making them capable of handling big data sets efficiently. However, the underlying similarity or distance measures in these methods have not been well understood. Contrary to the claims that these methods never rely on any distance measure, we find that they have close relationships with certain distance measures. This implies that the current use of this fast isolation mechanism is only limited to these distance measures and fails to generalise to other commonlyused measures. In this paper, we propose a generic framework named LSHiForest for fast tree isolation based ensemble anomaly analysis with the use of a Locality-Sensitive Hashing (LSH) forest. Being generic, the proposed framework can be instantiated with a diverse range of LSH families, and the fast isolation mechanism can be extended to any distance measures, data types and data spaces where an LSH family is defined. In particular, the instances of our framework with kernelised LSH families or learning based hashing schemes can detect complicated anomalies like local or surrounded anomalies. We also formally show that the existing tree isolation based detection methods are special cases of our framework with the corresponding distance measures. Extensive experiments on both synthetic and real-world benchmark data sets show that the framework can achieve both high time efficiency and anomaly detection quality.
【Keywords】: Vegetation; Current measurement; Big Data; Algorithm design and analysis; Data mining; Benchmark testing; Feature extraction
【Paper Link】 【Pages】:997-1008
【Authors】: Peng Cheng ; Xiang Lian ; Lei Chen ; Cyrus Shahabi
【Abstract】: With the rapid advancement of mobile devices and crowdsourcing platforms, spatial crowdsourcing has attracted much attention from various research communities. A spatial crowdsourcing system periodically matches a number of locationbased workers with nearby spatial tasks (e.g., taking photos or videos at some specific locations). Previous studies on spatial crowdsourcing focus on task assignment strategies that maximize an assignment score based solely on the available information about workers/tasks at the time of assignment. These strategies can only achieve local optimality by neglecting the workers/tasks that may join the system in a future time. In contrast, in this paper, we aim to improve the global assignment, by considering both present and future (via predictions) workers/tasks. In particular, we formalize a new optimization problem, namely maximum quality task assignment (MQA). The optimization objective of MQA is to maximize a global assignment quality score, under a traveling budget constraint. To tackle this problem, we design an effective grid-based prediction method to estimate the spatial distributions of workers/tasks in the future, and then utilize the predictions to assign workers to tasks at any given time instance. We prove that the MQA problem is NPhard, and thus intractable. Therefore, we propose efficient heuristics to tackle the MQA problem, including MQA greedy and MQA divide-and-conquer approaches, which can efficiently assign workers to spatial tasks with high quality scores and low budget consumptions. Through extensive experiments, we demonstrate the efficiency and effectiveness of our approaches on both real and synthetic datasets.
【Keywords】: Crowdsourcing; Optimization; Mobile handsets; Performance evaluation; Conferences; Data engineering; Videos
【Paper Link】 【Pages】:1009-1020
【Authors】: Tianshu Song ; Yongxin Tong ; Libin Wang ; Jieying She ; Bin Yao ; Lei Chen ; Ke Xu
【Abstract】: The prevalence of mobile Internet techniques and Online-To-Offline (O2O) business models has led the emergence of various spatial crowdsourcing (SC) platforms in our daily life. A core issue of SC is to assign real-time tasks to suitable crowd workers. Existing approaches usually focus on the matching of two types of objects, tasks and workers, or assume the static offline scenarios, where the spatio-temporal information of all the tasks and workers is known in advance. Recently, some new emerging O2O applications incur new challenges: SC platforms need to assign three types of objects, tasks, workers and workplaces, and support dynamic real-time online scenarios, where the existing solutions cannot handle. In this paper, based on the aforementioned challenges, we formally define a novel dynamic online task assignment problem, called the trichromatic online matching in real-time spatial crowdsourcing (TOM) problem, which is proven to be NP-hard. Thus, we first devise an efficient greedy online algorithm. However, the greedy algorithm can be trapped into local optimal solutions easily. We then present a threshold-based randomized algorithm that not only guarantees a tighter competitive ratio but also includes an adaptive optimization technique, which can quickly learn the optimal threshold for the randomized algorithm. Finally, we verify the effectiveness and efficiency of the proposed methods through extensive experiments on real and synthetic datasets.
【Keywords】: Employment; Real-time systems; Crowdsourcing; Algorithm design and analysis; Heuristic algorithms; Optimization; Bipartite graph
【Paper Link】 【Pages】:1021-1032
【Authors】: Caleb Chen Cao ; Jiayang Tu ; Zheng Liu ; Lei Chen ; H. V. Jagadish
【Abstract】: As crowdsourcing has been dramatically investigated and utilized to address problems in the real world, it is essential and important to think about performance optimization. Analogous to computer systems with CPUs, treating each worker as a HPU (Human Processing Unit [1]) and studying the performance optimization on top of HPUs are interesting perspectives to resolve crowdsourcing issues. However, as we characterize HPUs in detail for this purpose, we find that there are significant differences between CPUs and HPUs, leading to the need of completely new optimization algorithms. In this paper, we study the specific optimization problem of obtaining results the fastest for a crowdsourced job with a fixed total budget. In crowdsourcing, jobs are usually broken down into sets of small tasks, which are assigned to workers one at a time. We consider three scenarios of increasing complexity: Identical Round Homogeneous Tasks, Multiplex Round Homogeneous Tasks, and Multiple Round Heterogeneous Tasks. For each scenario, we analyze the stochastic behavior of the HPU clock rate as a function of the remuneration offered. After that, we develop an optimum Budget Allocation Strategy to minimize the latency of the job completion. We validate our results through extensive simulations and experiments on Amazon Mechanical Turk.
【Keywords】: Crowdsourcing; Resource management; Clocks; Optimization; Tuning; Sorting; Databases
【Paper Link】 【Pages】:1033-1044
【Authors】: Reinhard Heckel ; Michail Vlachos ; Thomas P. Parnell ; Celestine Dünner
【Abstract】: We consider the problem of generating interpretable recommendations by identifying overlapping co-clusters of clients and products, based only on positive or implicit feedback. Our approach is applicable on very large datasets because it exhibits almost linear complexity in the input examples and the number of co-clusters. We show, both on real industrial data and on publicly available datasets, that the recommendation accuracy of our algorithm is competitive to that of state-of-art matrix factorization techniques. In addition, our technique has the advantage of offering recommendations that are textually and visually interpretable. Finally, we examine how to implement our technique efficiently on Graphical Processing Units (GPUs).
【Keywords】: Clustering algorithms; Recommender systems; Collaboration; Algorithm design and analysis; Cognition; Bipartite graph; Detection algorithms
【Paper Link】 【Pages】:1047-1058
【Authors】: Yongyang Yu ; MingJie Tang ; Walid G. Aref ; Qutaibah M. Malluhi ; Mostafa M. Abbas ; Mourad Ouzzani
【Abstract】: The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This paper presents new efficient and scalable matrix processing and optimization techniques for in-memory distributed clusters. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics to optimize the cost of matrix computations in an in-memory distributed environment. The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix processing and optimization techniques in Spark, a distributed in-memory computing platform. Experiments on both real and synthetic data demonstrate that our proposed techniques achieve up to an order-of-magnitude performance improvement over state-of the-art distributed matrix computation systems on a wide range of applications.
【Keywords】: Matrix computation; query optimization; distributed computing
【Paper Link】 【Pages】:1059-1070
【Authors】: Chuitian Rong ; Chunbin Lin ; Yasin N. Silva ; Jianguo Wang ; Wei Lu ; Xiaoyong Du
【Abstract】: Set similarity join is an essential operation in big data analytics, e.g., data integration and data cleaning, that finds similar pairs from two collections of sets. To cope with the increasing scale of the data, distributed algorithms are called for to support large-scale set similarity joins. Multiple techniques have been proposed to perform similarity joins using MapReduce in recent years. These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully as MapReduce is a shared-nothing framework. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques. To address these problems, in this paper, we propose a duplicatefree framework, called FS-Join, to perform set similarity joins efficiently by utilizing an innovative vertical partitioning technique. FS-Join employs three powerful filtering methods to prune dissimilar string pairs without computing their similarity scores. To further improve the performance and scalability, FS-Join integrates horizontal partitioning. Experimental results on three real datasets show that FS-Join outperforms the state-of-theart methods by one order of magnitude on average, which demonstrates the good scalability and performance qualities of the proposed technique.
【Keywords】: Scalability; Big Data; Load management; Partitioning algorithms; Data engineering; Data integration; Cleaning
【Paper Link】 【Pages】:1071-1082
【Authors】: Namyong Park ; Sejoon Oh ; U. Kang
【Abstract】: How can we analyze tensors that are composed of 0's and 1's? How can we efficiently analyze such Boolean tensors with millions or even billions of entries? Boolean tensors often represent relationship, membership, or occurrences of events such as subject-relation-object tuples in knowledge base data (e.g., 'Seoul'-'is the capital of'-'South Korea'). Boolean tensor factorization (BTF) is a useful tool for analyzing binary tensors to discover latent factors from them. Furthermore, BTF is known to produce more interpretable and sparser results than normal factorization methods. Although several BTF algorithms exist, they do not scale up for large-scale Boolean tensors. In this paper, we propose DBTF, a distributed algorithm for Boolean tensor factorization running on the Spark framework. By caching computation results, exploiting the characteristics of Boolean operations, and with careful partitioning, DBTF successfully tackles the high computational costs and minimizes the intermediate data. Experimental results show that DBTF decomposes up to 163–323 larger tensors than existing methods in 68–382 less time, and exhibits near-linear scalability in terms of tensor dimensionality, density, rank, and machines.
【Keywords】: Tensile stress; Matrix decomposition; Scalability; Electronic mail; Distributed algorithms; Computational efficiency; Knowledge based systems
【Paper Link】 【Pages】:1083-1094
【Authors】: Claudio Martella ; Dionysios Logothetis ; Andreas Loukas ; Georgos Siganos
【Abstract】: In this paper, we present a graph partitioning algorithm to partition graphs with trillions of edges. To achieve such scale, our solution leverages the vertex-centric Pregel abstraction provided by Giraph, a system for large-scale graph analytics. We designed our algorithm to compute partitions with high locality and fair balance, and focused on the characteristics necessary to reach wide adoption by practitioners in production. Our solution can (i) scale to massive graphs and thousands of compute cores, (ii) efficiently adapt partitions to changes to graphs and compute environments, and (iii) seamlessly integrate in existing systems without additional infrastructure. We evaluate our solution on the Facebook and Instagram graphs, as well as on other large-scale, real-world graphs. We show that it is scalable and computes partitionings with quality comparable, and sometimes outperforming, existing solutions. By integrating the computed partitionings in Giraph, we speedup various real-world applications by up to a factor of 5.6 compared to default hash-partitioning.
【Keywords】: Partitioning algorithms; Algorithm design and analysis; Facebook; Production; Ecosystems; Scalability; Convergence
【Paper Link】 【Pages】:1095-1106
【Authors】: Zhida Chen ; Gao Cong ; Zhenjie Zhang ; Tom Z. J. Fu ; Lisi Chen
【Abstract】: Huge amount of data with both space and text information, e.g., geo-tagged tweets, is flooding on the Internet. Such spatio-textual data stream contains valuable information for millions of users with various interests on different keywords and locations. Publish/subscribe systems enable efficient and effective information distribution by allowing users to register continuous queries with both spatial and textual constraints. However, the explosive growth of data scale and user base has posed challenges to the existing centralized publish/subscribe systems for spatiotextual data streams. In this paper, we propose our distributed publish/subscribe system, called PS2Stream, which digests a massive spatio-textual data stream and directs the stream to target users with registered interests. Compared with existing systems, PS2Stream achieves a better workload distribution in terms of both minimizing the total amount of workload and balancing the load of workers. To achieve this, we propose a new workload distribution algorithm considering both space and text properties of the data. Additionally, PS2Stream supports dynamic load adjustments to adapt to the change of the workload, which makes PS2Stream adaptive. Extensive empirical evaluation, on commercial cloud computing platform with real data, validates the superiority of our system design and advantages of our techniques on system performance improvement.
【Keywords】: Servers; Distributed databases; Indexing; Spatial databases; Corporate acquisitions; Internet
【Paper Link】 【Pages】:1111-1116
【Authors】: Lalitha Viswanathan ; Bikash Chandra ; Willis Lang ; Karthik Ramachandra ; Jignesh M. Patel ; Ajay Kalhan ; David J. DeWitt ; Alan Halverson
【Abstract】: Over-booking cloud resources is an effective way to increase the cost efficiency of a cluster, and is being studied within Microsoft for the Azure SQL Database service. A key challenge is to strike the right balance between the potentially conflicting goals of optimizing for resource allocation efficiency and positive user experience. Understanding when cloud database customers use their database instances and when they are idle can allow one to successfully balance these two metrics. In our work, we formulate and evaluate production-feasible methods to develop idleness profiles for customer databases. Using one of the largest data center telemetry datasets, namely Azure SQL Database telemetry across multiple data centers, we show that our schemes are effective in predicting future patterns of database usage. Our methods are practical and improve the efficiency of clusters while managing customer expectations.
【Keywords】: Databases; Telemetry; Resource management; Measurement; Processor scheduling; Production; Predictive models
【Paper Link】 【Pages】:1117-1128
【Authors】: Raja Subramaniam Thangaraj ; Koyel Mukherjee ; Gurulingesh Raravi ; Asmita Metrewar ; Narendra Annamaneni ; Koushik Chattopadhyay
【Abstract】: Ride sharing is a sustainable, environmentallyfriendly mode of commute that is gaining in popularity. Though there are well-established commercial service providers, there are not many platforms for facilitating peer-to-peer ride sharing, especially in a dynamic scenario, integrated with multi-modal trip planners. Such systems would need to be highly searchoptimized for retrieval of multiple potential ride matches in real time, especially because multi-modal trip planners have a high look-to-book ratio. At the same time, validity of the matches need to be ensured, even in a dynamic setting, while addressing quality considerations and constraints such as maximum detour incurred by rides, walking distance for commuters, and time windows of requests. We describe Xhare-a-Ride (XAR) system, a platform for dynamic peer-to-peer ride sharing, that is scalable, efficient, and highly search-optimized for retrieving multiple potential matches for every ride request, while handling quality considerations. We propose a hierarchical discretization of the geographical region using grids, landmarks and clusters with theoretical guarantees, along with an efficient in-memory indexing of rides for maintaining spatio-temporal validity within a specified error tolerance. This helps eliminate shortest path computation in realtime during search, thus making XAR search-optimized, hence, suitable for integration with a multi-modal trip planner. We discuss modes of integrating XAR with such a trip planner for building an integrated system. Finally, we evaluate XAR thoroughly on ride share request data generated from the NY taxi trip data set on three fronts: (i) empirical performance against the theoretical guarantees as well as trade-off of performance with system parameters, (ii) benchmark XAR against a state-of the-art ride share system, showing a significant improvement in the search efficiency, and finally, (iii) the efficacy of combining ride sharing with public transportation.
【Keywords】: Public transportation; Real-time systems; Vehicle dynamics; Indexing; Legged locomotion; Peer-to-peer computing; Urban areas
【Paper Link】 【Pages】:1129-1139
【Authors】: Quanzhi Li ; Armineh Nourbakhsh ; Sameena Shah ; Xiaomo Liu
【Abstract】: In this paper, we present a new approach for detecting novel events from social media, specially Twitter, at real-time. An event is usually defined by who, what, where and when, and an event tweet usually contains terms corresponding to these aspects. To exploit this information, we propose a method that incorporates simple semantics by splitting the tweet term space into groups of terms that have the meaning of the same type. These groups are called semantic categories (classes) and each reflects one or more event aspects. The semantic classes include named entity, mention, location, hashtag, verb, noun and embedded link. To group tweets talking about the same event into the same cluster, similarity measuring is conducted by calculating class-wise similarity and then aggregating them together. Users of a real-time event detection system are usually only interested in novel (new) events, which are happening now or just happened a short time ago. To fulfill this requirement, a temporal identification module is used to filter out event clusters that are about old stories. The clustering module also computes a novelty score for each event cluster, which reflects how novel the event is, compared to previous events. We evaluated our event detection method using multiple quality metrics and a large-scale event corpus having millions of tweets. The experiment results show that the proposed online event detection method achieves the state-of-the-art performance. Our experiment also shows that the temporal identification module can effectively detect old events.
【Keywords】: event detection; event novelty; novel event; temporal identification; temporal information; semantic class; social media
【Paper Link】 【Pages】:1140-1149
【Authors】: Hanna Mazzawi ; Gal Dalaly ; David Rozenblatz ; Liat Ein-Dor ; Matan Ninio ; Ofer Lavi
【Abstract】: We present a novel approach for detecting malicious user activity in databases. Specifically, we propose a new machine learning algorithm for detecting attacks such as a stolen user account or illegal use by a user. Our algorithm relies on two main components that examine the consistency of a user's activity and compare it with activity patterns learned from past access. The first component tests for self-consistency, to determine whether the actions performed by a user are consistent with previous patterns. This engine is based on a probabilistic model that we developed to capture a user's normal behavior. The second component checks for global-consistency, to determine whether a user's actions are consistent with the past actions of similar users. We test our algorithm on access data from SQL databases. Experimental results show that we can keep false positive rates while retaining the overall accuracy level. An outlier detection engine based on the presented methods is now included in the standard offering of IBM InfoSphere Guardium1 with positive user feedback.
【Keywords】: Databases; Security; Data models; Algorithm design and analysis; Context; Companies; Servers
【Paper Link】 【Pages】:1150-1161
【Authors】: Abdeltawab M. Hendawi ; Jayant Gupta ; Youying Shi ; Hossam Fattah ; Mohamed H. Ali
【Abstract】: Connected moving objects with location sensors form the world of the Internet of Moving Things. This world includes people, animals, vehicles, drones, and vessels, to name a few. The conventional spatial libraries including Microsoft SQL Server Spatial (SqlSpatial) are primarily developed to evaluate spatial operations on stationary things. When it comes to real-world applications for the Internet of Moving Things that require realtime tracking and processing, the limitations of these libraries float to the surface. Unfortunately, the SqlSpatial library has very limited operations in this domain. This paper presents the Reactive eXtension Spatial (RxSpatial) library developed to provide real-time processing of spatio-temporal operations on moving objects connected through the Internet of Things. The superiority of the RxSpatial over the basic SqlSpatial is demonstrated throughout extensive experimental evaluations on real and synthetic data sets.
【Keywords】: Libraries; Real-time systems; Servers; Geospatial analysis; Spatiotemporal phenomena; Internet; Global Positioning System
【Paper Link】 【Pages】:1165-1172
【Authors】: Maosong Fu ; Ashvin Agrawal ; Avrilia Floratou ; Bill Graham ; Andrew Jorgensen ; Mark Li ; Neng Lu ; Karthik Ramasamy ; Sriram Rao ; Cong Wang
【Abstract】: Twitter's data centers process billions of events per day the instant the data is generated. To achieve real-time performance, Twitter has developed Heron, a streaming engine that provides unparalleled performance at large scale. Heron has been recently open-sourced and thus is now accessible to various other organizations. In this paper, we discuss the challenges we faced when transforming Heron from a system tailored for Twitter's applications and software stack to a system that efficiently handles applications with diverse characteristics on top of various Big Data platforms. Overcoming these challenges required a careful design of the system using an extensible, modular architecture which provides flexibility to adapt to various environments and applications. Further, we describe the various optimizations that allow us to gain this flexibility without sacrificing performance. Finally, we experimentally show the benefits of Heron's modular architecture.
【Keywords】: Computer architecture; Storms; Sparks; Topology; Twitter; Yarn; Engines
【Paper Link】 【Pages】:1173-1182
【Authors】: An Qin ; Yuan Yuan ; Dai Tan ; Pengyu Sun ; Xiang Zhang ; Hao Cao ; Rubao Lee ; Xiaodong Zhang
【Abstract】: Fast data analytics at an increasingly large scale has become a critical task in any Internet service company. For example, in Baidu, the major search engine company in China, large volumes of Web and business data in PB-scale are timely and constantly acquired and analyzed for the purposes of evaluating product revenue, tracking product demanding activities on market, predicting user behavior, upgrading product rankings, and diagnosing spam cases, and many others. Response time for queries of various data analytics not only affects user experiences, but also has a serious impact on productivity of business operations. In this paper, to meet the challenge of fast data analytics, we present Feisu (meaning fast in Chinese), a data integration system over heterogeneous storage systems, which has been widely used in Baidu's critical and daily business analytics applications after our R&D efforts. Feisu is designed and implemented to co-work together with several heterogeneous storage systems, and exploit the query similarity embedded in complex query workloads. Our experiments using real world workloads show that Feisu can significantly improve query performance in Baidu. Feisu has been in production use in Baidu for two years to effectively manage over dozens of petabytes of data for various applications.
【Keywords】: Servers; Business; Data analysis; Search engines; Distributed databases; Data models
【Paper Link】 【Pages】:1183-1194
【Authors】: Sijie Guo ; Robin Dhamankar ; Leigh Stewart
【Abstract】: DistributedLog is a high performance, strictly ordered, durably replicated log. It is multi-tenant, designed with a layered architecture that allows reads and writes to be scaled independently and supports OLTP, stream processing and batch workloads. It also supports a globally synchronous consistent replicated log spanning multiple geographically separated regions. This paper describes how DistributedLog is structured, its components and the rationale underlying various design decisions. We have been using DistributedLog in production for several years, supporting applications ranging from transactional database journaling, real-time data ingestion, and analytics to general publish-subscribe messaging.
【Keywords】: Distributed databases; Publish-subscribe; Real-time systems; Throughput; Metadata; Libraries; Twitter
【Paper Link】 【Pages】:1195-1205
【Authors】: Sam Lightstone ; Russ Ohanian ; Michael Haide ; James Cho ; Michael Springgay ; Torsten Steinbach
【Abstract】: In this paper we introduce dashDB Local, a new Big Data & SQL warehousing technology from IBM designed to provide dramatic simplification over traditional data warehousing deployments, while offering next generation query performance and the analytic richness of the Apache Spark ecosystem. Designed as a Software Defined Technology, dashDB Local automatically adapts to hardware platforms to provide simplified deployments. In experiments one multiple workloads, we have achieved deployment times for large clusters in
【Keywords】: Big Data; Data Warehousing; Columnar; SQL; Query; Apache Spark; Docker; Linux Container
【Paper Link】 【Pages】:1209-1212
【Authors】: Mohammed Al-Kateb ; Paul Sinclair ; Alain Crolotte ; Lu Ma ; Grace Au ; Sanjay Nair
【Abstract】: The UNION ALL set operator is useful for combining data from multiple sources. With the emergence of big data ecosystems in which data is typically stored on multiple systems, UNION ALL has become even more important. In this paper, we present optimization techniques implemented in Teradata Database for join queries with UNION ALL. Instead of spooling all branches of UNION ALL before performing join operations, we propose cost-based pushing of joins into branches. Join pushing not only addresses the prohibitive cost of spooling all branches, but also helps in exposing more efficient join methods (e.g., direct hash-based joins) which, otherwise, would not be considered by the query optimizer. The geography of relations being pushed to UNION ALL branches is also adjusted to avoid unnecessary redistributions and duplications of data. We conclude the paper with a performance study that demonstrates the impact of the proposed optimization techniques on query performance.
【Keywords】: Geography; Optimization; Parallel processing; Indexes; Big Data; Ecosystems
【Paper Link】 【Pages】:1213-1224
【Authors】: Shuhao Zhang ; Hoang Tam Vo ; Daniel Dahlmeier ; Bingsheng He
【Abstract】: SAP Event Stream Processor (ESP) platform aims at delivering real-time stream processing and analytics in many time-critical areas such as Capital Markets, Internet of Things (IoT) and Data Center Intelligence. SAP ESP allows users to realize complex event processing (CEP) in the form of pattern queries. In this paper, we present MOTTO – a multi-query optimizer in SAP ESP in order to improve the performance of many concurrent pattern queries. This is motivated by the observations that many real-world applications usually have concurrent pattern queries working on the same data streams, leading to tremendous sharing opportunities among queries. In MOTTO, we leverage three major sharing techniques, namely merge, decomposition and operator transformation sharing, to reduce redundant computation among pattern queries. In addition, MOTTO supports nested pattern queries as well as pattern queries with different window sizes. The experiments demonstrate the efficiency of the MOTTO with real-world application scenarios and sensitivity studies.
【Keywords】: Optimization; Monitoring; Tools; Real-time systems; Time factors; Stock markets
【Paper Link】 【Pages】:1225-1236
【Authors】: Ahmad Ghazal ; Todor Ivanov ; Pekka Kostamaa ; Alain Crolotte ; Ryan Voong ; Mohammed Al-Kateb ; Waleed Ghazal ; Roberto V. Zicari
【Abstract】: Benchmarking Big Data solutions has been gaining a lot of attention from research and industry. BigBench is one of the most popular benchmarks in this area which was adopted by the TPC as TPCx-BB. BigBench, however, has key shortcomings. The structured component of the data model is the same as the TPC-DS data model which is a complex snowflake-like schema. This is contrary to the simple star schema Big Data models in real life. BigBench also treats the semi-structured web-logs more or less as a structured table. In real life, web-logs are modeled as key-value pairs with unknown schema. Specific keys are captured at query time - a process referred to as late binding. In addition, eleven (out of thirty) of the BigBench queries are TPC-DS queries. These queries are complex SQL applied on the structured part of the data model which again is not typical of Big Data workloads. In this paper1, we present BigBench V2 to address the aforementioned limitations of the original BigBench. BigBench V2 is completely independent of TPC-DS with a new data model and an overhauled workload. The new data model has a simple structured data model. Web-logs are modeled as key-value pairs with a substantial and variable number of keys. BigBench V2 mandates late binding by requiring query processing to be done directly on key-value web-logs rather than a pre-parsed form of it. A new scale factor-based data generator is implemented to produce structured tables, key-value semistructured web-logs, and unstructured data. We implemented and executed BigBench V2 on Hive. Our proof of concept shows the feasibility of BigBench V2 and outlines different ways of implementing late binding.
【Keywords】: Benchmark testing; Data models; Big Data; Generators; Electronic mail; Web pages; Sparks
【Paper Link】 【Pages】:1237-1248
【Authors】: Arun Iyengar
【Abstract】: Data stores are typically accessed using a clientserver paradigm wherein the client runs as part of an application process which is trying to access the data store. This paper presents the design and implementation of enhanced data store clients having the capability of caching data for reducing the latency for data accesses, encryption for providing confidentiality before sending data to the server, and compression for reducing the size of data sent to the server. We support multiple approaches for caching data as well as multiple different types of caches. We also present a Universal Data Store Manager (UDSM) which allows an application to access multiple different data stores and provides a common interface to each data store. The UDSM provides both synchronous and asynchronous interfaces to each data store that it supports. An asynchronous interface allows an application program to access a data store and continue execution before receiving a response from the data store. The UDSM also can monitor the performance of different data stores. A workload generator allows users to easily determine and compare the performance of different data stores. The paper examines the key design issues in developing both the enhanced data store clients and the UDSM. It also looks at important issues in implementing client-side caching. The enhanced data store clients and UDSM are used to determine the performance of different data stores and to quantify the performance gains that can be achieved via caching.
【Keywords】: Servers; Encryption; Monitoring; Java; Open source software
【Paper Link】 【Pages】:1249-1254
【Authors】: Alexander Ulanov ; Andrey Simanovsky ; Manish Marwah
【Abstract】: Abstract-Present day machine learning is computationallyintensive and processes large amounts of data. It is implementedin a distributed fashion in order to address these scalabilityissues. The work is parallelized across a number of computingnodes. It is usually hard to estimate in advance how manynodes to use for a particular workload. We propose a simpleframework for estimating the scalability of distributed machinelearning algorithms. We measure the scalability by means of thespeedup an algorithm achieves with more nodes. We proposetime complexity models for gradient descent and graphicalmodel inference. We validate the gradient descent model withexperiments on deep learning training and graphical inferenceswith experiments on loopy belief propagation. The proposedframework was used to study the scalability of machine learningalgorithms in Apache Spark
【Keywords】: Computational modeling; Machine learning algorithms; Time complexity; Scalability; Graphical models; Machine learning; Data models
【Paper Link】 【Pages】:1259-1270
【Authors】: Victor J. Marin ; Tobin Pereira ; Srinivas Sridharan ; Carlos R. Rivero
【Abstract】: Currently, there is a "boom" in introductory programming courses to help students develop their computational thinking skills. Providing timely, personalized feedback that makes students reflect about what and why they did correctly or incorrectly is critical in such courses. However, the limited number of instructors and the great volume of submissions instructors need to assess, especially in Massive Open Online Courses (MOOCs), prove this task a challenge. One solution is to hire graders or create peer discussions among students, however, feedback may be too general, incomplete or even incorrect. Automatic techniques focus on: a) Functional testing, in which feedback usually does not sufficiently guide novices, b) Software verification to find code bugs, which may confuse novices since these tools usually skip true errors or produce false errors, and c) Comparing using reference solutions, in which a large amount of reference solutions or pre-existing correct submissions are usually required. This paper presents a semantic-aware technique to provide personalized feedback that aims to mimic an instructor looking for code snippets in student submissions. These snippets are modeled as subgraph patterns with natural language feedback attached to them. Submissions are transformed into extended program dependence graphs combining control and data flows. We leverage subgraph matching techniques to compute the adequate personalized feedback. Also, constraints correlating patterns allow performing fine-grained assessments. We have evaluated our method on several introductory programming assignments and a large number of submissions. Our technique delivered personalized feedback in milliseconds using a small set of patterns, which makes it appealing in real-world settings.
【Keywords】: Tools; Programming; Software; Testing; Computer bugs; Knowledge based systems; Pattern matching
【Paper Link】 【Pages】:1271-1282
【Authors】: Deokwoo Jung ; Zhenjie Zhang ; Marianne Winslett
【Abstract】: Vibration sensor is becoming an essential part of Internet of Things (IoT), fueled by the quickly evolving technology improving the measurement accuracy and lowering the hardware cost. Vibration sensors physically attach to core equipments in control and manufacturing systems, e.g., motors and tubes, providing key insight into the running status of these devices. Massive readings from vibration sensors, however, pose new technical challenges to the analytical system, due to the non-continuous sampling strategy for sensor energy saving, as well as hardness of data interpretation. To maximize the utility and minimize the operational overhead of vibration sensors, we propose a new analytical framework, especially designed for vibration analysis based on its unique characteristics. In particular, our data engine targets to support Remaining Usefulness Lifetime (RUL) estimation, known as one of the most important problems in cyber-physical system maintenance, to optimize the replacement scheduling over the equipments under monitoring. Our empirical evaluations on real manufacturing sites show that scalable and accurate analysis over the vibration data enables to prolong the average lifetime of the tubes by 1.2x and reduce the replacement cost by 20%.
【Keywords】: Vibrations; Micromechanical devices; Monitoring; Reliability; Servers; Vibration measurement; Data collection
【Paper Link】 【Pages】:1283-1294
【Authors】: Guojun Wu ; Yichen Ding ; Yanhua Li ; Jie Bao ; Yu Zheng ; Jun Luo
【Abstract】: Mining spatio-temporal reachable regions aims to find a set of road segments from massive trajectory data, that are reachable from a user-specified location and within a given temporal period. Accurately extracting such spatiotemporal reachable area is vital in many urban applications, e.g., (i) location-based recommendation, (ii) location-based advertising, and (iii) business coverage analysis. The traditional approach of answering such queries essentially performs a distance-based range query over the given road network, which have two main drawbacks: (i) it only works with the physical travel distances, where the users usually care more about dynamic traveling time, and (ii) it gives the same result regardless of the querying time, where the reachable area could vary significantly with different traffic conditions. Motivated by these observations, we propose a data-driven approach to formulate the problem as mining actual reachable region based on real historical trajectory dataset. The main challenge in our approach is the system efficiency, as verifying the reachability over the massive trajectories involves huge amount of disk I/Os. In this paper, we develop two indexing structures: 1) spatio-temporal index (ST-Index) and 2) connection index (Con-Index) to reduce redundant trajectory data access operations. We also propose a novel query processing algorithm with: 1) maximum bounding region search, which directly extracts a small searching region from the index structure and 2) trace back search, which refines the search results from the previous step to find the final query result. Moreover, our system can also efficiently answer the spatio-temporal reachability query with multiple query locations by skipping the overlapped area search. We evaluate our system extensively using a large-scale real taxi trajectory data in Shenzhen, China, where results demonstrate that the proposed algorithms can reduce 50%-90% running time over baseline algorithms.
【Keywords】: Roads; Trajectory; Query processing; Business; Indexing
【Paper Link】 【Pages】:1295-1306
【Authors】: Abdeltawab M. Hendawi ; Aqeel Rustum ; Amr A. Ahmadain ; David Hazel ; Ankur Teredesai ; Dev Oliver ; Mohamed H. Ali ; John A. Stankovic
【Abstract】: In smart cities, commuters have the opportunities for smart routing that may enable selecting a route with less car accidents, or one that is more scenic, or perhaps a straight and flat route. Such smart personalization requires a data management framework that goes beyond a static road network graph. This paper introduces PreGo, a novel system developed to provide real time personalized routing. The recommended routes by PreGo are smart and personalized in the sense of being (1) adjustable to individual users preferences, (2) subjective to the trip start time, and (3) sensitive to changes of the road conditions. Extensive experimental evaluation using real and synthetic data demonstrates the efficiency of the PreGo system.
【Keywords】: Routing; Roads; Smart cities; Accidents; Automobiles; Real-time systems; Instruction sets
【Paper Link】 【Pages】:1309-1319
【Authors】: Jose Cordova-Garcia ; Xin Wang
【Abstract】: Phasor Measurement Units (PMUs) provide high precision data at high sampling rates to support Smart Grid applications. Power Line Outage detection mechanisms can enhance the grid reliability by assisting power operators in taking proper control actions. Despite the potential provided by PMUs, there are very limited efforts on exploiting data available to more effectively detect outages. Conventional outage detection schemes are mostly designed based on simplified power models, and the limited work on detection with data either assume all the measurement samples are available or ignore the missing entries. Their performance suffers in the complex grid conditions in the presence of missing data. In this paper, we design a detection mechanism considering unreliable data, in the form of missing data samples. Detection is performed through the grouping of nodes according to their data availability and their learned detection capabilities. To enable the robust detection of power line outages, we propose learning outage characteristics for each individual node instead of specific single line outage scenarios. Our results show that the outages detected are highly consistent with the evaluated failures under different scenarios, with high accuracy and low false positive rates. Moreover, the detection application is resilient to unreliable data, and can properly differentiate data problems from physical power line failures.
【Keywords】: Phasor measurement units; Data models; Monitoring; Smart grids; Robustness; Power measurement
【Paper Link】 【Pages】:1320-1331
【Authors】: Mohamed Sarwat ; Raha Moraffah ; Mohamed F. Mokbel ; James L. Avery
【Abstract】: Personalized recommendation has become popular in modern web services. For instance, Amazon recommends new items to shoppers. Also, Netflix recommends shows to viewers, and Facebook recommends friends to its users. Despite the ubiquity of recommendation applications, classic database management systems still do not provide in-house support for recommending data stored in the database. In this paper, we present the anatomy of RecDB an open source PostgreSQLbased system that provides a unified approach for declarative data recommendation inside the database engine. RecDB realizes the personalized recommendation functionality as query operators inside the database kernel. That facilitates applying the recommendation functionality and typical database operations (e.g., Selection, Join, Top-k) side-by-side. To further reduce the application latency, RecDB pre-computes and caches the generated recommendation in the database. In the paper, we present extensive experiments that study the performance of personalized recommendation applications based on an actual implementation inside PostgreSQL 9.2 using real Movie recommendation and location-aware recommendation scenarios. The results show that a recommendation-aware database engine, i.e., RecDB, outperforms the classic approach that implements the recommendation logic on-top of the database engine in various recommendation applications.
【Keywords】: Database; Recommendation; Analytics; Personalization; Machine Learning; Join; Indexing
【Paper Link】 【Pages】:1332-1343
【Authors】: Constantinos Costa ; Georgios Chatzimilioudis ; Demetrios Zeinalipour-Yazti ; Mohamed F. Mokbel
【Abstract】: In the realm of smart cities, telecommunication companies (telcos) are expected to play a protagonistic role as these can capture a variety of natural phenomena on an ongoing basis, e.g., traffic in a city, mobility patterns for emergency response or city planning. The key challenges for telcos in this era is to ingest in the most compact manner huge amounts of network logs, perform big data exploration and analytics on the generated data within a tolerable elapsed time. This paper introduces SPATE, an innovative telco big data exploration framework whose objectives are two-fold: (i) minimizing the storage space needed to incrementally retain data over time, and (ii) minimizing the response time for spatiotemporal data exploration queries over recent data. The storage layer of our framework uses lossless data compression to ingest recent streams of telco big data in the most compact manner retaining full resolution for data exploration tasks. The indexing layer of our system then takes care of the progressive loss of detail in information, coined decaying, as data ages with time. The exploration layer provides visual means to explore the generated spatio-temporal information space. We measure the efficiency of the proposed framework using a 5GB anonymized real telco network trace and a variety of telco-specific tasks, such as OLAP and OLTP querying, privacy-aware data sharing, multivariate statistics, clustering and regression. We show that out framework can achieve comparable response times to the state-of-the-art using an order of magnitude less storage space.
【Keywords】: Big Data; Entropy; Time factors; Mobile communication; Indexing; Mobile computing
【Paper Link】 【Pages】:1344-1355
【Authors】: Jie-Teng Wang ; Wen-Yang Lin
【Abstract】: Nowadays, spontaneous reporting systems (SRS) have been widely established to collect adverse drug events for ADR detection and analysis, e.g., the FDA Adverse Event Reporting System (FAERS). The SRS data are provided to the researchers, even open to the public, to foster the research of ADR detection and analysis. Normally, SRS data contains personal information and some sensitive value such as indication. It is necessary to de-identify the SRS data for preventing the disclosure of individual privacy before the data are published. Although there have been many different privacy preserving models and anonymization methods in the literature, they are not suitable for protecting SRS data from disclosure due to some features of SRS data. As such, we previously have proposed a privacy model called MS(k, θ)-bounding and the associated algorithm MS-Anonymization. In the real world, the SRS data is dynamically growing and needs to be published periodically, which thwarts our single-release-focus method, i.e., MS(k, θ)-bounding. In this paper, we present some potential attacks on periodically published SRS data and propose a new privacy model called PPMS(k, θ*)-bounding and the associated PPMS-Anonymization algorithm. Experimental results on selected FAERS datasets show that our new method can prevent privacy disclosure from the attacks in periodical data publishing scenario with reasonable sacrifice of data utility and acceptable deviation to the strength of ADR signals.
【Keywords】: adverse drug reaction; data anonymization; incremental data publishing; privacy preserving data publishing; spontaneous reporting system
【Paper Link】 【Pages】:1361-1362
【Authors】: Ryan Marcus ; Sofiya Semenova ; Olga Papaemmanouil
【Abstract】: Data management applications deployed on IaaS cloud environments must simultaneously strive to minimize cost and provide good performance. Balancing these two goals requires complex decision-making across a number of axes: resource provisioning, query placement, and query scheduling. While previous works have addressed each axis in isolation for specific types of performance goals, this demonstration showcases WiSeDB, a cloud workload management advisor service that uses machine learning techniques to address all dimensions of the problem for customizable performance goals. In our demonstration, attendees will see WiSeDB in action for a variety of workloads and performance goals.
【Keywords】: Cloud computing; Databases; Measurement; Context; Data models; Decision trees; Real-time systems
【Paper Link】 【Pages】:1363-1364
【Authors】: Zujian Weng ; Qi Guo ; Chunkai Wang ; Xiaofeng Meng ; Bingsheng He
【Abstract】: Storm is a popular real-time processing system. However, our earlier experiment shows that the fixed configuration of Storm would lead to either significant resource waste or limited processing throughput. In this demonstration, we present AdaStorm, a system to dynamically adjust the Storm configuration according to current data stream properties. AdaStorm is designed to minimize the resource usage while still ensuring the same or even better real-time response. We will demonstrate that AdaStorm can achieve resource efficiency as well as data rate tolerance, compared to Storm system with fixed configuration. Video: https://youtu.be/YFPBFNdMbXM.
【Keywords】: Storms; Topology; Throughput; Real-time systems; User interfaces; Data models; Geology
【Paper Link】 【Pages】:1365-1366
【Authors】: Patrick Tan ; Yichao Zhou ; Xinxin Huang ; Giuseppe M. Mazzeo ; Chelsea Ju
【Abstract】: Omics phenotyping has become increasingly recognizedin our path to Precision Medicine. A major computationalchallenge of our investigator community is to identify the necessarydata analytical tools to process multi-omics data. To aidnavigation of the analytical tools fragmented across the web, wehave created a novel computational resource platform, AZTEC,that empowers users to simultaneously search a diverse arrayof digital resources including databases, standalone software,web services, publications, and large libraries composed ofmany interrelated functions. AZTEC fosters an environment ofsustainable resource support and discovery, enabling researchersto overcome the challenges of information science. An online videoof the demonstration can be viewed at https://www.youtube.com/watch?v=FIxvlof6Kbg.
【Keywords】: Tools; Bioinformatics; Information retrieval; Genomics; Portable document format; Data visualization; Semantics
【Paper Link】 【Pages】:1367-1368
【Authors】: Laurynas Siksnys ; Torben Bach Pedersen
【Abstract】: In this demo, we present SolveDB - the first purely SQL-based DBMS for optimization problem solving and solver integration. SolveDB provides (1) an SQL-based syntax for optimization problem specification, (2) an extensible infrastructure of solvers for different classes of problems (e.g., linear programming), and (3) query optimization techniques to achieve the best execution performance and/or result quality. We demonstrate how our PostgreSQL-based SolveDB implementation allows simplifying specifications of data-driven optimization problems and increasing overall solving performance for (1) traditional well-known optimization problems, (2) "how-to" problems based on TPC-H, and (3) a complex energy planning problem with interlinked energy forecasting and load scheduling.
【Keywords】: Optimization; Linear programming; Problem-solving; Query processing; Planning; Integrated circuits
【Paper Link】 【Pages】:1369-1370
【Authors】: Saad Bin Suhaim ; Nan Zhang ; Gautam Das ; Ali Jaoua
【Abstract】: We propose to demonstrate HDBExpDetector, a system for analysts to discover bursty events from a thirdparty web database by using nothing but the existing search interface of the web database to monitor and detect sudden changes to aggregates that satisfy certain user-defined conditions. Video https://youtu.be/7Pwd8o1h5CY.
【Keywords】: Aggregates; Databases; Portable computers; Estimation; Web servers; Monitoring
【Paper Link】 【Pages】:1371-1372
【Authors】: Qing Liu ; Yunjun Gao ; Linlin Zhou ; Gang Chen
【Abstract】: We develop IS2R, an efficient interactive system for refining reverse top-k queries, to eliminate unexpected query results including (i) the absence of expected objects, (ii) the presence of unexpected objects, and (iii) the empty query result. The IS2R returns the refinement suggestions with the minimal costs based on penalty models. In this demonstration (available at https://youtu.be/GnSm4T9Uslk), we show different scenarios on how IS2R can be used to refine the original reverse top-k query, and verify its effectiveness and efficiency.
【Keywords】: Computers; Knowledge discovery; Refining; Automobiles; Interactive systems; Heating systems; Database systems
【Paper Link】 【Pages】:1373-1374
【Authors】: Pierre Bourhis ; Daniel Deutch ; Yuval Moskovitch
【Abstract】: We consider in this demonstration the analysis of complex data-intensive applications. We focus on three classes of analytical questions that are important for application owners and users alike: Why was a result obtained? What would be the result if the application logic or database is modified in a particular way? How can one interact with the application to achieve a particular goal? Answering these questions efficiently is a fundamental step towards optimizing the application and its use. Noting that provenance was a key component in answering similar questions in the context of database queries, we have developed POLYTICS, a system that employs novel provenance-based solutions for these analytic questions for data-centric applications. We propose to demonstrate POLYTICS using an online bicycle shop application as an example, letting participants play the role of both analysts and users.
【Keywords】: Databases; Bicycles; Context; Engines; Navigation; Generators; Computer bugs
【Paper Link】 【Pages】:1375-1376
【Authors】: Sergey Hardock ; Ilia Petrov ; Robert Gottstein ; Alejandro P. Buchmann
【Abstract】: Abstract-In the present paper we demonstrate the novel technique to apply the recently proposed approach of In-Place Appends - overwrites on Flash without a prior erase operation. IPA can be applied selectively: only to DB-objects that have frequent and relatively small updates. To do so we couple IPA to the concept of NoFTL regions, allowing the DBA to place update-intensive DB-objects into special IPA-enabled regions. The decision about region configuration can be (semi-)automated by an advisor analyzing DB-log files in the background.
【Keywords】: Benchmark testing; Throughput; Graphical user interfaces; Electronic mail; Hardware; Indexes
【Paper Link】 【Pages】:1377-1378
【Authors】: Cetin Sahin ; Aaron Magat ; Victor Zakhary ; Amr El Abbadi ; Huijia (Rachel) Lin ; Stefano Tessaro
【Abstract】: This demonstration introduces the database community to state-of-the-art cryptographic methods that ensure efficient oblivious access to cloud data. In particular, we explore oblivious storage systems which hide both the content of data and data access patterns from an untrusted cloud provider. The demo considers the popular and realistic setting where multiple users from a trusted group asynchronously access and edit potentially overlapping data sets through a trusted proxy. We present a detailed implementation of TaoStore (Sahin et al., S&P 2016), a new tree-based ORAM scheme that processes client requests concurrently and asynchronously in a non-blocking fashion, resulting in substantial gains in throughput, simplicity, and flexibility over previous systems. The demo is presented in the context of a pedagogical game, Guess the Access, which allows participants to play as an adversary trying to guess queries against TaoStore or ObliviStore (Stefanov and Shi, S&P 2013), a recent oblivious storage system which has been shown to leak access patterns. The proposed game will highlight the subtleties and intricacies that underlie the cryptographic methods used to design oblivious storage systems.
【Keywords】: Cloud computing; Games; Throughput; Servers; Databases; Cryptography
【Paper Link】 【Pages】:1379-1380
【Authors】: Tania K. Roblot ; Sebastian Link
【Abstract】: With an ever-growing number of applications producing uncertain data, the probabilistic data model has gained renewed attention. Probabilistic cardinality constraints stipulate a range on the marginal probability by which cardinality constraints hold in probabilistic databases. The core challenge is in identifying the bounds that are meaningful in a given application domain. Our tool helps data and domain experts jointly tackle that challenge. The demo showcases our prototype, which computes perfect data summarizations for any set of probabilistic cardinality constraints: Armstrong databases.
【Keywords】: Probabilistic logic; Databases; Tools; Data models; Prototypes; Unified modeling language; Upper bound
【Paper Link】 【Pages】:1383-1384
【Authors】: Amr Magdy ; Mohamed F. Mokbel
【Abstract】: Video: http://kite.cs.umn.edu/video.html.
【Keywords】: Indexes; Twitter; Data visualization; Big Data; Real-time systems; Database languages; Buildings
【Paper Link】 【Pages】:1385-1386
【Authors】: Abdulrahman Alsaudi ; Mehdi Sadri ; Yasser Altowim ; Sharad Mehrotra
【Abstract】: Twitter provides a strictly limited API that makes it difficult for a simple search using pre-defined textual patterns to provide satisfying coverage of the topic of interest. This paper discusses a tweet acquisition system, that queries Twitter API using a set of key phrases, then analyzes the retrieved tweets. In order to achieve better coverage of the searched topic, the system employs an adaptive query generation mechanism that iteratively enriches the set of textual relevant patterns based on the previously collected tweets using an explore-exploit strategy. The paper also demonstrates an application called Topic Follow-up on Twitter (TFT) that is built on top of the acquisition system and aims at linking tweets with online articles. It first extracts a set of key phrases from the submitted news article and then utilizes the acquisition and analysis components of the system. Using this application, we will show how the adaptive searching mechanism of the tweet acquisition system improves the coverage of the topic of interest. Video: http://bit.ly/2kqkikB.
【Keywords】: Twitter; Thin film transistors; Adaptive systems; Databases; Joining processes; Uniform resource locators
【Paper Link】 【Pages】:1387-1388
【Authors】: Mohammad Hossein Namaki ; Keyvan Sasani ; Yinghui Wu ; Tingjian Ge
【Abstract】: This demo presents BEAMS, a system that automatically discovers and monitors top-k complex events over graph streams. Unlike conventional event detection over streams of items, BEAMS is able to (1) characterize and detect complex events in dynamic networks as graph patterns, and (2) perform online event discovery with a class of bounded algorithms that compute changes to top-k events in response to the transactions in graph streams, and incurs a minimized time cost determined by the changes, independent of the size of graph streams. We demonstrate: a) how BEAMS identifies top-k complex events as graph patterns in graph streams, and supports ad-hoc event queries online, b) how it copes with the sheer size of real-world graph streams with bounded event detection algorithm, and c) how the GUI of BEAMS interacts with users to support adhoc event queries that detect, browse and inspect trending events. Video: https://youtu.be/lVUGM0Fa17Q.
【Keywords】: Event detection; Pattern matching; Graphical user interfaces; Heuristic algorithms; Detectors; Finance; Companies
【Paper Link】 【Pages】:1389-1390
【Authors】: Chunbin Lin ; Jianguo Wang ; Yannis Papakonstantinou
【Abstract】: There is an increasing demand to explore similar entities in big graphs. For example, in domains like biomedical science, identifying similar entities may contribute to developing new drugs or discovering new diseases. In this paper, we demonstrate a graph exploration system, called GQFast, which provides a graphical interface to help users efficiently explore similar entities. Methodologically, GQFast first builds efficient indices combining column database optimizations and compression techniques, then it explores similar entities by using the indices. GQFast operates on the real-world Pubmed dataset consisting of over 23 million biomedical entities and 1.3 billion relationships. Relying on GQFast's high performance, GQFast provides (i) type-ahead-search to instantly visualize search results while a user is typing a query, and (ii) context-aware query completion to guide users typing queries.
【Keywords】: Indexes; Space exploration; Visualization; Arteriosclerosis; Data models; Generators
【Paper Link】 【Pages】:1391-1392
【Authors】: J. W. Zhang ; Anwesha Mal ; Y. C. Tay
【Abstract】: Enterprises and researchers often have datasets that can be represented as graphs (e.g. social networks). The owner of a large graph may want to scale it down to a similar but smaller version, e.g. for application development. On the other hand, the owner of a small graph may want to scale it up to a similar but larger version, e.g. to test system scalability. GSCALER is a recently developed tool for such scaling. This demonstration presents GSCALERCloud, a web-based service for access to GSCALER. A user can specify, via a browser, an example or empirical graph G, and a target size for the scaled version eG. The demonstration has two stages. In Stage I, the visitor can experiment with small graphs and check the similarity between input G and scaled eG. In Stage II, the visitor can test GSCALER with large graphs, and visually check for similarity using aggregate metrics and statistical distributions generated by GSCALERCloud's tools.
【Keywords】: Correlation; Silicon compounds; Generators; Visualization; Social network services; Aggregates; Measurement
【Paper Link】 【Pages】:1393-1394
【Authors】: Hui Miao ; Ang Li ; Larry S. Davis ; Amol Deshpande
【Abstract】: Deep learning has improved the state-of-the-art results in many domains, leading to the development of several systems for facilitating deep learning. Current systems, however, mainly focus on model building and training phases, while the issues of lifecycle management are largely ignored. Deep learning modeling lifecycle contains a rich set of artifacts and frequently conducted tasks, dealing with them is cumbersome and left to the users. To address these issues in a comprehensive manner, we demonstrate ModelHub, which includes a novel model versioning system (dlv), a domain-specific language for searching through model space (DQL), and a hosted service (ModelHub). Video: https://youtu.be/4JVehm5Ohg4.
【Keywords】: Machine learning; Predictive models; Training; Computational modeling; Metadata; Tuning
【Paper Link】 【Pages】:1395-1396
【Authors】: Theodore Georgiou ; Amr El Abbadi ; Xifeng Yan
【Abstract】: Towards the vision of building artificial intelligence systems that can assist with our everyday life, we introduce a proof of concept for a social media privacy "cyborg" which can locally and privately monitor a person's published content and offer advice or warnings when their privacy is at stake. The idea of a cyborg can be more general, as a separate local entity with its own computational resources, that can automatically perform several online tasks on our behalf. For this demonstration, we assume an attacker that can successfully infer user attributes, solely based on what the user has published (topic-based inference). We focus on Social Media privacy and specifically on the issue of exposing sensitive user-attributes, like location, or race, through published content. We built a privacy cyborg that can monitor a user's posted topics and automatically warn them in real time when a sensitive attribute is at risk of being exposed.
【Keywords】: Man-machine systems; Privacy; Twitter; Monitoring; Real-time systems; Probability distribution
【Paper Link】 【Pages】:1397-1398
【Authors】: HongYun Cai ; Vincent Wenchen Zheng ; Penghe Chen ; Fanwei Zhu ; Kevin Chen-Chuan Chang ; Zi Huang
【Abstract】: Community analysis is an important task in graph mining. Most of the existing community studies are community detection, which aim to find the community membership for each user based on the user friendship links. However, membership alone, without a complete profile of what a community is and how it interacts with other communities, has limited applications. This motivates us to consider systematically profiling the communities and thereby developing useful community-level applications. In this paper, we introduce a novel concept of community profiling, upon which we build a SocialLens system1 to enable searching and browsing communities by content and interaction. We deploy SocialLens on two social graphs: Twitter and DBLP. We demonstrate two useful applications of SocialLens, including interactive community visualization and profile-aware community ranking.
【Keywords】: Twitter; Visualization; Systems architecture; Tag clouds; Conferences; Data engineering; Urban areas
【Paper Link】 【Pages】:1399-1400
【Authors】: Jan Neerbeky ; Ira Assentz ; Peter Dolog
【Abstract】: Abstract-Leak of sensitive information from unstructured text documents is a costly problem both for government and for industrial institutions. Traditional approaches for data leak prevention are commonly based on the hypothesis that sensitive information is reflected in the presence of distinct sensitive words. However, for complex sensitive information, this hypothesis may not hold.
【Keywords】: Training; Load modeling; Sensitivity; Neural networks; Predictive models; Syntactics; Engines
【Paper Link】 【Pages】:1403-1404
【Authors】: Zuozhi Wang ; Flavio Bayer ; Seungjin Lee ; Kishore Narendran ; Xuxi Pan ; Qing Tang ; Jimmy Wang ; Chen Li
【Abstract】: We are developing TextDB, an open-source datamanagement system that supports text-centric operations in a declarative and efficient way using an algebraic approach as in relational DBMS. In this demonstration, we show scenarios where we can use TextDB to perform powerful information extraction easily and efficiently on text documents. Video: https://github.com/TextDB/textdb/wiki/Video.
【Keywords】: Drugs; Open source software; Engines; Graphical user interfaces; Data mining; Search engines
【Paper Link】 【Pages】:1405-1406
【Authors】: Zhaoqiang Chen ; Qun Chen ; Zhanhuai Li
【Abstract】: For entity resolution, it remains very challenging to find the solution with quality guarantees as measured by both precision and recall. In this demo, we propose a HUman-and-Machine cOoperative framework, denoted by HUMO, for entity resolution. Compared with the existing approaches, HUMO enables a flexible mechanism for quality control that can enforce both precision and recall levels. We also introduce the problem of minimizing human cost given a quality requirement and present corresponding optimization techniques. Finally, we demo that HUMO achieves high-quality results with reasonable return on investment (ROI) in terms of human cost on real datasets.
【Keywords】: Quality control; Optimization; Erbium; Crowdsourcing; Manuals; Labeling; Writing
【Paper Link】 【Pages】:1407-1408
【Authors】: Dimitris Stripelis ; José Luis Ambite ; Yao-Yi Chiang ; Sandrah P. Eckel ; Rima Habre
【Abstract】: According to the Centers for Disease Control, in the United States there are 6.8 million children living with asthma. Despite the importance of the disease, the available prognostic tools are not sufficient for biomedical researchers to thoroughly investigate the potential risks of the disease at scale. To overcome these challenges we present a big data integration and analysis infrastructure developed by our Data and Software Coordination and Integration Center (DSCIC) of the NIBIB-funded Pediatric Research using Integrated Sensor Monitoring Systems (PRISMS) program. Our goal is to help biomedical researchers to efficiently predict and prevent asthma attacks. The PRISMS-DSCIC is responsible for collecting, integrating, storing, and analyzing realtime environmental, physiological and behavioral data obtained from heterogeneous sensor and traditional data sources. Our architecture is based on the Apache Kafka, Spark and Hadoop frameworks and PostgreSQL DBMS. A main contribution of this work is extending the Spark framework with a mediation layer, based on logical schema mappings and query rewriting, to facilitate data analysis over a consistent harmonized schema. The system provides both batch and stream analytic capabilities over the massive data generated by wearable and fixed sensors. Demo Video: https://www.youtube.com/watch?v=6ntm4C29L-I.
【Keywords】: Sparks; Pediatrics; Biomedical monitoring; Diseases; Atmospheric measurements; Particle measurements; Pollution measurement
【Paper Link】 【Pages】:1409-1410
【Authors】: Ahmad Assadi ; Tova Milo ; Slava Novgorodov
【Abstract】: We demonstrate DANCE, a system that assists domain experts in the efficient resolution of integrity constraints violation. DANCE is demonstrated on the UEFA Champions League database, employing the ICDE'17 audience for effective data cleaning.
【Keywords】: Databases; Cleaning; Urban areas; Maintenance engineering; Games; User-generated content; Conferences
【Paper Link】 【Pages】:1411-1412
【Authors】: Behrooz Omidvar-Tehrani ; Arnab Nandi ; Nicholas Meyer ; Dalton Flanagan ; Seth Young
【Abstract】: The vast volume of real-time air traffic data, being produced through new digital transmissions of the movement of aircraft throughout the US National Airspace System (NAS), is a rich resource for evaluating the performance of the system. To date, the potential for comprehensively analyzing this data has yet to be tapped, precisely due to a lack of tools that limit fully interactive data visualization. In this paper, we propose DV8, an interactive data visualization framework which provides in immediately visualized aviation-oriented insights, with a focus on evaluating the deviations among flights by route, type, airport, and aircraft performance. By providing scenarios validated by aviation experts, we illustrate different utilities of DV8 in areas such as capacity planning, flight route prediction, and fuel consumption.
【Keywords】: Data visualization; Aircraft; Airports; Visualization; Measurement; Indexes; Loading
【Paper Link】 【Pages】:1413-1414
【Authors】: Jia Yu ; Raha Moraffah ; Mohamed Sarwat
【Abstract】: The paper demonstrates Hippo a lightweight database indexing scheme that significantly reduces the storage and maintenance overhead without compromising much on the query execution performance. Hippo stores disk page ranges instead of tuple pointers in the indexed table to reduce the storage space occupied by the index. It maintains simplified histograms that represent the data distribution and adopts a page grouping technique that groups contiguous pages into page ranges based on the similarity of their index key attribute distributions. When a query is issued, Hippo leverages the page ranges and histogram-based page summaries to recognize those pages such that their tuples are guaranteed not to satisfy the query predicates and then inspects the remaining pages. We demonstrate Hippo using a billion NYC taxi trip records. Video: http://www.youtube.com/watch?v=wWaOK2-9k9A.
【Keywords】: Public transportation; Histograms; Indexing; Maintenance engineering; Urban areas
【Paper Link】 【Pages】:1415-1416
【Authors】: Aibek Musaev ; Calton Pu
【Abstract】: Abstract-Modern world data come from an increasing numberof sources, including data from physical sensors like weathersatellites and seismographs as well as social networks and weblogs. While progress has been made in the filtering of individualsocial networks, there are significant advantages in the integrationof big data from multiple sources. For physical events, theintegration of physical sensors and social network data canimprove filtering efficiency and quality of results beyond whatis feasible in each individual data stream.Disasters are representative physical events with real worldimpact. As illustration and demonstration, we have built theLITMUS landslide information service that combines data fromboth physical sensors and social networks in real-time. LITMUSfilters and combines reliable but indirect physical data with directreport social media data on landslides to achieve high quality andwide coverage of landslide information.
【Keywords】: Terrain factors; Sensors; Feeds; Twitter; Information services; Real-time systems
【Paper Link】 【Pages】:1417-1418
【Authors】: Furong Li ; Mong-Li Lee ; Wynne Hsu
【Abstract】: In this demonstration, we showcase MAROON+, a system that builds historical profiles of real-world target entities by integrating their publicly available information from different websites. We face two challenges when building such a system. First, an entity may change its attribute values over time and it is thus hard to decide if distinct values actually describe the same entity but at different times. Second, the websites may provide inaccurate information or fail to update their contents in a timely manner.
【Keywords】: Computer architecture; Data acquisition; LinkedIn; Twitter; Crawlers; Databases
【Paper Link】 【Pages】:1419-1420
【Authors】: Constantinos Costa ; Georgios Chatzimilioudis ; Demetrios Zeinalipour-Yazti ; Mohamed F. Mokbel
【Abstract】: In this demonstration paper, we present SPATE, an innovative telco big data exploration framework whose objectives are two-fold: (i) minimizing the storage space needed to incrementally retain data over time, and (ii) minimizing the response time for spatiotemporal data exploration queries over stored data. Our framework deploys lossless data compression to ingest streams of telco big data in the most compact manner retaining full resolution for data exploration tasks. We augment our storage structures with decaying principles that lead to the progressive loss of detail as information gets older. Our framework also includes visual and declarative interfaces for a variety of telco-specific data exploration tasks. We demonstrate SPATE in two modes: (i) Visual Mode, where attendees will be able to interactively explore synthetic telco traces we will provide, and (ii) SQL Mode, where attendees can submit custom SQL queries based on a provided schema.
【Keywords】: Big Data; Visualization; Time factors; Indexing; Real-time systems; Data engineering
【Paper Link】 【Pages】:1421-1422
【Authors】: Xin Mou ; Hasan M. Jamil ; Xiaogang Ma
【Abstract】: The adoption and availability of diverse application design and support platforms are making generic scientific application orchestration increasingly difficult. In such an evolving environment, higher level abstractions of design primitives are critically important using which end users have a chance to craft their own applications without a complete technical grasp of the lower level details. In this research, we introduce a novel scientific workflow design platform that supports high level tools for data integration, process description and analytics based on a visual language for naive users and advanced options for computing savvy programmers in one single platform, called VisFlow. We describe its salient features and advantages using a complex scientific application in natural resources and ecology. Video: https://youtu.be/ 2YSYVyOuuk.
【Keywords】: Visualization; Rocks; Earth; Data visualization; Databases; Computer science
【Paper Link】 【Pages】:1425
【Authors】: Magda Balazinska
【Abstract】: There are many potential benefits to ensuring thatour research prototypes benefit real users. Users help to definerequirements and identify important areas for innovation. Usershelp to test our systems and verify their utility. Working withreal users, however, presents its challenges and requires an extraeffort. In this talk, we will present some of these benefits,challenges, and lessons learned from working with real usersfrom sciences in the context of database systems work in dataanalytics.
【Keywords】: magdacs.washington.edu
【Paper Link】 【Pages】:1426-1430
【Authors】: Zahid Abul-Basher
【Abstract】: Graph databases have become increasingly important with the rise of social networks, and with the growth of the Semantic Web and characterization of biological networks. Regular path queries (RPQs) are a way to explore path patterns in graphs which have become a standard method to explore graph databases. SPARQL 1.1 includes property paths, and so now encompasses RPQs as a fragment. In many environments, such as visual query systems (VQSs), the RPQs are generated visually which may contain many commonalities that can be optimized globally. We introduce SWARMGUIDE, a framework for optimizing multiple regular path queries. The framework detects commonalities among the RPQs, in order to find an execution plan that is globally optimized over the plan spaces of the constituent RPQs.
【Keywords】: Automata; Resource description framework; Optimization; Planar waveguides; Query processing; Image edge detection
【Paper Link】 【Pages】:1431-1435
【Authors】: Bella Martínez-Seis
【Abstract】: A community is a group of entities not only with high density but also with similar characteristics. Just recent works have merged the graph structure of the networks with the attributes of nodes. Some of them make weight modification, linear combination, and the most promising are the model-based algorithms, which faces the mixture of structure and attributes in a centralized way, this integration increases the dimensionality of the problem. Moreover, not all the attributes are relevant for that purpose, and there are works that rank attributes in each iteration of the community detection process, others do not consider multiple groups with the same attribute. Based on the principle that linked nodes are likely to adopt similar attributes, we calculated weight to each attribute to get the most RELevant Node Attributes (RELNA) based on the relations of the nodes. We integrated the selected attributes to overlapping community detection methods like the well-known CESNA (Communities from Edge Structure and Node Attributes) and our method QMUCA (Quality Measure to Upgrade Communities using Attributes and structure), which is a model-based approach that uses optimization in a regression analysis and a new similarity measure. Experiments show that QMUCA with the ranking of RELNA improves the performance of the accuracy of other methods, but we are still working on it.
【Keywords】: Social network services; Optimization; Image edge detection; Regression analysis; Heuristic algorithms; Algorithm design and analysis; Measurement uncertainty
【Paper Link】 【Pages】:1439-1443
【Authors】: Sujoy Chatterjee ; Anirban Mukhopadhyay ; Malay Bhattacharyya
【Abstract】: A vast majority of the existing crowd opinion aggregation models deal with independent opinions of the crowd workers, where the annotators (crowd workers) provide their opinions unanimously and these opinions are not visible to everyone. But in real life, it is seen that there are many applications in which an annotator can see others' opinions. As a result, there is a high chance that the opinion of an annotator gets biased by the other opinions. My proposed research is about finding a better consensus from multiple crowd opinions (dependent) obtained for a particular question. So, this work addresses a novel problem of dependent judgment analysis and proposes methods that derive the final judgment from a given set of ordered opinions. For the said problem, a Markov chain based aggregation method is proposed that can handle the ordered opinions of the crowd workers, and thus aggregate them to find a consensus. I have used a real-life dataset that contains the independent and dependent crowd opinions over some questions. Furthermore, I am studying a recently proposed model that collects crowd opinions twice (in independent and dependent fashion) to understand their relation. In a recent study, I have proposed a biclustering approach to compute aggregated judgment from multiple crowd opinions. In another current work, I have introduced a new problem dealing with constrained crowd opinions and demonstrated that proper aggregation of those type of opinions can be very helpful for smart city planning. These studies altogether focus on the different novel directions of the judgment analysis problem.
【Keywords】: Judgment Analysis; Crowdsourcing; Clustering
【Paper Link】 【Pages】:1444-1448
【Authors】: Anahita Davoudi
【Abstract】: We analyze online social data to model social interactions of users in recommender systems: i) Rating prediction, and ii) detecting spammers and abnormal user rating behaviors. We propose a social trust model using matrix factorization method to estimate users taste by incorporating user-item matrix. The effect of users friends tastes is modeled based on centrality metrics and similarity algorithms between users. The proposed method is validated using Epinions Dataset. To identify abnormal users in social recommender systems, we propose a classification approach. We define attributes to provide likelihood of a user having a profile of that of an attacker. Using user-item rating matrix and user-connection matrix, we find if the ratings are abnormal and if connections are random. We use k-means clustering to categorize users into authentic users and attackers. We use Epinions dataset to test the profile injection attacks.
【Keywords】: Recommender systems; Social network services; Computational modeling; Measurement; Prediction algorithms; Predictive models; Mathematical model
【Paper Link】 【Pages】:1451-1454
【Authors】: Xin Huang ; Laks V. S. Lakshmanan ; Jianliang Xu
【Abstract】: Communities serve as basic structures for understanding the organization of many real-world networks, such as social, biological, collaboration, and communication networks. Recently, community search over large graphs has attracted significantly increasing attention, from simple and static graphs to evolving, attributed, location-based graphs. Different from the well-studied problem of community detection that finds all communities in an entire network, community search is to find the cohesive communities w.r.t. the query nodes. In this tutorial, we survey the state-of-the-art of community search on various kinds of networks across different application areas such as densely-connected community search, attributed community search, social circle discovery, and querying geosocial groups. We first highlight the challenges posed by the community search problems. We continue the presentation of their principles, methodologies, algorithms, and applications, and give a comprehensive comparison of the state-of-the-art techniques. This tutorial finally concludes by offering future directions for research in this important and growing area.
【Keywords】: Social network services; Tutorials; Search problems; Biology; Privacy; Context; Query processing
【Paper Link】 【Pages】:1455-1458
【Authors】: Chao Zhang ; Quan Yuan ; Jiawei Han
【Abstract】: The pervasiveness of GPS-equipped mobile devices has been nurturing an unprecedented amount of semanticsrich spatiotemporal data. The confluence of spatiotemporal and semantic information offers new opportunities for extracting valuable knowledge about people's behaviors, but meanwhile also introduces its unique challenges that render conventional spatiotemporal data mining techniques inadequate. Consequently, mining semantics-rich spatiotemporal data has attracted significant research attention from the data mining community in the past few years. In this tutorial, we start with reviewing classic spatiotemporal data mining tasks and identifying the new opportunities introduced by semantics-rich spatiotemporal data. Subsequently, we provide a comprehensive introduction of existing techniques for mining semantics-rich spatiotemporal data, covering topics including spatiotemporal activity mining, spatiotemporal event discovery, and spatiotemporal mobility modeling. Finally, we discuss about the limitations of existing research and identify several important future directions.
【Keywords】: Spatiotemporal phenomena; Data mining; Tutorials; Semantics; Trajectory; Computer science; Event detection
【Paper Link】 【Pages】:1459-1462
【Authors】: Kostas Stefanidis ; Vassilis Christophides ; Vasilis Efthymiou
【Abstract】: Entity resolution aims to identify descriptions of the same entity within or across knowledge bases. In this work, we provide a comprehensive and cohesive overview of the key research results in the area of entity resolution. We are interested in frameworks addressing the new challenges in entity resolution posed by the Web of data in which real world entities are described by interlinked data rather than documents. Since such descriptions are usually partial, overlapping and sometimes evolving, entity resolution emerges as a central problem both to increase dataset linking, but also to search the Web of data for entities and their relations. We focus on Web-scale blocking, iterative and progressive solutions for entity resolution. Specifically, to reduce the required number of comparisons, blocking is performed to place similar descriptions into blocks and executes comparisons to identify matches only between descriptions within the same block. To minimize the number of missed matches, an iterative entity resolution process can exploit any intermediate results of blocking and matching, discovering new candidate description pairs for resolution. Finally, we overview works on progressive entity resolution, which attempt to discover as many matches as possible given limited computing budget, by estimating the matching likelihood of yet unresolved descriptions, based on the matches found so far.
【Keywords】: Erbium; Iterative methods; Schedules; Knowledge based systems; Joining processes; Semantics
【Paper Link】 【Pages】:1463-1466
【Authors】: Faisal Nawab ; Divyakant Agrawal ; Amr El Abbadi
【Abstract】: Global-scale data management (GSDM) empowers systems by providing higher levels of fault-tolerance, read availability, and efficiency in utilizing cloud resources. This has led to the emergence of global-scale data management and event processing. However, the Wide-Area Network (WAN) latency separating datacenters is orders of magnitude larger than typical network latencies, and this requires a reevaluation of many of the traditional design trade-offs of data management systems. Therefore, data management problems must be revisited to account for the new design space. In this tutorial, we survey recent developments in GSDM focusing on identifying fundamental challenges and advancements in addition to open research opportunities.
【Keywords】: Tutorials; Wide area networks; Distributed databases; Silicon; Fault tolerance; Fault tolerant systems
【Paper Link】 【Pages】:1467-1470
【Authors】: Andreas Züfle ; Goce Trajcevski ; Dieter Pfoser ; Matthias Renz ; Matthew T. Rice ; Timothy Leslie ; Paul L. Delamater ; Tobias Emrich
【Abstract】: An inherent challenge arising in any dataset containing information of space and/or time is uncertainty due to various sources of imprecision. Integrating the impact of the uncertainty is a paramount when estimating the reliability (confidence) of any query result from the underlying input data. To deal with uncertainty, solutions have been proposed independently in the geo-science and the data-science research community. This interdisciplinary tutorial bridges the gap between the two communities by providing a comprehensive overview of the different challenges involved in dealing with uncertain geo-spatial data, by surveying solutions from both research communities, and by identifying similarities, synergies and open research problems.
【Keywords】: Uncertainty; Tutorials; Spatial databases; Data models; Interpolation; Reliability; Measurement uncertainty
【Paper Link】 【Pages】:1473-1474
【Authors】: Bill Howe ; Michael J. Franklin ; Laura M. Haas ; Tim Kraska ; Jeffrey D. Ullman
【Abstract】: In the first wave of data science education programs, data engineering topics (systems, scalable algorithms, data management, integration) tended to be de-emphasized in favor of machine learning and statistical modeling. The anecdotal evidence suggests this was a mistake: data scientists report spending most of their time grappling with data far upstream of modeling activities. A second wave of data science education is emerging, one with increased emphasis on practical issues in ethics, legal compliance, scientific reproducibility, data quality, and algorithmic bias. The data engineering community has a second chance to influence these programs beyond just providing a set of tools. In this panel, we'll discuss the role of data engineering in data science education programs, and how best to capitalize on emerging opportunities in this space.
【Keywords】: Data science; Computer science; Data engineering; Education; Data models; Databases; Machine learning algorithms
【Paper Link】 【Pages】:1475-1476
【Authors】: Oliver Kennedy ; D. Richard Hipp ; Stratos Idreos ; Amélie Marian ; Arnab Nandi ; Carmela Troncoso ; Eugene Wu
【Abstract】: Data is becoming increasingly personal. Individuals regularly interact with a wide variety of structured data, from SQLite databases on phones, to HR spreadsheets, to personal sensors, to open government data appearing in news articles. Although these workloads are important, many of the classical challenges associated with scale and Big Data do not apply. This panel brings together experts in a variety of fields to explore the new opportunities and challenges presented by "Small Data".
【Keywords】: Databases; Tools; Data systems; Data privacy; Engineering profession; Privacy; Companies
【Paper Link】 【Pages】:1481-1488
【Authors】: Maria Stratigi ; Haridimos Kondylakis ; Kostas Stefanidis
【Abstract】: During the last decade, the number of users who look for health-related information has impressively increased. On the other hand, health professionals have less and less time to recommend useful sources of such information online to their patients. To this direction, we target at streamlining the process of providing useful online information to patients by their caregivers and improving as such the opportunities that patients have to inform themselves online about diseases and possible treatments. Using our system, relevant and high quality information is delivered to patients based on their profile, as represented in their personal healthcare record data, facilitating an easy interaction by minimizing the necessary manual effort. Specifically, in this paper, we propose a model for group recommendations following the collaborative filtering approach. Since in collaborative filtering is crucial to identify the correct set of similar users for a user in question, in addition to the traditional ratings, we pay particular attention on how to exploit healthrelated information for computing similarities between users. Our special focus is on providing valuable suggestions to a caregiver who is responsible for a group of users. We interpret valuable suggestions as suggestions that are both highly related and fair to the users of the group. In this line, we propose an algorithm for identifying the top-z most valuable recommendations, and present its implementation in MapReduce.
【Keywords】: Collaboration; Cancer; Computer architecture; Gold; Computational modeling
【Paper Link】 【Pages】:1489-1496
【Authors】: Yuta Suzuki ; Makito Sato ; Hiroaki Shiokawa ; Masashi Yanagisawa ; Hiroyuki Kitagawa
【Abstract】: Given brain and myoelectric signals taken from a mouse, how can we classify its sleep stages accurately? Classifying sleep stages is the fundamental problem in recent diagnoses and clinical researches. However, sleep staging suffers from a serious weakness, clinical experts visually inspect the brain and myoelectric signals to improve sleep staging accuracy. This is because recent diagnoses and clinical researches require classification accuracy at least 95% so as to enhance preciseness of their analyses. In this paper, we present an automatic classification method MASC based on the following three approaches: (1) it extracts effective features for fully representing each sleep stage property, (2) it classifies sleep stages by using temporal patterns of sleep stage transitions, and (3) it re-classifies sleep stages only for the results with low-confidence. As a result, MASC achieves more than 95% accuracy for both noisy and noiseless mice data.
【Keywords】: Sleep; Feature extraction; Mice; Electroencephalography; Electromyography; Algorithm design and analysis; Proposals
【Paper Link】 【Pages】:1497-1504
【Authors】: Yuanyang Zhang ; Richard Jiang ; Linda R. Petzold
【Abstract】: Data mining techniques have been proposed to predict mortality for ICU patients using their demographic data, measurements and notes from doctors and nurses. Most of these techniques suffer from two main drawbacks. First, they model the mortality prediction problem as a binary classification problem, while ignoring the time of death as continuous values. Second, they use topic models to analyze the notes, while ignoring the relationship between measurements, notes and mortality/discharge outcomes. In this paper we propose a novel model called the survival topic model (SVTM), which models patients' measurements, notes and mortality/discharge jointly, and predicts the probability of mortality/discharge as functions of time. The idea is that each patient has a latent distribution of disease conditions, which we call topics. These conditions generate the measurements and notes and determine the patients' mortality. We derive a mean-field variational inference algorithm for this model. We fitted the SVTM with two outcomes on Medical Information Mart for Intensive Care III (MIMIC III) trauma patients data and obtained some important topics. Also, we demonstrated the relationships between these topics.
【Keywords】: Mortality Prediction; Survival Analysis; Topic Model
【Paper Link】 【Pages】:1507-1509
【Authors】: Vinicius Oliverio ; Omero Bendicto Poli-Neto
【Abstract】: Chronic pelvic pain is a common clinical condition with negative consequences in many aspects of women's life. The clinical presentation is heterogeneous and the involvement of several body systems impairs the identification of the exact etiology of the problem. At the same time, a clinical treatment of good quality depends on the professional and the learning process is slow. The goal of this paper is to compare different classification algorithms for diagnosing the probable cause of the chronic pelvic pain in those women. A multi-label problem modeling technique and an attribute selection algorithm where used to prepare the problem to the comparison of the algorithms Naive Bayes, C4.5, Support Vector Machines, AdaBoost and KNN.
【Keywords】: Artificial Intelligence; Machine Learning; Chronic Pelvic Pain; Diagnosis; Multi-label
【Paper Link】 【Pages】:1510-1514
【Authors】: Nhathai Phan ; Soon Ae Chun ; Manasi Bhole ; James Geller
【Abstract】: Prescription drug abuse is one of the fastest growing public health problems in the USA. To address this epidemic, a near real-time monitoring strategy, instead of one resorting to a retrospective health records, may improve detecting the prevalence and patterns of abuse of both illegal drugs and prescription medications. In this paper, our primary goals are to demonstrate the possibility of utilizing social media, e.g., Twitter, for automatic monitoring of illegal drug and prescription medication abuse. We use machine learning methods for an automatic classification that can identify tweets that are indicative of drug abuse. We collected tweets associated with well-known illegal and prescription drugs. We manually annotated 300 tweets that are likely to be related to drug abuse. Our experiment compares a set of classification algorithms, and a decision tree classifier J48, and the SVM outperform others for determining whether tweets contain signals of drug abuse. This automatic supervised classification study results illustrate the utility of Twitter in examining patterns of abuse, and show the feasibility of building the drug abuse detection system that can process large volume data from social media sources in a near real-time.
【Keywords】: illegal drug; prescription drug; drug abuse; social media; online social network
【Paper Link】 【Pages】:1515-1522
【Authors】: Jing (Melody) Yao
【Abstract】: Objective This study aimed to examine the association between mother smoking during pregnancy and ADHD in children.
【Keywords】: Pregnancy; Logistics; Pediatrics; Diseases; Medical diagnostic imaging; Artificial neural networks; Conferences
【Paper Link】 【Pages】:1525-1532
【Authors】: Vineet K. Raghu ; Xiaoyu Ge ; Panos K. Chrysanthis ; Panayiotis V. Benos
【Abstract】: The exponential growth of high dimensional biological data has led to a rapid increase in demand for automated approaches for knowledge production. Existing methods rely on two general approaches to address this challenge: 1) the Theorydriven approach, which utilizes prior accumulated knowledge, and 2)the Data-driven approach, which solely utilizes the data to deduce scientific knowledge. Both of these approaches alone suffer from bias toward past/present knowledge, as they fail to incorporate all of the current knowledge that is available to make new discoveries. In this paper, we show how an integrated method can effectively address the high dimensionality of big biological data, which is a major problem for pure datadriven analysis approaches. We realize our approach in a novel two-step analytical workflow that incorporates a new feature selection paradigm as the first step to handling high-throughput gene expression data analysis and that utilizes graphical causal modeling as the second step to handle the automatic extraction of causal relationships. Our results, on real-world clinical datasets from The Cancer Genome Atlas (TCGA), demonstrate that our method is capable of intelligently selecting genes for learning effective causal networks.
【Keywords】: Gene expression; Databases; Diseases; Feature extraction; Genomics; Bioinformatics
【Paper Link】 【Pages】:1533-1540
【Authors】: Luca Bonomi ; Xiaoqian Jiang
【Abstract】: The study of patients in Intensive Care Units (ICUs) is a crucial task in critical care research which has significant implications both in identifying clinical risk factors and defining institutional guidances. The mortality study of ICU patients is of particular interest because it provides useful indications to healthcare institutions for improving patients experience, internal policies, and procedures (e.g. allocation of resources). To this end, many research works have been focused on the length of stay (LOS) for ICU patients as a feature for studying the mortality. In this work, we propose a novel mortality study based on the notion of burstiness, where the temporal information of patients longitudinal data is taken into consideration. The burstiness of temporal data is a popular measure in network analysis and time-series anomaly detection, where high values of burstiness indicate presence of rapidly occurring events in short time periods (i.e. burst). Our intuition is that these bursts may relate to possible complications in the patient's medical condition and hence provide indications on the mortality. Compared to the LOS, the burstiness parameter captures the temporality of the medical events providing information about the overall dynamic of the patients condition. To the best of our knowledge, we are the first to apply the burstiness measure in the clinical research domain. Our preliminary results on a real dataset show that patients with high values of burstiness tend to have higher mortality rate compared to patients with more regular medical events. Overall, our study shows promising results and provides useful insights for developing predictive models on temporal data and advancing modern critical care medicine.
【Keywords】: Medical diagnostic imaging; MIMICs; Hospitals; Diseases; Diabetes; Correlation; Biological system modeling
【Paper Link】 【Pages】:1541-1548
【Authors】: Diogo Ferreira Pacheco ; Diego Pinheiro ; Martin Cadeiras ; Ronaldo Menezes
【Abstract】: Approximately 22 people die every day in the USA due to a lack of organs for transplant. Research suggests that the most effective solution is to increase organ donor rates, current, proposals range from expanding the donor eligibility criteria (donor pool) to performing mass media campaigns. However, little is known about the extent in which activities on social media are associated with aspects (e.g. awareness) of organ donation. Our hypothesis is that social media can be utilized as a sensor to characterize organ donation awareness and population engagement in donation for each different organ. In this sense, we collected Twitter messages (tweets) regarding organ donation, and characterized organ awareness by aggregating tweets from users who mostly mentioned that organ. Similarly, we assessed the relative risk between the cumulative incidence of organrelated conversations inside and outside geographical regions to characterize them regarding organ donation awareness. Our characterization suggests that organ-related conversations on social media seems to be indeed associated with aspects of organ donation such as the co-occurrence of organ transplantations. Also, we found variations regarding the specific organs that are prominently discussed in each geographical region, and that such variations seem to be associated with aspects of organ donation in that region, for instance, the abnormal amount of conversations about kidneys in Kansas. Our findings suggest that the proposed approach has the potential to characterize the awareness of organ donation in real-time.
【Keywords】: Twitter; Context; Kidney; Sociology; Statistics
【Paper Link】 【Pages】:1553-1558
【Authors】: Thu-Le Pham ; Alessandra Mileo ; Muhammad Intizar Ali
【Abstract】: Stream reasoning is an emerging research area focused on providing continuous reasoning solutions for data streams. The high expressiveness of non-monotonic reasoning enables complex decision making by managing defaults, commonsense, preferences, recursion, and non-determinism, but it is computationally intensive. The exponential growth in the availability of streaming data on the Web has seriously hindered the applicability of state-of-the-art non-monotonic reasoners to be applied to streaming information in a scalable way. In this paper, we address the issue of scalability for nonmonotonic stream reasoning based on Answer Set Programming (ASP) - an expressive reasoning approach based on disjunctive logic Datalog with negation under the stable model semantics, by analyzing input dependency. We introduce an input dependency graph to represent the relationships between input events based on the structure of a given logical rule set. The input dependency graph allows us to dynamically configure the streaming window size in order to maximise the scalability of the non-monotonic reasoner. We conduct an experimental evaluation to demonstrate the effectiveness and ability of our proposed approach in improving the scalability of disjunctive logic programming with ASP in dynamic environments.
【Keywords】: Cognition; Automobiles; Scalability; Data analysis; Urban areas; Electronic mail
【Paper Link】 【Pages】:1559-1562
【Authors】: Marios Meimaris ; George Papastefanatos
【Abstract】: SPARQL query optimization relies on the design and execution of query plans that involve reordering triple patterns, in the hopes of minimizing cardinality of intermediate results. In practice, this is not always effective, as many existing systems succeed in certain types of query patterns and fail in others. This kind of trade-off is often a derivative of the algorithms behind query planning. In this paper, we introduce a novel join reordering approach that translates a query into a multidimensional vector space and performs distance-based optimization by taking into account the relative differences between the triple patterns. Preliminary experiments on synthetic data show that our algorithm consistently outperforms established methodologies, providing better plans for many different types of query patterns.
【Keywords】: Query processing; Optimization; Heuristic algorithms; Resource description framework; Planning; Estimation; Indexes
【Paper Link】 【Pages】:1563-1565
【Authors】: Sutanay Choudhury ; Khushbu Agarwal ; Sumit Purohit ; Baichuan Zhang ; Meg Pirrung ; William Smith ; Mathew Thomas
【Abstract】: The ability to construct domain specific knowledge graphs (KG) and perform question-answering or hypothesis generation is a transformative capability. Despite their value, automated construction of knowledge graphs remains an expensive technical challenge that is beyond the reach for most enterprises and academic institutions. We propose an end-toend framework for developing custom knowledge graph driven analytics for arbitrary application domains. The uniqueness of our system lies A) in its combination of curated KGs along with knowledge extracted from unstructured text, B) support for advanced trending and explanatory questions on a dynamic KG, and C) the ability to answer queries where the answer is embedded across multiple data sources.
【Keywords】: Data mining; Knowledge engineering; Heuristic algorithms; Drones; Sparks; Knowledge based systems; Information retrieval
【Paper Link】 【Pages】:1569-1574
【Authors】: Siti Aminah ; Iis Afriyanti ; Adila Krisnadhi
【Abstract】: Academic evaluation is an important activity regularly conducted by every higher education institution to gauge the performance of education services and help the administrators of the institution improve the quality of those services. Implementation of academic evaluation almost always requires an integration of relevant data, which are often scattered in separate systems. Such an integration effort is often tedious and time consuming because the data must be gathered from different systems, manually integrated, and then presented in a format that meets some pre-determined academic evaluation criteria. One way to help reducing the amount of effort expended in the integration is to make the data available in a format that enables linking between any part of the data and allows various academic evaluation queries to be performed on them. Our work intends to realise this idea by using Semantic Web technologies, in particular ontology and linked data. In particular, our work is situated in a concrete use case based on the academic evaluation process as periodically conducted in Universitas Indonesia (UI). In this paper, we focus on the first step of this effort, namely the development of an ontology for academic evaluation with a particular emphasis on evaluation of undergraduate degree programs in UI. There has been a number of prior work focusing on the development of an ontology for a similar problem, yet none of them takes into account the fact that academic data evolve over time. Our ontology thus accounts for this aspect according to the UIs academic evaluation criteria. We also demonstrate that the ontology can be used as a basis in answering SPARQL queries that capture those criteria, which indicates the suitability and usability of the ontology.
【Keywords】: ontology; data integration; semantic web; academic evaluation; accreditation
【Paper Link】 【Pages】:1575-1578
【Authors】: Michael N. Gubanov
【Abstract】: The big data era brought us petabytes of data together with the challenges of storing and efficiently accessing large-scale datasets. However, it unexpectedly surprised everyone with an enormous variety of data sources and types, and corresponding different data models. Data integration, a mature field addressing problems of accessing and fusing data residing in more than one data source, over the years came up with feasible semi-automatic solutions, most of which efficiently handle a handful of data sources represented in one or two different data models. Most of the solutions do not easily scale up, since they usually require some sort of human assistance, infeasible at scale. Unified Famous Object (UFO) Having trained many such UFOs, the system would get more and more powerful, as it learned to recognize and access data objects across many sources without supervision. While UFO was definitely progress towards scaling up data fusion, it was not a "silver bullet", because it was built mostly for relational data. Hence, there is a need for a new large-scale data integration system that could handle many types of data at scale. This paper sketches its architecture, and envisions potential research challenges. First, a few key principles that such a system should follow are described, then follows the discussion on architecture and research avenues.
【Keywords】: Data integration; Metadata; Information retrieval; Engines; Database languages; Data mining
【Paper Link】 【Pages】:1579-1581
【Authors】: Kostas Stefanidis ; Haridimos Kondylakis ; Georgia Troullinou
【Abstract】: As knowledge bases are constantly evolving, there is a clear need for monitoring and analyzing the changes that occur on them. Traditional approaches for studying the evolution of data focus on providing humans with deltas that include loads of information. In this work, we envision a processing model that recommends evolution measures taking into account particular challenges, such as relatedness, transparency, diversity, fairness and anonymity. We target at supporting humans with complementary measures that offer high-level overviews of the changes to help them understand how data of interest evolves.
【Keywords】: Knowledge based systems; Semantics; Synchronization; Data models; Social network services; Sensors
【Paper Link】 【Pages】:1587
【Authors】: Andre Putnam
【Abstract】: The Catapult project has brought the power and performance of FPGA-based reconfigurable computing to Microsoft’s hyperscale datacenters, accelerating major production cloud applications such as Bing web search and Microsoft Azure, and enabling a new generation of machine learning and artificial intelligence applications. Catapult is now deployed in nearly every new server across the more than a million machines that make up the Microsoft hyperscale cloud.
【Keywords】: Cloud computing; Acceleration; Field programmable gate arrays; Servers; Computer architecture; Hardware; Production
【Paper Link】 【Pages】:1591-1598
【Authors】: Jianting Zhang ; Simin You ; Le Gruenwald
【Abstract】: Processing large-scale data is typically memory intensive. The current generation of Graphics Processing Units (GPUs) has much lower memory capacity than CPUs which is often a limiting factor in processing large-scale data on GPUs. It is desirable to reduce memory footprint in spatially joining large-scale datasets through query optimization. In this study, we present a parallel selectivity estimation technique for optimizing spatial join processing on GPUs. By integrating the multi-dimensional cumulative histogram structure and the summed-area-table algorithm, our data parallel selectivity estimation technique can be efficiently realized on GPUs. Experiments on spatially joining two sets of Minimum Bounding Boxes (MBBs) derived from real point and polygon data, each with about one million MBBs, have shown that selectivity estimation at four grid levels took less than 1/3 of a second on a Nvidia GTX Titan GPU device. By using the best grid resolution, our technique saves 38.4% memory footprint for the spatial join.
【Keywords】: Selectivity Estimation; Spatial Join; GPU; Parallel Design; Cumulative Histogram
【Paper Link】 【Pages】:1599-1606
【Authors】: Marcus Pinnecke ; David Broneske ; Gabriel Campero Durand ; Gunter Saake
【Abstract】: Employing special-purpose processors (e.g., GPUs) in database systems has been studied throughout the last decade. Research on heterogeneous database systems that use both general-and special-purpose processors has addressed either transaction-or analytic processing, but not the combination of them. Support for hybrid transaction-and analytic processing (HTAP) has been studied exclusively for CPU-only systems. In this paper we ask the question whether current systems are ready for HTAP workload management with cooperating generaland special-purpose processors. For this, we take the perspective of the backbone of database systems: the storage engine. We propose a unified terminology and a comprehensive taxonomy to compare state-of-the-art engines from both domains. We show similarities and differences, and determine a necessary set of features for engines supporting HTAP workload on CPUs and GPUs. Answering our research question, our findings yield a resolute: not yet.
【Keywords】: Database systems; Engines; Layout; Graphics processing units; Optimization
【Paper Link】 【Pages】:1609
【Authors】: Bingsheng He
【Abstract】: Hardware development is a major driver for the development of data management systems. For example, due to the increased capacity and low cost of main memory, data management systems have shifted from disk-based to in-memory systems [1]. Recently, databases on emerging hardware have gained a lot of attractions in both academia and industry. A significant amount of research has been devoted to the design and implementation of high-performance hardwareconscious systems. We can summarize them into three main lines of research, mainly according to processor architectures.
【Keywords】: Hardware; Computer architecture; Databases; Software; Nonvolatile memory; Random access memory; Concurrency control
【Paper Link】 【Pages】:1610-1611
【Authors】: Stratis D. Viglas
【Abstract】: We present the results of our work on integrating database and programming language runtimes through code generation and extensive just-in-time adaptation. Our techniques deliver significant performance improvements over non-integrated solutions. Our work makes important first steps towards a future where data processing applications will commonly run on machines that can store their datasets entirely in persistent memory, and will be written in a single programming language employing higher-level APIs and language-integrated query.
【Keywords】: Runtime; Computer languages; Heuristic algorithms; Data processing; Data models; Sorting; Database systems
【Paper Link】 【Pages】:1612
【Authors】: Xiaodong Zhang
【Abstract】: The research and development of today's data management systems for both transaction processing and data analytics are driven by an inevitable technology trend, namely inmemory computing, and by an increasingly high demand of online applications from billions of concurrent users. This requires that a data management system must be able to provide both high throughput transactions and fast data analytics execution. GPU can play an important role in data management systems for both high throughput transactions and high performance data analytics. However, several fundamental challenges must be addressed before GPU can be integrated into data management production systems. First, the low bandwidth PCIe connection between CPU host memory and GPU device memory is a bottleneck to limit GPUs to achieve their full potential for high performance data analytics. Second, the mismatch between limited GPU programming capabilities and complex SQL query structures makes it hard to utilize GPUs to execute certain practically important queries including nested sub-queries and recursive queries. Finally, system support for resource management in GPUs is very limited, such as memory management and job scheduling, being unable to coordinate concurrent query executions, and causing a high system underutilization or overutilization on GPUs. In this presentation, I will first make a strong case for GPUs to serve as special-purpose devices to greatly accelerate the operations of in-memory key-value stores. Specifically, I present the design and implementation of Mega-KV, a GPU-based inmemory key-value store system that achieves high performance and high throughput. Mega-KV can process up to 160+ million key-value operations per second. I will discuss the challenges we are facing to accept GPU as a regular member in our conventional computing ecosystem along with several proposed solutions.
【Keywords】: Data management; data analytics; Graphics Processing Unit (GPU); Key-value store; in-memory computing
【Paper Link】 【Pages】:1615
【Authors】: Roger Moussalli
【Abstract】: General purpose processors have traditionally been favored over application-specific architectures due to the provided flexibility, standardized and simpler programming model, as well as significant reduction in development time. Fueled by the steady advances in transistor scaling, general purpose CPUs satisfied the performance needs of most applications. While CPUs were becoming ubiquitous, advances in digital storage technologies and sensing devices (cameras, microphones, etc) led to massive and sky-rocketing amounts of data being generated by devices of all scales. Extracting insights out of this Big Data introduces significant opportunities for business intelligence, though the growth of data volumes and complexity of query patterns has been increasing at a startling rate. With Moore's law ending, transistors' shrinking coming to a halt and CPU performance saturating, accelerator technologies are increasingly embraced to augment general purpose CPUs and to address performance concerns. Accelerators diverge from traditional CPU architectures in the way they utilize the available silicon resources. In particular, accelerators maximize the resources available for raw computing (ALUs, Floating Point Units) and push back the burden of correct program semantics and control to higher levels of the stack including the compiler and programming models, while focusing on a selected subset of applications. Accelerators include Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) and General Purpose Graphics Processing Units (GPGPUs), each with their programming model, advantages and challenges. FPGAs enable the deployment of deep custom pipelines, whereas GPGPUs provide hundreds of small processors executing in a massively parallel fashion. Compared to CPUs, accelerators attain higher performance out of the available transistors for a wide range of applications. This talk covers tradeoffs of accelerators (FPGA and GPGPU) specific to a set of database applications, namely XML filtering, spatiotemporal analytics in the context of the Internet of Things, and relational database querying. Tradeoff metrics include programmability, performance, accuracy and energy consumption. While accelerators achieve high speedups for "hot" code paths, the attach point of accelerators in a system significantly impacts the end-to-end application performance. As such, system and deployment-level considerations must be made. To this end, I will go over IBM's efforts to facilitate the inclusion and increase the adoption of accelerators. These include (1) the Coherent Accelerator Processor Interface (CAPI), reducing software refactoring from the CPU side as well as CPU-accelerator latency, (2) the ConTutto research platform for acceleration innovation in the memory subsystem, providing very high bandwidth to accelerators, and (3) NVLink®-enabled IBM POWER® processors.
【Keywords】: accelerator; acceleration; hardware; FPGA; GPU; database; ConTutto; CAPI; NVLink
【Paper Link】 【Pages】:1616
【Authors】: Evangelia A. Sitaridi
【Abstract】: It is known that the use of GPUs yields significant speed-ups for database processing. The increased availability of GPUs in the cloud opens new acceleration possibilities. To harness the power of GPUs, we adapt and redesign data analytics operators to exploit better their special memory and threading model. Due to the increasing memory capacities and also the users’ need for fast interaction with data, as in the case of visual analytics, we focus on in-memory data processing. We identify two key applications for hardware acceleration because of their impact on end-to-end system performance: a) Substring matching and b) Lossless data compression. Our techniques cover data preprocessing and algorithmic operator optimization taking the GPU control flow into consideration. We also discuss to what extent we can adapt our techniques for CPUs with SIMD extensions.
【Keywords】: Graphics processing units; Databases; Instruction sets; Hardware; Acceleration; Message systems