36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020. IEEE 【DBLP Link】
【Paper Link】 【Pages】:1-12
【Authors】: Yurong Cheng ; Boyang Li ; Xiangmin Zhou ; Ye Yuan ; Guoren Wang ; Lei Chen
【Abstract】: With the development of mobile communication techniques, spatial crowdsourcing has become popular recently. A typical topic of spatial crowdsourcing is task assignment, which assigns crowd workers to users' requests in real time and maximizes the total revenue. However, it is common that the available crowd workers over a platform are too far away to serve the requests, so some user requests may be rejected or responded at high money cost after long waiting. Fortunately, the neighbors of a platform usually have available resources for the same services. Collaboratively conducting the task allocation among different platforms can greatly improve the quality of services, but have not been investigated yet. In this paper, we propose a Cross Online Matching (COM), which enables a platform to "borrow" unoccupied crowd workers from other platforms for completing the user requests. We propose two algorithms, deterministic cross online matching (DemCOM) and randomized cross online matching (RamCom) for COM. DemCOM focuses on the largest obtained revenue in a greedy manner, while RamCom considers the trade-off between the obtained revenue and the probability of request being accepted by the borrowed workers. Extensive experimental results verify the effectiveness and efficiency of our algorithms.
【Keywords】: Crowdsourcing; Automobiles; Task analysis; Two dimensional displays; Real-time systems; Public transportation; Mobile communication
【Paper Link】 【Pages】:13-24
【Authors】: Yan Zhao ; Kai Zheng ; Yue Cui ; Han Su ; Feida Zhu ; Xiaofang Zhou
【Abstract】: With the rapid development of mobile networks and the widespread usage of mobile devices, spatial crowdsourcing, which refers to assigning location-based tasks to moving workers, has drawn increasing attention. One of the major issues in spatial crowdsourcing is task assignment, which allocates tasks to appropriate workers. However, existing works generally assume the static offline scenarios, where the spatio-temporal information of all the workers and tasks is determined and known a priori. Ignorance of the dynamic spatio-temporal distributions of workers and tasks can often lead to poor assignment results. In this work we study a novel spatial crowdsourcing problem, namely Predictive Task Assignment (PTA), which aims to maximize the number of assigned tasks by taking into account both current and future workers/tasks that enter the system dynamically with location unknown in advance. We propose a two-phase data-driven framework. The prediction phase hybrids different learning models to predict the locations and routes of future workers and designs a graph embedding approach to estimate the distribution of future tasks. In the assignment component, we propose both greedy algorithm for large-scale applications and optimal algorithm with graph partition based decomposition. Extensive experiments on two real datasets demonstrate the effectiveness of our framework.
【Keywords】: prediction; task assignment; spatial crowdsourcing
【Paper Link】 【Pages】:25-36
【Authors】: Chi Harold Liu ; Yinuo Zhao ; Zipeng Dai ; Ye Yuan ; Guoren Wang ; Dapeng Wu ; Kin K. Leung
【Abstract】: Spatial crowdsourcing (SC) utilizes the potential of a crowd to accomplish certain location based tasks. Although worker scheduling has been well studied recently, most existing works only focus on the static deployment of workers but ignore their temporal movement continuity. In this paper, we explicitly consider the use of unmanned vehicular workers, e.g., drones and driverless cars, which are more controllable and can be deployed in remote or dangerous areas to carry on long-term and hash tasks as a vehicular crowdsourcing (VC) campaign. We propose a novel deep reinforcement learning (DRL) approach for curiosity-driven energy-efficient worker scheduling, called "DRL-CEWS", to achieve an optimal trade-off between maximizing the collected amount of data and coverage fairness, and minimizing the overall energy consumption of workers. Specifically, we first utilize a chief-employee distributed computational architecture to stabilize and facilitate the training process. Then, we propose a spatial curiosity model with a sparse reward mechanism to help derive the optimal policy in large crowdsensing space with unevenly distributed data. Extensive simulation results show that DRL-CEWS outperforms the state-of-the-art methods and baselines, and we also visualize the benefits curiosity model brings and show the impact of two hyperparameters.
【Keywords】: Vehicular crowdsourcing; deep reinforcement learning; worker scheduling; curiosity model
【Paper Link】 【Pages】:37-48
【Authors】: JiaCheng Huang ; Wei Hu ; Zhifeng Bao ; Yuzhong Qu
【Abstract】: Knowledge bases (KBs) store rich yet heterogeneous entities and facts. Entity resolution (ER) aims to identify entities in KBs which refer to the same real-world object. Recent studies have shown significant benefits of involving humans in the loop of ER. They often resolve entities with pairwise similarity measures over attribute values and resort to the crowds to label uncertain ones. However, existing methods still suffer from high labor costs and insufficient labeling to some extent. In this paper, we propose a novel approach called crowdsourced collective ER, which leverages the relationships between entities to infer matches jointly rather than independently. Specifically, it iteratively asks human workers to label picked entity pairs and propagates the labeling information to their neighbors in distance. During this process, we address the problems of candidate entity pruning, probabilistic propagation, optimal question selection and error-tolerant truth inference. Our experiments on real-world datasets demonstrate that, compared with state-of-the-art methods, our approach achieves superior accuracy with much less labeling.
【Keywords】: Erbium; Labeling; Probabilistic logic; Inference algorithms; Task analysis; Object recognition; Forestry
【Paper Link】 【Pages】:49-60
【Authors】: Caihua Shan ; Nikos Mamoulis ; Reynold Cheng ; Guoliang Li ; Xiang Li ; Yuqiu Qian
【Abstract】: In this paper, we propose a Deep Reinforcement Learning (RL) framework for task arrangement, which is a critical problem for the success of crowdsourcing platforms. Previous works conduct the personalized recommendation of tasks to workers via supervised learning methods. However, the majority of them only consider the benefit of either workers or requesters independently. In addition, they do not consider the real dynamic environments (e.g., dynamic tasks, dynamic workers), so they may produce sub-optimal results. To address these issues, we utilize Deep Q-Network (DQN), an RL-based method combined with a neural network to estimate the expected long-term return of recommending a task. DQN inherently considers the immediate and the future rewards and can be updated quickly to deal with evolving data and dynamic changes. Furthermore, we design two DQNs that capture the benefit of both workers and requesters and maximize the profit of the platform. To learn value functions in DQN effectively, we also propose novel state representations, carefully design the computation of Q values, and predict transition probabilities and future states. Experiments on synthetic and real datasets demonstrate the superior performance of our framework.
【Keywords】: crowdsourcing platform; task arrangement; reinforcement learning; deep Q-Network
【Paper Link】 【Pages】:61-72
【Authors】: Yifeng Jin ; Lin Zhu ; Zijing Tan
【Abstract】: Bidirectional order dependencies state relationships of order between lists of attributes. They naturally model the order-by clauses in SQL queries, and are proved effective in query optimizations concerning sorting. Despite their importance, order dependencies on a dataset are typically unknown and are too costly, if not impossible, to design or discover manually. Techniques for automatic order dependency discovery are recently studied. It is challenging for order dependency discovery to scale well, since it is by nature factorial in the number m of attributes and quadratic in the number n of tuples. In this paper, we adopt a strategy that decouples the impact of m from that of n, and that still finds all minimal valid bidirectional order dependencies. We present carefully designed data structures, a host of algorithms and optimizations, for efficient order dependency discovery. With extensive experimental studies on both real-life and synthetic datasets, we verify our approach significantly outperforms state-of-the-art techniques, by orders of magnitude.
【Keywords】: Algorithms; Data profiling; Data dependency
【Paper Link】 【Pages】:73-84
【Authors】: Wentao Zhang ; Jiawei Jiang ; Yingxia Shao ; Bin Cui
【Abstract】: The ensemble of deep neural networks has been shown, both theoretically and empirically, to improve generalization accuracy on the unseen test set. However, the high training cost hinders its efficiency since we need a sufficient number of base models and each one in the ensemble has to be separately trained. Lots of methods are proposed to tackle this problem, and most of them are based on the feature that a pre-trained network can transfer its knowledge to the next base model and then accelerate the training process. However, these methods suffer a severe problem that all of them transfer knowledge without selection and thus lead to low diversity. As the effect of ensemble learning is more pronounced if ensemble members are accurate and diverse, we propose a method named Efficient Diversity-Driven Ensemble (EDDE) to address both the diversity and the efficiency of an ensemble. To accelerate the training process, we propose a novel knowledge transfer method which can selectively transfer the previous generic knowledge. To enhance diversity, we first propose a new diversity measure, then use it to define a diversity-driven loss function for optimization. At last, we adopt a Boosting-based framework to combine the above operations, such a method can also further improve diversity. We evaluate EDDE on Computer Vision (CV) and Natural Language Processing (NLP) tasks. Compared with other well-known ensemble methods, EDDE can get highest ensemble accuracy with the lowest training cost, which means it is efficient in the ensemble of neural networks.
【Keywords】: deep neural networks; ensemble learning; knowledge transfer; diversity; efficient
【Paper Link】 【Pages】:85-96
【Authors】: Thanh Trung Huynh ; Van Vinh Tong ; Thanh Tam Nguyen ; Hongzhi Yin ; Matthias Weidlich ; Nguyen Quoc Viet Hung
【Abstract】: Network alignment is the problem of pairing nodes between two graphs such that the paired nodes are structurally and semantically similar. A well-known application of network alignment is to identify which accounts in different social networks belong to the same person. Existing alignment techniques, however, lack scalability, cannot incorporate multi-dimensional information without training data, and are limited in the consistency constraints enforced by an alignment. In this paper, we propose a fully unsupervised network alignment framework based on a multi-order embedding model. The model learns the embeddings of each node using a graph convolutional neural representation, which we prove to satisfy consistency constraints. We further design a data augmentation method and a refinement mechanism to make the model adaptive to consistency violations and noise. Extensive experiments on real and synthetic datasets show that our model outperforms state-of-the-art alignment techniques. We also demonstrate the robustness of our model against adversarial conditions, such as structural noises, attribute noises, graph size imbalance, and hyper-parameter sensitivity.
【Keywords】: network alignment; network embedding; GCN
【Paper Link】 【Pages】:97-108
【Authors】: Wenlu Wang ; Yingtao Tian ; Haixun Wang ; Wei-Shinn Ku
【Abstract】: Relational database management systems (RDBMSs) are powerful because they are able to optimize and execute queries against relational databases. However, when it comes to NLIDB (natural language interface for databases), the entire system is often custom-made for a particular database. Overcoming the complexity and expressiveness of natural languages so that a single NLI can support a variety of databases is an unsolved problem. In this work, we show that it is possible to separate data specific components from latent semantic structures in expressing relational queries in a natural language. With the separation, transferring an NLI from one database to another becomes possible. We develop a neural network classifier to detect data specific components and an adversarial mechanism to locate them in a natural language question. We then introduce a general purpose transfer-learnable NLI that focuses on the latent semantic structure. We devise a deep sequence model that translates the latent semantic structure to an SQL query. Experiments show that our approach outperforms previous NLI methods on the WikiSQL [49] dataset, and the model we learned can be applied to other benchmark datasets without retraining.
【Keywords】: Databases; Natural languages; Structured Query Language; Semantics; Metadata; Sociology; Statistics
【Paper Link】 【Pages】:109-120
【Authors】: Olha Horlova ; Abdulrahman Kaitoua ; Stefano Ceri
【Abstract】: With the huge growth of genomic data, exposing multiple heterogeneous features of genomic regions for millions of individuals, we increasingly need to support domain-specific query languages and knowledge extraction operations, capable of aggregating and comparing trillions of regions arbitrarily positioned on the human genome. While row-based models for regions can be effectively used as a basis for cloud-based implementations, in previous work we have shown that the array-based model is effective in supporting the class of region-preserving operations, i.e. operations which do not create any new region but rather compose existing ones. In this paper, we remove the above constraint, and describe an array-based implementation which applies to unrestricted region operations, as required by the Genometric Query Language. Specifically, we define a wide spectrum of operations over datasets which are represented using arrays, and we show that the arraybased implementation scales well upon Spark, also thanks to a data representation which is effectively used for supporting machine learning. Our benchmark, which uses an independent, pre-existing collection of queries, shows that in many cases the novel array-based implementation significantly improves the performance of the row-based implementation.
【Keywords】: Genomics; Bioinformatics; Arrays; Metadata; Sparks; Engines
【Paper Link】 【Pages】:121-132
【Authors】: Lei Guo ; Hongzhi Yin ; Qinyong Wang ; Bin Cui ; Zi Huang ; Lizhen Cui
【Abstract】: Group Recommendation (GR) is the task of suggesting relevant items/events for a group of users in online systems, whose major challenge is to aggregate the preferences of group members to infer the decision of a group. Prior group recommendation methods applied predefined static strategies for preference aggregation. However, these static strategies are insufficient to model the complicated decision making process of a group, especially for occasional groups which are formed adhoc. Compared to conventional individual recommendation task, GR is rather dynamic and each group member may contribute differently to the final group decision. Recent works argue that group members should have non-uniform weights in forming the decision of a group, and try to utilize a standard attention mechanism to aggregate the preferences of group members, but they do not model the interaction behavior among group members, and the decision making process is largely unexplored.In this work, we study GR in a more general scenario, that is Occasional Group Recommendation (OGR), and focus on solving the preference aggregation problem and the data sparsity issue of group-item interactions. Instead of exploring new heuristic or vanilla attention-based mechanism, we propose a new social self-attention based aggregation strategy by directly modeling the interactions among group members, namely Group Self-Attention (GroupSA). In GroupSA, we treat the group decision making process as multiple voting processes, and develop a stacked social self-attention network to simulate how a group consensus is reached. To overcome the data sparsity issue, we resort to the relatively abundant user-item and user-user interaction data, and enhance the representation of users by two types of aggregation methods. In the training process, we further propose a joint training method to learn the user/item embeddings in the group-item recommendation task and the user-item recommendation task simultaneously. Finally, we conduct extensive experiments on two real-world datasets. The experimental results demonstrate the superiority of our proposed GroupSA method compared to several state-of-the-art methods in terms of HR and NDCG.
【Keywords】: Task analysis; Decision making; Training; Aggregates; Optimization; Feeds; Recommender systems
【Paper Link】 【Pages】:133-144
【Authors】: Yu Zheng ; Chen Gao ; Xiangnan He ; Yong Li ; Depeng Jin
【Abstract】: In recent years, much research effort on recommendation has been devoted to mining user behaviors, i.e., collaborative filtering, along with the general information which describes users or items, e.g., textual attributes, categorical demographics, product images, and so on. Price, an important factor in marketing - which determines whether a user will make the final purchase decision on an item - surprisingly, has received relatively little scrutiny. In this work, we aim at developing an effective method to predict user purchase intention with the focus on the price factor in recommender systems. The main difficulties are twofold: 1) the preference and sensitivity of a user on item price are unknown, which are only implicitly reflected in the items that the user has purchased, and 2) how the item price affects a user's intention depends largely on the product category, that is, the perception and affordability of a user on item price could vary significantly across categories. Towards the first difficulty, we propose to model the transitive relationship between user-to-item and item-to-price, taking the inspiration from the recently developed Graph Convolution Networks (GCN). The key idea is to propagate the influence of price on users with items as the bridge, so as to make the learned user representations be price-aware. For the second difficulty, we further integrate item categories into the propagation progress and model the possible pairwise interactions for predicting user-item interactions. We conduct extensive experiments on two real-world datasets, demonstrating the effectiveness of our GCN-based method in learning the price-aware preference of users. Further analysis reveals that modeling the price awareness is particularly useful for predicting user preference on items of unexplored categories.
【Keywords】: Price-aware; recommendation; collaborative filtering; price; user preference
【Paper Link】 【Pages】:145-156
【Authors】: Yuanyuan Jin ; Wei Zhang ; Xiangnan He ; Xinyu Wang ; Xiaoling Wang
【Abstract】: Herb recommendation plays a crucial role in the therapeutic process of Traditional Chinese Medicine (TCM), which aims to recommend a set of herbs to treat the symptoms of a patient. While several machine learning methods have been developed for herb recommendation, they are limited in modeling only the interactions between herbs and symptoms, and ignoring the intermediate process of syndrome induction. When performing TCM diagnostics, an experienced doctor typically induces syndromes from the patient's symptoms and then suggests herbs based on the induced syndromes. As such, we believe the induction of syndromes - an overall description of the symptoms - is important for herb recommendation and should be properly handled. However, due to the ambiguity and complexity of syndrome induction, most prescriptions lack the explicit ground truth of syndromes. In this paper, we propose a new method that takes the implicit syndrome induction process into account for herb recommendation. Specifically, given a set of symptoms to treat, we aim to generate an overall syndrome representation by effectively fusing the embeddings of all the symptoms in the set, so as to mimic how a doctor induces the syndromes. Towards symptom embedding learning, we additionally construct a symptom-symptom graph from the input prescriptions for capturing the relations (cooccurred patterns) between symptoms; we then build graph convolution networks (GCNs) on both symptom-symptom and symptom-herb graphs to learn symptom embedding. Similarly, we construct a herb-herb graph and build GCNs on both herbherb and symptom-herb graphs to learn herb embedding, which is finally interacted with the syndrome representation to predict the scores of herbs. The advantage of such a Multi-Graph GCN architecture is that more comprehensive representations can be obtained for symptoms and herbs. We conduct extensive experiments on a public TCM dataset, demonstrating significant improvements over state-of-the-art herb recommendation methods. Further studies justify the effectiveness of our design of syndrome representation and multiple graphs.
【Keywords】: herb recommendation; symptom-herb graph; graph neural network; representation learning
【Paper Link】 【Pages】:157-168
【Authors】: Junshuai Song ; Zhao Li ; Zehong Hu ; Yucheng Wu ; Zhenpeng Li ; Jian Li ; Jun Gao
【Abstract】: Data-driven recommender systems that can help to predict users' preferences are deployed in many real online service platforms. Several studies show that they are vulnerable to data poisoning attacks, and attackers have the ability to mislead the system to perform as their desires. Considering the realistic scenario, where the recommender system is usually a black-box for attackers and complex algorithms may be deployed in them, how to learn effective attack strategies on such recommender systems is still an under-explored problem. In this paper, we propose an adaptive data poisoning framework, PoisonRec, which can automatically learn effective attack strategies on various recommender systems with very limited knowledge. PoisonRec leverages the reinforcement learning architecture, in which an attack agent actively injects fake data (user behaviors) into the recommender system, and then can improve its attack strategies through reward signals that are available under the strict black-box setting. Specifically, we model the attack behavior trajectory as the Markov Decision Process (MDP) in reinforcement learning. We also design a Biased Complete Binary Tree (BCBT) to reformulate the action space for better attack performance. We adopt 8 widely-used representative recommendation algorithms as our testbeds, and make extensive experiments on 4 different real-world datasets. The results show that PoisonRec has the ability to achieve good attack performance on various recommender systems with limited knowledge.
【Keywords】: Data Poisoning Attack; Recommender System; Reinforcement Learning
【Paper Link】 【Pages】:169-180
【Authors】: Kazutoshi Umemoto ; Tova Milo ; Masaru Kitsuregawa
【Abstract】: How can recommender systems help people improve their skills? As a first step toward recommendation for the upskilling of users, this paper addresses the problems of modeling the improvement of user skills and the difficulty of items in action sequences where users select items at different times. We propose a progression model that uses latent variables to learn the monotonically non-decreasing progression of user skills. Once this model is trained with the given sequence data, we leverage it to find a statistical solution to the item difficulty estimation problem, where we assume that users usually select items within their skill capacity. Experiments on five datasets (four from real domains, and one generated synthetically) revealed that (1) our model successfully captured the progression of domain-dependent skills; (2) multi-faceted item features helped to learn better models that aligned well with the ground-truth skill and difficulty levels in the synthetic dataset; (3) the learned models were practically useful to predict items and ratings in action sequences; and (4) exploiting the dependency structure of our skill model for parallel computation made the training process more efficient.
【Keywords】: Hidden Markov models; Predictive models; Data models; Computational modeling; Motion pictures; Recommender systems; Diseases
【Paper Link】 【Pages】:181-192
【Authors】: Chen Zhang ; Fan Zhang ; Wenjie Zhang ; Boge Liu ; Ying Zhang ; Lu Qin ; Xuemin Lin
【Abstract】: In this paper, we propose and study a novel cohesive subgraph model, named (k,p)-core, which is a maximal subgraph where each vertex has at least k neighbours and at least p fraction of its neighbours in the subgraph. The model is motivated by the finding that each user in a community should have at least a certain fraction p of neighbors inside the community to ensure user engagement, especially for users with large degrees. Meanwhile, the uniform degree constraint k, as applied in the k-core model, guarantees a minimum level of user engagement in a community, and is especially effective for users with small degrees. We propose an O(m) algorithm to compute a (k,p)-core with given k and p, and an O(dm) algorithm to decompose a graph by (k,p)-core, where m is the number of edges in the graph G and d is the degeneracy of G. A space efficient index is designed for time-optimal (k,p)-core query processing. Novel techniques are proposed for the maintenance of (k,p)-core index against graph dynamic. Extensive experiments on 8 reallife datasets demonstrate that our (k,p)-core model is effective and the algorithms are efficient.
【Keywords】: Indexes; Maintenance engineering; Social network services; Computational modeling; Heuristic algorithms; Fans; Query processing
【Paper Link】 【Pages】:193-204
【Authors】: Joana M. F. da Trindade ; Konstantinos Karanasos ; Carlo Curino ; Samuel Madden ; Julian Shun
【Abstract】: Graphs are a natural way to model real-world entities and relationships between them, ranging from social networks to data lineage graphs and biological datasets. Queries over these large graphs often involve expensive sub-graph traversals and complex analytical computations. These real-world graphs are often substantially more structured than a generic vertex-and-edge model would suggest, but this insight has remained mostly unexplored by existing graph engines for graph query optimization purposes. In this work, we leverage structural properties of graphs and queries to automatically derive materialized graph views that can dramatically speed up query evaluation. We present Kaskade, the first graph query optimization framework to exploit materialized graph views for query optimization purposes. Kaskade employs a novel constraint-based view enumeration technique that mines constraints from query workloads and graph schemas, and injects them during view enumeration to significantly reduce the search space of views to be considered. Moreover, it introduces a graph view size estimator to pick the most beneficial views to materialize given a query set and to select the best query evaluation plan given a set of materialized views. We evaluate its performance over real-world graphs, including the provenance graph that we maintain at Microsoft to enable auditing, service analytics, and advanced system optimizations. Our results show that Kaskade substantially reduces the effective graph size and yields significant performance speedups (up to 50X), in some cases making otherwise intractable queries possible.
【Keywords】: Query processing; Engines; Database languages; Computational modeling; Data models; Task analysis; Ciphers
【Paper Link】 【Pages】:205-216
【Authors】: Qi Zhang ; Rong-Hua Li ; Qixuan Yang ; Guoren Wang ; Lu Qin
【Abstract】: The structural diversity of an edge, which is measured by the number of connected components of the edge's ego-network, has recently been recognized as a key metric for analyzing social influence and information diffusion in social networks. Given this, an important problem in social network analysis is to identify top-k edges that have the highest structural diversities. In this work, we for the first time perform a systematical study for the top-k edge structural diversity search problem on large graphs. Specifically, we first develop a new online search framework with two basic upper-bounding rules to efficiently solve this problem. Then, we propose a new index structure using near-linear space to process the top-k edge structural diversity search in near-optimal time. To create such an index structure, we devise an efficient algorithm based on an interesting connection between our problem and the 4-clique enumeration problem. In addition, we also propose efficient index maintenance techniques to handle dynamic graphs. The results of extensive experiments on five large real-life datasets demonstrate the efficiency, scalability, and effectiveness of our algorithms.
【Keywords】: Cultural differences; Indexes; Upper bound; Social network services; Search problems; Heuristic algorithms; Maintenance engineering
【Paper Link】 【Pages】:217-228
【Authors】: Zhuo Wang ; Chaokun Wang ; Weiping Wang ; Xiaoyan Gu ; Bo Li ; Dan Meng
【Abstract】: Given a network and a set of seeds related to each other, the problem of relation discovery from focusing seeds aims to discover the relations among all the seeds. Due to its wide applications, the task has been well studied in the literature. However, when facing applications where maybe not all the seeds relate to each other, methods for the task will discover many vertices unrelated to the seeds. To support such applications, a new problem called adaptive relation discovery from focusing seeds (A-RDFS) is proposed and studied in this article. Given a network and a set of seeds that may not be related to each other, discover additional vertices to reveal the relations among the seeds which are related to each other. To solve the A-RDFS problem, a relation sensitive subgraph structure called the α- relation core is proposed to find vertices related to a subset of the seeds. Thereafter, a metric called the relation quality is proposed to measure the quality of discovered relations. The metric is positively correlated with the α value of each discovered α-relation core. Hence, by maximizing the relation quality, a set of α-relation cores with large α values can be discovered, which reveals the relations among the seeds related to each other. Two algorithms are developed to optimize the relation quality. Then, using the algorithms as subroutines, the OPT-A-RDFS method is designed for the A-RDFS problem. Extensive experimental results demonstrate the performance of our methods.
【Keywords】: Relation Discovery; Focusing Seeds; Graph Mining
【Paper Link】 【Pages】:229-240
【Authors】: Peng Lin ; Qi Song ; Yinghui Wu ; Jiaxing Pi
【Abstract】: This paper studies a class of neighborhood con-straints to characterize and repair erroneous entity information in multi-relational graph data. (1) We propose a class of constraints called star functional dependencies (StarFDs). Unlike conventional integrity constraints, a StarFDenforces value dependencies conditioned by entities and their relevant neighbors, which are identified by a star pattern that incorporates conjunctive regular path queries. StarFDsachieve a balance between expressiveness and complexity: the validation of StarFDsis tractable, and the satisfiability and implication of StarFDsare NP-complete and coNP-complete, respectively. (2) Given a set of StarFDsΣ and a graph G, the entity repair problem is to compute a minimum repair of G by enforcing Σ with the smallest amount of changes. Although this problem is NP-complete and hard to approximate, we show it is feasible to compute repairs in large graphs. Our approach (a) discriminately detects and resolves errors with optimal, approximable and cost-bounded solutions whenever possible, and (b) incurs a time cost determined by Σ and the size of inconsistencies, for all cases. Using real world data, we show that StarFD-based techniques effectively identify and repair errors. We also show that our repairing algorithms benefit other tasks such as fact checking.
【Keywords】: data cleaning; knowledge graphs
【Paper Link】 【Pages】:241-252
【Authors】: Xun Yi ; Russell Paulet ; Elisa Bertino ; Fang-Yu Rao
【Abstract】: In this paper we consider the problem where a client wishes to subscribe to some product or service provided by a server, but maintain their anonymity. At the same time, the server must be able to authenticate the client as a genuine user and be able to discontinue (or revoke) the client's access if the subscription fees are not paid. Current solutions for this problem are typically constructed using some combination of blind signature or zero-knowledge proof techniques, which do not directly support client revocation (that is, revoking a user before expiry of their secret value). In this paper, we present a solution for this problem on the basis of the broadcast encryption scheme, suggested by Boneh et al., by which the server can broadcast a secret to a group of legitimate clients. Our solution allows the registered client to log into the server anonymously and also supports client revocation by the server. Our solution can be used in many applications, such as location-based queries. We formally define a model for our anonymous subscription protocol and prove the security of our solution under this model. In addition, we present experimental results from an implementation of our protocol. These experimental results demonstrate that our protocol is practical.
【Keywords】: Servers; Protocols; Encryption; Public key; Registers
【Paper Link】 【Pages】:253-264
【Authors】: Ningning Cui ; Xiaochun Yang ; Bin Wang ; Jianxin Li ; Guoren Wang
【Abstract】: With the boom in cloud computing, data outsourcing in location-based services is proliferating and has attracted increasing interest from research communities and commercial applications. Nevertheless, since the cloud server is probably both untrusted and malicious, concerns of data security and result integrity have become on the rise sharply. However, there exist little work that can commendably assure the data security and result integrity using a unified way. In this paper, we study the problem of secure and verifiable k nearest neighbor query (SVkNN). To support SVkNN, we first propose a novel unified structure, called verifiable and secure index (VSI). Based on this, we devise a series of secure protocols to facilitate query processing and develop a compact verification strategy. Given an SVkNN query, our proposed solution can not merely answer the query efficiently while can guarantee: 1) preserving the privacy of data, query, result and access patterns; 2) authenticating the correctness and completeness of the results without leaking the confidentiality. Finally, the formal security analysis and complexity analysis are theoretically proven and the performance and feasibility of our proposed approaches are empirically evaluated and demonstrated.
【Keywords】: Privacy; Servers; Data privacy; Indexes; Cloud computing; Protocols; Security
【Paper Link】 【Pages】:265-276
【Authors】: Daren Fadolalkarim ; Elisa Bertino ; Asmaa Sallam
【Abstract】: Application programs are a possible source of attacks to databases as attackers might exploit vulnerabilities in a privileged database application. They can perform code injection or code-reuse attack in order to steal sensitive data. However, as such attacks very often result in changes in the program's behavior, program monitoring techniques represent an effective defense to detect on-going attacks. One such technique is monitoring the library/system calls that the application program issues while running. In this paper, we propose AD-PROM, an Anomaly Detection system that aims at protecting relational database systems against malicious/compromised applications PROgraMs aiming at stealing data. AD-PROM tracks calls executed by application programs on data extracted from a database. The system operates in two phases. The first phase statically and dynamically analyzes the behavior of the application in order to build profiles representing the application's normal behavior. AD-PROM analyzes the control and data flow of the application program (i.e., static analysis), and builds a hidden Markov model trained by the program traces (i.e., dynamic analysis). During the second phase, the program execution is monitored in order to detect anomalies that may represent data leakage attempts. We have implemented AD-PROM and carried experimental activities to assess its performance. The results showed that our system is highly accurate in detecting changes in the application programs' behaviors and has very low false positive rates.
【Keywords】: Database; insider attacks; anomaly detection; application profile; data leakage
【Paper Link】 【Pages】:277-288
【Authors】: Basit Khurram ; Florian Kerschbaum
【Abstract】: The prevalence of various (and increasingly large) datasets presents the challenging problem of discovering common entities dispersed across disparate datasets. Solutions to the private record linkage problem (PRL) aim to enable such explorations of datasets in a secure manner. A two-party PRL protocol allows two parties to determine for which entities they each possess a record (either an exact matching record or a fuzzy matching record) in their respective datasets - without revealing to one another information about any entities for which they do not both possess records. Although several solutions have been proposed to solve the PRL problem, no current solution offers a fully cryptographic security guarantee while maintaining both high accuracy of output and subquadratic runtime efficiency. To this end, we propose the first known efficient PRL protocol that runs in subquadratic time, provides high accuracy, and guarantees cryptographic security in the semi-honest security model.
【Keywords】: Secure Multiparty Computation; Private Record Linkage; Entity Matching; Private Permutation; Deduplication
【Paper Link】 【Pages】:289-300
【Authors】: Mingjie Li ; Ying Zhang ; Yifang Sun ; Wei Wang ; Ivor W. Tsang ; Xuemin Lin
【Abstract】: Approximate nearest neighbour search (ANNS) in high dimensional space is a fundamental problem in many applications, such as multimedia database, computer vision and information retrieval. Among many solutions, data-sensitive hashing-based methods are effective to this problem, yet few of them are designed for external storage scenarios and hence do not optimized for I/O efficiency during the query processing. In this paper, we introduce a novel data-sensitive indexing and query processing framework for ANNS with an emphasis on optimizing the I/O efficiency, especially, the sequential I/Os. The proposed index consists of several lists of point IDs, ordered by values that are obtained by learned hashing (i.e., mapping) functions on each corresponding data point. The functions are learned from the data and approximately preserve the order in the high-dimensional space. We consider two instantiations of the functions (linear and non-linear), both learned from the data with novel objective functions. We also develop an I/O efficient ANNS framework based on the index. Comprehensive experiments on six benchmark datasets show that our proposed methods with learned index structure perform much better than the state-of-the-art external memory-based ANNS methods in terms of I/O efficiency and accuracy.
【Keywords】: data privacy; file organisation; function approximation; indexing; input-output programs; learning (artificial intelligence); nearest neighbour methods; neural nets; query processing; search problems
【Paper Link】 【Pages】:301-312
【Authors】: Tim Gubner ; Viktor Leis ; Peter A. Boncz
【Abstract】: Modern query engines rely heavily on hash tables for query processing. Overall query performance and memory footprint is often determined by how hash tables and the tuples within them are represented. In this work, we propose three complementary techniques to improve this representation: Domain-Guided Prefix Suppression bit-packs keys and values tightly to reduce hash table record width. Optimistic Splitting decomposes values (and operations on them) into (operations on) frequently-accessed and infrequently-accessed value slices. By removing the infrequently-accessed value slices from the hash table record, it improves cache locality. The Unique Strings Self-aligned Region (USSR) accelerates handling frequently-occurring strings, which are very common in real-world data sets, by creating an on-the-fly dictionary of the most frequent strings. This allows executing many string operations with integer logic and reduces memory pressure. We integrated these techniques into Vectorwise. On the TPC-H benchmark, our approach reduces peak memory consumption by 2-4× and improves performance by up to 1.5×. On a real-world BI workload, we measured a 2× improvement in performance and in micro-benchmarks we observed speedups of up to 25×.
【Keywords】: Query Processing; Cache Efficiency; Strings
【Paper Link】 【Pages】:313-324
【Authors】: Qiang Zhang ; Yongkun Li ; Patrick P. C. Lee ; Yinlong Xu ; Qiu Cui ; Liu Tang
【Abstract】: Persistent key-value (KV) stores are mainly designed based on the Log-Structured Merge-tree (LSM-tree), which suffer from large read and write amplifications, especially when KV stores grow in size. Existing design optimizations for LSM-tree-based KV stores often make certain trade-offs and fail to simultaneously improve both the read and write performance on large KV stores without sacrificing scan performance. We design UniKV, which unifies the key design ideas of hash indexing and the LSM-tree in a single system. Specifically, UniKV leverages data locality to differentiate the indexing management of KV pairs. It also develops multiple techniques to tackle the issues caused by unifying the indexing techniques, so as to simultaneously improve the performance in reads, writes, and scans. Experiments show that UniKV significantly outperforms several state-of-the-art KV stores (e.g., LevelDB, RocksDB, HyperLevelDB, and PebblesDB) in overall throughput under read-write mixed workloads.
【Keywords】: Indexing; Throughput; Compaction; Scalability; Merging; Sorting
【Paper Link】 【Pages】:325-336
【Authors】: TaiNing Wang ; Chee-Yong Chan
【Abstract】: Recent research on sampling-based join size estimation has focused on a promising new technique known as correlated sampling. While several variants of this technique have been proposed, there is a lack of a systematic study of this family of techniques. In this paper, we first introduce a framework to characterize its design space in terms of five parameters. Based on this framework, we propose a new correlated sampling based technique to address the limitations of existing techniques. Our new technique is based on using a discrete learning method for estimating the join size from samples. We experimentally compare the performance of multiple variants of our new technique and identify a hybrid variant that provides the best estimation quality. This hybrid variant not only outperforms the state-of-the-art correlated sampling technique, but it is also more robust to small samples and skewed data.
【Keywords】: query processing; database systems; sampling methods
【Paper Link】 【Pages】:337-348
【Authors】: Botao Peng ; Panagiota Fatourou ; Themis Palpanas
【Abstract】: Data series similarity search is a core operation for several data series analysis applications across many different domains. However, the state-of-the-art techniques fail to deliver the time performance required for interactive exploration, or analysis of large data series collections. In this work, we propose MESSI, the first data series index designed for in-memory operation on modern hardware. Our index takes advantage of the modern hardware parallelization opportunities (i.e., SIMD instructions, multi-core and multi-socket architectures), in order to accelerate both index construction and similarity search processing times. Moreover, it benefits from a careful design in the setup and coordination of the parallel workers and data structures, so that it maximizes its performance for in-memory operations. Our experiments with synthetic and real datasets demonstrate that overall MESSI is up to 4× faster at index construction, and up to 11× faster at query answering than the state-of-the-art parallel approach. MESSI is the first to answer exact similarity search queries on 100GB datasets in ~50msec (30-75msec across diverse datasets), which enables real-time, interactive data exploration on very large data series collections.
【Keywords】: Data series; Indexing; Modern hardware
【Paper Link】 【Pages】:349-360
【Authors】: Xiucheng Li ; Gao Cong ; Yun Cheng
【Abstract】: In this paper, we study the problem of predicting the most likely traveling route on the road network between two given locations by considering the real-time traffic. We present a deep probabilistic model-DeepST-which unifies three key explanatory factors, the past traveled route, the impact of destination and real-time traffic for the route decision. DeepST explains the generation of next route by conditioning on the representations of the three explanatory factors. To enable effectively sharing the statistical strength, we propose to learn representations of K-destination proxies with an adjoint generative model. To incorporate the impact of real-time traffic, we introduce a high dimensional latent variable as its representation whose posterior distribution can then be inferred from observations. An efficient inference method is developed within the Variational Auto-Encoders framework to scale DeepST to large-scale datasets. We conduct experiments on two real-world large-scale trajectory datasets to demonstrate the superiority of DeepST over the existing methods on two tasks: the most likely route prediction and route recovery from sparse trajectories. In particular, on one public large-scale trajectory dataset, DeepST surpasses the best competing method by almost 50% on the most likely route prediction task and up to 15% on the route recovery task in terms of accuracy.
【Keywords】: Roads; Trajectory; Hidden Markov models; Real-time systems; Probabilistic logic; Data models; Task analysis
【Paper Link】 【Pages】:361-372
【Authors】: Shen Liang ; Yanchun Zhang ; Jiangang Ma
【Abstract】: Positive unlabeled time series classification (PUTSC) refers to classifying time series with a set PL of positive labeled examples and a set U of unlabeled ones. Model selection for PUTSC is a largely untouched topic. In this paper, we look into PUTSC model selection, which as far as we know is the first systematic study in this topic. Focusing on the widely adopted self-training one-nearest-neighbor (ST-1NN) paradigm, we propose a model selection framework based on active learning (AL). We present the novel concepts of self-training label propagation, pseudo label calibration principles and ultimately influence to fully exploit the mechanism of ST-1NN. Based on them, we develop an effective model performance evaluation strategy and three AL sampling strategies. Experiments on over 120 datasets and a case study in arrhythmia detection show that our methods can yield top performance in interactive environments, and can achieve near optimal results by querying very limited numbers of labels from the AL oracle.
【Keywords】: time series; positive unlabeled classification; model selection; active learning; self-training
【Paper Link】 【Pages】:373-384
【Authors】: Yuanduo He ; Xu Chu ; Yasha Wang
【Abstract】: Unsupervised time series mining has been attracting great interest from both academic and industrial communities. As the two most basic data mining tasks, the discoveries of frequent/rare subsequences have been extensively studied in the literature. Specifically, frequent/rare subsequences are defined as the ones with the smallest/largest 1-nearest neighbor distance, which are also known as motif/discord. However, discord fails to identify rare subsequences when it occurs more than once in the time series, which is widely known as the twin freak problem. This problem is just the "tip of the iceberg" due to the 1-nearest neighbor distance based definitions. In this work, we for the first time provide a clear theoretical analysis of motif/discord as the 1-nearest neighbor based nonparametric density estimation of subsequence. Particularly, we focus on matrix profile, a recently proposed mining framework, which unifies the discovery of motif and discord under the same computing model. Thereafter, we point out the inherent three issues: low-quality density estimation, gravity defiant behavior, and lack of reusable model, which deteriorate the performance of matrix profile in both efficiency and subsequence quality.To overcome these issues, we propose Neighbor Profile to robustly model the subsequence density by bagging nearest neighbors for the discovery of frequent/rare subsequences. Specifically, we leverage multiple subsamples and average the density estimations from subsamples using adjusted nearest neighbor distances, which not only enhances the estimation robustness but also realizes a reusable model for efficient learning. We check the sanity of neighbor profile on synthetic data and further evaluate it on real-world datasets. The experimental results demonstrate that neighbor profile can correctly model the subsequences of different densities and shows superior performance significantly over matrix profile on the real-world arrhythmia dataset. Also, it is shown that neighbor profile is efficient for massive datasets.
【Keywords】: Time series; motif; discord; nearest neighbor; unsupervised mining
【Paper Link】 【Pages】:385-396
【Authors】: Fabian Gieseke ; Sabina Rosca ; Troels Henriksen ; Jan Verbesselt ; Cosmin E. Oancea
【Abstract】: Large amounts of satellite data are now becoming available, which, in combination with appropriate change detection methods, offer the opportunity to derive accurate information on timing and location of disturbances such as deforestation events across the earth surface. Typical scenarios require the analysis of billions of image patches/pixels. While various change detection techniques have been proposed in the literature, the associated implementations usually do not scale well, which renders the corresponding analyses computationally very expensive or even impossible. In this work, we propose a novel massively-parallel implementation for a state-of-the-art change detection method and demonstrate its potential in the context of monitoring deforestation. The novel implementation can handle large scenarios in a few hours or days using cheap commodity hardware, compared to weeks or even years using the existing publicly available code, and enables researchers, for the first time, to conduct global-scale analyses covering large parts of our Earth using little computational resources. From a technical perspective, we provide a high-level parallel algorithm specification along with several performance-critical optimizations dedicated to efficiently map the specified parallelism to modern parallel devices. While a particular change detection method is addressed in this work, the algorithmic building blocks provided are also of immediate relevance to a wide variety of related approaches in remote sensing and other fields.
【Keywords】: Time series analysis; Monitoring; Satellites; Data models; Earth; Mathematical model; Remote sensing
【Paper Link】 【Pages】:397-408
【Authors】: Qiyan Li ; Yuanyuan Zhu ; Jeffrey Xu Yu
【Abstract】: Given a network with social and spatial information, cohesive group queries aim at finding a group of users, which are strongly connected and closely co-located. Most existing studies limit to finding groups either with the strongest social ties under certain spatial constraint or minimum spatial distance under certain social constraints. It is difficult for users to decide which constraints they need to choose and how to decide the priority of the constraints to meet their real requirements since the social constraint and spatial constraint are different in nature. In this paper, we take a new approach to consider the constraints equally and study a skyline query. Specifically, given a road-social network consisting of a road network G r and a location-based social network G s , we aim to find a set of skyline cohesive groups, in which each group cannot be dominated by any other group in terms of social cohesiveness and spatial cohesiveness. We find a group of users using social cohesiveness based on (k, c)-core (a k-core of size c) and spatial cohesiveness based on travel cost to a meeting point from group members. Such skyline problem is NP-hard as we need to explore the combinations of c vertices to check whether it is a qualified (k, c)-core. In this paper, we first provide exact solutions by developing efficient pruning strategies to filter out a large number of combinations which cannot form a (k, c)-core, and then propose highly efficient greedy solutions based on a newly designed cd-tree to keep the distance on the road network and social structural information simultaneously. Experimental results show that our exact methods run faster than the brute-force methods by 2-4 orders of magnitude in general, and our cd-tree based greedy methods can significantly reduce the computation cost by 1-4 order of magnitude while the extra travel cost is less than 5% compared to the exact method on multiple real road-social networks.
【Keywords】: Roads; Social network services; Density measurement; Games; Atmospheric measurements; Atmospheric modeling; Junctions
【Paper Link】 【Pages】:409-420
【Authors】: Taotao Cai ; Jianxin Li ; Nur Al Hasan Haldar ; Ajmal Mian ; John Yearwood ; Timos Sellis
【Abstract】: User engagement has recently received significant attention in understanding decay and expansion of communities in social networks. However, the problem of user engagement hasn't been fully explored in terms of users' specific interests and structural cohesiveness altogether. Therefore, we fill the gap by investigating the problem of community engagement from the perspective of attributed communities. Given a set of keywords W, a structure cohesive parameter k, and a budget parameter l, our objective is to find l number of users who can induce a maximal expanded community. Meanwhile, every community member must contain the given keywords in W and the community should meet the specified structure cohesiveness constraint k. We introduce this problem as best-Anchored Vertex set Exploration (AVE).To solve the AVE problem, we develop a Filter-Verify framework by maintaining the intermediate results using multiway tree, and probe the best anchored users in a best search way. To accelerate the efficiency, we further design a keyword-aware anchored and follower index, and also develop an index-based efficient algorithm. The proposed algorithm can greatly reduce the cost of computing anchored users and their followers. Additionally, we present two bound properties that can guarantee the correctness of our solution. Finally, we demonstrate the efficiency of our proposed algorithms and index. We measure the effectiveness of attributed community-based community engagement model by conducting extensive experiments on five real-world datasets.
【Keywords】: Social network services; Indexes; Companies; Australia; Filtering algorithms; Maintenance engineering; Approximation algorithms
【Paper Link】 【Pages】:421-432
【Authors】: Ruida Yang ; Xin Lin ; Jianliang Xu ; Yan Yang ; Liang He
【Abstract】: Knowledge graphs have been used in a wide range of applications to support search, recommendation, and question answering (Q&A). For example, in Q&A systems, given a new question, we may use a knowledge graph to automatically identify the most suitable answers based on similarity evaluation. However, such systems may suffer from two major limitations. First, the knowledge graph constructed based on source data may contain errors. Second, the knowledge graph may become out of date and cannot quickly adapt to new knowledge. To address these issues, in this paper, we propose an interactive framework that refines and optimizes knowledge graphs through user votes. We develop an efficient similarity evaluation notion, called extended inverse P-distance, based on which the graph optimization problem can be formulated as a signomial geometric programming problem. We then propose a basic single-vote solution and a more advanced multi-vote solution for graph optimization. We also propose a split-and-merge optimization strategy to scale up the multi-vote solution. Extensive experiments based on real-life and synthetic graphs demonstrate the effectiveness and efficiency of our proposed framework.
【Keywords】: Knowledge Graphs; Question Answering; Data Cleaning; Query Processing
【Paper Link】 【Pages】:433-444
【Authors】: Yongqi Zhang ; Quanming Yao ; Wenyuan Dai ; Lei Chen
【Abstract】: Scoring functions (SFs), which measure the plausibility of triplets in knowledge graph (KG), have become the crux of KG embedding. Lots of SFs, which target at capturing different kinds of relations in KGs, have been designed by humans in recent years. However, as relations can exhibit complex patterns that are hard to infer before training, none of them can consistently perform better than others on existing benchmark data sets. In this paper, inspired by the recent success of automated machine learning (AutoML), we propose to automatically design SFs (AutoSF) for distinct KGs by the AutoML techniques. However, it is non-trivial to explore domain-specific information here to make AutoSF efficient and effective. We firstly identify a unified representation over popularly used SFs, which helps to set up a search space for AutoSF. Then, we propose a greedy algorithm to search in such a space efficiently. The algorithm is further sped up by a filter and a predictor, which can avoid repeatedly training SFs with same expressive ability and help removing bad candidates during the search before model training. Finally, we perform extensive experiments on benchmark data sets. Results on link prediction and triplets classification show that the searched SFs by AutoSF, are KG dependent, new to the literature, and outperform the state-of- the-art SFs designed by humans.
【Keywords】: Machine learning; Search problems; Training; Optimization; Adaptation models; Artificial neural networks
【Paper Link】 【Pages】:445-456
【Authors】: Yuxiang Wang ; Arijit Khan ; Tianxing Wu ; Jiahui Jin ; Haijiang Yan
【Abstract】: Recently, graph query is widely adopted for querying knowledge graphs. Given a query graph G Q , the graph query finds subgraphs in a knowledge graph G that exactly or approximately match G Q . We face two challenges on graph query: (1) the structural gap between G Q and the predefined schema in G causes mismatch with query graph, (2) users cannot view the answers until the graph query terminates, leading to a longer system response time (SRT). In this paper, we propose a semantic-guided and response-time-bounded graph query to return the top-k answers effectively and efficiently. We leverage a knowledge graph embedding model to build the semantic graph SG Q , and we define the path semantic similarity (pss) over SG Q as the metric to evaluate the answer's quality. Then, we propose an A* semantic search on SG Q to find the top-k answers with the greatest pss via a heuristic pss estimation. Furthermore, we make an approximate optimization on A* semantic search to allow users to trade off the effectiveness for SRT within a user- specific time bound. Extensive experiments over real datasets confirm the effectiveness and efficiency of our solution.
【Keywords】: Automobiles; Semantic search; Time factors; Vocabulary; Pattern matching; Estimation
【Paper Link】 【Pages】:457-468
【Authors】: Jiaxin Jiang ; Xin Huang ; Byron Choi ; Jianliang Xu ; Sourav S. Bhowmick ; Lyu Xu
【Abstract】: Due to the unstructuredness and the lack of schemas of graphs, such as knowledge graphs, social networks and RDF graphs, keyword search has been proposed for querying such graphs/networks. In many applications (e.g., social networks), users may prefer to hide parts or all of her/his data graphs (e.g., private friendships) from the public. This leads to a recent graph model, namely the public-private network model, in which each user has his/her own network. While there have been studies on public-private network analysis, keyword search on public- private networks has not yet been studied. For example, query answers on private networks and on a combination of private and public networks can be different. In this paper, we propose a new keyword search framework, called public-private keyword search (PPKWS). PPKWS consists of three major steps: partial evaluation, answer refinement, and answer completion. Since there have been plenty of keyword search semantics, we select three representative ones and show that they can be implemented on the model with minor modifications. We propose indexes and optimizations for PPKWS. We have verified through experiments that, on average, the algorithms implemented on top of PPKWS run 113 times faster than the original algorithms directly running on the public network attached to the private network for retrieving answers that spans through them.
【Keywords】: Keyword search; Semantics; Portals; Indexes; Collaboration; Social network services; Optimization
【Paper Link】 【Pages】:469-480
【Authors】: Shagufta Mehnaz ; Elisa Bertino
【Abstract】: Anomaly detection on data collected by devices, such as sensors and IoT objects, is inevitable for many critical systems, e.g., an anomaly in the data of a patient's health monitoring device may indicate a medical emergency situation. Because of the resource-constrained nature of these devices, data collected by such devices are usually off-loaded to the cloud/edge for storage and/or further analysis. However, to ensure data privacy it is critical that the data be transferred to and managed by the cloud/edge in an encrypted form which necessitates efficient processing of such encrypted data for real-time anomaly detection. Motivated by the simultaneous demands for data privacy and real-time data processing, in this paper, we investigate the problem of a privacy-preserving real-time anomaly detection service on sensitive, time series, streaming data. We propose a privacy-preserving framework that enables efficient anomaly detection on encrypted data by leveraging a lightweight and aggregation optimized encryption scheme to encrypt the data before off-loading the data to the edge. We demonstrate our solution for a widely used anomaly detection algorithm, windowed Gaussian anomaly detector and evaluate the performance of the solution in terms of the obtained model privacy, accuracy, latency, and communication cost.
【Keywords】: Anomaly detection; Edge Computing; Privacy
【Paper Link】 【Pages】:481-492
【Authors】: Chao Yan ; Haifeng Xu ; Yevgeniy Vorobeychik ; Bo Li ; Daniel Fabbri ; Bradley A. Malin
【Abstract】: Routine operational use of sensitive data is often governed by law and regulation. For instance, in the medical domain, there are various statues at the state and federal level that dictate who is permitted to work with patients' records and under what conditions. To screen for potential privacy breaches, logging systems are usually deployed to trigger alerts whenever a suspicious access is detected. However, such mechanisms are often inefficient because 1) the vast majority of triggered alerts are false positives, 2) small budgets make it unlikely that a real attack will be detected, and 3) attackers can behave strategically, such that traditional auditing mechanisms cannot easily catch them. To improve efficiency, information systems may invoke signaling, so that whenever a suspicious access request occurs, the system can, in real time, warn the user that the access may be audited. Then, at the close of a finite period, a selected subset of suspicious accesses are audited. This gives rise to an online problem in which one needs to determine 1) whether a warning should be triggered and 2) the likelihood that the data request event will be audited. In this paper, we formalize this auditing problem as a Signaling Audit Game (SAG), in which we model the interactions between an auditor and an attacker in the context of signaling and the usability cost is represented as a factor of the auditor's payoff. We study the properties of its Stackelberg equilibria and develop a scalable approach to compute its solution. We show that a strategic presentation of warnings adds value in that SAGs realize significantly higher utility for the auditor than systems without signaling. We perform a series of experiments with 10 million real access events, containing over 26K alerts, from a large academic medical center to illustrate the value of the proposed auditing model and the consistency of its advantages over existing baseline methods.
【Keywords】: database auditing; privacy; signaling; Stackel-burg game
【Paper Link】 【Pages】:493-504
【Authors】: Ios Kotsogiannis ; Stelios Doudalis ; Samuel Haney ; Ashwin Machanavajjhala ; Sharad Mehrotra
【Abstract】: We study the problem of privacy-preserving data sharing, wherein only a subset of the records in a database is sensitive, possibly based on predefined privacy policies. Existing solutions, viz, differential privacy (DP), are over-pessimistic as they treat all records as sensitive. Alternatively, techniques like access control and personalized differential privacy that reveal all non-sensitive records truthfully indirectly leak whether a record is sensitive and consequently the record's value. In this work we introduce one-sided differential privacy (OSDP) that offers provable privacy guarantees to the sensitive records. In addition, OSDP satisfies the sensitivity masking property which ensures that any algorithm satisfying OSDP does not allow an attacker to significantly decrease his/her uncertainty about whether a record is sensitive or not. We design OSDP algorithms that can truthfully release a sample of non-sensitive records. Such algorithms can be used to support applications that must output true data with little loss in utility, especially when using complex types of data like images or location trajectories. Additionally, we present OSDP algorithms for releasing count queries, which leverage the presence of nonsensitive records and are able to offer up to a 6× improvement in accuracy over state-of-the-art DP-solutions.
【Keywords】: Sensitivity; Databases; Privacy; Access control; Uncertainty
【Paper Link】 【Pages】:505-516
【Authors】: Xiaolan Gu ; Ming Li ; Li Xiong ; Yang Cao
【Abstract】: Local Differential Privacy (LDP) provides provable privacy protection for data collection without the assumption of the trusted data server. In the real-world scenario, different data have different privacy requirements due to the distinct sensitivity levels. However, LDP provides the same protection for all data. In this paper, we tackle the challenge of providing input-discriminative protection to reflect the distinct privacy requirements of different inputs. We first present the Input- Discriminative LDP (ID-LDP) privacy notion and focus on a specific version termed MinID-LDP, which is shown to be a fine-grained version of LDP. Then, we focus on the application of frequency estimation and develop the IDUE mechanism based on Unary Encoding for single-item input and the extended mechanism IDUE-PS (with Padding-and-Sampling protocol) for item-set input. The results on both synthetic and real-world datasets validate the correctness of our theoretical analysis and show that the proposed mechanisms satisfying MinID-LDP have better utility than the state-of-the-art mechanisms satisfying LDP due to the input-discriminative protection.
【Keywords】: local differential privacy; input-discriminative protection; frequency estimation
【Paper Link】 【Pages】:517-528
【Authors】: Qian Tao ; Yongxin Tong ; Zimu Zhou ; Yexuan Shi ; Lei Chen ; Ke Xu
【Abstract】: With spatial crowdsourcing applications such as Uber and Waze deeply penetrated into everyday life, there is a growing concern to protect user privacy in spatial crowdsourcing. Particularly, locations of workers and tasks should be properly processed via certain privacy mechanism before reporting to the untrusted spatial crowdsourcing server for task assignment. Privacy mechanisms typically permute the location information, which tends to make task assignment ineffective. Prior studies only provide guarantees on privacy protection without assuring the effectiveness of task assignment. In this paper, we investigate privacy protection for online task assignment with the objective of minimizing the total distance, an important task assignment formulation in spatial crowdsourcing. We design a novel privacy mechanism based on Hierarchically Well-Separated Trees (HSTs). We prove that the mechanism is ε-Geo-Indistinguishable and show that there is a task assignment algorithm with a competitive ratio of O( ε 1/4 log N log 2 k), where ε is the privacy budget, N is the number of predefined points on the HST, and k is the matching size. Extensive experiments on synthetic and real datasets show that online task assignment under our privacy mechanism is notably more effective in terms of total distance than under prior differentially private mechanisms.
【Keywords】: Task analysis; Privacy; Servers; Crowdsourcing; Data privacy; Extraterrestrial measurements
【Paper Link】 【Pages】:529-540
【Authors】: Noura Alghamdi ; Liang Zhang ; Huayi Zhang ; Elke A. Rundensteiner ; Mohamed Y. Eltabakh
【Abstract】: Scalable subsequence matching is critical for supporting analytics on big time series from mining, prediction to hypothesis testing. However, state-of-the-art subsequence matching techniques do not scale well to TB-scale datasets. Not only does index construction become prohibitively expensive, but also the query response time deteriorates quickly as the length of the query subsequence exceeds several 100s of data points. Although Locality Sensitive Hashing (LSH) has emerged as a promising solution for indexing long time series, it relies on expensive hash functions that perform multiple passes over the data and thus is impractical for big time series. In this work, we propose a lightweight distributed indexing framework, called ChainLink, that supports approximate kNN queries over TB-scale time series data. As a foundation of ChainLink, we design a novel hashing technique, called Single Pass Signature (SPS), that successfully tackles the above problem. In particular, we prove theoretically and demonstrate experimentally that the similarity proximity of the indexed subsequences is preserved by our proposed single-pass SPS scheme. Leveraging this SPS innovation, Chainlink then adopts a three-step approach for scalable index building: (1) in-place data re-organization within each partition to enable efficient record-level random access to all subsequences, (2) parallel building of hash-based local indices on top of the re-organized data using our SPS scheme for efficient search within each partition, and (3) efficient aggregation of the local indices to construct a centralized yet highly compact global index for effective pruning of irrelevant partitions during query processing. ChainLink achieves the above three steps in one single map-reduce process. Our experimental evaluation shows that ChainLink indices are compact at less than 2% of dataset size while state-of-the-art index sizes tend to be almost the same size as the dataset. Better still, ChainLink is up to 2 orders of magnitude faster in its index construction time compared to state-of-the-art techniques, while improving both the final query response time by up to 10 fold and the result accuracy by 15%.
【Keywords】: Time series analysis; Indexing; Feature extraction; Time factors; Distributed databases; Scalability
【Paper Link】 【Pages】:541-552
【Authors】: Trong Duc Nguyen ; Ming-Hung Shih ; Sai Sree Parvathaneni ; Bojian Xu ; Divesh Srivastava ; Srikanta Tirthapura
【Abstract】: Random sampling has been widely used in approximate query processing on large databases, due to its potential to significantly reduce resource usage and response times, at the cost of a small approximation error. We consider random sampling for answering the ubiquitous class of group-by queries, which first group data according to one or more attributes, and then aggregate within each group after filtering through a predicate. The challenge with group-by queries is that a sampling method cannot focus on optimizing the quality of a single answer (e.g. the mean of selected data), but must simultaneously optimize the quality of a set of answers (one per group). We present CVOPT, a query- and data-driven sampling framework for a set of group-by queries. To evaluate the quality of a sample, CVOPT defines a metric based on the norm (e.g. ℓ 2 or ℓ ∞ ) of the coefficients of variation (CVs) of different answers, and constructs a stratified sample that provably optimizes the metric. CVOPT can handle group-by queries on data where groups have vastly different statistical characteristics, such as frequencies, means, or variances. CVOPT jointly optimizes for multiple aggregations and multiple group-by clauses, and provides a way to prioritize specific groups or aggregates. It can be tuned to cases when partial information about a query workload is known, such as a data warehouse where queries are run periodically. Our experimental results show that CVOPT outperforms the current state-of-the-art on sample quality and estimation accuracy for group-by queries. On a set of queries on two real-world data sets, CVOPT yields relative errors that are 5× smaller than competing approaches, under the same space budget.
【Keywords】: Resource management; Measurement; Aggregates; Optimization; Query processing; Sociology; Statistics
【Paper Link】 【Pages】:553-564
【Authors】: Li Wang ; Zining Zhang ; Bingsheng He ; Zhenjie Zhang
【Abstract】: With the gaining popularity of the Non-Volatile Memory (NVM) technology, NVM express (NVMe) is now becoming the de facto interface for high-end block devices. NVMe generally enables the applications to exploit the massive internal parallelism of new-generation solid-state drives (SSDs) by issuing simultaneous I/O requests. However, existing B+ Trees are unable to maximize the utilization of NVMe hardware performance because of their synchronous execution paradigm which is incompatible with the interface of NVMe. To tackle this problem, we propose PA-Tree, an NVMe-friendly B+ Tree with a novel, polled-mode, asynchronous execution paradigm to process multiple index operations in an interleaved and asynchronous manner. Such an execution paradigm allows PA-Tree to saturate NVMe hardware with sufficient asynchronous I/O operations, while avoiding the potential overhead of excessive multi-threading. To further unleash the power of the new paradigm, we devise a new workload-aware scheduling algorithm to optimize the access to the NVMe interface based on the unique characteristic of NVMe, which enhances throughput while minimizing processing latency as well as CPU consumption. Extensive experiments on both synthetic and real workloads demonstrate that PA-Tree achieves up to 5× improvement on throughput and 30% reduction on latency against state-of-the-art solutions.
【Keywords】: NVM; NVMe; B+ Tree; Index
【Paper Link】 【Pages】:565-576
【Authors】: Jianye Yang ; Wenjie Zhang ; Xiang Wang ; Ying Zhang ; Xuemin Lin
【Abstract】: With the prevalence of Internet access and user generated content, a large number of documents/records, such as news and web pages, have been continuously generated in an unprecedented manner. In this paper, we study the problem of efficient stream set similarity join over distributed systems, which has broad applications in data cleaning and data integration tasks, such as on-line near-duplicate detection. In contrast to prefix-based distribution strategy which is widely adopted in offline distributed processing, we propose a simple yet efficient length-based distribution framework which dispatches incoming records by their length. A load-aware length partition method is developed to find a balanced partition by effectively estimating local join cost to achieve good load balance. Our length-based scheme is surprisingly superior to its competitors since it has no replication, small communication cost, and high throughput. We further observe that the join results from the current incoming record can be utilized to guide the index construction, which in turn can facilitate the join processing of future records. Inspired by this observation, we propose a novel bundle-based join algorithm by grouping similar records on-the-fly to reduce filtering cost. A by-product of this algorithm is an efficient verification technique, which verifies a batch of records by utilizing their token differences to share verification costs, rather than verifying them individually. Extensive experiments conducted on Storm, a popular distributed stream processing system, suggest that our methods can achieve up to one order of magnitude throughput improvement over baselines.
【Keywords】: Distributed databases; Partitioning algorithms; Approximation algorithms; Throughput; Indexing; Australia
【Paper Link】 【Pages】:577-588
【Authors】: Zhongle Xie ; Hongbin Ying ; Cong Yue ; Meihui Zhang ; Gang Chen ; Beng Chin Ooi
【Abstract】: With a huge volume and variety of data accumulated over the years, OnLine Analytical Processing (OLAP) systems are facing challenges in query efficiency. Furthermore, the design of OLAP systems cannot serve modern applications well due to their inefficiency in processing complex queries such as cohort queries with low query latency. In this paper, we present Cool, a cohort online analytical processing system. As an integrated system with the support of several newly proposed operators on top of a sophisticated storage layer, it processes both cohort queries and conventional OLAP queries with superb performance. Its distributed design contains minimal load balancing and fault tolerance support and is scalable. Our evaluation results show that Cool outperforms two state-of-the-art systems, MonetDB and Druid, by a wide margin in single-node setting. The multi-node version of Cool can also beat the distributed Druid, as well as SparkSQL, by one order of magnitude in terms of query latency.
【Keywords】: Cohort Analysis; OLAP; Distributed System
【Paper Link】 【Pages】:589-600
【Authors】: Zijian Li ; Wenhao Zheng ; Xueling Lin ; Ziyuan Zhao ; Zhe Wang ; Yue Wang ; Xun Jian ; Lei Chen ; Qiang Yan ; Tiezheng Mao
【Abstract】: Learning network embeddings has attracted growing attention in recent years. However, most of the existing methods focus on homogeneous networks, which cannot capture the important type information in heterogeneous networks. To address this problem, in this paper, we propose TransN, a novel multi-view network embedding framework for heterogeneous networks. Compared with the existing methods, TransN is an unsupervised framework which does not require node labels or user-specified meta-paths as inputs. In addition, TransN is capable of handling more general types of heterogeneous networks than the previous works. Specifically, in our framework TransN, we propose a novel algorithm to capture the proximity information inside each single view. Moreover, to transfer the learned information across views, we propose an algorithm to translate the node embeddings between different views based on the dual-learning mechanism, which can both capture the complex relations between node embeddings in different views, and preserve the proximity information inside each view during the translation. We conduct extensive experiments on real-world heterogeneous networks, whose results demonstrate that the node embeddings generated by TransN outperform those of competitors in various network mining tasks.
【Keywords】: Heterogeneous Network Embedding; Representation Learning; Multi-View Network Embedding; Dual Learning
【Paper Link】 【Pages】:601-612
【Authors】: Jin Xu ; Jingbo Zhou ; Yongpo Jia ; Jian Li ; Hui Xiong
【Abstract】: Revenue prediction is an essential component in security analysis since the revenue of a company has a great impact on the performance of its stock. For investment, one of the most valuable pieces of information is the company's unexpected revenue, which is the difference between the officially reported revenue and the consensus estimate for revenue predicted by analysts. Since it is the unexpected revenue that indicates something exceeding or under analysts' expectation, it is an indispensable factor that influences the performance of a stock. Besides conventional trading data from stock market and companies' financial reports, recent years have witnessed an extensive application of alternative data for gaining an information edge in stock investment.In this paper, we study the challenging problem of better predicting unexpected revenue of a company via machine learning with alternative data. To the best of our knowledge, this is the first work studying this problem in literature. However, it is nontrivial to quantitatively model the relations between the unexpected revenue and the information provided by alternative data with a machine learning approach. Thus we proposed an adaptive master-slave regularized model, called AMS for short, to effectively leverage alternative data for unexpected revenue prediction. AMS first trains a master model upon a company graph, which captures the relations among companies, using a graph neural network (GNN). Then for a target company, the master model generates an adaptive slave-model, which is specially optimized for this target company. Finally, we use this slave-model to predict the unexpected revenue of the target company. Besides its excellent prediction performance, another critical advantage of our AMS model lies in its superior interpretability, which is crucial for portfolio managers to understand the predicted results. With extensive experiments using two real-world alternative datasets, we have demonstrated the effectiveness of our model against a set of competitors.
【Keywords】: Companies; Adaptation models; Predictive models; Data models; Machine learning; Investment
【Paper Link】 【Pages】:613-624
【Authors】: Zicun Cong ; Lingyang Chu ; Lanjun Wang ; Xia Hu ; Jian Pei
【Abstract】: More and more AI services are provided through APIs on cloud where predictive models are hidden behind APIs. To build trust with users and reduce potential application risk, it is important to interpret how such predictive models hidden behind APIs make their decisions. The biggest challenge of interpreting such predictions is that no access to model parameters or training data is available. Existing works interpret the predictions of a model hidden behind an API by heuristically probing the response of the API with perturbed input instances. However, these methods do not provide any guarantee on the exactness and consistency of their interpretations. In this paper, we propose an elegant closed form solution named OpenAPI to compute exact and consistent interpretations for the family of Piecewise Linear Models (PLM), which includes many popular classification models. The major idea is to first construct a set of overdetermined linear equation systems with a small set of perturbed instances and the predictions made by the model on those instances. Then, we solve the equation systems to identify the decision features that are responsible for the prediction on an input instance. Our extensive experiments clearly demonstrate the exactness and consistency of our method.
【Keywords】: Predictive models; Computational modeling; Perturbation methods; Data models; Mathematical model; Training data; Training
【Paper Link】 【Pages】:625-636
【Authors】: Keqi Han ; Yuan Tian ; Yunjia Zhang ; Ling Han ; Hao Huang ; Yunjun Gao
【Abstract】: Reconstructing the topology of a diffusion network based on observed diffusion results is an open challenge in data mining. Existing approaches mostly assume that the observed diffusion results are available and consist of not only the final infection statuses of nodes, but also the exact timestamps that pinpoint when infections occur. Nonetheless, the exact infection timestamps are often unavailable in practice, due to a high cost and uncertainties in the monitoring of node infections. In this work, we investigate the problem of how to infer the topology of a diffusion network from only the final infection statuses of nodes. To this end, we propose a new scoring criterion for diffusion network reconstruction, which is able to estimate the likelihood of potential topologies of the objective diffusion network based on infection status results with a relatively low statistical error. As the proposed scoring criterion is decomposable, our problem is transformed into finding for each node in the network a set of most probable parent nodes that maximizes the value of a local score. Furthermore, to eliminate redundant computations during the search of most probable parent nodes, we identify insignificant candidate parent nodes by checking whether their infections have negative or extremely low positive correlations with the infections of a corresponding child node, and exclude them from the search space. Extensive experiments on both synthetic and real-world networks are conducted, and the results verify the effectiveness and efficiency of our approach.
【Keywords】: diffusion network; topology; influence relationship; infection timestamp
【Paper Link】 【Pages】:637-648
【Authors】: Quang-Huy Duong ; Heri Ramampiaro ; Kjetil Nørvåg
【Abstract】: Dense subtensor detection is a well-studied area, with a wide range of applications, and numerous efficient approaches and algorithms have been proposed. Existing algorithms are generally efficient for dense subtensor detection and could perform well in many applications. However, the main drawback of most of these algorithms is that they can estimate only one subtensor at a time, with a low guarantee on the subtensor's density. While some methods can, on the other hand, estimate multiple subtensors, they can give a guarantee on the density with respect to the input tensor for the first estimated subsensor only. We address these drawbacks by providing both theoretical and practical solution for estimating multiple dense subtensors in tensor data. In particular, we guarantee and prove a higher bound of the lower-bound density of the estimated subtensors. We also propose a novel approach to show that there are multiple dense subtensors with a guarantee on its density that is greater than the lower bound used in the state-of-the-art algorithms. We evaluate our approach with extensive experiments on several real- world datasets, which demonstrates its efficiency and feasibility.
【Keywords】: Tensor; Dense Subtensor; Dense Subgraph; Multiple Subtensor Detection; Density Guarantee
【Paper Link】 【Pages】:649-660
【Authors】: Keke Huang ; Jing Tang ; Xiaokui Xiao ; Aixin Sun ; Andrew Lim
【Abstract】: Given a social network G, the profit maximization (PM) problem asks for a set of seed nodes to maximize the profit, i.e., revenue of influence spread less the cost of seed selection. The target profit maximization (TPM) problem, which generalizes the PM problem, aims to select a subset of seed nodes from a target user set T to maximize the profit. Existing algorithms for PM mostly consider the nonadaptive setting, where all seed nodes are selected in one batch without any knowledge on how they may influence other users. In this paper, we study TPM in adaptive setting, where the seed users are selected through multiple batches, such that the selection of a batch exploits the knowledge of actual influence in the previous batches. To acquire an overall understanding, we study the adaptive TPM problem under both the oracle model and the noise model, and propose ADG and AddATP algorithms to address them with strong theoretical guarantees, respectively. In addition, to better handle the sampling errors under the noise model, we propose the idea of hybrid error based on which we design a novel algorithm HATP that boosts the efficiency of AddATP significantly. We conduct extensive experiments on real social networks to evaluate the performance, and the experimental results strongly confirm the superiorities and effectiveness of our solutions.
【Keywords】: target profit maximization; social networks; approximation algorithms
【Paper Link】 【Pages】:661-672
【Authors】: Kai Wang ; Xuemin Lin ; Lu Qin ; Wenjie Zhang ; Ying Zhang
【Abstract】: Cohesive subgraph mining in bipartite graphs becomes a popular research topic recently. An important structure k-bitruss is the maximal cohesive subgraph where each edge is contained in at least k butterflies (i.e., (2,2)-bicliques). In this paper, we study the bitruss decomposition problem which aims to find all the k-bitrusses for k ≥ 0. The existing bottom-up techniques need to iteratively peel the edges with the lowest butterfly support. In this peeling process, these techniques are time-consuming to enumerate all the supporting butterflies for each edge. To relax this issue, we first propose a novel online index - the BE-Index which compresses butterflies into k-blooms (i.e., (2,k)-bicliques). Based on the BE-Index, the new bitruss decomposition algorithm BiT-BU is proposed, along with two batch-based optimizations, to accomplish the butterfly enumeration of the peeling process in an efficient way. Furthermore, the BiT-PC algorithm is devised which is more efficient against handling the edges with high butterfly supports. We theoretically show that our new algorithms significantly reduce the time complexities of the existing algorithms. Also, we conduct extensive experiments on real datasets and the results demonstrate that our new techniques can speed up the state-of-the-art techniques by up to two orders of magnitude.
【Keywords】: Bipartite graph; Indexes; Color; Optimization; Time complexity; Explosions; Conferences
【Paper Link】 【Pages】:673-684
【Authors】: Cheng Zhao ; Zhibin Zhang ; Peng Xu ; Tianqi Zheng ; Jiafeng Guo
【Abstract】: Graph mining is one of the most important categories of graph algorithms. However, exploring the subgraphs of an input graph produces a huge amount of intermediate data. The "think like a vertex" programming paradigm, pioneered by Pregel, cannot readily formulate mining problems, which is designed to produce graph computation problems like PageRank. Existing mining systems like Arabesque and RStream need large amounts of computing and memory resources. In this paper, we present Kaleido, an efficient single machine, out-of-core graph mining system which treats disks as an extension of memory. Kaleido treats intermediate data in graph mining tasks as a tensor and adopts a succinct data structure for the intermediate data. Kaleido implements half-memory-half-disk storage for storing large intermediate data, which treats the disk as an extension of the memory. Kaleido adopts a lightweight isomorphism checking strategy which uses an eigenvalue-based algorithm for small graphs and solves tree isomorphism for the other graphs. Comparing with two state-of-the-art mining systems, Arabesque and RStream, Kaleido outperforms them by a GeoMean 13.2× and 64.8× respectively.
【Keywords】: graph mining; exploration; isomorphism; out-of-core
【Paper Link】 【Pages】:685-696
【Authors】: Deming Chu ; Fan Zhang ; Xuemin Lin ; Wenjie Zhang ; Ying Zhang ; Yinglong Xia ; Chenyi Zhang
【Abstract】: The mode of k-core and its hierarchical decomposition have been applied in many areas, such as sociology, the world wide web, and biology. Algorithms on related studies often need an input value of parameter k, while there is no existing solution other than manual selection. In this paper, given a graph and a scoring metric, we aim to efficiently find the best value of k such that the score of the k-core (or k-core set) is the highest. The problem is challenging because there are various community scoring metrics and the computation is costly on large datasets. With the well-designed vertex ordering techniques, we propose time and space optimal algorithms to compute the best k, which are applicable to most community metrics. The proposed algorithms can compute the score of every k-core (set) and can benefit the solutions to other k-core related problems. Extensive experiments are conducted on 10 real-world networks with size up to billion-scale, which validates both the efficiency of our algorithms and the effectiveness of the resulting k-cores.
【Keywords】: Measurement; Approximation algorithms; Fans; Time complexity; Conferences; Data engineering; Facebook
【Paper Link】 【Pages】:697-708
【Authors】: Guohao Sun ; Guanfeng Liu ; Yan Wang ; Xiaofang Zhou
【Abstract】: Graph Pattern based Node Matching (GPNM) is to find all the matches of the nodes in a data graph G D based on a given pattern graph G P . GPNM has become increasingly important in many applications, e.g., group finding and expert recommendation. In real scenarios, both G P and G D are updated frequently. However, the existing GPNM methods either need to perform a new GPNM procedure from scratch to deliver the node matching results based on the updated G P and G D or incrementally perform the GPNM procedure for each of the updates, leading to low efficiency. Therefore, there is a pressing need for a new method to efficiently deliver the node matching results on the updated graphs. In this paper, we first analyze and detect the elimination relationships between the updates. Then, we construct an Elimination Hierarchy Tree (EH-Tree) to index these elimination relationships. In order to speed up the GPNM process, we propose a graph partition method and then propose a new updates-aware GPNM method, called UA-GPNM, considering the single-graph elimination relationships among the updates in a single graph of G P or G D , and also the cross-graph elimination relationships between the updates in G P and the updates in G D . UA-GPNM first delivers the GPNM result of an initial query, and then delivers the GPNM result of a subsequent query, based on the initial GPNM result and the multiple updates that occur between two queries. The experimental results on five real-world social graphs demonstrate that our proposed UA-GPNM is much more efficient than the state-of-the-art GPNM methods.
【Keywords】: Pattern matching; Indexes; Social network services; Collaboration; Query processing; Fans; Periodic structures
【Paper Link】 【Pages】:709-720
【Authors】: Alex Bogatu ; Alvaro A. A. Fernandes ; Norman W. Paton ; Nikolaos Konstantinou
【Abstract】: Data analytics stands to benefit from the increasing availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data wrangling, specific target datasets can be constructed that enable value- adding analytics. Given the potential vastness of such data lakes, the issue arises of how to pull out of the lake those datasets that might contribute to wrangling out a given target. We refer to this as the problem of dataset discovery in data lakes and this paper contributes an effective and efficient solution to it. Our approach uses features of the values in a dataset to construct hash- based indexes that map those features into a uniform distance space. This makes it possible to define similarity distances between features and to take those distances as measurements of relatedness w.r.t. a target table. Given the latter (and exemplar tuples), our approach returns the most related tables in the lake. We provide a detailed description of the approach and report on empirical results for two forms of relatedness (unionability and joinability) comparing them with prior work, where pertinent, and showing significant improvements in all of precision, recall, target coverage, indexing and discovery times.
【Keywords】: data discovery; table search; data wrangling
【Paper Link】 【Pages】:721-732
【Authors】: Yu Sun ; Shaoxu Song ; Chen Wang ; Jianmin Wang
【Abstract】: Misplaced data in a tuple are prevalent, e.g., a value "Passport" is misplaced in the passenger-name attribute, which should belong to the travel-document attribute instead. While repairing in-attribute errors have been widely studied, i.e., to repair the error by other values in the attribute domain, misplacement errors are surprisingly untouched, where the true value is simply misplaced in some other attribute of the same tuple. For instance, the true passenger-name is indeed misplaced in the travel-document attribute of the record. In this sense, we need a novel swapping repair model (to swap the misplaced passenger-name and travel-document values "Passport" and "John Adam" in the same tuple). Determining a proper swapping repair, however, is non-trivial. The minimum change criterion, evaluating the distance between the swapping repaired values, is obviously meaningless, since they are from different attribute domains. Intuitively, one may examine whether the swapped value ("John Adam") is similar to other values in the corresponding attribute domain (passenger-name). In a holistic view of all (swapped) attributes, we propose to evaluate the likelihood of a swapping repaired tuple by studying its distances (similarity) to neighbors. The rationale of distance likelihood refers to the Poisson process of nearest neighbor appearance. The optimum repair problem is to find a swapping repair with the maximum likelihood on distances. Experiments over datasets with real-world misplaced attribute values demonstrate the effectiveness of our proposal in repairing misplacement.
【Keywords】: Conferences; Data engineering
【Paper Link】 【Pages】:733-744
【Authors】: Yuyu Luo ; Chengliang Chai ; Xuedi Qin ; Nan Tang ; Guoliang Li
【Abstract】: In this paper, we study the problem of interactive cleaning for progressive visualization (ICPV): Given a bad visualization V , it is to obtain a "cleaned" visualization V whose distance is far from V , under a given (small) budget w.r.t. human cost. In ICPV, a system interacts with a user iteratively. During each iteration, it asks the user a data cleaning question such as "how to clean detected errors x?", and takes value updates from the user to clean V . Conventional wisdom typically picks a single question (e.g., "Are SIGMOD conference and SIGMOD the same?") with the maximum expected benefit in each iteration. We propose to use a composite question - i.e., a group of single questions to be treated as one question - in each iteration (for example, Are SIGMOD conference in t 1 and SIGMOD in t 2 the same value, and are t 1 and t 2 duplicates?). A composite question is presented to the user as a small connected graph through a novel GUI that the user can directly operate on. We propose algorithms to select the best composite question in each iteration. Experiments on real-world datasets verify that composite questions are more effective than asking single questions in isolation w.r.t. the human cost.
【Keywords】: Data visualization; Cleaning; Bars; Maintenance engineering; Task analysis; Graphical user interfaces; Manganese
【Paper Link】 【Pages】:745-757
【Authors】: Kim-Hung Le ; Paolo Papotti
【Abstract】: Anomalies are pervasive in time series data, such as sensor readings. Existing methods for anomaly detection cannot distinguish between anomalies that represent data errors, such as incorrect sensor readings, and notable events, such as the watering action in soil monitoring. In addition, the quality performance of such detection methods highly depends on the configuration parameters, which are dataset specific. In this work, we exploit active learning to detect both errors and events in a single solution that aims at minimizing user interaction. For this joint detection, we introduce an algorithm that accurately detects and labels anomalies with a non-parametric concept of neighborhood and probabilistic classification. Given a desired quality, the confidence of the classification is then used as termination condition for the active learning algorithm. Experiments on real and synthetic datasets demonstrate that our approach achieves F-score above 80% in detecting errors by labeling 2 to 5 points in one data series. We also show the superiority of our solution compared to the state-of-the-art approaches for anomaly detection. Finally, we demonstrate the positive impact of our error detection methods in downstream data repairing algorithms.
【Keywords】: Time series analysis; Anomaly detection; Monitoring; Soil; Labeling; Computational modeling; Probabilistic logic
【Paper Link】 【Pages】:757-768
【Authors】: Hanbing Zhang ; Yazhong Zhang ; Zhenying He ; Yinan Jing ; Kai Zhang ; X. Sean Wang
【Abstract】: Agile analytics can help organizations to gain and sustain a competitive advantage by making timely decisions. Approximate query processing (AQP) is one of the useful approaches in agile analytics, which facilitates fast queries on big data by leveraging a pre-computed sample. One problem such a sample faces is that when new data is being imported, re-sampling is most likely needed to keep the sample fresh and AQP results accurate enough. Re-sampling from scratch for every batch of new data, called the full re-sampling method and adopted by many existing AQP works, is obviously a very costly process, and a much quicker incremental sampling process, such as reservoir sampling, may be used to cover the newly arrived data. However, incremental update methods suffer from the fact that the sample size cannot be increased, which is a problem when the underlying data distribution dramatically changes and the sample needs to be enlarged to maintain the AQP accuracy. This paper proposes an adaptive sample update (ASU) approach that avoids re-sampling from scratch as much as possible by monitoring the data distribution, and uses instead an incremental update method before a re-sampling becomes necessary. The paper also proposes an enhanced approach (T-ASU), which tries to enlarge the sample size without re-sampling from scratch when a bit of query inaccuracy is tolerable to further reduce the sample update cost. These two approaches are integrated into a state-of-the-art AQP engine for an extensive experimental study. Experimental results on both real-world and synthetic datasets show that the two approaches are faster than the full re-sampling method while achieving almost the same AQP accuracy when the underlying data distribution continuously changes.
【Keywords】: Engines; Maintenance engineering; Reservoirs; Query processing; Data warehouses; Monitoring; Encyclopedias
【Paper Link】 【Pages】:769-780
【Authors】: Junzhou Zhao ; Pinghui Wang ; Jing Tao ; Shuo Zhang ; John C. S. Lui
【Abstract】: The sheer scale of big data causes the information overload issue and there is an urgent need for tools that can draw valuable insights from massive data. This paper investigates the core items tracking (CIT) problem where the goal is to continuously track representative items, called core items, in a data stream so to best represent/summarize the stream. In order to simultaneously satisfy the recency and continuity requirements, we consider CIT over probabilistic-decaying streams where items in the stream are forgotten gradually in a probabilistic manner. We first introduce an algorithm, called PNDCIT, to find core items in a special kind of probabilistic non-decaying streams. Furthermore, using PNDCIT as a building block, we design two novel algorithms, namely PDCIT and PDCIT+, to maintain core items over probabilistic-decaying streams with constant approximation ratios. Finally, extensive experiments on real data demonstrate that PDCIT+ achieves a speedup of up to one order of magnitude over a batch algorithm while providing solutions with comparable quality.
【Keywords】: Probabilistic logic; Approximation algorithms; Analytical models; History; Windows; Data models; Big Data
【Paper Link】 【Pages】:781-792
【Authors】: Puya Memarzia ; Suprio Ray ; Virendra C. Bhavsar
【Abstract】: Data analytics systems commonly utilize in-memory query processing techniques to achieve better throughput and lower latency. Modern computers increasingly rely on Non-Uniform Memory Access (NUMA) architectures to achieve scalability. A key drawback of NUMA architectures is that many existing software solutions are not aware of the underlying NUMA topology and thus do not take full advantage of the hardware. Modern operating systems are designed to provide basic support for NUMA systems. However, default system configurations are typically sub-optimal for large data analytics applications. Additionally, rewriting the application from the ground up is not always feasible.In this work, we evaluate a variety of strategies that aim to accelerate memory-intensive data analytics workloads on NUMA systems. Our findings indicate that the operating system default configurations can be detrimental to query performance. We analyze the impact of different memory allocators, memory placement strategies, thread placement, and kernel-level load balancing and memory management mechanisms. With extensive experimental evaluation, we demonstrate that the methodical application of these techniques can be used to obtain significant speedups in four commonplace in-memory query processing tasks, on three different hardware architectures. Furthermore, we show that these strategies can improve the performance of five popular database systems running a TPC-H workload. Lastly, we summarize our findings in a decision flowchart for practitioners.
【Keywords】: Query processing; Topology; Computer architecture; Hardware; Instruction sets; Data analysis
【Paper Link】 【Pages】:793-804
【Authors】: Lijun Chang ; Xing Feng ; Xuemin Lin ; Lu Qin ; Wenjie Zhang ; Dian Ouyang
【Abstract】: Graph similarity search retrieves from a database all graphs whose edit distance (GED) to a query graph is within a threshold. As GED computation is NP-hard, the existing works adopt the filtering-and-verification paradigm to reduce the number of GED verifications, and they mainly focus on designing filtering techniques while using the now out-dated algorithm A*GED for verification. In this paper, we aim to speed up GED verification, which is orthogonal to the index structures used in the filtering phase. We propose a best-first search algorithm AStar + -LSa which improves A*GED by (1) reducing memory consumption, (2) tightening lower bound estimation, and (3) improving the time complexity for lower bound computation. We formally show that AStar + -LSa has a lower space and time complexity than A*GED. We further modify AStar + -LSa into a depth-first search algorithm to contrast these two search paradigms, and we extend our algorithms for exact GED computation. We conduct extensive empirical studies on real graph datasets, and show that our algorithm AStar + -LSa outperforms the state-of-the-art algorithms by several orders of magnitude for both GED verification and GED computation.
【Keywords】: Indexes; Time complexity; Transforms; Estimation; Search problems; Memory management
【Paper Link】 【Pages】:805-816
【Authors】: Damjan Gjurovski ; Sebastian Michel
【Abstract】: In this work, we consider computing natural joins over massive streams of JSON documents that do not adhere to a specific schema. We first propose an efficient and scalable partitioning algorithm that uses the main principles of association analysis to identify patterns of co-occurrence of the attribute-value pairs within the documents. Data is then accordingly forwarded to compute nodes and locally joined using a novel FP-tree-based join algorithm. By compactly storing the documents and efficiently traversing the FP-tree structure, the proposed join algorithm can operate on large input sizes and provide results in real-time. We discuss data-dependent scalability limitations that are inherent to natural joins over schema-free data and show how to practically circumvent them by artificially expanding the space of possible attribute-value pairs. The proposed algorithms are realized in the Apache Storm stream processing framework. Through extensive experiments with real-world as well as synthetic data, we evaluate the proposed algorithms and show that they outperform competing approaches.
【Keywords】: Partitioning algorithms; Scalability; Storms; Twitter; Computer architecture; Microsoft Windows; Task analysis
【Paper Link】 【Pages】:817-828
【Authors】: Tova Milo ; Yuval Moskovitch ; Brit Youngmann
【Abstract】: The use of probabilistic datalog programs has been recently advocated for applications that involve recursive computation and uncertainty. While using such programs allows for a flexible knowledge derivation, it makes the analysis of query results a challenging task. Particularly, given a set O of output tuples and a number k, one would like to understand which k-size subset of the input tuples have contributed the most to the derivation of O. This is useful for multiple tasks, such as identifying the critical sources of errors and understanding surprising results. Previous works have mainly focused on the quantification of tuples contribution to a query result in non-recursive SQL queries, very often disregarding probabilistic inference. To quantify the contribution in probabilistic datalog programs, one must account for the recursive relations between input and output data, and the uncertainty. To this end, we formalize the Contribution Maximization (CM) problem. We then reduce CM to the well-studied Influence Maximization (IM) problem, showing that we can harness techniques developed for IM to our setting. However, we show that such naïve adoption results in poor performance. To overcome this, we propose an optimized algorithm which injects a refined variant of the classic Magic Sets technique, integrated with a sampling method, into IM algorithms, achieving a significant saving of space and execution time. Our experiments demonstrate the effectiveness of our algorithm, even where the naïve approach is infeasible.
【Keywords】: Probabilistic logic; Databases; Approximation algorithms; Uncertainty; Task analysis; Semantics; Structured Query Language
【Paper Link】 【Pages】:829-840
【Authors】: Shahab Helmi ; Farnoush Banaei Kashani
【Abstract】: Thanks to recent prevalence of location tracking technologies, collecting massive spatiotemporal datasets containing moving object trajectories has become possible, providing an exceptional opportunity to derive interesting insights about the behavior of moving objects such as people, animals, and vehicles. In particular, mining patterns from "co-movements" of objects (such as movements by players of a sports team, joints of the human body while walking, and vehicles in a transportation network) can lead to the discovery of interesting patterns (e.g., offense tactics of a sports team, gait signature of a person, and driving behaviors causing heavy traffic). Various trajectory mining and frequent pattern mining techniques have been proposed to discover patterns in trajectory datasets and more generally, event sequences. However, existing approaches are inapplicable for co-movement pattern mining from multi-trajectory datasets. In this paper, we propose a novel and efficient framework for co-movement pattern mining. We also extend this framework for efficient mining of such patterns at multiple spatial scales. The performance of the proposed solutions is evaluated by conducting extensive experiments using two real datasets, a soccer game dataset and a human gait dataset. Our experimental results show that our proposed algorithms are promising.
【Keywords】: frequent pattern mining; spatiotemporal data mining; trajectory pattern mining
【Paper Link】 【Pages】:841-852
【Authors】: Zhining Liu ; Wei Cao ; Zhifeng Gao ; Jiang Bian ; Hechang Chen ; Yi Chang ; Tie-Yan Liu
【Abstract】: Many real-world applications reveal difficulties in learning classifiers from imbalanced data. The rising big data era has been witnessing more classification tasks with large-scale but extremely imbalance and low-quality datasets. Most of existing learning methods suffer from poor performance or low computation efficiency under such a scenario. To tackle this problem, we conduct deep investigations into the nature of class imbalance, which reveals that not only the disproportion between classes, but also other difficulties embedded in the nature of data, especially, noises and class overlapping, prevent us from learning effective classifiers. Taking those factors into consideration, we propose a novel framework for imbalance classification that aims to generate a strong ensemble by self-paced harmonizing data hardness via under-sampling. Extensive experiments have shown that this new framework, while being very computationally efficient, can lead to robust performance even under highly overlapping classes and extremely skewed distribution. Note that, our methods can be easily adapted to most of existing learning methods (e.g., C4.5, SVM, GBDT and Neural Network) to boost their performance on imbalanced data.
【Keywords】: imbalance learning; imbalance classification; ensemble learning; data re-sampling
【Paper Link】 【Pages】:853-864
【Authors】: Yash Garg ; K. Selçuk Candan ; Maria Luisa Sapino
【Abstract】: Deep neural networks (DNNs), especially convolutional neural networks (CNNs), have been effective in various data-driven applications. Yet, DNNs suffer from several major challenges; in particular, in many applications where the input data is relatively sparse, DNNs face the problems of overfitting to the input data and poor generalizability. This brings up several critical questions: "Are all inputs equally important?" "Can we selectively focus on parts of the input data in a way that reduces overfitting to irrelevant observations?" Recently, attention networks showed some success in helping the overall process focus onto parts of the data that carry higher importance in the current context. Yet, we note that the current attention network design approaches are not sufficiently informed about the key data characteristics in identifying salient regions in the data. We propose an innovative robust feature learning framework, scale-invariant attention networks (SAN), that identifies salient regions in the input data for the CNN to focus on. Unlike the existing attention networks, SAN concentrates attention on parts of the data where there is major change across space and scale. We argue, and experimentally show, that the salient regions identified by SAN lead to better network performance compared to state-of-the-art (attentioned and non-attentioned) approaches, including architectures such as LeNet, VGG, ResNet, and LSTM, with common benchmark datasets, MNIST, FMNIST, CIFAR10/20/100, GTSRB, ImageNet, Mocap, Aviage, and GTSDB for tasks such as image/time series classification, time series forecasting and object detection in images.
【Keywords】: attention module; attention networks; convolutional neural networks
【Paper Link】 【Pages】:865-876
【Authors】: Khanh Luong ; Richi Nayak
【Abstract】: Effective methods are required to be developed that can deal with the multi-faceted nature of the multi-view data. We design a factorization-based loss function-based method to simultaneously learn two components encoding the consensus and complementary information present in multi-view data by using the Coupled Matrix Factorization (CMF) and Non-negative Matrix Factorization (NMF). We propose a novel optimal manifold for multi-view data which is the most consensed manifold embedded in the high-dimensional multi-view data. A new complementary enhancing term is added in the loss function to enhance the complementary information inherent in each view. An extensive experiment with diverse datasets, benchmarking the state-of-the-art multi-view clustering methods, has demonstrated the effectiveness of the proposed method in obtaining accurate clustering solution.
【Keywords】: Multi-view Clustering; Non-negative Matrix Factorization; Coupled Matrix Factorization; Manifold Learning; Consensus Manifold
【Paper Link】 【Pages】:877-888
【Authors】: Alexandra Kim ; Laks V. S. Lakshmanan ; Divesh Srivastava
【Abstract】: Data scientists typically analyze and extract insights from large multidimensional data sets such as US census data, enterprise sales data, and so on. But before sophisticated machine learning and statistical methods are employed, it is useful to build and explore concise summaries of the data set. While a variety of summaries have been proposed over the years, the goal of creating a concise summary of multidimensional data that can provide worst-case accuracy guarantees has remained elusive. In this paper, we propose Tree Summaries, which attain this challenging goal over arbitrary hierarchical multidimensional data sets. Intuitively, a Tree Summary is a weighted "embedded tree" in the lattice that is the cross-product of the dimension hierarchies; individual data values can be efficiently estimated by looking up the weight of their unique closest ancestor in the Tree Summary. We study the problems of generating lossless as well as (given a desired worst-case accuracy guarantee a) lossy Tree Summaries. We develop a polynomial-time algorithm that constructs the optimal (i.e., most concise) Tree Summary for each of these problems; this is a surprising result given the NP-hardness of constructing a variety of other optimal summaries over multidimensional data. We complement our analytical results with an empirical evaluation of our algorithm, and demonstrate with a detailed set of experiments on real and synthetic data sets that our algorithm outperforms prior methods in terms of conciseness of summaries or accuracy of estimation.
【Keywords】: Aggregates; Lattices; Estimation error; Data mining; Machine learning; Insects
【Paper Link】 【Pages】:889-900
【Authors】: Yue Kou ; Derong Shen ; Quinn Snell ; Dong Li ; Tiezheng Nie ; Ge Yu ; Shuai Ma
【Abstract】: Finding a team that is both competent in performing the task and compatible in working together has been extensively studied. However, most methods for team formation tend to rely on a set of skills only. In order to solve this problem, we present an efficient team formation method based on Constrained Pattern Graph (called CPG). Unlike traditional methods, our method takes into account both structure constraints and communication constraints on team members, which can better meet the requirements of users. First, a CPG preprocessing method is proposed to normalize a CPG and represent it as a CoreCPG in order to establish the basis for efficient matching. Second, a Communication Cost Index (called CCI) is constructed to speed up the matching between a CPG and its corresponding social network. Third, a CCI-based node matching algorithm is proposed to minimize the total number of intermediate results. Moreover, a set of incremental maintenance strategies for the changes of social networks are proposed. We conduct experimental studies based on two real-world social networks. The experiments demonstrate the effectiveness and the efficiency of our proposed method in comparison with traditional methods.
【Keywords】: team formation; social networks; Constrained Pattern Graph; Communication Cost Index
【Paper Link】 【Pages】:901-912
【Authors】: Yixing Yang ; Yixiang Fang ; Xuemin Lin ; Wenjie Zhang
【Abstract】: Recently, the topic of truss computation has gained plenty of attention, where the k-truss of a graph is the maximum subgraph in which each edge participates in at least (k-2) triangles. Existing solutions mainly focus on homogeneous networks, where vertices are of the same type, and thus cannot be applied to heterogeneous information networks which consist of multi-typed and interconnected objects, such as the bibliographic networks and knowledge graphs. In this paper, we study the problem of truss computation over HINs, which aims to find groups of vertices that are of the same type and densely connected. To model the relationship between two vertices of the same type, we adopt the well-known concept of meta-path, which is a sequence of vertex types and edge types between two given vertex types. We then introduce two kinds of HIN triangles for three vertices, regarding a specific meta-path P. The first one requires that each pair of vertices is connected by an instance of P, while the second one also has such a connectivity constraint but further needs that the three instances of P form a circle. Based on these two kinds of triangles, we propose two HIN truss models respectively. We further develop efficient truss computation algorithms. We have performed extensive experiments on five real large HINs, and the results show that the proposed solutions are highly effective and efficient.
【Keywords】: Computational modeling; Knowledge engineering; Measurement; Flickr; Conferences; Data engineering; Computer science
【Paper Link】 【Pages】:913-924
【Authors】: Dandan Lin ; Raymond Chi-Wing Wong ; Min Xie ; Victor Junqiu Wei
【Abstract】: Due to the prevalence of graph data, graph analysis is very important nowadays. One popular analysis on graph data is Random Walk with Restart (RWR) since it provides a good metric for measuring the proximity of two nodes in a graph. Although RWR is important, it is challenging to design an algorithm for RWR. To the best of our knowledge, there are no existing RWR algorithms which, at the same time, (1) are index-free, (2) return answers with a theoretical guarantee and (3) are efficient. Motivated by this, in this paper, we propose an index-free algorithm called Residue-Accumulated approach (ResAcc) which returns answers with a theoretical guarantee efficiently. Our experimental evaluations on large-scale real graphs show that ResAcc is up to 4 times faster than the best-known previous algorithm, guaranteeing the same accuracy. Under typical settings, the best-known algorithm ran around 1000 seconds on a large dataset containing 41.7 million nodes, which is too time-consuming, while ResAcc finished in 275 seconds with the same accuracy. Moreover, ResAcc is up to 6 orders of magnitude more accurate than the best-known algorithm in practice with the same execution time, which is considered as a substantial improvement.
【Keywords】: Radio frequency; Twitter; Indexes; Additives; Approximation algorithms; Monte Carlo methods; Conferences
【Paper Link】 【Pages】:925-936
【Authors】: Jeongmyung Lee ; Seokwon Kang ; Yongseung Yu ; Yong-Yeon Jo ; Sang-Wook Kim ; Yongjun Park
【Abstract】: Sparse matrix multiplication (spGEMM) is widely used to analyze the sparse network data, and extract important information based on matrix representation. As it contains a high degree of data parallelism, many efficient implementations using data-parallel programming platforms such as CUDA and OpenCL have been introduced on graphic processing units (GPUs). Several well-known spGEMM techniques, such as cuS- PARSE and CUSP, often do not utilize the GPU resources fully, owing to the load imbalance between threads in the expansion process and high memory contention in the merge process. Furthermore, even though several outer-product-based spGEMM techniques are proposed to solve the load balancing problem on expansion, they still do not utilize the GPU resources fully, because severe computation load variations exist among the multiple thread blocks.To solve these challenges, this paper proposes a new optimization pass called Block Reorganizer, which balances the total computations of each computing unit on target GPUs, based on the outer-product-based expansion process, and reduces the memory pressure during the merge process. For expansion, it first identifies the actual computation amount for each block, and then performs two thread block transformation processes based on their characteristics: 1) B-Splitting to transform a heavy-computation blocks into multiple small blocks and 2) B- Gathering to aggregate multiple small-computation blocks to a larger block. While merging, it improves the overall performance by performing B-Limiting to limit the number of blocks on each computing unit. Experimental results show that it improves the total performance of kernel execution by 1.43x, on an average, when compared to the row-product-based spGEMM, for NVIDIA Titan Xp GPUs on real-world datasets.
【Keywords】: Sparse matrix multiplication; sparse network; GPU; linear algebra
【Paper Link】 【Pages】:937-948
【Authors】: Qing Liu ; Yifan Zhu ; Minjun Zhao ; Xin Huang ; Jianliang Xu ; Yunjun Gao
【Abstract】: Attributed community search aims to find the community with strong structure and attribute cohesiveness from attributed graphs. However, existing works suffer from two major limitations: (i) it is not easy to set the conditions on query attributes; (ii) the queries support only a single type of attributes. To make up for these deficiencies, in this paper, we study a novel attributed community search called vertex-centric attributed community (VAC) search. Given an attributed graph and a query vertex set, the VAC search returns the community which is densely connected (ensured by the k-truss model) and has the best attribute score. We show that the problem is NP-hard. To answer the VAC search, we develop both exact and approximate algorithms. Specifically, we develop two exact algorithms. One searches the community in a depth-first manner and the other is in a best-first manner. We also propose a set of heuristic strategies to prune the unqualified search space by exploiting the structure and attribute properties. In addition, to further improve the search efficiency, we propose a 2-approximation algorithm. Comprehensive experimental studies on various realworld attributed graphs demonstrate the effectiveness of the proposed model and the efficiency of the developed algorithms.
【Keywords】: Approximation algorithms; Search problems; Euclidean distance; Computer science; Complex networks; Conferences
【Paper Link】 【Pages】:949-960
【Authors】: Yiding Liu ; Kaiqi Zhao ; Gao Cong ; Zhifeng Bao
【Abstract】: Detecting anomalous trajectory has become an important and fundamental concern in many real-world applications. However, most of the existing studies 1) cannot handle the complexity and variety of trajectory data and 2) do not support efficient anomaly detection in an online manner. To this end, we propose a novel model, namely Gaussian Mixture Variational Sequence AutoEncoder (GM-VSAE), to tackle these challenges. Our GM-VSAE model is able to (1) capture complex sequential information enclosed in trajectories, (2) discover different types of normal routes from trajectories and represent them in a continuous latent space, and (3) support efficient online detection via trajectory generation. Our experiments on two real-world datasets demonstrate that GM-VSAE is more effective than the state-of-the-art baselines and is efficient for online anomalous trajectory detection.
【Keywords】: Trajectory; Anomaly detection; Global Positioning System; Public transportation; Computational modeling; Recurrent neural networks; Data models
【Paper Link】 【Pages】:961-972
【Authors】: Zhidan Liu ; Zengyang Gong ; Jiangzhou Li ; Kaishun Wu
【Abstract】: Taxi ridesharing becomes promising and attractive because of the wide availability of taxis in a city and tremendous benefits of ridesharing, e.g., alleviating traffic congestion and reducing energy consumption. Existing taxi ridesharing schemes, however, are not efficient and practical, due to they simply match ride requests and taxis based on partial trip information and omit the offline passengers, who hail a taxi at roadside with no explicit requests to the system. In this paper, we consider the mobility-aware taxi ridesharing problem, and present mT- Share to address these limitations. mT-Share fully exploits the mobility information of ride requests and taxis to achieve efficient indexing of taxis/requests and better passenger-taxi matching, while still satisfying the constraints on passengers' deadlines and taxis' capacities. Specifically, mT-Share indexes taxis and ride requests with both geographical information and travel directions, and supports the shortest path based routing and probabilistic routing to serve both online and offline ride requests. Extensive experiments with a large real-world taxi dataset demonstrate the efficiency and effectiveness of mT-Share, which can response each ride request in milliseconds and with a moderate detour cost. Compared to state-of-the-art methods, mT-Share serves 42% and 62% more ride requests in peak and non-peak hours, respectively.
【Keywords】: taxi ridesharing; mobility; clustering; route planning
【Paper Link】 【Pages】:973-984
【Authors】: Bolong Zheng ; Chenze Huang ; Christian S. Jensen ; Lu Chen ; Nguyen Quoc Viet Hung ; Guanfeng Liu ; Guohui Li ; Kai Zheng
【Abstract】: In Pickup-and-Delivery problems (PDP), mobile workers are employed to pick up and deliver items with the goal of reducing travel and fuel consumption. Unlike most existing efforts that focus on finding a schedule that enables the delivery of as many items as possible at the lowest cost, we consider trichromatic (worker-item-task) utility that encompasses worker reliability, item quality, and task profitability. Moreover, we allow customers to specify keywords for desired items when they submit tasks, which may result in multiple pickup options, thus further increasing the difficulty of the problem. Specifically, we formulate the problem of Online Trichromatic Pickup and Delivery Scheduling (OTPD) that aims to find optimal delivery schedules with highest overall utility. In order to quickly respond to submitted tasks, we propose a greedy solution that finds the schedule with the highest utility-cost ratio. Next, we introduce a skyline kinetic tree-based solution that materializes intermediate results to improve the result quality. Finally, we propose a density-based grouping solution that partitions streaming tasks and efficiently assigns them to the workers with high overall utility. Extensive experiments with real and synthetic data offer evidence that the proposed solutions excel over baselines with respect to both effectiveness and efficiency.
【Keywords】: Spatial Crowdsourcing; pickup and delivery; scheduling; real-time; query optimization
【Paper Link】 【Pages】:985-996
【Authors】: Wangze Ni ; Peng Cheng ; Lei Chen ; Xuemin Lin
【Abstract】: Ubiquitous smart devices and high-quality wireless networks enable people to participate in spatial crowdsourcing tasks easily, which require workers to physically move to specific locations to conduct their assigned tasks. Spatial crowdsourcing has attracted much attention from both academia and industry. In this paper, we consider a spatial crowdsourcing scenario, where the tasks may have some dependencies among them. Specifically, one task can only be dispatched when its dependent tasks have already been assigned. In fact, task dependencies are quite common in many real-life applications, such as house repairing and holding sports games. We formally define the dependency-aware spatial crowdsourcing (DA-SC), which focuses on finding an optimal worker-and-task assignment under the constraints of dependencies, skills of workers, moving distances and deadlines to maximize the successfully assigned tasks. We prove that the DA-SC problem is NP-hard and thus intractable. Therefore, we propose two approximation algorithms, including a greedy approach and a game-theoretic approach, which can guarantee the approximate bounds of the results in each batch process. Through extensive experiments on both real and synthetic data sets, we demonstrate the efficiency and effectiveness of our DA-SC approaches.
【Keywords】: Spatial Crowdsourcing; Task Assignment; Approximate Algorithm
【Paper Link】 【Pages】:997-1008
【Authors】: Lisi Chen ; Shuo Shang ; Christian S. Jensen ; Bin Yao ; Panos Kalnis
【Abstract】: Matching similar pairs of trajectories, called trajectory similarity join, is a fundamental functionality in spatial data management. We consider the problem of semantic trajectory similarity join (STS-Join). Each semantic trajectory is a sequence of Points-of-interest (POIs) with both location and text information. Thus, given two sets of semantic trajectories and a threshold θ, the STS-Join returns all pairs of semantic trajectories from the two sets with spatio-textual similarity no less than θ. This join targets applications such as term-based trajectory near-duplicate detection, geo-text data cleaning, personalized ridesharing recommendation, keyword-aware route planning, and travel itinerary recommendation.With these applications in mind, we provide a purposeful definition of spatio-textual similarity. To enable efficient STS-Join processing on large sets of semantic trajectories, we develop trajectory pair filtering techniques and consider the parallel processing capabilities of modern processors. Specifically, we present a two-phase parallel search algorithm. We first group semantic trajectories based on their text information. The algorithm's per-group searches are independent of each other and thus can be performed in parallel. For each group, the trajectories are further partitioned based on the spatial domain. We generate spatial and textual summaries for each trajectory batch, based on which we develop batch filtering and trajectory-batch filtering techniques to prune unqualified trajectory pairs in a batch mode. Additionally, we propose an efficient divide-and-conquer algorithm to derive bounds of spatial similarity and textual similarity between two semantic trajectories, which enable us prune dissimilar trajectory pairs without the need of computing the exact value of spatio-textual similarity. Experimental study with large semantic trajectory data confirms that our algorithm of processing semantic trajectory join is capable of outperforming our well-designed baseline by a factor of 8-12.
【Keywords】: Trajectory; Semantics; Measurement; Indexes; Partitioning algorithms; Complexity theory; Mathematical model
【Paper Link】 【Pages】:1009-1020
【Authors】: Min Xie ; Raymond Chi-Wing Wong ; Peng Peng ; Vassilis J. Tsotras
【Abstract】: When faced with a database containing millions of products, a user may be only interested in a (typically much) smaller representative subset. Various approaches were proposed to create a good representative subset that fits the user's needs which are expressed in the form of a utility function (e.g., the top-k and diversification query). Recently, a regret minimization query was proposed: it does not require users to provide their utility functions and returns a small set of tuples such that any user's favorite tuple in this subset is guaranteed to be not much worse than his/her favorite tuple in the whole database. In a sense, this query finds a small set of tuples that makes the user happy (i.e., not regretful) even if s/he gets the best tuple in the selected set but not the best tuple among all tuples in the database. In this paper, we study the min-size version of the regret minimization query; that is, we want to determine the least tuples needed to keep users happy at a given level. We term this problem as the α-happiness query where we quantify the user's happiness level by a criterion, called the happiness ratio, and guarantee that each user is at least α happy with the set returned (i.e., the happiness ratio is at least α) where α is a real number from 0 to 1. As this is an NP-hard problem, we derive an approximate solution with theoretical guarantee by considering the problem from a geometric perspective. Since in practical scenarios, users are interested in achieving higher happiness levels (i.e., α is closer to 1), we performed extensive experiments for these scenarios, using both real and synthetic datasets. Our evaluations show that our algorithm outperforms the best-known previous approaches in two ways: (i) it answers the α-happiness query by returning fewer tuples to users and, (ii) it answers much faster (up to two orders of magnitude times improvement for large α).
【Keywords】: Databases; Automobiles; Minimization; Decision making; Conferences; Data engineering; NP-hard problem
【Paper Link】 【Pages】:1021-1032
【Authors】: Jun Kuang ; Yixin Cao ; Jianbing Zheng ; Xiangnan He ; Ming Gao ; Aoying Zhou
【Abstract】: Relation extraction (RE) aims at extracting the relation between two entities from the text corpora. It is a crucial task for Knowledge Graph (KG) construction. Most existing methods predict the relation between an entity pair by learning the relation from the training sentences, which contain the targeted entity pair. In contrast to existing distant supervision approaches that suffer from insufficient training corpora to extract relations, our proposal of mining implicit mutual relation from the massive unlabeled corpora transfers the semantic information of entity pairs into the RE model, which is more expressive and semantically plausible. After constructing an entity proximity graph based on the implicit mutual relations, we preserve the semantic relations of entity pairs via embedding each vertex of the graph into a low-dimensional space. As a result, we can easily and flexibly integrate the implicit mutual relations and other entity information, such as entity types, into the existing RE methods.Our experimental results on a New York Times and another Google Distant Supervision datasets suggest that our proposed neural RE framework provides a promising improvement for the RE task, and significantly outperforms the state-of-the-art methods. Moreover, the component for mining implicit mutual relations is so flexible that can help to improve the performance of both CNN-based and RNN-based RE models significant.
【Keywords】: Relation extraction; implicit mutual relations; unlabeled data; entity information
【Paper Link】 【Pages】:1033-1044
【Authors】: Weijie Zhao ; Shulong Tan ; Ping Li
【Abstract】: Approximate nearest neighbor (ANN) searching is a fundamental problem in computer science with numerous applications in (e.g.,) machine learning and data mining. Recent studies show that graph-based ANN methods often outperform other types of ANN algorithms. For typical graph-based methods, the searching algorithm is executed iteratively and the execution dependency prohibits GPU adaptations. In this paper, we present a novel framework that decouples the searching on graph algorithm into 3 stages, in order to parallel the performance-crucial distance computation. Furthermore, to obtain better parallelism on GPU, we propose novel ANN-specific optimization methods that eliminate dynamic GPU memory allocations and trade computations for less GPU memory consumption. The proposed system is empirically compared against HNSW-the state-of-the-art ANN method on CPU-and Faiss-the popular GPU-accelerated ANN platform-on 6 datasets. The results confirm the effectiveness: SONG has around 50-180x speedup compared with single-thread HNSW, while it substantially outperforms Faiss.
【Keywords】: Graphics processing units; Instruction sets; Memory management; Indexes; Data structures; Approximation algorithms
【Paper Link】 【Pages】:1045-1056
【Authors】: Kejing Lu ; Mineichi Kudo
【Abstract】: Locality sensitive hashing (LSH) is a widely practiced c-approximate nearest neighbor (c-ANN) search algorithm because of its appealing theoretical guarantee and empirical performance. However, available LSH-based solutions do not achieve a good balance between cost and quality because of certain limitations in their index structures. In this paper, we propose a novel and easy-to-implement disk- based method named R2LSH to answer ANN queries in high-dimensional spaces. In the indexing phase, R2LSH maps data objects into multiple two-dimensional projected spaces. In each space, a group of B+-trees is constructed to characterize the corresponding data distribution. In the query phase, by setting a query-centric ball in each projected space and using a dynamic counting technique, R2LSH efficiently determines candidates and returns query results with the required quality. Rigorous theoretical analysis reveals that the proposed algorithm supports c-ANN search for arbitrarily small c ≥ 1 with probability guarantee. Extensive experiments on real datasets verify the superiority of R2LSH over state-of-the-art methods.
【Keywords】: Locality Sensitive Hashing; Approximate Nearest Neighbor Search; High-Dimensional Spaces
【Paper Link】 【Pages】:1057-1068
【Authors】: Yan Li ; Tingjian Ge ; Cindy X. Chen
【Abstract】: Knowledge graphs have seen increasingly broad applications. However, they are known to be incomplete. We define the notion of a virtual knowledge graph which extends a knowledge graph with predicted edges and their probabilities. We focus on two important types of queries: top-k entity queries and aggregate queries. To improve query processing efficiency, we propose an incremental index on top of low dimensional entity vectors transformed from network embedding vectors. We also devise query processing algorithms with the index. Moreover, we provide theoretical guarantees of accuracy, and conduct a systematic experimental evaluation. The experiments show that our approach is very efficient and effective. In particular, with the same or better accuracy guarantees, it is one to two orders of magnitude faster in query processing than the closest previous work which can only handle one relationship type.
【Keywords】: Aggregates; Transforms; Query processing; Prediction algorithms; Probabilistic logic; Indexing
【Paper Link】 【Pages】:1069-1080
【Authors】: Feng Zhang ; Jidong Zhai ; Xipeng Shen ; Onur Mutlu ; Xiaoyong Du
【Abstract】: Recent studies have shown the promise of direct data processing on hierarchically-compressed text documents. By removing the need for decompressing data, the direct data processing technique brings large savings in both time and space. However, its benefits have been limited to data traversal operations; for random accesses, direct data processing is several times slower than the state-of-the-art baselines. This paper presents a set of techniques that successfully eliminate the limitation, and for the first time, establishes the feasibility of effectively handling both data traversal operations and random data accesses on hierarchically-compressed data. The work yields a new library, which achieves 3.1× speedup over the state-of-the-art on random data accesses to compressed data, while preserving the capability of supporting traversal operations efficiently and providing large (3.9×) space savings.
【Keywords】: Indexing; Data processing; Data structures; Technological innovation; Dictionaries; Law; Data mining
【Paper Link】 【Pages】:1081-1092
【Authors】: Zhong Yang ; Bolong Zheng ; Guohui Li ; Xi Zhao ; Xiaofang Zhou ; Christian S. Jensen
【Abstract】: The set similarity join (SSJ) is core functionality in a range of applications, including data cleaning, near-duplicate object detection, and data integration. Threshold-based SSJ queries return all pairs of sets with similarity no smaller than a given threshold. As results, and their utility, are very sensitive to the choice of threshold value, it is a problem that it is difficult to choose such an appropriate value. Doing so requires prior knowledge of the data, which users often do not have. To avoid this problem, we propose a solution to the top-k overlap set similarity join (TkOSSJ) that returns k pairs of sets with the highest overlap similarities. The state-of-the-art solution disregards the effect of the so-called step size, which is the number of elements accessed in each iteration of the algorithm. This affects its performance negatively. To address this issue, we first propose an algorithm that uses a fixed step size, thus taking advantage of the benefits of a large step size, and then we present an adaptive step size algorithm that is capable of automatically adjusting the step size, thus reducing redundant computations. An extensive empirical study offers insight into the new algorithms and indicates that they are capable of outperforming the state-of-the-art method on real, large-scale data sets.
【Keywords】: overlap set similarity; similarity join; top-k join, data lake
【Paper Link】 【Pages】:1093-1104
【Authors】: Bo Zhao ; Nguyen Quoc Viet Hung ; Matthias Weidlich
【Abstract】: Complex event processing (CEP) systems that evaluate queries over streams of events may face unpredictable input rates and query selectivities. During short peak times, exhaustive processing is then no longer reasonable, or even infeasible, and systems shall resort to best-effort query evaluation and strive for optimal result quality while staying within a latency bound. In traditional data stream processing, this is achieved by load shedding that discards some stream elements without processing them based on their estimated utility for the query result. We argue that such input-based load shedding is not always suitable for CEP queries. It assumes that the utility of each individual element of a stream can be assessed in isolation. For CEP queries, however, this utility may be highly dynamic: Depending on the presence of partial matches, the impact of discarding a single event can vary drastically. In this work, we therefore complement input-based load shedding with a state-based technique that discards partial matches. We introduce a hybrid model that combines both input-based and state-based shedding to achieve high result quality under constrained resources. Our experiments indicate that such hybrid shedding improves the recall by up to 14× for synthetic data and 11.4× for real-world data, compared to baseline approaches.
【Keywords】: Query processing; Load modeling; Computational modeling; Correlation; Complexity theory; Transportation; Semantics
【Paper Link】 【Pages】:1105-1116
【Authors】: Nikos R. Katsipoulakis ; Alexandros Labrinidis ; Panos K. Chrysanthis
【Abstract】: Stream Processing Engines (SPEs) are used for realtime and continuous processing with stateful operations. This type of processing poses numerous challenges due to its associated complexity, unpredictable input, and need for timely results. As a result, users tend to overprovision resources, and online scaling is required in order to overcome overloaded situations. Current attempts for expediting stateful processing are impractical, due to their inability to uphold the quality of results, maintain performance, and reduce memory requirements. In this paper, we present the SPEAr system, which can expedite processing of stateful operations automatically by trading accuracy for performance. SPEAr detects when it can accelerate processing by employing online sampling and accuracy estimation at no additional cost. We built SPEAr on top of Storm and our experiments indicate that it can reduce processing times by more than an order of magnitude, use more than an order of magnitude less memory, and offer accuracy guarantees in real-world benchmarks.
【Keywords】: Watermarking; Buffer storage; Storms; Real-time systems; Memory management; Manuals; Silicon
【Paper Link】 【Pages】:1117-1128
【Authors】: Shixun Huang ; Zhifeng Bao ; Guoliang Li ; Yanghao Zhou ; J. Shane Culpepper
【Abstract】: Network embedding is an effective method to learn low-dimensional representations of nodes, which can be applied to various real-life applications such as visualization, node classification, and link prediction. Although significant progress has been made on this problem in recent years, several important challenges remain, such as how to properly capture temporal information in evolving networks. In practice, most networks are continually evolving. Some networks only add new edges or nodes such as authorship networks, while others support removal of nodes or edges such as internet data routing. If patterns exist in the changes of the network structure, we can better understand the relationships between nodes and the evolution of the network, which can be further leveraged to learn node representations with more meaningful information. In this paper, we propose the Embedding via Historical Neighborhoods Aggregation (EHNA) algorithm. More specifically, we first propose a temporal random walk that can identify relevant nodes in historical neighborhoods which have impact on edge formations. Then we apply a deep learning model which uses a custom attention mechanism to induce node embeddings that directly capture temporal information in the underlying feature representation. We perform extensive experiments on a range of real-world datasets, and the results demonstrate the effectiveness of our new approach in the network reconstruction task and the link prediction task.
【Keywords】: Task analysis; Collaboration; Heuristic algorithms; Machine learning; Social network services; Organizations; Feature extraction
【Paper Link】 【Pages】:1129-1140
【Authors】: Swapnil Gandhi ; Yogesh Simmhan
【Abstract】: Algorithms for temporal property graphs may be time-dependent (TD), navigating the structure and time concurrently, or time-independent (TI), operating separately on different snapshots. Currently, there is no unified and scalable programming abstraction to design TI and TD algorithms over large temporal graphs. We propose an interval-centric computing model (ICM) for distributed and iterative processing of temporal graphs, where a vertex's time-interval is a unit of data-parallel computation. It introduces a unique time-warp operator for temporal partitioning and grouping of messages that hides the complexity of designing temporal algorithms, while avoiding redundancy in user logic calls and messages sent. GRAPHITE is our implementation of ICM over Apache Giraph, and we use it to design 12 TI and TD algorithms from literature. We rigorously evaluate its performance for diverse real-world temporal graphs - as large as 131M vertices and 5.5B edges, and as long as 219 snapshots. Our comparison with 4 baseline platforms on a 10-node commodity cluster shows that ICM shares compute and messaging across intervals to out-perform them by up to 25×, and matches them even in worst-case scenarios. GRAPHITE also exhibits weak-scaling with near-perfect efficiency.
【Keywords】: Computational modeling; Data models; Partitioning algorithms; Scalability; Navigation; Programming; Graphite
【Paper Link】 【Pages】:1141-1152
【Authors】: Mo Li ; Farhana Murtaza Choudhury ; Renata Borovica-Gajic ; Zhiqiong Wang ; Junchang Xin ; Jianxin Li
【Abstract】: SimRank is a significant metric to measure the similarity of nodes in graph data analysis. The problem of SimRank computation has been studied extensively, however there is no existing work that can provide one unified algorithm to support the SimRank computation both on static and temporal graphs. In this work, we first propose CrashSim, an index-free algorithm for single-source SimRank computation in static graphs. CrashSim can provide provable approximation guarantees for the computational results in an efficient way. In addition, as the reallife graphs are often represented as temporal graphs, CrashSim enables efficient computation of SimRank in temporal graphs. We formally define two typical SimRank queries in temporal graphs, and then solve them by developing an efficient algorithm based on CrashSim, called CrashSim-T. From the extensive experimental evaluation using five real-life and synthetic datasets, it can be seen that the CrashSim algorithm and CrashSim-T algorithm substantially improve the efficiency of the state-of-the-art SimRank algorithms by about 30%, while achieving the precision of the result set with about 97%.
【Keywords】: Computer crashes; Approximation algorithms; Indexes; Market research; Heuristic algorithms; Data analysis; Australia
【Paper Link】 【Pages】:1153-1164
【Authors】: Dong Wen ; Yilun Huang ; Ying Zhang ; Lu Qin ; Wenjie Zhang ; Xuemin Lin
【Abstract】: Reachability is a fundamental problem in graph analysis. In applications such as social networks and collaboration networks, edges are always associated with timestamps. Most existing works on reachability queries in temporal graphs assume that two vertices are related if they are connected by a path with non-decreasing timestamps (time-respecting) of edges. This assumption fails to capture the relationship between entities involved in the same group or activity with no time-respecting path connecting them. In this paper, we define a new reachability model, called span-reachability, designed to relax the time order dependency and identify the relationship between entities in a given time period. We adopt the idea of two-hop cover and propose an index-based method to answer span-reachability queries. Several optimizations are also given to improve the efficiency of index construction and query processing. We conduct extensive experiments on 17 real-world datasets to show the efficiency of our proposed solution.
【Keywords】: Reachability; Temporal Graphs
【Paper Link】 【Pages】:1165-1176
【Authors】: Jia Yu ; Mohamed Sarwat
【Abstract】: In this paper, we present a middleware framework that runs on top of a SQL data system with the purpose of increasing the interactivity of geospatial visualization dashboards. The proposed system adopts a sampling cube approach that stores pre-materialized spatial samples and allows users to define their own accuracy loss function such that the produced samples can be used for various user-defined visualization tasks. The system ensures that the difference between the sample fed into the visualization dashboard and the raw query answer never exceeds the user-specified loss threshold. To reduce the number of cells in the sampling cube and hence mitigate the initialization time and memory utilization, the system employs two main strategies: (1) a partially materialized cube to only materialize local samples of those queries for which the global sample (the sample drawn from the entire dataset) exceeds the required accuracy loss threshold. (2) a sample selection technique that finds similarities between different local samples and only persists a few representative samples. Based on the extensive experimental evaluation, Tabula can bring down the total data-to-visualization time (including both data-system and visualization times) of a heat map generated over 700 million taxi rides to 600 milliseconds with 250 meters user-defined accuracy loss. Besides, Tabula costs up to two orders of magnitude less memory footprint (e.g., only 800 MB for the running example) and one order of magnitude less initialization time than the fully materialized sampling cube.
【Keywords】: Data visualization; Data systems; Public transportation; Heating systems; Loss measurement; Geospatial analysis; Task analysis
【Paper Link】 【Pages】:1177-1188
【Authors】: Ibrahim Sabek ; Mohamed F. Mokbel
【Abstract】: This paper presents Sya; the first spatial probabilistic knowledge base construction system, based on Markov Logic Networks (MLN). Sya injects the awareness of spatial relationships inside the MLN grounding and inference phases, which are the pillars of the knowledge base construction process, and hence results in a better knowledge base output. In particular, Sya generates a probabilistic model that captures both logical and spatial correlations among knowledge base relations. Sya provides a simple spatial high-level language, a spatial variation of factor graph, a spatial rules-query translator, and a spatially-equipped statistical inference technique to infer the factual scores of relations. In addition, Sya provides an optimization that ensures scalable grounding and inference for large-scale knowledge bases. Experimental evidence, based on building two real knowledge bases with spatial nature, shows that Sya can achieve 70% higher F1-score on average over the state-of-the-art DeepDive system, while achieving at least 20% reduction in the execution times.
【Keywords】: Knowledge based systems; Grounding; Correlation; Probabilistic logic; Spatial databases; Standards; Data mining
【Paper Link】 【Pages】:1189-1200
【Authors】: Lei Li ; Mengxuan Zhang ; Wen Hua ; Xiaofang Zhou
【Abstract】: Shortest path query is a fundamental operation in various location-based services (LBS) and most of them process queries on the server-side. As the business expands, scalability becomes a severe issue. Instead of simply deploying more servers to cope with the quickly increasing query number, batch shortest path algorithms have been proposed recently to answer a set of queries together using shareable computation. Besides, they can also work in a highly dynamic environment as no index is needed. However, the existing batch algorithms either assume the batch queries are finely decomposed or just process them without differentiation, resulting in poor query efficiency. In this paper, we aim to improve the performance of batch shortest path algorithms by revisiting the problem of query clustering. Specifically, we first propose three query decomposition methods to cluster queries: Zigzag that considers the 1-N shared computation; Search-Space Estimation that further incorporates search space estimation; and Co-Clustering that considers the source and target's spatial locality. After that, we propose two batch algorithms that take advantage of the previously decomposed query sets for efficient query answering: Local Cache that improves the existing Global Cache with higher cache hit ratio, and R2R that finds a set of approximate shortest paths from one region to another with bounded error. Experiments on a large real-world query sets verify the effectiveness and efficiency of our decomposition methods compared with the state-of-the-art batch algorithms.
【Keywords】: Roads; Heuristic algorithms; Clustering algorithms; Indexes; Approximation algorithms; Scalability; Query processing
【Paper Link】 【Pages】:1201-1212
【Authors】: Jiehuan Luo ; Xin Cao ; Xike Xie ; Qiang Qu ; Zhiqiang Xu ; Christian S. Jensen
【Abstract】: Networked data, notably social network data, often comes with a rich set of annotations, or attributes, such as documents (e.g., tweets) and locations (e.g., check-ins). Community search in such attributed networks has been studied intensively due to its many applications in friends recommendation, event organization, advertising, etc. We study the problem of attribute-constrained co-located community (ACOC) search, which returns a community that satisfies three properties: i) structural cohesiveness: the members in the community are densely connected; ii) spatial co-location: the members are close to each other; and iii) attribute constraint: a set of attributes are covered by the attributes associated with the members. The ACOC problem is shown to be NP-hard. We develop four efficient approximation algorithms with guaranteed error bounds in addition to an exact solution that works on relatively small graphs. Extensive experiments conducted with both real and synthetic data offer insight into the efficiency and effectiveness of the proposed methods, showing that they outperform three adapted state-of-the-art algorithms by an order of magnitude. We also find that the approximation algorithms are much faster than the exact solution and yet offer high accuracy.
【Keywords】: Approximation algorithms; Collaboration; Conferences; Search problems; Social network services; Task analysis; Measurement
【Paper Link】 【Pages】:1213-1224
【Authors】: Zijin Feng ; Tiantian Liu ; Huan Li ; Hua Lu ; Lidan Shou ; Jianliang Xu
【Abstract】: People have many activities indoors and there is an increasing demand of keyword-aware route planning for indoor venues. In this paper, we study the indoor top-k keyword-aware routing query (IKRQ). Given two indoor points s and t, an IKRQ returns k s-to-t routes that do not exceed a given distance constraint but have optimal ranking scores integrating keyword relevance and spatial distance. It is challenging to efficiently compute the ranking scores and find the best yet diverse routes in a large indoor space with complex topology. We propose prime routes to diversify top-k routes, devise mapping structures to organize indoor keywords and compute route keyword relevances, and derive pruning rules to reduce search space in routing. With these techniques, we design two search algorithms with different routing expansions. Experiments on synthetic and real data demonstrate the efficiency of our proposals.
【Keywords】: Routing; Planning; Computer science; Topology; Proposals; Airports; Logic gates
【Paper Link】 【Pages】:1225-1236
【Authors】: Jiajia Chu ; Yunshan Tu ; Yao Zhang ; Chuliang Weng
【Abstract】: Most database systems rely on complex multi-layer and compatibility-oriented storage stacks, which results in sub-optimal database management system (DBMS) performance and significant write amplification. A heavy storage stack can be tolerated in the slow disk era because its storage overhead is completely overlapped by hardware delay. However, with advances in storage technologies, emerging NVMe devices have reached the same level of latency as software, which in turn has caused the storage stack to become a new bottleneck. On the other hand, NVMe devices not only improve I/O efficiency but also introduce distinctive hardware features that require software modifications to take advantage of themTo fully exploit the hardware potential of NVMe devices, we propose a lightweight native storage stack called Lightstack to minimize the software overhead. The core of Lightstack is an efficient table storage engine, LATTE, which abstracts the essential data service of the database's 2D table. LATTE is designed from the ground up to use NVMe devices efficiently. It directly accesses NVMe devices to reduce single I/O latency and utilizes a parallel scheduling strategy to leverage multiple deep I/O queues and CPU cores. Besides, undo logging on heterogeneous storage is proposed to mitigate the write amplification further. We also implement a working prototype and evaluate it with standard benchmarks on the Intel Optane DC P4800X NVMe SSD and the DC P3608 Series NVMe SSD. Experimental results show that LATTE has up to 3.6-6.5× the throughput of MySQL's InnoDB and MyRocks engines, with latency as low as 28% in the same hardware environment.
【Keywords】: Nonvolatile memory; Hardware; Kernel; Databases; Performance evaluation; Engines
【Paper Link】 【Pages】:1237-1248
【Authors】: Sehyeon Oh ; Wook-Hee Kim ; Jihye Seo ; Hyeonho Song ; Sam H. Noh ; Beomseok Nam
【Abstract】: Various transactional systems use out-of-place up-dates such as logging or copy-on-write mechanisms to update data in a failure-atomic manner. Such out-of-place update methods double the I/O traffic due to back-up copies in the database layer and quadruple the I/O traffic due to the file system journaling. In mobile systems, transaction sizes of mobile apps are known to be tiny and transactions run at low concurrency. For such mobile transactions, legacy out-of-place update methods such as WAL are sub-optimal. In this work, we propose a crash consistent in-place update logging method - doubleheader logging (DHL) for SQLite. DHL prevents previous consistent records from being lost by performing a copy-on-write inside the database page and co-locating the metadata-only journal information within the page. This is done, in turn, with minimal sacrifice to page utilization. DHL is similar to when journaling is disabled, in the sense that it incurs almost no additional overhead in terms of both I/O and computation. Our experimental results show that DHL outperforms other logging methods such as out-of-place update write-ahead logging (WAL) and in-place update multi-version B-tree (MVBT).
【Keywords】: Metadata; Database systems; Concurrent computing; Performance evaluation; Computer crashes; Concurrency control
【Paper Link】 【Pages】:1249-1260
【Authors】: Li Zeng ; Lei Zou ; M. Tamer Özsu ; Lin Hu ; Fan Zhang
【Abstract】: Subgraph isomorphism is a well-known NP-hard problem that is widely used in many applications, such as social network analysis and querying over the knowledge graph. Due to the inherent hardness, its performance is often a bottleneck in various real-world applications. We address this by designing an efficient subgraph isomorphism algorithm leveraging features of GPU architecture, such as massive parallelism and memory hierarchy. Existing GPU-based solutions adopt two-step output scheme, performing the same join twice in order to write inter-mediate results concurrently. They also lack GPU architecture-aware optimizations that allow scaling to large graphs. In this paper, we propose a GPU-friendly subgraph isomorphism algorithm, GSI. Different from existing edge join-based GPU solutions, we propose a Prealloc-Combine strategy based on the vertex-oriented framework, which avoids joining-twice in existing solutions. Also, a GPU-friendly data structure (called PCSR) is proposed to represent an edge-labeled graph. Extensive experiments on both synthetic and real graphs show that GSI outperforms the state-of-the-art algorithms by up to several orders of magnitude and has good scalability with graph size scaling to hundreds of millions of edges.
【Keywords】: GSI; GPU; Subgraph Isomorphism
【Paper Link】 【Pages】:1261-1272
【Authors】: Xuan Sun ; Jinghuan Yu ; Zimeng Zhou ; Chun Jason Xue
【Abstract】: With the rapid growth of big data, LSM-tree based key-value stores are widely applied due to its high efficiency in write performance. Compaction plays a critical role in LSM-tree, which merges old data and could significantly reduce the overall throughput of the whole system especially for write-intensive workloads. Hardware acceleration for database is a popular trend in recent years. In this paper, we design and implement an FPGA-based compaction engine to accelerate compaction in LSM-tree based key-value stores. To take full advantage of the pipeline mechanism on FPGA, the key-value separation and index-data block separation strategies are proposed. In order to improve the compaction performance, the bandwidth of FPGA-chip is fully utilized. In addition, the proposed acceleration engine is integrated with a classic LSM-tree based store without modifications on the original storage format. The experimental results demonstrate that the proposed FPGA-based compaction engine can achieve up to 92.0x acceleration ratio compared with CPU baseline, and achieve up to 6.4x improvement on the throughput of random writes.
【Keywords】: compaction; LSM-tree; key-value; FPGA
【Paper Link】 【Pages】:1273-1284
【Authors】: Andrew Crotty ; Alex Galakatos ; Tim Kraska
【Abstract】: Code generation for in-memory query processing is now commonplace. While existing approaches use a wide range of techniques (e.g., inline expansion, pipelining, SIMD vectorization, prefetching) to reduce processing effort, we argue that generating code with better data access patterns is often more important. Therefore, we propose SWOLE, the first access-aware code generation strategy. Contradictory to the conventional wisdom, SWOLE heavily leverages predicate pullups to produce code with better access patterns, which outweighs the overhead of performing wasted work. Our experiments show that SWOLE can outperform the state-of-the-art approach by over 2.6×.
【Keywords】: Query processing; Lenses; Prefetching; Hybrid power systems; Optimization; Arrays; Pipeline processing
【Paper Link】 【Pages】:1285-1296
【Authors】: Nick Koudas ; Raymond Li ; Ioannis Xarchakos
【Abstract】: Recent advances in video processing utilizing deep learning primitives achieved breakthroughs in fundamental problems in video analysis such as frame classification and object detection enabling an array of new applications. In this paper we study the problem of interactive declarative query processing on video streams. In particular we introduce a set of approximate filters to speed up queries that involve objects of specific type (e.g., cars, trucks, etc.) on video frames with associated spatial relationships among them (e.g., car left of truck). The resulting filters are able to assess quickly if the query predicates are true to proceed with further analysis of the frame or otherwise not consider the frame further avoiding costly object detection operations. We propose two classes of filters IC and OD, that adapt principles from deep image classification and object detection. The filters utilize extensible deep neural architectures and are easy to deploy and utilize. In addition, we propose statistical query processing techniques to process aggregate queries involving objects with spatial constraints on video streams and demonstrate experimentally the resulting increased accuracy on the resulting aggregate estimation. Combined these techniques constitute a robust set of video monitoring query processing techniques. We demonstrate that the application of the techniques proposed in conjunction with declarative queries on video streams can dramatically increase the frame processing rate and speed up query processing by at least two orders of magnitude. We present the results of a thorough experimental study utilizing benchmark video data sets at scale demonstrating the performance benefits and the practical relevance of our proposals.
【Keywords】: Streaming media; Object detection; Query processing; Convolution; Monitoring; Automobiles; Aggregates
【Paper Link】 【Pages】:1297-1308
【Authors】: Xiang Yu ; Guoliang Li ; Chengliang Chai ; Nan Tang
【Abstract】: Join order selection (JOS) - the problem of finding the optimal join order for an SQL query - is a primary focus of database query optimizers. The problem is hard due to its large solution space. Exhaustively traversing the solution space is prohibitively expensive, which is often combined with heuristic pruning. Despite decades-long effort, traditional optimizers still suffer from low scalability or low accuracy when handling complicated SQL queries. Recent attempts using deep reinforcement learning (DRL), by encoding join trees with fixed-length handtuned feature vectors, have shed some light on JOS. However, using fixed-length feature vectors cannot capture the structural information of a join tree, which may produce poor join plans. Moreover, it may also cause retraining the neural network when handling schema changes (e.g., adding tables/columns) or multialias table names that are common in SQL queries.In this paper, we present RTOS, a novel learned optimizer that uses Reinforcement learning with Tree-structured long short-term memory (LSTM) for join Order Selection. RTOS improves existing DRL-based approaches in two main aspects: (1) it adopts graph neural networks to capture the structures of join trees; and (2) it well supports the modification of database schema and multi-alias table names. Extensive experiments on Join Order Benchmark (JOB) and TPC-H show that RTOS outperforms traditional optimizers and existing DRL-based learned optimizers. In particular, the plan RTOS generated for JOB is 101% on (estimated) cost and 67% on latency (i.e., execution time) on average, compared with dynamic programming that is known to produce the state-of-the-art results on join plans.
【Keywords】: Vegetation; Databases; Neural networks; Forestry; Training; Machine learning; Benchmark testing
【Paper Link】 【Pages】:1309-1320
【Authors】: Saravanan Thirumuruganathan ; Shohedul Hasan ; Nick Koudas ; Gautam Das
【Abstract】: Data is generated at an unprecedented rate surpassing our ability to analyze them. The database community has pioneered many novel techniques for Approximate Query Processing (AQP) that could give approximate results in a fraction of time needed for computing exact results. In this work, we explore the usage of deep learning (DL) for answering aggregate queries specifically for interactive applications such as data exploration and visualization. We use deep generative models, an unsupervised learning based approach, to learn the data distribution faithfully such that aggregate queries could be answered approximately by generating samples from the learned model. The model is often compact - few hundred KBs - so that arbitrary AQP queries could be answered on the client side without contacting the database server. Our other contributions include identifying model bias and minimizing it through a rejection sampling based approach and an algorithm to build model ensembles for AQP for improved accuracy. Our extensive experiments show that our proposed approach can provide answers with high accuracy and low latency.
【Keywords】: Data models; Aggregates; Encoding; Computational modeling; Query processing; Data visualization
【Paper Link】 【Pages】:1321-1332
【Authors】: Fotis Savva ; Christos Anagnostopoulos ; Peter Triantafillou
【Abstract】: Several data mining tasks focus on repeatedly inspecting multidimensional data regions summarized by a statistic. The value of this statistic (e.g., region-population sizes, order moments) is used to classify the region's interesting-ness. These regions can be naively extracted from the entire dataspace - however, this is extremely time-consuming and compute-resource demanding. This paper studies the reverse problem: analysts provide a cut-off value for a statistic of interest and in turn our proposed framework efficiently identifies multidimensional regions whose statistic exceeds (or is below) the given cut-off value (according to user's needs). However, as data dimensions and size increase, such task inevitably becomes laborious and costly. To alleviate this cost, our solution, coined SuRF (SUrrogate Region Finder), leverages historical region evaluations to train surrogate models that learn to approximate the distribution of the statistic of interest. It then makes use of evolutionary multi-modal optimization to effectively and efficiently identify regions of interest regardless of data size and dimensionality. The accuracy, efficiency, and scalability of our approach are demonstrated with experiments using synthetic and real-world datasets and compared with other methods.
【Keywords】: Surrogate model estimation; statistical learning; swarm intelligence; evolutionary multimodal optimization
【Paper Link】 【Pages】:1333-1344
【Authors】: Xinyang Yu ; Yanqing Peng ; Feifei Li ; Sheng Wang ; Xiaowei Shen ; Huijun Mai ; Yue Xie
【Abstract】: The explosion of time series advances the development of time series databases. To reduce storage overhead in these systems, data compression is widely adopted. Most existing compression algorithms utilize the overall characteristics of the entire time series to achieve high compression ratio, but ignore local contexts around individual points. In this way, they are effective for certain data patterns, and may suffer inherent pattern changes in real-world time series. It is therefore strongly desired to have a compression method that can always achieve high compression ratio in the existence of pattern diversity. In this paper, we propose a two-level compression model that selects a proper compression scheme for each individual point, so that diverse patterns can be captured at a fine granularity. Based on this model, we design and implement AMMMO framework, where a set of control parameters is defined to distill and categorize data patterns. At the top level, we evaluate each sub-sequence to fill in these parameters, generating a set of compression scheme candidates (i.e., major mode selection). At the bottom level, we choose the best scheme from these candidates for each data point respectively (i.e., sub-mode selection). To effectively handle diverse data patterns, we introduce a reinforcement learning based approach to learn parameter values automatically. Our experimental evaluation shows that our approach improves compression ratio by up to 120% (with an average of 50%), compared to other time-series compression methods.
【Keywords】: Time series analysis; Aerospace electronics; Data compression; Data models; Learning (artificial intelligence); Compression algorithms; Transforms
【Paper Link】 【Pages】:1345-1356
【Authors】: Mohammad Javad Amiri ; Sujaya Maiyya ; Divyakant Agrawal ; Amr El Abbadi
【Abstract】: Large scale data management systems utilize State Machine Replication to provide fault tolerance and to enhance performance. Fault-tolerant protocols are extensively used in the distributed database infrastructure of large enterprises such as Google, Amazon, and Facebook. However, and in spite of years of intensive research, existing fault-tolerant protocols do not adequately address hybrid cloud environments consisting of private and public clouds which are widely used by enterprises. In this paper, we consider a private cloud consisting of nonmalicious nodes (crash-only failures) and a public cloud with possible malicious failures. We introduce SeeMoRe, a hybrid State Machine Replication protocol that uses the knowledge of where crash and malicious failures may occur in a public/private cloud environment to improve overall performance. SeeMoRe has three different modes that can be used depending on the private cloud load and the communication latency between the public and private clouds. SeeMoRe can dynamically transition from one mode to another. Furthermore, an extensive evaluation reveals that SeeMoRe's performance is close to the state of the art crash fault-tolerant protocols while tolerating malicious failures.
【Keywords】: Fault Tolerance; Consensus; Cloud Computing
【Paper Link】 【Pages】:1357-1368
【Authors】: Yuechen Tao ; Bo Li ; Jingjie Jiang ; Hok Chu Ng ; Cong Wang ; Baochun Li
【Abstract】: Current blockchain systems suffer from a number of inherent drawbacks in its scalability, latency, and processing throughput. By enabling parallel confirmations of transactions, sharding has been proposed to mitigate these drawbacks, which usually requires frequent communication among miners through a separate consensus protocol.In this paper, we propose, analyze, and implement a new distributed and dynamic sharding system to substantially improve the throughput of blockchain systems based on smart contracts, while requiring minimum cross-shard communication. Our key observation is that transactions sent by users who only participate in a single smart contract can be validated and confirmed independently without causing double spending. Therefore, the natural formation of a shard is to surround one smart contract to start with. The complication lies in the different sizes of shards being formed, in which a small shard with few transactions tends to generate a large number of empty blocks resulting in a waste of mining power, while a large shard adversely affects parallel confirmations. To overcome this problem, we propose an inter-shard merging algorithm with incentives to encourage small shards to merge with one another and form a larger shard, an intra-shard transaction selection mechanism to encourage miners to select different subsets of transactions for validation, as well as a parameter unification method to further improve these two algorithms to reduce the communication cost and improve system reliability.We analyze our proposed algorithms using the game theoretic approach, and prove that they converge to a Nash Equilibrium. We also present a security analysis on our sharding design, and prove that it resists adversaries who occupy at most 33% of the computation power. We have implemented our designs on go-Ethereum 1.8.0 and evaluated their performance using both real-world blockchain transactions and large-scale simulations. Our results show that throughput has been improved by 7.2×, and the number of empty blocks has been reduced by 90%.
【Keywords】: Contracts; Throughput; Merging; Protocols; Games
【Paper Link】 【Pages】:1369-1380
【Authors】: Da Yan ; Guimu Guo ; Md Mashiur Rahman Chowdhury ; M. Tamer Özsu ; Wei-Shinn Ku ; John C. S. Lui
【Abstract】: Mining from a big graph those subgraphs that satisfy certain conditions is useful in many applications such as community detection and subgraph matching. These problems have a high time complexity, but existing systems to scale them are all IO-bound in execution. We propose the first truly CPU-bound distributed framework called G-thinker that adopts a user-friendly subgraph-centric vertex-pulling API for writing distributed subgraph mining algorithms. To utilize all CPU cores of a cluster, G-thinker features (1) a highly-concurrent vertex cache for parallel task access and (2) a lightweight task scheduling approach that ensures high task throughput. These designs well overlap communication with computation to minimize the CPU idle time. Extensive experiments demonstrate that G-thinker achieves orders of magnitude speedup compared even with the fastest existing subgraph-centric system, and it scales well to much larger and denser real network data. G-thinker is open-sourced at http://bit.ly/gthinker with detailed documentation.
【Keywords】: graph mining; subgraph-centric; CPU-bound; compute-intensive; clique; triangle; subgraph matching
【Paper Link】 【Pages】:1381-1392
【Authors】: Michael Abebe ; Brad Glasbergen ; Khuzaima Daudjee
【Abstract】: Single-master replicated database systems strive to be scalable by offloading reads to replica nodes. However, single-master systems suffer from the performance bottleneck of all updates executing at a single site. Multi-master replicated systems distribute updates among sites but incur costly coordination for multi-site transactions. We present DynaMast, a lazily replicated, multi-master database system that guarantees one-site transaction execution while effectively distributing both reads and updates among multiple sites. DynaMast benefits from these advantages by dynamically transferring the mastership of data, or remastering, among sites using a lightweight metadata-based protocol. DynaMast leverages remastering to adaptively place master copies to balance load and minimize future remastering. Using benchmark workloads, we demonstrate that DynaMast delivers superior performance over existing replicated database system architectures.
【Keywords】: Protocols; Silicon; Database systems; Scalability; Distributed databases; Computer architecture
【Paper Link】 【Pages】:1393-1404
【Authors】: Jinkun Geng ; Dan Li ; Shuai Wang
【Abstract】: Distributed machine learning (DML) has become the common practice in industry, because of the explosive volume of training data and the growing complexity of training model. Traditional DML follows data parallelism but causes significant communication cost, due to the huge amount of parameter transmission. The recently emerging model-parallel solutions can reduce the communication workload, but leads to load imbalance and serious straggler problems. More importantly, the existing solutions, either data-parallel or model-parallel, ignore the nature of flexible parallelism for most DML tasks, thus failing to fully exploit the GPU computation power. Targeting at these existing drawbacks, we propose Fela, which incorporates both flexible parallelism and elastic tuning mechanism to accelerate DML. In order to fully leverage GPU power and reduce communication cost, Fela adopts hybrid parallelism and uses flexible parallel degrees to train different parts of the model. Meanwhile, Fela designs token-based scheduling policy to elastically tune the workload among different workers, thus mitigating the straggler effect and achieve better load balance. Our comparative experiments show that Fela can significantly improve the training throughput and outperforms the three main baselines (i.e. dataparallel, model-parallel, and hybrid-parallel) by up to 3.23×, 12.22×, and 1.85× respectively.
【Keywords】: Distributed machine learning; flexible parallelism; token-based scheduling; straggler effect
【Paper Link】 【Pages】:1405-1416
【Authors】: Tong Chen ; Hongzhi Yin ; Quoc Viet Hung Nguyen ; Wen-Chih Peng ; Xue Li ; Xiaofang Zhou
【Abstract】: In various web applications like targeted advertising and recommender systems, the available categorical features (e.g., product type) are often of great importance but sparse. As a widely adopted solution, models based on Factorization Machines (FMs) are capable of modelling high-order interactions among features for effective sparse predictive analytics. As the volume of web-scale data grows exponentially over time, sparse predictive analytics inevitably involves dynamic and sequential features. However, existing FM-based models assume no temporal orders in the data, and are unable to capture the sequential dependencies or patterns within the dynamic features, impeding the performance and adaptivity of these methods. Hence, in this paper, we propose a novel Sequence-Aware Factorization Machine (SeqFM) for temporal predictive analytics, which models feature interactions by fully investigating the effect of sequential dependencies. As static features (e.g., user gender) and dynamic features (e.g., user interacted items) express different semantics, we innovatively devise a multi-view self-attention scheme that separately models the effect of static features, dynamic features and the mutual interactions between static and dynamic features in three different views. In SeqFM, we further map the learned representations of feature interactions to the desired output with a shared residual network. To showcase the versatility and generalizability of SeqFM, we test SeqFM in three popular application scenarios for FM-based models, namely ranking, classification and regression tasks. Extensive experimental results on six large-scale datasets demonstrate the superior effectiveness and efficiency of SeqFM.
【Keywords】: Computational modeling; Frequency modulation; Predictive models; Task analysis; Analytical models; Encoding; Feature extraction
【Paper Link】 【Pages】:1417-1428
【Authors】: Jilin Hu ; Bin Yang ; Chenjuan Guo ; Christian S. Jensen ; Hui Xiong
【Abstract】: Origin-destination (OD) matrices are used widely in transportation and logistics to record the travel cost (e.g., travel speed or greenhouse gas emission) between pairs of OD regions during different intervals within a day. We model a travel cost as a distribution because when traveling between a pair of OD regions, different vehicles may travel at different speeds even during the same interval, e.g., due to different driving styles or different waiting times at intersections. This yields stochastic OD matrices. We consider an increasingly pertinent setting where a set of vehicle trips is used for instantiating OD matrices. Since the trips may not cover all OD pairs for each interval, the resulting OD matrices are likely to be sparse. We then address the problem of forecasting complete, near future OD matrices from sparse, historical OD matrices. To solve this problem, we propose a generic learning framework that (i) employs matrix factorization and graph convolutional neural networks to contend with the data sparseness while capturing spatial correlations and that (ii) captures spatio-temporal dynamics via recurrent neural networks extended with graph convolutions. Empirical studies using two taxi trajectory data sets offer detailed insight into the properties of the framework and indicate that it is effective.
【Keywords】: Sparse matrices; Stochastic processes; Correlation; Forecasting; Urban areas; Roads; Recurrent neural networks
【Paper Link】 【Pages】:1429-1440
【Authors】: Yvonne Mülle ; Michael H. Böhlen
【Abstract】: Ongoing time point now is used to state that a tuple is valid from the start point onward. For database systems ongoing time points have far-reaching implications since they change continuously as time passes by. State-of-the-art approaches deal with ongoing time points by instantiating them to the reference time. The instantiation yields query results that are only valid at the chosen time and get invalidated as time passes by. We propose a solution that keeps ongoing time points uninstantiated during query processing. We do so by evaluating predicates and functions at all possible reference times. This renders query results independent of a specific reference time and yields results that remain valid as time passes by. As query results, we propose ongoing relations that include a reference time attribute. The value of the reference time attribute is restricted by predicates and functions on ongoing attributes. We describe and evaluate an efficient implementation of ongoing data types and operations in PostgreSQL.
【Keywords】: Computer bugs; Query processing; Time-domain analysis; Database systems
【Paper Link】 【Pages】:1441-1452
【Authors】: Huan Li ; Hua Lu ; Muhammad Aamir Cheema ; Lidan Shou ; Gang Chen
【Abstract】: Indoor mobility semantics analytics can greatly benefit many pertinent applications. Existing semantic annotation methods mainly focus on outdoor space and require extra knowledge such as POI category or human activity regularity. However, these conditions are difficult to meet in indoor venues with relatively small extents but complex topology. This work studies the annotation of indoor mobility semantics that describe an object's mobility event (what ) at a semantic indoor region (where ) during a time period (when ). A coupled conditional Markov network (C2MN) is proposed with a set of feature functions carefully designed by incorporating indoor topology and mobility behaviors. C2MN is able to capture probabilistic dependencies among positioning records, semantic regions, and mobility events jointly. Nevertheless, the correlation of regions and events hinders the parameters learning. Therefore, we devise an alternate learning algorithm to enable the parameter learning over correlated variables. The extensive experiments demonstrate that our C2MN-based semantic annotation is efficient and effective on both real and synthetic indoor mobility data.
【Keywords】: Semantics; Markov random fields; Labeling; Correlation; 1/f noise; Topology; Probabilistic logic
【Paper Link】 【Pages】:1453-1464
【Authors】: Keyu Yang ; Yunjun Gao ; Lei Liang ; Bin Yao ; Shiting Wen ; Gang Chen
【Abstract】: There is an emerging trend of integrating machine learning (ML) techniques into database systems (DB). Considering that almost all the ML toolkits assume that the input of ML algorithms is a single table even though many real-world datasets are stored as multiple tables due to normalization in DB. Thus, data scientists have to perform joins before learning a ML model. This strategy is called learning after joins, which incurs redundancy avoided by normalization. In the area of ML, the Support Vector Machine (SVM) is one of the most standard classification tools. In this paper, we focus on the factorized SVM with gaussian kernels over normalized data. We present factorized learning approaches for two main SVM optimization methods, i.e., Gradient Descent (GD) and Sequential Minimal Optimization (SMO), by factorizing gaussian kernel function computation. Furthermore, we transform the normalized data into matrices, and boost the efficiency of SVM learning via linear algebra operations. Extensive experiments with nine real normalized data sets demonstrate the efficiency and scalability of our proposed approaches.
【Keywords】: Support vector machines; Kernel; Databases; Motion pictures; Optimization; Training; Machine learning
【Paper Link】 【Pages】:1465-1476
【Authors】: Jiyuan Zhang ; Yi Lu ; Daniele G. Spampinato ; Franz Franchetti
【Abstract】: Set intersection is an important operation and widely used in both database and graph analytics applications. However, existing state-of-the-art set intersection methods only consider the size of input sets and fail to optimize for the case in which the intersection size is small. In real-world scenarios, the size of most intersections is usually orders of magnitude smaller than the size of the input sets, e.g., keyword search in databases and common neighbor search in graph analytics. In this paper, we present FESIA, a new set intersection approach on modern CPUs. The time complexity of our approach is O(n/√w + r), in which w is the SIMD width, and n and r are the size of input sets and intersection size, respectively. The key idea behind FESIA is that it first uses bitmaps to filter out unmatched elements from the input sets, and then selects suitable specialized kernels (i.e., small function blocks) at runtime to compute the final intersection on each pair of bitmap segments. In addition, all data structures in FESIA are designed to take advantage of SIMD instructions provided by vector ISAs with various SIMD widths, including SSE, AVX, and the latest AVX512. Our experiments on both real-world and synthetic datasets show that our intersection method achieves more than an order of magnitude better performance than conventional scalar implementations, and up to 4x better performance than state-of-the-art SIMD implementations.
【Keywords】: Data structures; Time complexity; Acceleration; Program processors; Parallel processing; Databases
【Paper Link】 【Pages】:1477-1488
【Authors】: Philipp Fent ; Alexander van Renen ; Andreas Kipf ; Viktor Leis ; Thomas Neumann ; Alfons Kemper
【Abstract】: While hardware and software improvements greatly accelerated modern database systems' internal operations, the decades-old stream-based Socket API for external communication is still unchanged. We show experimentally, that for modern high-performance systems networking has become a performance bottleneck. Therefore, we argue that the communication stack needs to be redesigned to fully exploit modern hardware - as has already happened to most other database system components.We propose L5, a high-performance communication layer for database systems. L5 rethinks the flow of data in and out of the database system and is based on direct memory access techniques for intra-datacenter (RDMA) and intra-machine communication (Shared Memory). With L5, we provide a building block to accelerate ODBC-like interfaces with a unified and message-based communication framework. Our results show that using interconnects like RDMA (InfiniBand), RoCE (Ethernet), and Shared Memory (IPC), L5 can largely eliminate the network bottleneck for database systems.
【Keywords】: Sockets; Database systems; Servers; Hardware; Message systems; Throughput
【Paper Link】 【Pages】:1489-1500
【Authors】: Zoi Kaoudi ; Jorge-Arnulfo Quiané-Ruiz ; Bertty Contreras-Rojas ; Rodrigo Pardo-Meza ; Anis Troudi ; Sanjay Chawla
【Abstract】: Cost-based optimization is widely known to suffer from a major weakness: administrators spend a significant amount of time to tune the associated cost models. This problem only gets exacerbated in cross-platform settings as there are many more parameters that need to be tuned. In the era of machine learning (ML), the first step to remedy this problem is to replace the cost model of the optimizer with an ML model. However, such a solution brings in two major challenges. First, the optimizer has to transform a query plan to a vector million times during plan enumeration incurring a very high overhead. Second, a lot of training data is required to effectively train the ML model. We overcome these challenges in Robopt, a novel vector-based optimizer we have built for Rheem, a cross-platform system. Robopt not only uses an ML model to prune the search space but also bases the entire plan enumeration on a set of algebraic operations that operate on vectors, which are a natural fit to the ML model. This leads to both speed-up and scale-up of the enumeration process by exploiting modern CPUs via vectorization. We also accompany Robopt with a scalable training data generator for building its ML model. Our evaluation shows that (i) the vector-based approach is more efficient and scalable than simply using an ML model and (ii) Robopt matches and, in some cases, improves Rheem's cost-based optimizer in choosing good plans without requiring any tuning effort.
【Keywords】: query optimization; machine learning; cross-platform data processing; polystores
【Paper Link】 【Pages】:1501-1512
【Authors】: Haitao Yuan ; Guoliang Li ; Ling Feng ; Ji Sun ; Yue Han
【Abstract】: Materializing views is an important method to reduce redundant computations in DBMS, especially for processing large scale analytical queries. However, many existing methods still need DBAs to manually generate materialized views, which are not scalable to a large number of database instances, especially on the cloud database. To address this problem, we propose an automatic view generation method which judiciously selects "highly beneficial" subqueries to generate materialized views. However, there are two challenges. (1) How to estimate the benefit of using a materialized view for a query? (2) How to select optimal subqueries to generate materialized views? To address the first challenge, we propose a neural network based method to estimate the benefit of using a materialized view to answer a query. In particular, we extract significant features from different perspectives and design effective encoding models to transform these features into hidden representations. To address the second challenge, we model this problem to an ILP (Integer Linear Programming) problem, which aims to maximize the utility by selecting optimal subqueries to materialize. We design an iterative optimization method to select subqueries to materialize. However, this method cannot guarantee the convergence of the solution. To address this issue, we model the iterative optimization process as an MDP (Markov Decision Process) and use the deep reinforcement learning model to solve the problem. Extensive experiments show that our method outperforms existing solutions by 28.4%, 8.8% and 31.7% on three real-world datasets.
【Keywords】: Feature extraction; Optimization; Machine learning; Computational modeling; Data models; Learning (artificial intelligence); Encoding
【Paper Link】 【Pages】:1513-1524
【Authors】: Zhipeng Zhang ; Wentao Wu ; Jiawei Jiang ; Lele Yu ; Bin Cui ; Ce Zhang
【Abstract】: Distributed machine learning (ML) has triggered tremendous research interest in recent years. Stochastic gradient descent (SGD) is one of the most popular algorithms for training ML models, and has been implemented in almost all distributed ML systems, such as Spark MLlib, Petuum, MXNet, and TensorFlow. However, current implementations often incur huge communication and memory overheads when it comes to large models. One important reason for this inefficiency is the row-oriented scheme (RowSGD) that existing systems use to partition the training data, which forces them to adopt a centralized model management strategy that leads to vast amount of data exchange over the network. We propose a novel, column-oriented scheme (ColumnSGD) that partitions training data by columns rather than by rows. As a result, ML model can be partitioned by columns as well, leading to a distributed configuration where individual data and model partitions can be collocated on the same machine. Following this locality property, we develop a simple yet powerful computation framework that significantly reduces communication overheads and memory footprints compared to RowSGD, for large-scale ML models such as generalized linear models (GLMs) and factorization machines (FMs). We implement ColumnSGD on top of Apache Spark, and study its performance both analytically and experimentally. Experimental results on both public and real-world datasets show that ColumnSGD is up to 930× faster than MLlib, 63× faster than Petuum, and 14× faster than MXNet.
【Keywords】: Conferences; Data engineering; Radio frequency
【Paper Link】 【Pages】:1525-1536
【Authors】: Harald Bögeholz ; Michael Brand ; Radu-Alexandru Todor
【Abstract】: We describe a Big Data-practical, SQL-implementable algorithm for efficiently determining connected components for graph data stored in a Massively Parallel Processing (MPP) relational database. The algorithm described is a linear-space, randomised algorithm, always terminating with the correct answer but subject to a stochastic running time, such that for any ε > 0 and any input graph G = 〈V,E〉 the algorithm terminates after O(log|V |) SQL queries with probability of at least 1 - ε, which we show empirically to translate to a quasi-linear runtime in practice.
【Keywords】: Big Data; data science; relational databases; SQL; distributed databases; distributed algorithms; graph theory; blockchain
【Paper Link】 【Pages】:1537-1548
【Authors】: Shuhao Zhang ; Yingjun Wu ; Feng Zhang ; Bingsheng He
【Abstract】: Recent data stream processing systems (DSPSs) can achieve excellent performance when processing large volumes of data under tight latency constraints. However, they sacrifice support for concurrent state access that eases the burden of developing stateful stream applications. Recently, some have proposed managing concurrent state access during stream processing by modeling state accesses as transactions. However, these are realized with locks involving serious contention overhead. The coarse-grained processing paradigm adopted in these proposals magnify contention issues and does not exploit modern multicore architectures to their full potential. This paper introduces TStream, a novel DSPS supporting efficient concurrent state access on multicore processors. Transactional semantics is employed like previous work, but scalability is greatly improved due to two novel designs: 1) dual-mode scheduling, which exposes more parallelism opportunities, 2) dynamic restructuring execution, which aggressively exploits the parallelism opportunities from dual-mode scheduling without centralized lock contentions. To validate our proposal, we evaluate TStream with a benchmark of four applications on a modern multicore machine. Experimental results show that 1) TStream achieves up to 4.8 times higher throughput with similar processing latency compared to the state-of-the-art and 2) unlike prior solutions, TStream is highly tolerant of varying application workloads such as key skewness and multi-partition state accesses.
【Keywords】: Roads; Multicore processing; Schedules; Scalability; Semantics; Parallel processing; Dynamic scheduling
【Paper Link】 【Pages】:1549-1557
【Authors】: Jiawei Jiang ; Pin Xiao ; Lele Yu ; Xiaosen Li ; Jiefeng Cheng ; Xupeng Miao ; Zhipeng Zhang ; Bin Cui
【Abstract】: Spark has extensively used in many applications of Tencent, due to its easy deployment, pipeline capability, and close integration with the Hadoop ecosystem. As the graph computing engine of Spark, GraphX is also widely deployed to process large-scale graph data in Tencent. However, when the size of the graph data is up to billion-scale, GraphX encounters serious performance degradation. Worse, Graphx cannot support the rising advancement of graph embedding (GE) and graph neural network (GNN) algorithms. To address these challenges, we develop a new graph processing system, called PSGraph, which uses Spark executor and PyTorch to perform calculation, and develops a distributed parameter server to store frequently accessed models. PSGraph can train extremely large-scale graph data in Tencent with the parameter server architecture, and enable the training of GE and GNN algorithms. Moreover, PSGraph still benefits from the advantages of Spark via staying inside the Spark ecosystem, and can directly replace GraphX without modification to the existing application framework. Our experiments show that PSGraph outperforms GraphX significantly.
【Keywords】: graph algorithm; Spark; parameter server
【Paper Link】 【Pages】:1558-1569
【Authors】: Ruiyuan Li ; Huajun He ; Rubin Wang ; Yuchuan Huang ; Junwen Liu ; Sijie Ruan ; Tianfu He ; Jie Bao ; Yu Zheng
【Abstract】: With the prevalence of positioning techniques, a prodigious number of spatio-temporal data is generated constantly. To effectively support sophisticated urban applications, e.g., location-based services, based on spatio-temporal data, it is desirable for an efficient, scalable, update-enabled, and easy-to-use spatio-temporal data management system.This paper presents JUST, i.e., JD Urban Spatio-Temporal data engine, which can efficiently manage big spatio-temporal data in a convenient way. JUST incorporates the distributed NoSQL data store, i.e., Apache HBase, as the underlying storage, GeoMesa as the spatio-temporal data indexing tool, and Apache Spark as the execution engine. We creatively design two indexing techniques, i.e., Z2T and XZ2T, which accelerates spatio-temporal queries tremendously. Furthermore, we introduce a compression mechanism, which not only greatly reduces the storage cost, but also improves the query efficiency. To make JUST easy-to-use, we design and implement a complete SQL engine, with which all operations can be performed through a SQL-like query language, i.e., JustQL. JUST also supports inherently new data insertions and historical data updates without index reconstruction. JUST is deployed as a PaaS in JD with multi-users support. Many applications have been developed based on the SDKs provided by JUST. Extensive experiments are carried out with six state-of-the-art distributed spatio-temporal data management systems based on two real datasets and one synthetic dataset. The results show that JUST has a competitive query performance and is much more scalable than them.
【Keywords】: Sparks; Distributed databases; Engines; Spatial databases; Indexing; Scalability
【Paper Link】 【Pages】:1570-1578
【Authors】: Sukhada Pendse ; Vasudha Krishnaswamy ; Kartik Kulkarni ; Yunrui Li ; Tirthankar Lahiri ; Vivekanandhan Raja ; Jing Zheng ; Mahesh Girkar ; Akshay Kulkarni
【Abstract】: Oracle Database In-Memory (DBIM) provides orders of magnitude speedup for analytic queries with its highly compressed, transactionally consistent, memory-optimized Column Store. Customers can use Oracle DBIM for making real-time decisions by analyzing vast amounts of data at blazingly fast speeds. Active Data Guard (ADG) is Oracle's comprehensive solution for high-availability and disaster recovery for the Oracle Database. Oracle ADG eliminates the high cost of idle redundancy by allowing reporting applications, ad-hoc queries and data extracts to be offloaded to the synchronized, physical Standby database replicated using Oracle ADG. In Oracle 12.2, we extended the DBIM advantage to Oracle ADG architecture. DBIM-on-ADG significantly boosts the performance of analytic, read-only workloads running on the physical Standby database, while the Primary database continues to process high-speed OLTP workloads. Customers can partition their data across the In-Memory Column Stores on the Primary and Standby databases based on access patterns, and reap the benefits of fault-tolerance as well as workload isolation without compromising on critical performance SLAs. In this paper, we explore and address the key challenges involved in building the DBIM-on-ADG infrastructure, including synchronized maintenance of the In-Memory Column Store on the Standby database, with high-speed OLTP activity continuously modifying data on the Primary database.
【Keywords】: OLAP; Standby database; In-Memory Analytics
【Paper Link】 【Pages】:1579-1590
【Authors】: Arun Swami ; Sriram Vasudevan ; Joojay Huyn
【Abstract】: Many organizations process big data for important business operations and decisions. Hence, data quality greatly affects their success. Data quality problems continue to be widespread, costing US businesses an estimated $600 billion annually. To date, addressing data quality in production environments still poses many challenges: easily defining properties of high-quality data; validating production-scale data in a timely manner; debugging poor quality data; designing data quality solutions to be easy to use, understand, and operate; and designing data quality solutions to easily integrate with other systems. Current data validation solutions do not comprehensively address these challenges. To address data quality in production environments at LinkedIn, we developed Data Sentinel, a declarative production-scale data validation platform. In a simple and well-structured configuration, users declaratively specify the desired data checks. Then, Data Sentinel performs these data checks and writes the results to an easily understandable report. Furthermore, Data Sentinel provides well-defined schemas for the configuration and report. This makes it easy for other systems to interface or integrate with Data Sentinel. To make Data Sentinel even easier to use, understand, and operate in production environments, we provide Data Sentinel Service (DSS), a complementary system to help specify data checks, schedule, deploy, and tune data validation jobs, and understand data checking results. The contributions of this paper include the following: 1) Data Sentinel, a declarative production-scale data validation platform successfully deployed at LinkedIn 2) A generic design to build and deploy similar systems for production environments 3) Experiences and lessons learned that can benefit practitioners with similar objectives.
【Keywords】: Data integrity; Production; LinkedIn; Big Data; Business; Debugging; Schedules
【Paper Link】 【Pages】:1591-1602
【Authors】: Yuan Mei ; Luwei Cheng ; Vanish Talwar ; Michael Y. Levin ; Gabriela Jacques-Silva ; Nikhil Simha ; Anirban Banerjee ; Brian Smith ; Tim Williamson ; Serhat Yilmaz ; Weitao Chen ; Guoqiang Jerry Chen
【Abstract】: The demand for stream processing at Facebook has grown as services increasingly rely on real-time signals to speed up decisions and actions. Emerging real-time applications require strict Service Level Objectives (SLOs) with low downtime and processing lag-even in the presence of failures and load variability. Addressing this challenge at Facebook scale led to the development of Turbine, a management platform designed to bridge the gap between the capabilities of the existing general-purpose cluster management frameworks and Facebook's stream processing requirements. Specifically, Turbine features a fast and scalable task scheduler; an efficient predictive auto scaler; and an application update mechanism that provides fault-tolerance, atomicity, consistency, isolation and durability. Turbine has been in production for over three years, and one of the core technologies that enabled a booming growth of stream processing at Facebook. It is currently deployed on clusters spanning tens of thousands of machines, managing several thousands of streaming pipelines processing terabytes of data per second in real time. Our production experience has validated Turbine's effectiveness: its task scheduler evenly balances workload fluctuation across clusters; its auto scaler effectively and predictively handles unplanned load spikes; and the application update mechanism consistently and efficiently completes high scale updates within minutes. This paper describes the Turbine architecture, discusses the design choices behind it, and shares several case studies demonstrating Turbine capabilities in production.
【Keywords】: Stream Processing; Cluster Management
【Paper Link】 【Pages】:1603-1608
【Authors】: Wolfram Wingerath ; Felix Gessert ; Erik Witt ; Hannes Kuhlmann ; Florian Bücklers ; Benjamin Wollmer ; Norbert Ritter
【Abstract】: Users leave when page loads take too long. This simple fact has complex implications for virtually all modern businesses, because accelerating content delivery through caching is not as simple as it used to be. As a fundamental technical challenge, the high degree of personalization in today's Web has seemingly outgrown the capabilities of traditional content delivery networks (CDNs) which have been designed for distributing static assets under fixed caching times. As an additional legal challenge for services with personalized content, an increasing number of regional data protection laws constrain the ways in which CDNs can be used in the first place. In this paper, we present Speed Kit as a radically different approach for content distribution that combines (1) a polyglot architecture for efficiently caching personalized content with (2) a natively GDPR-compliant client proxy that handles all sensitive information within the user device. We describe the system design and implementation, explain the custom cache coherence protocol to avoid data staleness and achieve Δ-atomicity, and we share field experiences from over a year of productive use in the e-commerce industry.
【Keywords】: Web Caching; Cache Coherence; CDNs; Dynamic Data; Personalized Content; Data Privacy; Polyglot Storage
【Paper Link】 【Pages】:1609-1620
【Authors】: Shouling Ji ; Qinchen Gu ; Haiqin Weng ; Qianjun Liu ; Pan Zhou ; Jing Chen ; Zhao Li ; Raheem Beyah ; Ting Wang
【Abstract】: In this paper, we study the privacy of online health data. We present a novel online health data De-Anonymization (DA) framework, named De-Health. Leveraging two real world online health datasets WebMD and HealthBoards, we validate the DA efficacy of De-Health. We also present a linkage attack framework which can link online health/medical information to real world people. Through a proof-of-concept attack, we link 347 out of 2805 WebMD users to real world people, and find the full names, medical/health information, birthdates, phone numbers, and other sensitive information for most of the re-identified users. This clearly illustrates the fragility of the privacy of those who use online health forums.
【Keywords】: Feature extraction; Data privacy; Writing; Privacy; Diseases; Correlation
【Paper Link】 【Pages】:1621-1632
【Authors】: Xuanhua Shi ; Yipeng Zhang ; Hong Huang ; Zhenyu Hu ; Hai Jin ; Huan Shen ; Yongluan Zhou ; Bingsheng He ; Ruibo Li ; Keyong Zhou
【Abstract】: JSON is a very popular data format in many applications in Web and enterprise. Recently, many data analytical systems support the loading and querying JSON data. However, JSON parsing can be costly, which dominates the execution time of querying JSON data. Many previous studies focus on building efficient parsers to reduce this parsing cost, and little work has been done on how to reduce the occurrences of parsing. In this paper, we start with a study with a real production workload in Alibaba, which consists of over 3 million queries on JSON. Our study reveals significant temporal and spatial correlations among those queries, which result in massive redundant parsing operations among queries. Instead of repetitively parsing the JSON data, we propose to develop a cache system named Maxson for caching the JSON query results (the values evaluated from JSONPath) for reuse. Specifically, we develop effective machine learning-based predictor with combining LSTM (long shortterm memory) and CRF (conditional random field) to determine the JSONPaths to cache given the space budget. We have implemented Maxson on top of SparkSQL. We experimentally evaluate Maxson and show that 1) Maxson is able to eliminate the most of duplicate JSON parsing overhead, 2) Maxson improves end-to-end workload performance by 1.5-6.5×.
【Keywords】: JSON parsing; semi-structured format; data analytics system
【Paper Link】 【Pages】:1633-1644
【Authors】: Lisheng Zhao ; Jiali Mao ; Min Pu ; Guoping Liu ; Cheqing Jin ; Weining Qian ; Aoying Zhou ; Xiang Wen ; Runbo Hu ; Hua Chai
【Abstract】: The inaccuracy of road intersection in digital road map easily brings serious effects on the mobile navigation and other applications. Massive traveling trajectories of thousands of vehicles enable frequent updating of road intersection topology. In this paper, we first expand the road intersection detection issue into a topology calibration problem for road intersection influence zone. Distinct from the existing road intersection update methods, we not only determine the location and coverage of road intersection, but figure out incorrect or missing turning paths within whole influence zone based on unmatched trajectories as compared to the existing map. The important challenges of calibration issue include that trajectories are mixing with exceptional data, and road intersections are of different sizes and shapes, etc. To address above challenges, we propose a three-phase calibration framework, called CITT. It is composed of trajectory quality improving, core zone detection, and topology calibration within road intersection influence zone. From such components it can automatically obtain high quality topology of road intersection influence zone. Extensive experiments compared with the state-of-the-art methods using trajectory data obtained from Didi Chuxing and Chicago campus shuttles demonstrate that CITT method has strong stability and robustness and significantly outperforms the existing methods.
【Keywords】: road intersection; influence zone; core zone; quality improving; centerline fitting
【Paper Link】 【Pages】:1645-1656
【Authors】: Qitao Shi ; Ya-Lin Zhang ; Longfei Li ; Xinxing Yang ; Meng Li ; Jun Zhou
【Abstract】: Machine learning techniques have been widely applied in Internet companies for various tasks, acting as an essential driving force, and feature engineering has been generally recognized as a crucial tache when constructing machine learning systems. Recently, a growing effort has been made to the development of automatic feature engineering methods, so that the substantial and tedious manual effort can be liberated. However, for industrial tasks, the efficiency and scalability of these methods are still far from satisfactory. In this paper, we proposed a staged method named SAFE (Scalable Automatic Feature Engineering), which can provide excellent efficiency and scalability, along with requisite interpretability and promising performance. Extensive experiments are conducted and the results show that the proposed method can provide prominent efficiency and competitive effectiveness when comparing with other methods. What's more, the adequate scalability of the proposed method ensures it to be deployed in large scale industrial tasks.
【Keywords】: feature engineering; automatic machine learning
【Paper Link】 【Pages】:1657-1666
【Authors】: Tong Zhang ; Baoliang Cui ; Zhen Cui ; Haikuan Huang ; Jian Yang ; Hongbo Deng ; Bo Zheng
【Abstract】: In this work, a new e-commerce search service named text-picture shopping guide (TPSG) is investigated and deployed to one of the most popular shopping platforms called Taobao. Different from traditional services that only contain text options, the TPSG provides pairs of text terms and user-friendly pictures for shopping guide, named text-picture options (TPOs). Instead of manually labeling pictures, we aim to automatically recommend personalized pictures in TPOs. To this end, we build a large-scale graph model on a great amount of data about users, pictures, and terms. Accordingly, a cross-graph convolution learning (CGCL) method is proposed to facilitate the accurate and efficient inference on the constructed graph. To separate the cue of personalized preferences of users to commodities, we factorize the entire mixture-relation graph involving attributes/relations of users and commodities into the user graph, the commodity graph, and the cross user-commodity graph which just characterizes the preferences. Further, we introduce powerful graph convolution to learn more effective representation of these graphs. To reduce the computation burden, specifically, we generalize graph convolution and propose a tensor graph convolution method to learn representation on cross graphs. We conduct extensive offline and online experiments on the large-scale datasets. The results show that the proposed CGCL is very effective and the TPOs recommendation method outperforms manual/advanced selection methods.
【Keywords】: large-scale online e-commerce; text-picture shopping guide; text-picture option; cross-graph convolution learning
【Paper Link】 【Pages】:1667-1676
【Authors】: Andreas Pfadler ; Huan Zhao ; Jizhe Wang ; Lifeng Wang ; Pipei Huang ; Dik Lun Lee
【Abstract】: In recent years, embedding models based on skip-gram algorithm have been widely applied to real-world recommendation systems (RSs). When designing embedding-based methods for recommendation at Taobao, there are three main challenges: scalability, sparsity and cold start. The first problem is inherently caused by the extremely large numbers of users and items (in the order of billions), while the remaining two problems are caused by the fact that most items have only very few (or none at all) user interactions. To address these challenges, in this work, we present a flexible and highly scalable Side Information (SI) enhanced Skip-Gram (SISG) framework, which is deployed at Taobao. SISG overcomes the drawbacks of existing embedding-based models by modeling user metadata and capturing asymmetries of user behavior. Furthermore, as training SISG can be performed using any SGNS implementation, we present our production deployment of SISG on a custom-built word2vec engine, which allows us to compute item and SI embedding vectors for billion-scale sets of products in a join semantic space on a daily basis. Finally, using offline and online experiments we demonstrate the significant superiority of SISG over our previously deployed framework, EGES, and a well-tuned CF, as well as present evidence supporting our scalability claims.
【Keywords】: large-scale recommendation; embedding-based methods
【Paper Link】 【Pages】:1677-1688
【Authors】: Zhao Li ; Xin Shen ; Yuhang Jiao ; Xuming Pan ; Pengcheng Zou ; Xianling Meng ; Chengwei Yao ; Jiajun Bu
【Abstract】: The e-commerce appeals to a multitude of online shoppers by providing personalized experiences and becomes indispensable in our daily life. Accurately predicting user preference and making a recommendation of favorable items plays a crucial role in improving several key tasks such as Click Through Rate (CTR) and Conversion Rate (CVR) in order to increase commercial value. Some state-of-the-art collaborative filtering methods exploiting non-linear interactions on a user-item bipartite graph are able to learn better user and item representations with Graph Neural Networks (GNNs), which do not learn hierarchical representations of graphs because they are inherently flat. Hierarchical representation is reportedly favorable in making more personalized item recommendations in terms of behaviorally similar users in the same community and a context of topic-driven taxonomy. However, some advanced approaches, in this regard, are either only considering linear interactions, or adopting single-level community, or computationally expensive. To address these problems, we propose a novel method with Hierarchical bipartite Graph Neural Network (HiGNN) to handle large-scale e-commerce tasks. By stacking multiple GNN modules and using a deterministic clustering algorithm alternately, HiGNN is able to efficiently obtain hierarchical user and item embeddings simultaneously, and effectively predict user preferences on a larger scale. Extensive experiments on some real-world e-commerce datasets demonstrate that HiGNN achieves a significant improvement compared to several popular methods. Moreover, we deploy HiGNN in Taobao, one of the largest e-commerces with hundreds of million users and items, for a series of large-scale prediction tasks of item recommendations. The results also illustrate that HiGNN is arguably promising and scalable in real-world applications.
【Keywords】: Hierarchical Representation; Graph Neural Network; Bipartite Graph; E-commerce Recommendations
【Paper Link】 【Pages】:1689-1700
【Authors】: Chonggang Song ; Qian Lin ; Guohui Ling ; Zongyi Zhang ; Hongzhao Chen ; Jun Liao ; Chuan Chen
【Abstract】: Relationships in online social networks often imply social connections in real life. An accurate understanding of relationship types benefits many applications, e.g. social advertising and recommendation. Some recent attempts have been proposed to classify user relationships into predefined types with the help of pre-labeled relationships or abundant interaction features on relationships. Unfortunately, both relationship feature data and label data are very sparse in real social platforms like WeChat, rendering existing methods inapplicable. In this paper, we present an in-depth analysis of WeChat relationships to identify the major challenges for the relationship classification task. To tackle the challenges, we propose a Local Community-based Edge Classification (LoCEC) framework that classifies user relationships in a social network into real-world social connection types. LoCEC enforces a three-phase processing, namely local community detection, community classification and relationship classification, to address the sparsity issue of relationship features and relationship labels. Moreover, LoCEC is designed to handle large-scale networks by allowing parallel and distributed processing. We conduct extensive experiments on the real-world WeChat network with hundreds of billions of edges to validate the effectiveness and efficiency of LoCEC.
【Keywords】: WeChat; edge classification; social networks
【Paper Link】 【Pages】:1701-1712
【Authors】: Jiaping Gui ; Ding Li ; Zhengzhang Chen ; Junghwan Rhee ; Xusheng Xiao ; Mu Zhang ; Kangkook Jee ; Zhichun Li ; Haifeng Chen
【Abstract】: While backtracking analysis has been successful in assisting the investigation of complex security attacks, it faces a critical dependency explosion problem. To address this problem, security analysts currently need to tune backtracking analysis manually with different case-specific heuristics. However, existing systems fail to fulfill two important system requirements to achieve effective backtracking analysis. First, there need flexible abstractions to express various types of heuristics. Second, the system needs to be responsive in providing updates so that the progress of backtracking analysis can be frequently inspected, which typically involves multiple rounds of manual tuning. In this paper, we propose a novel system, APTrace, to meet both of the above requirements. As we demonstrate in the evaluation, security analysts can effectively express heuristics to reduce more than 99.5% of irrelevant events in the backtracking analysis of real-world attack cases. To improve the responsiveness of backtracking analysis, we present a novel execution-window partitioning algorithm that significantly reduces the waiting time between two consecutive updates (especially, 57 times reduction for the top 1% waiting time).
【Keywords】: Backtracking analysis; domain language; expressiveness; responsiveness
【Paper Link】 【Pages】:1713-1717
【Authors】: Qifei Li ; Zhicong Huang ; Wen-jie Lu ; Cheng Hong ; Hunter Qu ; Hui He ; Weizhe Zhang
【Abstract】: Homomorphic Encryption (HE) allows encrypted data to be processed without decryption, which could maximize the protection of user privacy without affecting the data utility. Thanks to strides made by cryptographers in the past few years, the efficiency of HE has been drastically improved, and machine learning on homomorphically encrypted data has become possible. Several works have explored machine learning based on HE, but most of them are restricted to the outsourced scenario, where all the data comes from a single data owner. We propose HomoPAI, an HE-based secure collaborative machine learning system, enabling a more promising scenario, where data from multiple data owners could be securely processed. Moreover, we integrate our system with the popular MPI framework to achieve parallel HE computations. Experiments show that our system can train a logistic regression model on millions of homomorphically encrypted data in less than two minutes.
【Keywords】: homomorphic encryption; machine learning
【Paper Link】 【Pages】:1718-1721
【Authors】: Qian Lin ; Kaiyuan Yang ; Tien Tuan Anh Dinh ; Qingchao Cai ; Gang Chen ; Beng Chin Ooi ; Pingcheng Ruan ; Sheng Wang ; Zhongle Xie ; Meihui Zhang ; Olafs Vandans
【Abstract】: Data collaboration activities typically require systematic or protocol-based coordination to be scalable. Git, an effective enabler for collaborative coding, has been attested for its success in countless projects around the world. Hence, applying the Git philosophy to general data collaboration beyond coding is motivating. We call it Git for data. However, the original Git design handles data at the file granule, which is considered too coarse-grained for many database applications. We argue that Git for data should be co-designed with database systems. To this end, we developed ForkBase to make Git for data practical. ForkBase is a distributed, immutable storage system designed for data version management and data collaborative operation. In this demonstration, we show how ForkBase can greatly facilitate collaborative data management and how its novel data deduplication technique can improve storage efficiency for archiving massive data versions.
【Keywords】: Indexes; Collaboration; History; Distributed databases; Semantics; Access control
【Paper Link】 【Pages】:1722-1725
【Authors】: Boxuan Li ; Reynold Cheng ; Jiafeng Hu ; Yixiang Fang ; Min Ou ; Ruibang Luo ; Kevin Chen-Chuan Chang ; Xuemin Lin
【Abstract】: Large networks with labeled nodes are prevalent in various applications, such as biological graphs, social networks, and e-commerce graphs. To extract insight from this rich information source, we propose MC-Explorer, which is an advanced analysis and visualization system. A highlight of MC-Explorer is its ability to discover motif-cliques from a graph with labeled nodes. A motif, such as a 3-node triangle, is a fundamental building block of a graph. A motif-clique is a "complete" subgraph in a network with respect to a desired higher-order connection pattern. For example, on a large biological graph, we found out some motif-cliques, which disclose new side effects of a drug, and potential drugs for healing diseases. MC-Explorer includes online and interactive facilities for exploring a large labeled network through the use of motif-cliques. We will demonstrate how MC-Explorer can facilitate the analysis and visualization of a labeled biological network.An online demo video of MC-Explorer can be accessed from https://www.dropbox.com/s/vkalumc28wqp8yl/demo.mov.
【Keywords】: Drugs; Obesity; Breast cancer; Data analysis
【Paper Link】 【Pages】:1726-1729
【Authors】: Nico Schäfer ; Sebastian Michel
【Abstract】: We describe the demonstration of JODA (Json On Demand Analytics), an approach to handling large amounts of JSON documents in a vertically scalable manner. With JODA, the user can import, filter, transform, aggregate, group, and export documents with a simple PIG-style query language, offering fast execution speed. This is achieved by utilizing a multithreaded architecture over disjoint, read-only containers of data that are processed in parallel, similar to what RDDs are to Spark. Containers are augmented with auxiliary information like Bloom filters and adaptive indices and all containers are processed in parallel by individual threads. By avoiding locks, latches, and synchronization beyond simple thread pooling, we do not risk contention and therefore maximize resource utilization. The demonstration scenarios aim at engaging visitors with several data analytics tasks around large, real-world datasets that are to be solved with the help of JODA, and further gives insights on system internals and the installation/configuration process.
【Keywords】: in-memory; json; semi-structured
【Paper Link】 【Pages】:1730-1733
【Authors】: Shangwei Guo ; Yang Ji ; Ce Zhang ; Cheng Xu ; Jianliang Xu
【Abstract】: We demonstrate vCBIR, a verifiable search engine for Content-Based Image Retrieval. vCBIR allows a small or medium-sized enterprise to outsource its image database to a cloud-based service provider and ensures the integrity of query processing. Like other common data-as-a-service (DaaS) systems, vCBIR consists of three parties: (i) the image owner who outsources its database, (ii) the service provider who executes the authenticated query processing, and (iii) the client who issues search queries. By employing a novel query authentication scheme proposed in our prior work [4], the system not only supports cloud-based image retrieval, but also generates a cryptographic proof for each query, by which the client could verify the integrity of query results. During the demonstration, we will showcase the usage of vCBIR and also provide attendees interactive experience of verifying query results against an untrustworthy service provider through graphical user interface (GUI).
【Keywords】: Feature extraction; Graphical user interfaces; Query processing; Indexes; Image retrieval; Browsers
【Paper Link】 【Pages】:1734-1737
【Authors】: François Lévesque ; Marielle St-Germain ; Dominique Piché ; Jean-François Gauvin ; Michel Gagnon ; Thomas Hurtut
【Abstract】: This paper introduces MusX, a visualization-based system that helps to search and explore a large multivariate bipartite graph of artists and songs. An additional tree structure for the song nodes is also inherited from the musical adaptation relations. In tight collaboration with a public national library and archives institution, we propose a novel artist-centered interactive set of representations, focusing on several identified user tasks. This online system is targeted towards laypersons, willing to quickly navigate artists' body of songs and explore their relationships to other artists through their implication in song creation. In this paper, we present a detailed description of MusX along with design and technical considerations, and the demonstration scenarios we intend to present to the audience.
【Keywords】: bipartite graph; knowledge graph; exploration task; visualization
【Paper Link】 【Pages】:1738-1741
【Authors】: Qiyu Liu ; Libin Zheng ; Lei Chen
【Abstract】: As an important operator for spatial data analytics, Region of Interest (ROI) query is of great importance in many location-based services such as event detection, location recommendation and smart transportation. To address the challenge of conducting ROI queries on the increasingly complex spatial data, we present RIDE, an efficient and effective system for generalized ROI Discovery and Exploration. Different from existing studies and systems, RIDE supports a large spectrum of region score functions and query geometries, enabling customized ROI queries for different application scenarios. This demonstration proposal introduces the basic concept of ROI queries and key components of the RIDE system, including data storage and indexing, ROI query processing and optimization and user interface.
【Keywords】: ROI Queries; Region Exploration; Big Spatial Analytics
【Paper Link】 【Pages】:1742-1745
【Authors】: Yihai Xi ; Ning Wang ; Shuang Hao ; Wenyang Yang ; Li Li
【Abstract】: A data summarization for the large table can be of great help, which provides a concise and informative overview and assists the user to quickly figure out the subject of the data. However, a high quality summarization needs to have two desirable properties: presenting notable entities and achieving broad domain coverage. In this demonstration, we propose a summarizer system called PocketView that is able to create a data summarization through a pocket view of the table. The attendees will experience the following features of our system:(1) time-sensitive notability evaluation - PocketView can automatically identify notable entities according to their significance and popularity in user-defined time period; (2) broad-coverage pocket view - Our system will provide a pocket view for the table without losing any domain, which is much simpler and clearer for attendees to figure out the subject compared with the original table.
【Keywords】: data summarization; pocket view; notability; domain coverage
【Paper Link】 【Pages】:1746-1749
【Authors】: Qixu Gong ; Jiefei Liu ; Huiping Cao
【Abstract】: Skyline queries find the representative data points from a multi-dimensional dataset, which are better than other data points on at least one dimension. A multi-cost transportation network (MCTN) can be modeled as a multi-dimensional dataset. The MCTN-constrained skyline query (CSQ) is a type of skyline queries on MCTN where the query point and the skyline answer-objects are points of interest (POI) that are off the network, and the answer-points need to be reached from the query point by utilizing the MCTN. CSQ is useful in many applications such as trip planning and apartment selection. For example, when a person wants to find an apartment, he/she may consider not only the price and the number of rooms of the apartment but also the cost and travel time of using public transportation from his apartment to his/her office.In this paper, we present a system to answer MCTN-constrained CSQs, namely CSQ system. This system is implemented as a web application, which allows users to input a query point from a web interface, get the skyline result by using several algorithms, and display the result on the web interface. We load the POIs and public transportation networks of three cities (Los Angeles, San Francisco, and New York) to the system, and explain how users can interact with the CSQ system.
【Keywords】: Skyline queries; Transportation networks; Multi-dimensional data
【Paper Link】 【Pages】:1750-1553
【Authors】: Chao Zhang ; Farouk Toumani ; Bastien Doreau
【Abstract】: We present SUDAF (Sharing User-Defined Aggregate Functions), a declarative framework that allows users to formulate UDAFs as mathematical expressions and use them in SQL statements. SUDAF rewrites partial aggregates of UDAFs using built-in aggregate functions and supports efficient dynamic caching and reusing of partial aggregates. Our evaluation shows that using SUDAF on top of Spark SQL can lead from one to two orders of magnitude improvement in query execution times compared to the original Spark SQL.
【Keywords】: Query processing; Query Rewriting; Userdefined aggregate functions
【Paper Link】 【Pages】:1754-1757
【Authors】: Leshang Chen ; Susan B. Davidson
【Abstract】: The ability to cite software and give credit to its authors and contributors is increasingly important. While the number of online open-source software repositories has grown rapidly over the past few years, few are being properly cited when used due to the difficulty of creating appropriate citations and the lack of automated techniques. This paper presents GitCite, a model for software citation with version control which enables citations to be inferred for any project component based on a small number of explicit citations attached to subdirectories/files, and an implementation that integrates with Git and GitHub. The implementation includes a browser extension and a local executable tool, which enable citations to be added/modified/deleted to software project repositories and managed through functions such as fork/merge/copy.
【Keywords】: Tools; Usability; Metadata; Browsers; Computer architecture; Control systems
【Paper Link】 【Pages】:1758-1761
【Authors】: Jiaye Liu ; Jiali Mao ; Jiajun Liao ; Huiqi Hu ; Ye Guo ; Aoying Zhou
【Abstract】: The rapid development of steel logistics industry still has not effectively address such issues as truck overload and order overdue as well as cargo overstock. One of the reasons lie in limited number of trucks for transporting large scale cargos. More importantly, traditional methods attend to distribute cargos to trucks with the aim of maximizing the loading of each truck. But they ignore the priority level of orders and the expiration date of cargos stored in the warehouses, which have critical influences on profits of steel logistics industry. Hence, it necessitates an appropriate cargo distribution mechanism under the precondition of limited transportation capacity resources, to guarantee the maximization of delivery proportion for high-priority cargos. Recently, tremendous logistics data has been produced and are being in constant increment hourly in steel logistics platform. However, there is no existing solution to transform such data into actionable scheme to improve cargo distributing effectiveness. This paper puts forward a system implementation of smart cargo loading plan decision framework (SCLPD for short) for steel logistics industry. Through analysis on numerous real data cargo loading plan and inventory of warehouse, some important rules related to cargo distribution process are extracted. Additionally, consider that different amounts of trucks arriving in different time periods, based on adaptive time window model, a two- layer searching mechanism consisting of a genetic algorithm and A* algorithm is designed to ensure global optimization of cargo loading plan for the trucks in all time periods. In our demonstration, we illustrate the procedure of matching for cargos and trucks in various time windows, and showcase the comparison experimental results between the traditional method and SCLPD by the measurement of delivery proportion for high- priority cargos. The effectiveness and practicality of SCLPD enables efficient cargo loading plan generation, to meet the real- world requirements from steel logistics platform.
【Keywords】: steel logistics; cargo loading plan; match; dynamic time window
【Paper Link】 【Pages】:1762-1765
【Authors】: Feiyang Xu ; Yue Ding ; Zhenhua Ling ; Xin Li ; Yunxia Li ; Shijin Wang
【Abstract】: Alzheimer's disease is a chronic neurodegenerative disease that usually starts slowly and gradually worsens over time. Although there's no cure for Alzheimer's disease yet, a number of recent researches have shown that the early diagnosis and intervention could not only improve the quality of life but also help to slow the progression of the disease. Clock Drawing Test (CDT) is one of the commonly used clinical methods for screening cognitive impairment, due to its simplicity and convenience. In this paper, we'd like to introduce DCDT, a novel Clock Drawing Test system based on digital collection and intellectualized analysis. We first introduce the background of AD and CDT, and then describe the DCDT system from the external and internal aspects. Finally, the demonstration scenario is described briefly.
【Keywords】: Digital Clock Drawing Test; Comprehensive Evaluation Index System; Machine Learning
【Paper Link】 【Pages】:1766-1769
【Authors】: Mohammad Hossein Namaki ; Xin Zhang ; Sukhjinder Singh ; Arman Ahmed ; Armina Foroutan ; Yinghui Wu ; Anurag K. Srivastava ; Anton Kocheturov
【Abstract】: We demonstrate Kronos, a framework and system that automatically extracts highly dynamic knowledge for complex event analysis in Cyber-Physical systems. Kronos captures events with anomaly-based event model, and integrates various events by correlating with their temporal associations in realtime, from heterogeneous, continuous cyber-physical measurement data streams. It maintains a lightweight highly dynamic knowledge base, enabled by online, window-based ensemble learning and incremental association analysis for event detection and linkage, respectively. These algorithms incur time costs determined by available memory, independent of the size of streams. Exploiting the highly dynamic knowledge, Kronos supports a rich set of stream event analytical queries including event search (keywords and query-by-example), provenance queries ("which measurements or features are responsible for detected events?"), and root cause analysis. We demonstrate how the GUI of Kronos interacts with users to support both continuous and ad-hoc queries online and enables situational awareness in Cyber-power systems, communication, and traffic networks.
【Keywords】: Correlation; Transmission line measurements; Data models; Data mining; Detectors; Event detection; Phasor measurement units
【Paper Link】 【Pages】:1770-1773
【Authors】: Abdulaziz Almaslukh ; Laila Abdelhafeez ; Amr Magdy
【Abstract】: This paper demonstrates DLEEL; a research system that supports scalable spatial queries with multiple predicates on user-generated data streams, such as social media streams. Supported queries include spatial-social queries and spatial-keyword queries, which are popular in different applications but have never been addressed in the challenging environment of streaming data, where data arrives with excessively high rates. DLEEL distinguishes itself with three novel contributions: (1) Indexing spatial-social data in for personalized real-time search: DLEEL is the first to address personalized queries on streaming spatial- social data through novel low-overhead indexing that scales for large amounts of data and users. The novel indexing has a hybrid storage architecture that trades off indexing overhead, memory consumption, and query latency. (2) Indexing spatial-keyword data for real-time search: DLEEL is the first to enrich existing spatial-keyword indexes with novel streaming data components. The new components reveal performance losses and gains from a system perspective, trading off the system overhead with flexibility to support a variety of queries. (3) Scalable query processing: DLEEL exploits the indexes content to smartly prune the search space on multiple dimensions and support efficient query latency for its different queries on excessive number of data records. DLEEL is demonstrated using a stream of 5 billions real tweets collected from Twitter APIs and real query locations obtained from a popular web search engine. DLEEL has shown superior performance with serving incoming queries with an average latency of few milliseconds while digesting hundreds of thousands of data records every second.
【Keywords】: Indexing; Real-time systems; Twitter; Query processing; Memory management
【Paper Link】 【Pages】:1774-1777
【Authors】: Peng Gao ; Xusheng Xiao ; Ding Li ; Kangkook Jee ; Haifeng Chen ; Sanjeev R. Kulkarni ; Prateek Mittal
【Abstract】: The need for countering Advanced Persistent Threat (APT) attacks has led to the solutions that ubiquitously monitor system activities in each enterprise host, and perform timely abnormal system behavior detection over the stream of monitoring data. However, existing stream-based solutions lack explicit language constructs for expressing anomaly models that capture abnormal system behaviors, thus facing challenges in incorporating expert knowledge to perform timely anomaly detection over the large-scale monitoring data. To address these limitations, we build SAQL, a novel stream-based query system that takes as input, a real-time event feed aggregated from multiple hosts in an enterprise, and provides an anomaly query engine that queries the event feed to identify abnormal behaviors based on the specified anomaly models. SAQL provides a domain-specific query language, Stream-based Anomaly Query Language ( SAQL), that uniquely integrates critical primitives for expressing major types of anomaly models. In the demo, we aim to show the complete usage scenario of SAQL by (1) performing an APT attack in a controlled environment, and (2) using SAQL to detect the abnormal behaviors in real time by querying the collected stream of system monitoring data that contains the attack traces. The audience will have the option to interact with the system and detect the attack footprints in real time via issuing queries and checking the query results through a command-line UI.
【Keywords】: Monitoring; Microsoft Windows; Computational modeling; Servers; Data models; Databases; Real-time systems
【Paper Link】 【Pages】:1778-1781
【Authors】: Paul Boniol ; Michele Linardi ; Federico Roncallo ; Themis Palpanas
【Abstract】: Subsequence anomaly (or outlier) detection in long sequences is an important problem with applications in a wide range of domains. However, current approaches have severe limitations: they either require prior domain knowledge, or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. We recently proposed NorM, a novel approach suitable for domain-agnostic anomaly detection, which addresses the aforementioned problems by detecting anomalies based on their (dis)similarity to a model that represents normal behavior. The experimental results on several real datasets demonstrate that the proposed approach outperforms the current state-of-the art in terms of both accuracy and execution time. In this demonstration, we present a system for unsupervised Subsequence Anomaly Detection (SAD) that uses the NorM method. Through various scenarios with real datasets, we showcase the challenges of the problem, and we demonstrate the advantages of the proposed system.
【Keywords】: Anomaly detection; Computational modeling; Data models; Time series analysis; Data visualization; Graphical user interfaces; Art
【Paper Link】 【Pages】:1782-1785
【Authors】: Ibrahim Sabek ; Mohamed F. Mokbel
【Abstract】: The proliferation in amounts of generated data has propelled the rise of scalable machine learning solutions to efficiently analyze and extract useful insights from such data. Meanwhile, spatial data has become ubiquitous, e.g., GPS data, with increasingly sheer sizes in recent years. The applications of big spatial data span a wide spectrum of interests including tracking infectious disease, climate change simulation, drug addiction, among others. Consequently, major research efforts are exerted to support efficient analysis and intelligence inside these applications by either providing spatial extensions to existing machine learning solutions or building new solutions from scratch. In this 90-minutes tutorial, we comprehensively review the state-of-the-art work in the intersection of machine learning and big spatial data. We cover existing research efforts and challenges in three major areas of machine learning, namely, data analysis, deep learning and statistical inference. We also discuss the existing end-to-end systems, and highlight open problems and challenges for future research in this area.
【Keywords】: Spatial databases; Machine learning; Tutorials; Data analysis; Routing; Machine learning algorithms; Data mining
【Paper Link】 【Pages】:1786-1789
【Authors】: Sebastián Villarroya ; Peter Baumann
【Abstract】: Machine Learning is increasingly being applied to many different application domains. From cancer detection to weather forecast, a large number of different applications leverage machine learning algorithms to get faster and more accurate results over huge datasets. Although many of these datasets are mainly composed of array data, a vast majority of machine learning applications do not use array databases. This tutorial focuses on the integration of machine learning algorithms and array databases. By implementing machine learning algorithms in array databases users can boost the native efficient array data processing with machine learning methods to perform accurate and fast array data analytics.
【Keywords】: array data; array database managers; machine learning; efficient array machine learning
【Paper Link】 【Pages】:1790-1793
【Authors】: Maria Krommyda ; Verena Kantere
【Abstract】: The wide adoption of the RDF data model, as well as the Linked Open Data initiative, have made available large linked datasets that have the potential to offer invaluable knowledge. Accessing, evaluating and understanding these datasets as published, though, requires extensive training and experience in the field of the Semantic Web, making these valuable sources of information inaccessible to a wider audience. In the recent years, there have been many efforts to create systems that allow the visualization and exploration of this information. Some of there systems rely on techniques that allow them to limit the volume of the displayed information, by providing aggregated, filtered or summarized access to the datasets while others initialize the exploration of the dataset based on actions performed by the users, such as keyword searches and queries. The underlying technique is key for the sustainability of the system, the definition of the requirements that the input must comply with, the datasets that can be visualized as well as the visualization types provided. We present here a survey on these techniques, their strengths and weaknesses as well as the datasets that they can support. The survey will provide the reader with a deep understanding of the challenges regarding the visualization of large linked datasets, a categorization of the developed techniques to resolve them as well as an overview of the available systems and their functionalities.
【Keywords】: Data visualization; Semantics; Navigation; Resource description framework; Tools; Linked data; Browsers
【Paper Link】 【Pages】:1794-1797
【Authors】: Mohammad Javad Amiri ; Divyakant Agrawal ; Amr El Abbadi
【Abstract】: Modern large-scale data management systems utilize consensus protocols to provide fault tolerance. Consensus protocols are extensively used in the distributed database infrastructure of large enterprises such as Google, Amazon, and Facebook as well as permissioned blockchain systems like IBM's Hyperledger Fabric. In the last four decades, numerous consensus protocols have been proposed to cover a broad spectrum of distributed database systems. On one hand, distributed networks might be synchronous, partially synchronous, or asynchronous, and on the other hand, infrastructures might consist of crashonly nodes, Byzantine nodes or both. In addition, a consensus protocol might follow a pessimistic or optimistic strategy to process transactions. Furthermore, while traditional consensus protocols assume a priori known set of nodes, in permissionless blockchains, nodes are assumed to be unknown. Finally, consensus protocols have explored a variety of performance trade-offs between the number of phases/messages (latency), the number of required nodes, message complexity, and the activity level of participants. In this tutorial, we discuss consensus protocols that are used in modern large-scale data management systems, classify them into different categories based on their assumptions on network synchrony, failure model of nodes, etc., and elaborate on their main advantages and limitations.
【Keywords】: Fault Tolerance; Consensus; Data Management
【Paper Link】 【Pages】:1798-1801
【Authors】: Shantanu Sharma ; Anton Burtsev ; Sharad Mehrotra
【Abstract】: Despite extensive research, secure outsourcing remains an open challenge. This tutorial focuses on recent advances in secure cloud-based data outsourcing based on cryptographic (encryption, secret-sharing, and multi-party computation (MPC)) and hardware-based approaches. We highlight the strengths and weaknesses of state-of-the-art techniques, and conclude that, while no single approach is likely to emerge as a silver bullet. Thus, the key is to merge different hardware and software techniques to work in conjunction using partitioned computing wherein a computation is split across different cryptographic techniques carefully, so as not to compromise security. We highlight some recent work in that direction.
【Keywords】: Hardware; Outsourcing; Tutorials; Cloud computing; Encryption
【Paper Link】 【Pages】:1802-1805
【Authors】: Xiangyao Yu ; Matt Youill ; Matthew E. Woicik ; Abdurrahman Ghanem ; Marco Serafini ; Ashraf Aboulnaga ; Michael Stonebraker
【Abstract】: This paper studies the effectiveness of pushing parts of DBMS analytics queries into the Simple Storage Service (S3) of Amazon Web Services (AWS), using a recently released capability called S3 Select. We show that some DBMS primitives (filter, projection, and aggregation) can always be cost-effectively moved into S3. Other more complex operations (join, top-K, and group-by) require reimplementation to take advantage of S3 Select and are often candidates for pushdown. We demonstrate these capabilities through experimentation using a new DBMS that we developed, PushdownDB. Experimentation with a collection of queries including TPC-H queries shows that PushdownDB is on average 30% cheaper and 6.7× faster than a baseline that does not use S3 Select.
【Keywords】: Cloud computing; Runtime; Computer architecture; Servers; Indexing; Filtering algorithms
【Paper Link】 【Pages】:1806-1809
【Authors】: Dong Wei ; Senjuti Basu Roy ; Sihem Amer-Yahia
【Abstract】: We study recommendation of deployment strategies to task requesters that are consistent with their deployment parameters: a lower-bound on the quality of the crowd contribution, an upper-bound on the latency of task completion, and an upper-bound on the cost incurred by paying workers. We propose BatchStrat, an optimization-driven middle layer that recommends deployment strategies to a batch of requests by accounting for worker availability. We develop computationally efficient algorithms to recommend deployments that maximize task throughput and pay-off, and empirically validate its quality and scalability.
【Keywords】: Task analysis; Throughput; Optimization; Approximation algorithms; Linear programming; Force; Organizations
【Paper Link】 【Pages】:1810-1813
【Authors】: Ruochen Jiang ; Changbo Qu ; Jiannan Wang ; Chi Wang ; Yudian Zheng
【Abstract】: Live streaming platforms need to store a lot of recorded live videos on a daily basis. An important problem is how to automatically extract highlights (i.e., attractive short video clips) from these massive, long recorded live videos. However, algorithmic approaches are either domain-specific, which require experts to spend a long time to design, or resource-intensive, which require a lot of training data and/or computing resources. In this paper, we propose LIGHTOR, a novel implicit crowd-sourcing approach to overcome these limitations. The key insight is to collect users' natural interactions with a live streaming platform, and then leverage them to detect highlights. We recruit hundreds of users from Amazon Mechanical Turk, and evaluate the performance of LIGHTOR using two popular games in Twitch. The results show that LIGHTOR can achieve high extraction precision with a small set of training data and low computing resources.
【Keywords】: Machine learning; Implicit crowdsourcing; Highlight detection
【Paper Link】 【Pages】:1814-1817
【Authors】: Chengliang Chai ; Guoliang Li ; Ju Fan ; Yuyu Luo
【Abstract】: Visualization charts are widely utilized for presenting structured data. Under many circumstances, people want to explore the data in the charts collected from various sources, such as papers and websites, so as to further analyzing the data or creating new charts. However, the existing automatic and semi-automatic approaches are not always effective due to the variety of charts. In this paper, we introduce a crowdsourcing approach that leverages human ability to extract data from visualization charts. There are several challenges. The first one is how to avoid tedious human interaction with charts and design simple crowdsourcing tasks. Second, it is challenging to evaluate worker's quality for truth inference, because workers may not only provide inaccurate values but also misalign values to wrong data series. To address the challenges, we design an effective crowdsourcing task scheme that splits a chart into simple micro-tasks. We introduce a novel worker quality model by considering worker's accuracy and task difficulty. We also devise an effective early-stopping mechanisms to save the cost. We have conducted experiments on a real crowdsourcing platform, and the results show that our framework outperforms state-of-the-art approaches on both cost and quality.
【Keywords】: Task analysis; Data mining; Crowdsourcing; Feature extraction; Data visualization; Bars; Tools
【Paper Link】 【Pages】:1818-1821
【Authors】: Hongzhi Shi ; Quanming Yao ; Qi Guo ; Yaguang Li ; Lingyu Zhang ; Jieping Ye ; Yong Li ; Yan Liu
【Abstract】: Predicting Origin-Destination (OD) flow is a crucial problem for intelligent transportation. However, it is extremely challenging because of three reasons: first, correlations exist between both origins and destinations; second, the correlations are dynamic across the time; at last, there are multiple correlations from different aspects. To the best of our knowledge, existing models for OD flow prediction cannot tackle all of these three issues simultaneously. We propose Multi-Perspective Graph Convolutional Networks (MPGCN) to capture the complex dependencies. Our proposed model first utilizes long short-term memory (LSTM) network to extract temporal features for each OD pair and then learns the spatial dependency of origins and destinations by a two-dimensional graph convolutional network. Furthermore, we design a dynamic graph together with two static graphs to capture the complicated spatial dependencies and use an average strategy to obtain the final predicted OD flow. We conduct extensive experiments on two large-scale and real-world datasets, which not only demonstrate our design philosophy but also validate the effectiveness of the proposed model.
【Keywords】: Origin-destination traffic prediction; graph convolutional network; spatial-temporal data analysis
【Paper Link】 【Pages】:1822-1825
【Authors】: Guanjie Zheng ; Hanyang Liu ; Kai Xu ; Zhenhui Li
【Abstract】: Traffic simulations can help to explore novel and efficient transportation solutions that overcome traffic problems such as traffic jams and road planning. Traditional traffic simulators usually leverage a car-following model to simulate the vehicle's behavior in the real-world traffic environment. However, these calibrated simplified physical models often fail to accurately predict the pattern of vehicle's movement in complicated real-world traffic environment. Considering the complexity and non-linearity of the real-world traffic, this paper unprecedentedly treat the problem of traffic simulation as a learning problem, and proposes learning to simulate (L2S) vehicle trajectory. We use the generative adversarial imitation learning framework to estimate the policy that provides sequential decisions for the vehicle given real-world demonstrations. The experiment on real-world traffic data shows the superior performance in simulating vehicle trajectories of our method compared to traditional traffic simulation approaches.
【Keywords】: Traffic simulation; imitation learning
【Paper Link】 【Pages】:1826-1829
【Authors】: Jiawei Zhang ; Bowen Dong ; Philip S. Yu
【Abstract】: In recent years, due to the booming development of online social networks, fake news for various commercial and political purposes has been appearing in large numbers and widespread in the online world. With deceptive words, online social network users can get infected by these online fake news easily, which has brought about tremendous effects on the offline society already. An important goal in improving the trustworthiness of information in online social networks is to identify the fake news timely. This paper aims at investigating the principles, methodologies and algorithms for detecting fake news articles, creators and subjects from online social networks and evaluating the corresponding performance. This paper addresses the challenges introduced by the unknown characteristics of fake news and diverse connections among news articles, creators and subjects. This paper introduces a novel gated graph neural network, namely FAKEDETECTOR. Based on a set of explicit and latent features extracted from the textual information, FAKEDETECTOR builds a deep diffusive network model to learn the representations of news articles, creators and subjects simultaneously. Extensive experiments have been done on a real-world fake news dataset to compare FAKEDETECTOR with several state-of-the-art models, and the experimental results are provided in the full-version of this paper at [13].
【Keywords】: Fake News Detection; Diffusive Network; Graph Neural Network; Text Mining; Data Mining
【Paper Link】 【Pages】:1830-1833
【Authors】: Jingping Liu ; Yuanfu Zhou ; Dan Wu ; Chao Wang ; Haiyun Jiang ; Sheng Zhang ; Bo Xu ; Yanghua Xiao
【Abstract】: Commonsense knowledge acquisition is one of the fundamental issues in the implementation of human-level AI. However, commonsense is difficult to obtain, because it is a human consensus and rarely explicitly appears in texts or other data. In this paper, we focus on the automatic acquisition of a typical kind of implicit verb-oriented commonsense knowledge (e.g., "person eats food"), which is the concept level knowledge of verb phrases. For this purpose, we propose a knowledge-driven approach to mine verb-oriented commonsense knowledge from verb phrases with the help of taxonomy. First, we design an entropy-based filter to cope with noisy input verb phrases. Then, we propose a joint model based on minimum description length and a neural language model to generate verb-oriented common-sense knowledge. We conduct extensive experiments to show that our solution is more effective to mine verb-oriented commonsense knowledge than competitors, and finally, we harvest 18K verb-oriented commonsense knowledge.
【Keywords】: commonsense knowledge; verb phrases; probabilistic taxonomy
【Paper Link】 【Pages】:1834-1837
【Authors】: Paul Boniol ; Michele Linardi ; Federico Roncallo ; Themis Palpanas
【Abstract】: Subsequence anomaly (or outlier) detection in long sequences is an important problem with applications in a wide range of domains. However, current approaches have severe limitations: they either require prior domain knowledge, or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. In this work, we address these problems, and propose NorM, a novel approach, suitable for domain-agnostic anomaly detection. NorM is based on a new data series primitive, which permits to detect anomalies based on their (dis)similarity to a model that represents normal behavior. The experimental results on several real datasets demonstrate that the proposed approach outperforms by a large margin the current state-of-the art algorithms in terms of accuracy, while being orders of magnitude faster.
【Keywords】: Data Series; Time Series; Anomaly discovery
【Paper Link】 【Pages】:1838-1841
【Authors】: Shuxiang Zhang ; David Tse Jung Huang ; Gillian Dobbie ; Yun Sing Koh
【Abstract】: Concept drift detection refers to the process of detecting changes in the underlying distribution of data. Interest in the data stream mining community has increased, because of their role in improving the performance of online learning algorithms. Over the years, a myriad of drift detection methods have been proposed. However, most of these methods are single detectors, which usually work well only with a single type of drift. In this research, we propose a semi-supervised locally-weighted ensemble detector (SLED), where the relative performance among its base detectors is characterized by a set of weights learned in a semi-supervised manner. The aim of this technique is to effectively deal with both abrupt and gradual concept drifts. In our experiments, SLED is configured with ten well-known drift detectors. To evaluate the performance of SLED, we compare it with single detectors as well as state-of-the-art ensemble methods on both synthetic and real-world datasets using different performance measures. The experimental results show that SLED has fewer false positives, higher precision, and higher Matthews correlation coefficient while maintaining reasonably good performance for other measures.
【Keywords】: concept drift; data stream; drift ensemble; drift detector; stream mining
【Paper Link】 【Pages】:1842-1845
【Authors】: Muzaffer Can Altinigneli ; Lukas Miklautz ; Christian Böhm ; Claudia Plant
【Abstract】: We propose a novel density-based mode-seeking Hierarchical Quick Shift clustering algorithm with an optional Recurrent Neural Network (RNN) to jointly learn the cluster assignments for every sample and the underlying dynamics of the mode-seeking clustering process. As a mode-seeking clustering algorithm, Hierarchical Quick Shift constrains data samples to stay on similar trajectories. All data samples converging to the same local mode are assigned to a common cluster. The RNN enables us to learn quasi-temporal structures during the mode-seeking clustering process. It supports variable density clusters with arbitrary shapes without requiring the expected number of clusters a priori. We evaluate our method in extensive experiments to show the advantages over other density-based clustering algorithms.
【Keywords】: quick shift; HDBSCAN; mode-seeking; hierarchical clustering; mean shift; medoid shift; recurrent neural network
【Paper Link】 【Pages】:1846-1849
【Authors】: Yan Zhu ; Chin-Chia Michael Yeh ; Zachary Zimmerman ; Eamonn J. Keogh
【Abstract】: Since its introduction several years ago, the Matrix Profile has received significant attention for two reasons. First, it is a very general representation, allowing for the discovery of time series motifs, discords, chains, joins, shapelets, segmentations etc. Secondly, it can be computed very efficiently, allowing for fast exact computation and ultra-fast approximate computation. For analysts that use the Matrix Profile frequently, its incremental computability means that they can perform ad-hoc analytics at any time, with almost no delay time. However, they can only issue global queries. That is, queries that consider all the data from time zero to the current time. This is a significant limitation, as they may be interested in localized questions about a contiguous subset of the data. For example, "do we have any unusual motifs that correspond with that unusually cool summer two years ago". Such ad-hoc queries would require recomputing the Matrix Profile for the time period in question. This is not an untenable computation, but it could not be done in interactive time. In this work we introduce a novel indexing framework that allows queries about arbitrary ranges to be answered in quasilinear time, allowing such queries to be interactive for the first time.
【Keywords】: Time Series; Matrix Profile; Indexing
【Paper Link】 【Pages】:1850-1853
【Authors】: Jun-Gi Jang ; U Kang
【Abstract】: Given a dense tensor, how can we find latent patterns and relations efficiently? Existing Tucker decomposition methods based on Alternating Least Square (ALS) have limitations in terms of time and space since they directly handle large dense tensors to obtain the result of Tucker decomposition. Although few methods have tried to reduce their computational time by sampling tensors, sketching tensors, and efficient matrix operations, their speed and memory efficiency are limited. In this paper, we propose D-Tucker, a fast and memory-efficient method for Tucker decomposition on large dense tensors. D-Tucker consists of the approximation, the initialization, and the iteration phases. D-Tucker 1) compresses an input tensor by computing randomized singular value decomposition of matrices sliced from the input tensor, and 2) efficiently obtains orthogonal factor matrices and a core tensor by using SVD results of sliced matrices. Through experiments, we show that D-Tucker is up to 38.4× faster, and requires up to 17.2× less space than existing methods with little sacrifice in accuracy.
【Keywords】: dense tensor; tucker decomposition; efficiency
【Paper Link】 【Pages】:1854-1857
【Authors】: Guo Zhong ; Chi-Man Pun
【Abstract】: In the era of big data, multi-view clustering has drawn considerable attention in machine learning and data mining communities due to the existence of a large number of unlabeled multi-view data in reality. Traditional spectral graph theoretic methods have recently been extended to multi-view clustering and shown outstanding performance. However, most of them still consist of two separate stages: learning a fixed common real matrix (i.e., continuous labels) of all the views from original data, and then applying K-means to the resulting common label matrix to obtain the final clustering results. To address these, we design a unified multi-view spectral clustering scheme to learn the discrete cluster indicator matrix in one stage. Specifically, the proposed framework directly obtain clustering results without performing K-means clustering. Experimental results on several famous benchmark datasets verify the effectiveness and superiority of the proposed method compared to the state-of-the-arts.
【Keywords】: Optimization; Clustering algorithms; Clustering methods; Databases; Big Data; Laplace equations; Measurement
【Paper Link】 【Pages】:1858-1861
【Authors】: Mu Yuan ; Lan Zhang ; Xiang-Yang Li ; Hui Xiong
【Abstract】: Labeling data comprehensively and efficiently is a widely needed but challenging task. With limited computing resources, given a data stream and a collection of deep-learning models, we propose to adaptively select and schedule a subset of these models to execute, aiming to maximize the value of the model output. Achieving this goal is nontrivial since a model's output on any data item is content-dependent and hard to predict. In this paper, we present an Adaptive Model Scheduling framework, consisting of 1) a deep reinforcement learning-based approach to predict the value of unexecuted models by mining semantic relationship among diverse models, and 2) two heuristic algorithms to adaptively schedule models under deadline or deadline-memory constraints. The proposed framework does not require any prior knowledge of the data, which works as a powerful complement to existing model optimization technologies. We conduct extensive evaluations on 30 popular image labeling models to demonstrate the effectiveness of our design.
【Keywords】: multi-model inference; data labeling; adaptive model scheduling; deep reinforcement learning
【Paper Link】 【Pages】:1862-1865
【Authors】: Zhongyuan Jiang ; Lichao Sun ; Philip S. Yu ; Hui Li ; Jianfeng Ma ; Yulong Shen
【Abstract】: In this paper, we incorporate the realistic scenario of key protection into link privacy preserving and propose the target-link privacy preserving (TPP) model: target links referred to as targets are the most important and sensitive objectives that would be intentionally attacked by adversaries, in order that need privacy protections, while other links of less privacy concerns are properly released to maintain the graph utility. The goal of TPP is to limit the target disclosure by deleting a budget limited set of alternative non-target links referred to as protectors to defend the adversarial link predictions for all targets. Traditional link privacy preserving treated all links as targets and concentrated on structural level protections in which serious link disclosure and high graph utility loss is still the bottleneck of graph releasing today, while TPP focuses on the target level protections in which key protection is implemented on a tiny fraction of critical targets to achieve better privacy protection and lower graph utility loss. Currently there is a lack of clear TPP problem definition, provable optimal or near optimal protector selection algorithms and scalable implementations on large-scale social graphs. Firstly, we introduce the TPP model and propose a dissimilarity function used for measuring the defense ability against privacy analyzing for the targets. We consider two different problems by budget assignment settings: 1) we protect all targets and to optimize the dissimilarity of all targets with a single budget; 2) besides the protections of all targets, we also care about the protection of each target by assigning a local budget to every target. Moreover, we propose two local protector selections, namely cross-target and with-target pickings. Each problem with each protector picking selection is corresponding to a greedy algorithm. We also implement scalable implementations for all greedy algorithms by limiting the selection scale of protectors, and we prove that all greedy-based algorithms achieve approximation by holding the monotonicity and submodularity. Through experiments on large real social graphs, we demonstrate the effectiveness and efficiency of the proposed target link protection methods.
【Keywords】: Target Privacy Preserving; Link Deletion; Budget; Graph Utility
【Paper Link】 【Pages】:1866-1869
【Authors】: Xiaolin Han ; Tobias Grubenmann ; Reynold Cheng ; Sze Chun Wong ; Xiaodong Li ; Wenya Sun
【Abstract】: Incident detection (ID), or the automatic discovery of anomalies from road traffic data (e.g., road sensor and GPS data), enables emergency actions (e.g., rescuing injured people) to be carried out in a timely fashion. Existing ID solutions based on data mining or machine learning often rely on dense traffic data; for instance, sensors installed in highways provide frequent updates of road information. In this paper, we ask the question: Can ID be performed on sparse traffic data (e.g., location data obtained from GPS devices equipped on vehicles)? As these data may not be enough to describe the state of the roads involved, they can undermine the effectiveness of existing ID solutions. To tackle this challenge, we borrow an important insight from the transportation area, which uses trajectories (i.e., moving histories of vehicles) to derive incident patterns. We study how to obtain incident patterns from trajectories and devise a new solution (called Filter-Discovery-Match (FDM)) to detect anomalies in sparse traffic data. Experiments on a taxi dataset in Hong Kong and a simulated dataset show that FDM is more effective than state-of-the-art ID solutions on sparse traffic data.
【Keywords】: Data Mining; Traffic Incident Detection; Sparsity
【Paper Link】 【Pages】:1870-1873
【Authors】: Weixin Zeng ; Xiang Zhao ; Jiuyang Tang ; Xuemin Lin
【Abstract】: Entity alignment (EA) identifies entities that refer to the same real-world object but locate in different knowledge graphs (KGs), and has been harnessed for KG construction and integration. When generating EA results, current solutions treat entities independently and fail to take into account the interdependence between entities. To fill this gap, we propose a collective EA framework. We first employ three representative features, i.e., structural, semantic and string signals, which are adapted to capture different aspects of the similarity between entities in heterogeneous KGs. In order to make collective EA decisions, we formulate EA as the classical stable matching problem, which is further effectively solved by deferred acceptance algorithm. Our proposal is evaluated on both cross-lingual and mono-lingual EA benchmarks against state-of-the-art solutions, and the empirical results verify its effectiveness and superiority.
【Keywords】: Entity alignment; Stable matching
【Paper Link】 【Pages】:1874-1877
【Authors】: Wolfram Wingerath ; Felix Gessert ; Norbert Ritter
【Abstract】: Traditional databases are optimized for pull-based queries, i.e. they make information available in direct response to client requests. While this access pattern is adequate for mostly static domains, it requires inefficient and slow workarounds (e.g. periodic polling) when clients need to stay up-to-date. Acknowledging reactive and interactive workloads, modern real-time databases such as Firebase, Meteor, and RethinkDB proactively deliver result updates to their clients through push-based real-time queries. However, current implementations are only of limited practical relevance, since they are incompatible with existing technology stacks, fail under heavy load, or do not support complex queries to begin with. To address these issues, we propose the system design InvaliDB which combines linear read and write scalability for real-time queries with superior query expressiveness and legacy compatibility. We compare InvaliDB against competing system designs to emphasize the benefits of our approach that has been serving customers at the Database-as-a-Service company Baqend since July 2017.
【Keywords】: database systems; real-time queries; real-time databases; Firebase; Meteor; RethinkDB; push-based data access; result change notifications; NoSQL; document stores; MongoDB
【Paper Link】 【Pages】:1878-1881
【Authors】: Pei Li ; Jaroslaw Szlichta ; Michael H. Böhlen ; Divesh Srivastava
【Abstract】: We introduce band ODs to model the semantics of attributes that are monotonically related with small variations without there being an intrinsic violation of semantics. To make band ODs relevant to real-world applications, we make them less strict to hold approximately with some exceptions. Since formulating integrity constraints manually is cumbersome, we study the problem of automatic approximate band OD discovery. We devise an algorithm that determines the optimal solution in polynomial time. We perform a thorough experimental evaluation of our techniques over real-world and synthetic datasets.
【Keywords】: Semantics; Approximation algorithms; Data integrity; Conferences; Data engineering; Cleaning; Industries
【Paper Link】 【Pages】:1882-1885
【Authors】: Omar AlOmeir ; Eugenie Yujing Lai ; Mostafa Milani ; Rachel Pottinger
【Abstract】: Provenance information can be large and overwhelming to users. We present a set of criteria that any provenance exploration tool must have and introduce Pastwatch, a provenance exploration system that adheres to those criteria. We also address the issues associated with provenance of aggregation queries, including the creation of a summarization method that makes provenance of aggregation queries manageable for users. Finally, we conduct a quantitative user study to show statistically significant results that Pastwatch makes provenance information more efficient and easier to use than standard approaches.
【Keywords】: provenance; relational databases; summarization; visualization; usability
【Paper Link】 【Pages】:1886-1889
【Authors】: Stella Giannakopoulou ; Manos Karpathiotakis ; Anastasia Ailamaki
【Abstract】: Data cleaning is a time-consuming process that depends on the data analysis that users perform. Existing solutions treat data cleaning as a separate offline process that takes place before analysis begins. Applying data cleaning before analysis assumes a priori knowledge of the inconsistencies and the query workload, thereby requiring effort on understanding and cleaning the data that is unnecessary for the analysis. We propose an approach that performs probabilistic repair of functional dependency violations on-demand, driven by the exploratory analysis that users perform. We introduce Daisy, a system that seamlessly integrates data cleaning into the analysis by relaxing query results. Daisy executes analytical query-workloads over dirty data by weaving cleaning operators into the query plan. Our evaluation shows that Daisy adapts to the workload and outperforms traditional offline cleaning on both synthetic and real-world workloads.
【Keywords】: data cleaning; dependencies; query-driven
【Paper Link】 【Pages】:1890-1893
【Authors】: Shuang Hao ; Chengliang Chai ; Guoliang Li ; Nan Tang ; Ning Wang ; Xiang Yu
【Abstract】: Knowledge bases (KBs), which store high-quality information, are crucial for many applications, such as enhancing search results and serving as external sources for data cleaning. Not surprisingly, there exist outdated facts in most KBs due to the rapid change of information. Naturally, it is important to keep KBs up-to-date. Traditional wisdom has investigated the problem of using reference data (such as new facts extracted from the news) to detect outdated facts in KBs. However, existing approaches can only cover a small percentage of facts in KBs. In this paper, we propose a novel human-in-the-loop approach for outdated fact detection in KBs. It trains a binary classifier using features such as historical update frequency and existence time of a fact to compute the likelihood of a fact in a KB to be outdated. Then, it interacts with humans to verify whether a fact with high likelihood is indeed outdated. In addition, it also uses logical rules to detect more outdated facts based on human feedback. The outdated facts detected by the logical rules will also be fed back to train the ML model further for data augmentation. Extensive experiments on real-world KBs, such as Yago and DBpedia, show the effectiveness of our solution.
【Keywords】: Training; Predictive models; Feature extraction; Task analysis; Training data; Knowledge based systems; Data mining
【Paper Link】 【Pages】:1894-1897
【Authors】: Oksana Dolmatova ; Nikolaus Augsten ; Michael H. Böhlen
【Abstract】: There exist large amounts of numerical data that are stored in databases and must be analyzed. Database tables come with a schema and include non-numerical attributes; this is crucial contextual information that is needed for interpreting the numerical values. We propose relational matrix operations that support the analysis of data stored in tables and that preserve contextual information. The result of our approach are precisely defined relational matrix operations and a system implementation in MonetDB that illustrates the seamless integration of relational matrix operations into a relational DBMS.
【Keywords】: Matrix decomposition; Linear regression; Merging; Data structures; Databases
【Paper Link】 【Pages】:1898-1901
【Authors】: Lily Xu ; Shahrzad Gholami ; Sara Mc Carthy ; Bistra Dilkina ; Andrew J. Plumptre ; Milind Tambe ; Rohit Singh ; Mustapha Nsabuga ; Joshua Mabonga ; Margaret Driciru ; Fred Wanyama ; Aggrey Rwetsiba ; Tom Okello ; Eric Enyel
【Abstract】: Illegal wildlife poaching threatens ecosystems and drives endangered species toward extinction. However, efforts for wildlife protection are constrained by the limited resources of law enforcement agencies. To help combat poaching, the Protection Assistant for Wildlife Security (PAWS) is a machine learning pipeline that has been developed as a data-driven approach to identify areas at high risk of poaching throughout protected areas and compute optimal patrol routes. In this paper, we take an end-to-end approach to the data-to-deployment pipeline for anti-poaching. In doing so, we address challenges including extreme class imbalance (up to 1:200), bias, and uncertainty in wildlife poaching data to enhance PAWS, and we apply our methodology to three national parks with diverse characteristics. (i) We use Gaussian processes to quantify predictive uncertainty, which we exploit to improve robustness of our prescribed patrols and increase detection of snares by an average of 30%. We evaluate our approach on real-world historical poaching data from Murchison Falls and Queen Elizabeth National Parks in Uganda and, for the first time, Srepok Wildlife Sanctuary in Cambodia. (ii) We present the results of large-scale field tests conducted in Murchison Falls and Srepok Wildlife Sanctuary which confirm that the predictive power of PAWS extends promisingly to multiple parks. This paper is part of an effort to expand PAWS to 800 parks around the world through integration with SMART conservation software.
【Keywords】: wildlife protection; data mining; predictive modeling; patrol route planning; poaching; uncertainty; gaussian processes
【Paper Link】 【Pages】:1902-1905
【Authors】: André Bauer ; Marwin Züfle ; Nikolas Herbst ; Samuel Kounev ; Valentin Curtef
【Abstract】: One central problem of machine learning is the inherent limitation to predict only what has been learned -stationarity. Any time series property that eludes stationarity poses a challenge for the proper model building. Furthermore, existing forecasting methods lack reliable forecast accuracy and time-to-result if not applied in their sweet spot. In this paper, we propose a fully automated machine learning-based forecasting approach. Our Telescope approach extracts and transforms features from an input time series and uses them to generate an optimized forecast model. In a broad competition including the latest hybrid forecasters, established statistical, and machine learning-based methods, our Telescope approach shows the best forecast accuracy coupled with a lower and reliable time-to-result.
【Keywords】: Automatic feature extraction; Combining forecasts; Comparative studies; Forecasting competitions; Long term time series forecasting; Time series
【Paper Link】 【Pages】:1906-1909
【Authors】: Chunnan Wang ; Hongzhi Wang ; Tianyu Mu ; Jianzhong Li ; Hong Gao
【Abstract】: In many fields, a mass of algorithms with completely different hyperparameters have been developed to address the same type of problems. Choosing the algorithm and hyperparameter setting correctly can promote the overall performance greatly, but users often fail to do so due to the absence of knowledge. How to help users to effectively and quickly select the suitable algorithm and hyperparameter settings for the given task instance is an important research topic nowadays, which is known as the CASH problem. In this paper, we design the Auto-Model approach, which makes full use of known information in the related research paper and introduces hyperparameter optimization techniques, to solve the CASH problem effectively. Auto-Model tremendously reduces the cost of algorithm implementations and hyperparameter configuration space, and thus capable of dealing with the CASH problem efficiently and easily. To demonstrate the benefit of Auto-Model, we compare it with classical Auto-Weka approach. The experimental results show that our proposed approach can provide superior results and achieves better performance in a short time.
【Keywords】: Algorithm selection; Hyperparameter optimization; Combined algorithm selection and hyperparameter optimization problem; Auto-Weka; Classification algorithms
【Paper Link】 【Pages】:1910-1913
【Authors】: Parmita Mehta ; Stephen Portillo ; Magdalena Balazinska ; Andrew J. Connolly
【Abstract】: Deep learning (DL) models have achieved paradigm-changing performance in many fields with high dimensional data, such as images, audio, and text. However, the black-box nature of deep neural networks is not only a barrier to adoption in applications such as medical diagnosis, where interpretability is essential, but it also impedes diagnosis of under performing models. The task of diagnosing or explaining DL models requires the computation of additional artifacts, such as activation values and gradients. These artifacts are large in volume, and their computation, storage, and querying raise significant data management challenges. In this paper, we develop a novel data sampling technique that produces approximate but accurate results for these model debugging queries. Our sampling technique utilizes the lower dimension representation learned by the DL model and focuses on model decision boundaries for the data in this lower dimensional space.
【Keywords】: Data models; Computational modeling; Data visualization; Neurons; Machine learning; Buildings; Training
【Paper Link】 【Pages】:1910-1913
【Authors】: Simon Aagaard Pedersen ; Bin Yang ; Christian S. Jensen
【Abstract】: Increasingly available trajectory data enables detailed capture of traffic conditions. We consider an uncertain road network graph, where each graph edge is associated with a travel time distribution, and we study probabilistic budget routing that aims to find the path with the highest probability of arriving within a given time budget. In this setting, a fundamental operation is to compute the travel cost distribution of a path from the cost distributions of the edges in the path. Solutions that rely on convolution generally assume independence among the edges' distributions, which often does not hold and thus incurs poor accuracy. We propose a hybrid approach that combines convolution and machine learning-based estimation to take into account dependencies among distributions in order to improve accuracy. Next, we propose an efficient routing algorithm that is able to utilize the hybrid approach and that features effective pruning techniques to enable faster routing. Empirical studies on a substantial real-world trajectory set offer insight into the properties of the proposed solution, indicating that it is promising.
【Keywords】: Convolution; Estimation; Trajectory; Routing; Roads; Computational modeling; Histograms
【Paper Link】 【Pages】:1914-1917
【Authors】: Gangmuk Lim ; Mohamed S. Hassan ; Ze Jin ; Stavros Volos ; Myeongjae Jeon
【Abstract】: Datacenter systems require real-time troubleshooting so as to minimize downtimes. In doing so, datacenter operators employ streaming analytics for collecting and processing datacenter telemetry over a temporal window. Quantile computation is key to this telemetry monitoring since it can summarize the typical and abnormal behavior of the monitored system. However, computing quantiles in real-time is resource-intensive as it requires processing hundreds of millions of events in seconds while providing high accuracy. To address these challenges, we propose AOMG, an efficient and accurate quantile approximation algorithm that capitalizes insights from our workload study. AOMG improves performance through two-level hierarchical windowing while offering small value errors in a wide range of quantiles by taking into account the density of underlying data distribution. Our evaluations show that AOMG estimates the exact quantiles with less than 5% relative value error for a variety of use cases while providing high throughput.
【Keywords】: approximate quantile; data stream; datacenter telemetry; datacenter monitoring
【Paper Link】 【Pages】:1918-1921
【Authors】: James R. Foulds ; Rashidul Islam ; Kamrun Naher Keya ; Shimei Pan
【Abstract】: We propose differential fairness, a multi-attribute definition of fairness in machine learning which is informed by intersectionality, a critical lens arising from the humanities literature, leveraging connections between differential privacy and legal notions of fairness. We show that our criterion behaves sensibly for any subset of the set of protected attributes, and we prove economic, privacy, and generalization guarantees. We provide a learning algorithm which respects our differential fairness criterion. Experiments on the COMPAS criminal recidivism dataset and census data demonstrate the utility of our methods.
【Keywords】: fairness in AI; AI and society; 80% rule; privacy
【Paper Link】 【Pages】:1922-1925
【Authors】: Qingqing Ye ; Haibo Hu ; Man Ho Au ; Xiaofeng Meng ; Xiaokui Xiao
【Abstract】: Local differential privacy (LDP) is an emerging technique for privacy-preserving data collection without a trusted collector. Despite its strong privacy guarantee, LDP cannot be easily applied to real-world graph analysis tasks such as community detection and centrality analysis due to its high implementation complexity and low data utility. In this paper, we address these two issues by presenting LF-GDPR, the first LDP-enabled graph metric estimation framework for graph analysis. It collects two atomic graph metrics - the adjacency bit vector and node degree - from each node locally. LF-GDPR simplifies the job of implementing LDP-related steps (e.g., local perturbation, aggregation and calibration) for a graph metric estimation task by providing either a complete or a parameterized algorithm for each step.
【Keywords】: Local differential privacy; Graph metric; Privacy-preserving graph analysis
【Paper Link】 【Pages】:1926-1929
【Authors】: Xiaodong Qi ; Zhao Zhang ; Cheqing Jin ; Aoying Zhou
【Abstract】: The full-replication data storage mechanism, as commonly utilized in existing blockchain systems, is lack of sufficient storage scalability, since it reserves a copy of the whole block data in each node so that the overall storage consumption per block is O(n) with n nodes. Moreover, due to the existence of Byzantine nodes, existing partitioning methods, though widely adopted in distributed systems for decades, cannot suit for blockchain systems directly, thereby it is critical to devise a new storage mechanism. This paper proposes a novel storage engine, called BFT-Store, to enhance storage scalability by integrating erasure coding with Byzantine Fault Tolerance (BFT) consensus protocol. First, the storage consumption per block can be reduced to O(1), which enlarges overall storage capability when more nodes join blockchain. Second, an efficient online re-encoding protocol is designed for storage scale-out and a hybrid replication scheme is employed to improve reading performance. Last, extensive experimental results illustrate the scalability, availability and efficiency of BFT-Store, which is implemented on an open-source permissioned blockchain Tendermint.
【Keywords】: Blockchain; erasure coding; Byzantine Fault Tolerance; storage scalability
【Paper Link】 【Pages】:1930-1933
【Authors】: Sara Cohen ; Adam Rosenthal ; Aviv Zohar
【Abstract】: A key difference between using blockchains to store data and centrally controlled databases is that transactions are accepted to a blockchain via a consensus mechanism, and not by a controlling central party. Hence, once a user has issued a transaction, she cannot be certain if it will be accepted. Moreover, a yet unaccepted transaction cannot be retracted by the user, and may (or may not) be appended to the blockchain at any point in the future. This causes difficulties as the user may wish to formulate new transactions based on the knowledge of which previous transactions will be accepted. Yet this knowledge is inherently uncertain. We introduce a formal abstraction for blockchains as a data storage layer that underlies a database. The main issue that we tackle is the need to reason about possible worlds, due to the uncertainty in transaction appending. In particular, we consider the theoretical complexity of determining whether it is possible for a denial constraint to be contradicted, given the current state of the blockchain, pending transactions, and integrity constraints on blockchain data. We then present practical algorithms for this problem that work well in practice.
【Keywords】: Bitcoin; Complexity theory; Uncertainty; Cognition; Standards
【Paper Link】 【Pages】:1934-1937
【Authors】: Chenhao Huang ; Michael J. Cahill ; Alan D. Fekete ; Uwe Röhm
【Abstract】: MongoDB is a popular document store that is also available as a cloud-hosted service. MongoDB internally deploys primary-copy asynchronous replication, and it allows clients to vary the Read Preference, so reads can deliberately be directed to secondaries rather than the primary site. Doing this can sometimes improve performance, but the returned data might be stale, whereas the primary always returns the freshest data value. While state-of-practice is for programmers to decide where to direct the reads at application development time, they do not have full understanding then of workload or hardware capacity. It should be better to choose the appropriate Read Preference setting at runtime, as we describe in this paper. We show how a system can detect when the primary copy is saturated in MongoDB-as-a-Service, and use this to choose where reads should be done to improve overall performance. Our approach is aimed at a cloud-consumer; it assumes access to only the limited diagnostic data provided to clients of the hosted service.
【Keywords】: Database-as-a-Service; performance; replication
【Paper Link】 【Pages】:1938-1941
【Authors】: Da Yan ; Wenwen Qu ; Guimu Guo ; Xiaoling Wang
【Abstract】: Frequent pattern mining (FPM) has been a focused theme in data mining research for decades, but there lacks a general programming framework that can be easily customized to mine different kinds of frequent patterns, and existing solutions to FPM over big transaction databases are IO-bound rendering CPU cores underutilized even though FPM is NP-hard. This paper presents, PrefixFPM, a general-purpose framework for FPM that is able to fully utilize the CPU cores in a multicore machine. PrefixFPM follows the idea of prefix projection to partition the workloads of PFM into independent tasks by divide and conquer. PrefixFPM exposes a unified programming interface to users who can customize it to mine their desired patterns, and the parallel execution engine is transparent to end-users and can be reused for mining all kinds of patterns. We have adapted the state-of-the-art serial algorithms for mining frequent patterns including subsequences, subtrees, and subgraphs on top of PrefixFPM, and extensive experiments demonstrate an excellent speedup ratio of PrefixFPM with the number of cores. A demo is available at https://youtu.be/PfioC0GDpsw; the code is available at https://github.com/yanlab19870714/PrefixFPM.
【Keywords】: frequent pattern mining; prefix projection; compute-intensive; CPU-bound; sequence; subgraph; tree
【Paper Link】 【Pages】:1942-1945
【Authors】: Shufeng Gong ; Yanfeng Zhang ; Ge Yu
【Abstract】: Existing graph partition methods are designed for round-robin synchronous distributed frameworks. They balance workload without discrimination of vertex importance and fail to consider the characteristics of priority-based scheduling, which may limit the benefit of prioritized graph computation. To accelerate prioritized iterative graph computations, we propose Hotness Balanced Partition (HBP) and a stream-based partition algorithm Pb-HBP. Pb-HBP partitions graph by distributing vertices with discrimination according to their hotness rather than blindly distributing vertices with equal weights, which aims to evenly distribute the hot vertices among workers. Our results show that our proposed partition method outperforms the state-of-the-art partition methods, Fennel and HotGraph. Specifically, Pb-HBP can reduce 40-90% runtime of that by hash partition, 5-75% runtime of that by Fennel, and 22-50% runtime of that by HotGraph.
【Keywords】: Hotness balance partition; Graph partition; Distributed computing; Prioritized Computation
【Paper Link】 【Pages】:1946-1949
【Authors】: Junli Li ; Chaowei Zhang ; Jifu Zhang ; Xiao Qin
【Abstract】: This paper develops a parallel computing system - MiCS - for mutual information of big categorical data on the Spark computing platform. The MiCS algorithm is conductive to processing a large amount and strong repeatability of mutual-information calculation among feature pairs by applying a column-wise transformation scheme. And to improve the efficiency of the MiCS and the utilization rate of Spark cluster resources, we adopt a virtual partitioning scheme to achieve balanced load while mitigating the data skewness problem in the Spark Shuffle process.
【Keywords】: Parallel Mutual-information Computation; Feature Grouping; Data Skewness; Big Categorical Data; Spark.
【Paper Link】 【Pages】:1950-1953
【Authors】: Xiaoshuang Chen ; Longbin Lai ; Lu Qin ; Xuemin Lin
【Abstract】: Structural node similarity is widely used in analyzing complex networks. As one of the structural node similarity metrics, role similarity has the good merit of indicating automorphism (isomorphism). Existing algorithms to compute role similarity (e.g., RoleSim and NED) suffer from severe performance bottlenecks, and thus cannot handle large real-world graphs. In this paper, we propose a new framework StructSim to compute nodes' role similarity. Under this framework, we prove that StructSim is guaranteed to be an admissible role similarity metric based on the maximum matching. While maximum matching is too costly to scale, we then devise the BinCount matching to speed up the computation. BinCount-based StructSim admits a precomputed index to query one single pair in O(k log D) time, where k is a small user-defined parameter and D is the maximum node degree. Extensive empirical studies show that StructSim is significantly faster than existing works for computing structural node similarities on the real-world graphs, with comparable effectiveness.
【Keywords】: Node Similarity; Role Similarity; Efficiency; Scalability.
【Paper Link】 【Pages】:1954-1957
【Authors】: Yue Wang ; Zhe Wang ; Ziyuan Zhao ; Zijian Li ; Xun Jian ; Lei Chen ; Jianchun Song
【Abstract】: Heterogeneous information networks (HINs) are usually used to model information systems with multi-type objects and relations. Measuring the similarity among objects is an important task in data mining applications. Currently, several similarity measures are defined for HIN. Most of these measures are based on meta-paths, which show sequences of node classes and edge types along the paths between two nodes. However, meta-paths, which are often designed by domain experts, are hard to enumerate and choose w.r.t. the quality of the similarity scores. This makes the existing similarity measures difficult to use in real applications. To address this problem, we extend SimRank, a well-known similarity measure for homogeneous graphs, to HINs, by introducing the concept of decay graph. The newly proposed relevance measure is called HowSim, which has the property of being meta-path free, and capturing the structural and semantic similarity simultaneously. The generality and effectiveness of HowSim, are demonstrated by extensive experiments.
【Keywords】: Heterogeneous information networks; similarity measure; data mining; SimRank
【Paper Link】 【Pages】:1958-1961
【Authors】: Bencheng Yan ; Chaokun Wang
【Abstract】: Recently, learning embedding of nodes in graphs has attracted increasing research attention. There are two main kinds of graph embedding methods, i.e., the transductive embedding methods and the inductive embedding methods. The former focuses on directly optimizing the embedding vectors, and the latter tries to learn a mapping function for the given nodes and features. However, few works focus on applying the learned model from one graph to another, which is a pervasive idea in Computer Version or Natural Language Processing. Although some of the graph neural networks (GNNs) present similar motivation, none of them considers the graph bias among graphs. In this paper, we present an interesting graph embedding problem called Adaptive Task (AT), and propose a unified framework for this adaptive task, which introduces two types of alignment to learn adaptive node embedding across graphs. Then, based on the proposed framework, a novel graph adaptive embedding network is designed to address the adaptive task. Extensive experimental results demonstrate that our model significantly outperforms the state-of-the-art methods.
【Keywords】: adaptive embedding; GNNs; graph bias
【Paper Link】 【Pages】:1962-1965
【Authors】: Yeting Li ; Jialun Cao ; Haiming Chen ; Tingjian Ge ; Zhiwu Xu ; Qiancheng Peng
【Abstract】: Getting high quality XML schemas to avoid or reduce application risks is an important problem in practice, for which some important aspects have yet to be addressed satisfactorily in existing work. In this paper, we propose a tool FlashSchema for high quality XML schema design, which supports both one-pass and interactive schema design and schema recommendation. To the best of our knowledge, no other existing tools support interactive schema design and schema recommendation. One salient feature of our work is the design of algorithms to infer k-occurrence interleaving regular expressions, which are not only more powerful in model capacity, but also more efficient. Additionally, such algorithms form the basis of our interactive schema design. The other feature is that, starting from large-scale schema data that we have harvested from the Web, we devise a new solution for type inference, as well as propose schema recommendation for schema design. Finally, we conduct a series of experiments on two XML datasets, comparing with 9 state-of-the-art algorithms and open-source tools in terms of running time, preciseness, and conciseness. Experimental results show that our work achieves the highest level of preciseness and conciseness within only a few seconds. Experimental results and examples also demonstrate the effectiveness of our type inference and schema recommendation methods.
【Keywords】: XML; Inference algorithms; Tools; Semantics; Libraries; Data models
【Paper Link】 【Pages】:1966-1969
【Authors】: Yongjiang Liang ; Tingting Hu ; Peixiang Zhao
【Abstract】: Clustering uncertain graphs based on the probabilistic graph model has sparked extensive research and widely varying applications. Existing structural clustering methods rely heavily on the computation of pairwise reliable structural similarity between vertices, which has proven to be extremely costly, especially in large uncertain graphs. In this paper, we develop a new, decomposition-based method, ProbSCAN, for efficient reliable structural similarity computation with theoretically improved complexity. We further design a cost-effective index structure UCNO-Index, and a series of powerful pruning strategies to expedite reliable structural similarity computation in uncertain graphs. Experimental studies on eight real-world uncertain graphs demonstrate the effectiveness of our proposed solutions, which achieves orders of magnitude improvement of clustering efficiency, compared with the state-of-the-art structural clustering methods in large uncertain graphs.
【Keywords】: Uncertain Graphs; Structural Clustering
【Paper Link】 【Pages】:1970-1973
【Authors】: Weiguo Zheng ; Jiewei Gu ; Peng Peng ; Jeffrey Xu Yu
【Abstract】: As a well-known optimization problem, the maximum independent set (MIS) has attracted a lot of effort due to its significance in graph theory and wide applications. Nevertheless, the vertices of many graphs are weighted unequally in real scenarios, but the previous studies ignore the intrinsic weights on the graphs. Therefore, the weight of an MIS may not necessary to be the largest. Generalizing the traditional MIS problem, we study the problem of maximum weighted independent set (MWIS) that returns the set of independent vertices with the largest weight in this paper, which is computationally expensive. Following the reduction-and-branching strategy, we propose an exact algorithm to compute the maximum weighted independent set. Since it is intractable to deliver the exact solution for large graphs, we design an efficient greedy algorithm to compute a near-maximum weighted independent set. We devise a set of novel reductions for general weighted graphs. To confirm the effectiveness and efficiency of the proposed methods, we conduct extensive experimental studies over a bunch of real graphs.
【Keywords】: Greedy algorithms; Task analysis; Micromechanical devices; Graph theory; Conferences; Data engineering; Data science
【Paper Link】 【Pages】:1974-1977
【Authors】: Ting Deng ; Lei Hou ; Ziyan Han
【Abstract】: Keys for graphs aim to uniquely identify entities represented by vertices in a graph, using the combination of topological constraints and value equality constraints. This paper proposes graph matching keys, referred to as GMKs, an extension of graph keys with similarity predicates on values, supporting approximation entity matching. We treat entity matching as a classification problem, and propose GMKSLEM, a supervised learning method for graph entity matching. In GMKSLEM, a feature extraction method is provided to discover candidate GMKs (CGMKs) to construct features for vector representation, and then high-quality features and representations are generated by feature selection. Moreover, GMKSLEM provides support to explain the classification results. Using real-life data, we experimentally verify the effectiveness of GMKSLEM, as well as its interpretability.
【Keywords】: Feature extraction; Pattern matching; Supervised learning; Semantics; Measurement; Conferences; Data engineering
【Paper Link】 【Pages】:1978-1981
【Authors】: Chaoyue Niu ; Zhenzhe Zheng ; Fan Wu ; Shaojie Tang ; Guihai Chen
【Abstract】: The society's insatiable appetites for personal data are driving the emergency of data markets, allowing data consumers to launch customized queries over the datasets collected by a data broker from data owners. In this paper, we study how the data broker can maximize her cumulative revenue by posting reasonable prices for sequential queries. We thus propose a contextual dynamic pricing mechanism with the reserve price constraint, which features the properties of ellipsoid for efficient online optimization, and can support linear and non-linear market value models with uncertainty. In particular, under low uncertainty, our pricing mechanism provides a worst-case regret logarithmic in the number of queries. We further extend to other similar application scenarios, including hospitality service and online advertising, and extensively evaluate all three application instances over MovieLens 20M dataset, Airbnb listings in U.S. major cities, and Avazu mobile ad click dataset, respectively. The analysis and evaluation results reveal that our proposed pricing mechanism incurs low practical regret, online latency, and memory overhead, and also demonstrate that the existence of reserve price can mitigate the cold-start problem in a posted price mechanism, and thus can reduce the cumulative regret.
【Keywords】: personal data market; revenue maximization; contextual dynamic pricing; reserve price
【Paper Link】 【Pages】:1982-1985
【Authors】: Wangda Zhang ; Kenneth A. Ross
【Abstract】: Analytic queries enable sophisticated large-scale data analysis within many commercial, scientific and medical domains today. Data skew is a ubiquitous feature of these real-world domains, but current systems do not make the most of caches for exploiting skew. In particular, a whole cache line may remain cache resident even though only a small part of the cache line corresponds to a popular data item. In this paper, we propose a novel index structure for repositioning data items to concentrate popular items into the same cache lines, resulting in better spatial locality, and better utilization of limited cache resources. We analyze cache behavior, and implement database operators that are efficient in the presence of skew. Experiments on real and synthetic data show that exploiting skew can significantly improve in-memory query performance. In some cases, our techniques can speed up queries by over an order of magnitude.
【Keywords】: Indexes; Arrays; Query processing; Buildings; Data warehouses; Layout
【Paper Link】 【Pages】:1986-1989
【Authors】: Chengcheng Yang ; Dong Deng ; Shuo Shang ; Ling Shao
【Abstract】: Approximate Nearest Neighbor (ANN) search in high-dimensional space is a fundamental task in many applications. Locality-Sensitive Hashing (LSH) is a well-known methodology to solve the ANN problem with theoretical guarantees and empirical performance. We observe that existing LSH-based approaches target at the problem of designing search optimized indexes, which require a number of separate indexes and high index maintenance overhead, and hence impractical for high-dimensional streaming data processing. In this paper, we present PDA-LSH, a novel and practical disk-based LSH index that can offer efficient support for both updates and searches. Experiments on real-world datasets show that our proposal outperforms the state-of-the-art schemes by up to 10× on update performance and up to 2× on search performance.
【Keywords】: Search problems; Indexing; Gaussian distribution; Handheld computers; Closed-form solutions; Maintenance engineering
【Paper Link】 【Pages】:1990-1993
【Authors】: Abdullah Alfarrarjeh ; Seon Ho Kim ; Vinuta Hegde ; Akshansh ; Cyrus Shahabi ; Qingyun Xie ; Siva Ravada
【Abstract】: Due to the prevalence of GPS-equipped cameras (e.g., smartphones and surveillance cameras), massive amounts of geo-tagged images capturing urban streets are increasingly being collected. Consequently, many smart city applications have emerged, relying on efficient image search. Such searches include spatial-visual queries in which spatial and visual properties are used in tandem to retrieve similar images to a given query image within a given geographical region. Towards this end, new index structures that organize images based on both spatial and visual properties are needed to efficiently execute such queries. Based on our observation that street images are typically similar in the same spatial locality, index structures for spatial-visual queries can be effectively built on a spatial index (i.e., R-tree). Therefore, we propose a class of R-tree indexes, particularly, by associating each node with two separate minimum bounding rectangles (MBR), one for spatial and the other for (dimension-reduced) visual properties of their contained images, and adapting the R*-tree optimization criteria to both property types.
【Keywords】: geo-tagged images; geo-tagged street images; spatial-visual index; spatial-visual search
【Paper Link】 【Pages】:1994-1997
【Authors】: Chi Thang Duong ; Hongzhi Yin ; Dung Hoang ; Minn Hung Nguyen ; Matthias Weidlich ; Quoc Viet Hung Nguyen ; Karl Aberer
【Abstract】: Effective information retrieval (IR) relies on the ability to comprehensively capture a user's information needs. Traditional IR systems are limited to homogeneous queries that define the information to retrieve by a single modality. Support for heterogeneous queries that combine different modalities has been proposed recently. Yet, existing approaches for heterogeneous querying are computationally expensive, as they require several passes over the data to construct a query answer.In this paper, we propose an IR system that overcomes the computational challenges imposed by heterogeneous queries by adopting graph embeddings. Specifically, we propose graph-based models in which both, data and queries, incorporate information of different modalities. Then, we show how either representation is transformed into a graph embedding in the same space, capturing relations between information of different modalities. By grounding query processing in graph embeddings, we enable processing of heterogeneous queries with a single pass over the data representation. Our experiments on several real-world and synthetic datasets illustrate that our technique is able to return twice the amount of relevant information in comparison with several baselines, while being scalable to large-scale data.
【Keywords】: Peer-to-peer computing; Training; Decoding; Information retrieval; Data models; Query processing; Task analysis
【Paper Link】 【Pages】:1998-2001
【Authors】: Jin Wang ; Chunbin Lin
【Abstract】: Location-based services have become ubiquitous in smart life, but typing queries in mobile devices is tedious and error-prone. Therefore, query autocompletion is needed to instantly provide users with query suggestions based on the incomplete user input. A recent trend is to support error-tolerant autocompletion, which could improve the usability by allowing a small number of errors between the query input and prefixes of strings in database. In addition, the query autocompletion should be location-aware for location-based services since it makes more sense to provide query suggestions for nearby objects. Unfortunately, existing query autocompletion algorithms cannot efficiently support both error-tolerate and location-aware features at the same time. In this paper, we propose a novel framework AutoEL to support error-tolerant location-aware query autocompletion. The error-tolerate feature is enabled by applying edit distance to evaluate the textual similarity between given query and the underlying data, while the location-aware feature is guaranteed by choosing the k-nearest neighbors. To improve the efficiency, we construct a hybrid data structure to jointly index spatial and textural information. We also propose several optimizations on data partition as well as search algorithm. Extensive experiments on real datasets demonstrate that AutoEL outperforms the baseline methods by up to an order of magnitude.
【Keywords】: Indexes; Mobile handsets; Optimization; Upper bound; Search engines; Search problems
【Paper Link】 【Pages】:2002-2005
【Authors】: Ruiyuan Li ; Huajun He ; Rubin Wang ; Sijie Ruan ; Yuan Sui ; Jie Bao ; Yu Zheng
【Abstract】: Trajectory data is very useful for many urban applications. However, due to its spatio-temporal and high-volume properties, it is challenging to manage trajectory data. Existing trajectory data management frameworks suffer from scalability problem, and only support limited trajectory queries. This paper proposes a holistic distributed NoSQL trajectory storage engine, TrajMesa, based on GeoMesa, an open-source indexing toolkit for spatio-temporal data. TrajMesa adopts a novel storage schema, which reduces the storage size tremendously. We also devise novel indexing key designs, and propose a bunch of pruning strategies. TrajMesa can support plentiful queries efficiently, including ID-Temporal query, spatial range query, similarity query, and k-NN query. Experimental results show the powerful query efficiency and scalability of TrajMesa.
【Keywords】: Trajectory; Indexing; Global Positioning System; Distributed databases; Engines; Scalability
【Paper Link】 【Pages】:2006-2009
【Authors】: Sean Bin Yang ; Bin Yang
【Abstract】: Modern navigation services often provide multiple paths connecting the same source and destination for users to select. Hence, ranking such paths becomes increasingly important, which directly affects service quality. We present PathRank, a data-driven framework for ranking paths based on historical trajectories. If a trajectory used path P from source s to destination d, PathRank considers this as an evidence that P is preferred over all other paths from s to d. Thus, a path that is similar to P should have a larger ranking score than a path that is dissimilar to P. Based on this intuition, PathRank models path ranking as a regression problem that assigns each path a ranking score. We first propose an effective method to generate a compact set of diversified paths using trajectories as training data. Next, we propose an end-to-end deep learning framework to solve the regression problem. In particular, a spatial network embedding is proposed to embed each vertex to a feature vector by considering the road network topology. Since a path is represented by a sequence of vertices, which is now a sequence of feature vectors after embedding, recurrent neural network is applied to model the sequence. Empirical studies on a substantial trajectory data set offer insight into the designed properties of the proposed framework and indicating that it is effective and practical.
【Keywords】: Training; Trajectory; Routing; Roads; Vehicles; Machine learning; Training data
【Paper Link】 【Pages】:2014-2017
【Authors】: Tiantian Liu ; Zijin Feng ; Huan Li ; Hua Lu ; Muhammad Aamir Cheema ; Hong Cheng ; Jianliang Xu
【Abstract】: Indoor shortest path query (ISPQ) is of fundamental importance for indoor location-based services (LBS). However, existing ISPQs ignore indoor temporal variations, e.g., the open and close times associated with entities like doors and rooms. In this paper, we define a new type of query called Indoor Temporal-variation aware Shortest Path Query (ITSPQ). It returns the valid shortest path based on the up-to-date indoor topology at the query time. A set of techniques is designed to answer ITSPQ efficiently. We design a graph structure (IT-Graph) that captures indoor temporal variations. To process ITSPQ using IT-Graph, we design two algorithms that check a door's accessibility synchronously and asynchronously, respectively. We experimentally evaluate the proposed techniques using synthetic data. The results show that our methods are efficient.
【Keywords】: Topology; Partitioning algorithms; Computer science; Navigation; Buildings; Hospitals; Legged locomotion
【Paper Link】 【Pages】:2018-2019
【Authors】: Sin G. Teo ; Jianneng Cao ; Vincent C. S. Lee
【Abstract】: Secure multi-party computation (SMC) allows parties to jointly compute a function over their inputs, while keeping every input confidential. SMC has been extensively applied in tasks with privacy requirements, such as privacy-preserving data mining (PPDM), to learn task output and at the same time protect input data privacy. However, existing SMC-based solutions are ad-hoc - they are proposed for specific applications, and thus cannot be applied to other applications directly. To address this issue, we propose a privacy model DAG (Directed Acyclic Graph) that consists of a set of fundamental secure operators (e.g., +, -, ×, /, and power). Our model is general - its operators, if pipelined together, can implement various functions, even complicated ones. The experimental results also show that our DAG model can run in acceptable time.
【Keywords】: DAG; Secure; Operator; Privacy; and Data Mining
【Paper Link】 【Pages】:2020-2021
【Authors】: Woohwan Jung ; Suyong Kwon ; Kyuseok Shim
【Abstract】: Log data from mobile devices usually contain a series of events with time intervals. However, the problem of releasing differentially private time interval data has not been tackled yet. We propose the TIDY (publishing Time Intervals via Differential privacY) algorithm to release time interval data under differential privacy. We use the frequency vectors as a compact representation of the time interval data to reduce the aggregated noise. We also develop a new partitioning method adapted for the frequency vectors to balance the trade-off between the noise and structural errors. Our experiments confirm that TIDY outperforms the existing algorithms for releasing 2D histograms.
【Keywords】: Differential privacy; Time interval data; Temporal data
【Paper Link】 【Pages】:2022-2023
【Authors】: Tsz Nam Chan ; Man Lung Yiu ; Leong Hou U
【Abstract】: The Earth Mover's Distance (EMD) is a robust similarity measure between two histograms (e.g., probability distributions). It has been extensively used in a wide range of applications, e.g., multimedia, data mining, computer vision, etc. As EMD is a computationally intensive operation, many efficient lower and upper bound functions of EMD have been developed. However, they provide no guarantee on the error. In this work, we study how to compute approximate EMD value with bounded error, using these bound functions. First, we propose an approximation framework that leverages on lower and upper bound functions to compute approximate EMD with error guarantee. Then, we present three solutions to solve our problem. Experimental results on real data demonstrate the efficiency of our proposed solutions.
【Keywords】: Earth; Upper bound; Robustness; Histograms; Multimedia databases; Computer vision; Image retrieval
【Paper Link】 【Pages】:2024-2025
【Authors】: Victor Junqiu Wei ; Raymond Chi-Wing Wong ; Cheng Long ; Pan Hui
【Abstract】: Geo-textual data is ubiquitous nowadays, where each object has a location and is associated with some keywords. Many types of queries based on geo-textual data, termed as spatial keyword queries, have been proposed, and are to find optimal object(s) in terms of both its (their) location(s) and keywords. In this paper, we propose a new type of query called nearby-fit spatial keyword query (NSKQ), where an optimal object is defined based not only on the location and the keywords of the object itself, but also on those of the objects nearby. For example, in an application of finding a hotel, not only the location of a hotel but also the objects near the hotel (e.g., shopping malls, restaurants and bus stops nearby) might need to be taken into consideration.The query is proved to be NP-hard, and in order to perform the query efficiently, we developed two approximate algorithms with small constant approximation factors equal to 1.155 and 1.79. We conducted extensive experiments based on both real and synthetic datasets, which verified our algorithms.
【Keywords】: spatial keyword queries; proximity queries; spatial database; location-based services
【Paper Link】 【Pages】:2026-2027
【Authors】: Jaewook Byun ; Sungpil Woo ; Daeyoung Kim
【Abstract】: ChronoGraph is a novel system enabling temporal graph traversals. Compared to snapshot-oriented systems, this traversal-oriented system is suitable for analyzing information diffusion over time without violating a time constraint on temporal paths. The cornerstone of ChronoGraph aims at bridging the chasm between point-based semantics and period-based semantics and the gap between temporal graph traversals and static graph traversals. Therefore, our graph model and traversal language provide the temporal syntax for both semantics, and we present a method converting point-based semantics to period-based ones. Also, ChronoGraph exploits the temporal support and parallelism to handle the temporal degree, which explosively increases compared to static graphs. We demonstrate how three traversal recipes can be implemented on top of our system: temporal breadth-first search (tBFS), temporal depth-first search (tDFS), and temporal single source shortest path (tSSSP). According to our evaluation, our temporal support and parallelism enhance temporal graph traversals in terms of convenience and efficiency. Also, ChronoGraph outperforms existing property graph databases in terms of temporal graph traversals. We prototype ChronoGraph by extending Tinkerpop, a de facto standard for property graphs. Therefore, we expect that our system would be readily accessible to existing property graph users.
【Keywords】: ChronoGraph; Temporal Networks; Temporal Graph; Graph Traversal Language; Temporal Aggregation; Parallelism
【Paper Link】 【Pages】:2028-2029
【Authors】: Jong-Ryul Lee ; Chin-Wan Chung
【Abstract】: A distance sensitivity oracle is a data structure answering queries that ask the shortest distance from a node to another in a network expecting node/edge failures. It has been mainly studied in theory literature, but all the existing oracles for a directed graph suffer from prohibitive preprocessing time and space. Motivated by this, we develop two practical distance sensitivity oracles for directed graphs as variants of Transit Node Routing, and effective speed-up techniques with a slight loss of accuracy. Extensive experiments demonstrate that our oracles greatly outperform all of competitors in most cases. To the best of our knowledge, our oracles are the first distance sensitivity oracles that handle real-world graph data with million-level nodes.
【Keywords】: Graph algorithms; Path and circuit problems; Graphs and networks; Distance sensitivity oracle; Distance query; Shortest distance; Shortest path algorithms
【Paper Link】 【Pages】:2030-2031
【Authors】: Petr Lukás ; Radim Baca ; Michal Krátký ; Tok Wang Ling
【Abstract】: Structural XQuery and XPath queries are often modeled by twig pattern queries (TPQs) specifying predicates on XML nodes and structural relationships to be satisfied between them. This paper considers a TPQ model extended by a specification of output and non-output query nodes since it complies with the XQuery and XPath semantics. There are two types of TPQ processing approaches: binary joins and holistic joins. The binary joins utilize a query plan of interconnected binary operators, whereas the holistic joins are based on one complex operator to process the whole query. In the recent years, the holistic joins have been considered as the state-of-the-art TPQ processing method. However, a thorough analytical and experimental comparison of binary and holistic joins has been missing despite an enormous research effort in this area. In this paper, we try to fill this gap. We introduce several improvements of the binary join operators which enable us to build a so-called fully-pipelined (FP) query plan for any TPQ with the specification of output and non-output query nodes. We analytically show that, for a class of queries, the proposed approach has the same time and space complexity as holistic joins, and we experimentally demonstrate that the proposed approach outperforms holistic joins in many cases.
【Keywords】: XML; Computational modeling; Data engineering; Query processing; Complexity theory; Data collection; Pattern matching
【Paper Link】 【Pages】:2032-2033
【Authors】: Xiaoye Miao ; Yunjun Gao ; Su Guo ; Lu Chen ; Jianwei Yin ; Qing Li
【Abstract】: Due to the pervasiveness of incomplete data, incomplete data queries are vital in a large number of real-life scenarios. Current models and approaches for incomplete data queries mainly rely on the machine power. In this paper, we study the problem of skyline queries over incomplete data with crowdsourcing. We propose a novel query framework, termed as BayesCrowd, on top of Bayesian network and the typical c-table model on incomplete data. Considering budget and latency constraints, we present a suite of effective task selection strategies. In particular, since the probability computation of each object being an answer object is at least as hard as #SAT problem, we propose an adaptive DPLL (i.e., Davis-Putnam-Logemann-Loveland) algorithm to speed up the computation. Extensive experiments using both real and synthetic data sets confirm the superiority of BayesCrowd to the state-of-the-art method.
【Keywords】: Task analysis; Crowdsourcing; Computational modeling; Data models; Computer science; Bayes methods; Databases
【Paper Link】 【Pages】:2034-2035
【Authors】: Pengfei Li ; Hua Lu ; Qian Zheng ; Shijian Li ; Gang Pan
【Abstract】: This study explores the problem of co-location judgement, i.e., to decide whether two Twitter users are co-located at some point-of-interest (POI). We extract novel features, named HisRect, from users' historical visits and recent tweets: The former has impact on where a user visits in general, whereas the latter gives more hints about where a user is currently. To alleviate the issue of data scarcity, a semi-supervised learning (SSL) framework is designed to extract HisRect features. Moreover, we use an embedding neural network layer to decide co-location based on the difference between two users' His-Rect features. Extensive experiments on real Twitter data suggest that our HisRect features and SSL framework are highly effective at deciding co-locations.
【Keywords】: Feature extraction; History; Twitter; Neural networks; Training; Computer science
【Paper Link】 【Pages】:2036-2037
【Authors】: Tenindra Abeywickrama ; Muhammad Aamir Cheema ; Arijit Khan
【Abstract】: Given the prevalence and volume of local search queries, today's search engines are required to find results by both spatial proximity and textual relevance at high query throughput. Existing techniques to answer such spatial keyword queries employ a keyword aggregation strategy that suffers from certain drawbacks when applied to road networks. Instead, we propose the K-SPIN framework, which uses an alternative keyword separation strategy that is more suitable on road networks. While this strategy was previously thought to entail prohibitive pre-processing costs, we further propose novel techniques to make our framework viable and even light-weight. Thorough experimentation shows that K-SPIN outperforms the state-of-the-art by up to two orders of magnitude on a wide range of settings and real-world datasets.
【Keywords】: Road networks; points of interest search; spatio-textual queries; network Voronoi diagrams
【Paper Link】 【Pages】:2038-2039
【Authors】: Hongmei Chen ; Yixiang Fang ; Ying Zhang ; Wenjie Zhang ; Lizhen Wang
【Abstract】: A huge volume of spatio-textual objects generated from location-based services enable a wide range of spatial keyword queries. Recently, researchers have proposed a novel query, called Spatial Pattern Matching (SPM), which uses a pattern to capture users' intention. It has been demonstrated to be useful but computationally intractable. Existing algorithms suffer from the low efficiency issue, especially on large scale datasets. To enhance the performance of SPM, in this paper we propose a novel Efficient Spatial Pattern Matching (ESPM) algorithm, which exploits the inverted linear quadtree index and computes matched node pairs and object pairs level by level in a top-down manner. In particular, it focuses on pruning unpromising nodes and node pairs at the high levels, resulting in a large number of unpromising objects and object pairs to be pruned before accessing them from disk. Our experimental results on real large datasets show that ESPM is over one order of magnitude faster than the state-of-the-art algorithm, and also uses much less I/O cost.
【Keywords】: Pattern matching; Indexes; Spatial databases; Time factors; Australia; Conferences; Data engineering
【Paper Link】 【Pages】:2040-2041
【Authors】: Yong Zhang ; Jiacheng Wu ; Jin Wang ; Chunxiao Xing
【Abstract】: Set similarity search is a fundamental operation in a variety of applications [3] , [5] , [2] . There is a long stream of research on the problem of set similarity search. Given a collection of set records, a query and a similarity function, the algorithm will return all the set records that are similarity with the query. There are many metrics to measure the similarity between two sets, such as Overlap, Jaccard, Cosine and Dice. In this paper we use the widely applied Jaccard to quantify the similarity between two sets, but our proposed techniques can be easily extended to other set-based similarity functions. Previous approaches require users to specify a threshold of similarity. However, in many scenarios it is rather difficult to specify such a threshold. For example, when users types some keywords in the search engine, they will pay more attention for the results which rank in the front, say the top five ones. In this case, if we use threshold-based search instead of KNN similarity search, it is difficult to find the results that are more attractive for users.
【Keywords】: Indexes; Approximation algorithms; Measurement; Transforms; Upper bound; Computer science; Search problems
【Paper Link】 【Pages】:2042-2043
【Authors】: Mao-Lin Li ; Francesco Di Mauro ; K. Selçuk Candan ; Maria Luisa Sapino
【Abstract】: With many applications relying on multi-dimensional datasets for decision making, matrix factorization (or decomposition) is becoming the basis for many knowledge discovery and machine learning tasks, from clustering, trend detection, anomaly detection, to correlation analysis. Unfortunately, a major shortcoming of matrix analysis operations is that, despite their effectiveness when the data is scalar, these operations become difficult to apply in the presence of non-scalar data, as they are not designed for data that include non-scalar observations, such as intervals. In this paper, we propose matrix decomposition techniques that consider the existence of interval-valued data. We show that naive ways to deal with such imperfect data may introduce errors in analysis and present factorization techniques that are especially effective when the amount of imprecise information is large.
【Keywords】: Matrix decomposition; Semantics; Face; Probabilistic logic; Collaboration; Conferences; Data engineering
【Paper Link】 【Pages】:2044-2048
【Authors】: Zhenggang Wang ; Liu Zhong
【Abstract】: In this paper, by analyzing the advantages and disadvantages of existing clustering analysis algorithms, a new neighborhood density correlation clustering (NDCC) algorithm for quickly discovering arbitrary shaped clusters is proposed that avoids the clustering errors caused by iso-density points between clusters. Because the density of the center region of any cluster sample dataset is greater than that of the edge region, the data points can be divided into core, edge, and noise data points, and then the density correlation of the core data points in their neighborhood can be used to form a cluster. The clustering results of a sample dataset can be formed according to the connection between the core data points and edge points. By constructing an objective function and optimizing the parameters automatically, a locally optimal result that is close to the globally optimal solution can be obtained. This algorithm solves the difficulty of parameter optimization and aspheric clustering in unsupervised clustering to some extent and has excellent potential for application in data analysis and mining.
【Keywords】: aspheric cluster; parameter optimization; neighborhood connection; core data; clustering
【Paper Link】 【Pages】:2049-2053
【Authors】: Deleli Mesay Adinew ; Zhou Shijie ; Yongjian Liao
【Abstract】: As data is growing in different dimensions, it is difficult to get appropriate data analytic tools. Spark is one of high speed "in-memory computing" big data analytic tool designed to improve the efficiency of data computing in both batch and realtime data analytic. Spark is memory bottleneck problem which degrades the performance of applications due to in memory computation and uses of storing intermediate and output result in memory. Investigating how performance is increased in relation to spark executor memory, number of executors, number of cores, and deploy mode parameters configuration in a standalone cluster model is our primary goal. Three representative spark applications are used as workloads to evaluates performance in relation to changing these parameters value. Experimental result show, submitting the job in cluster deploy mode is faster to finish than a submitting job in client deploy mode under two workloads. This implies spark performance does not depend on deploy mode rather it depends on types of application. However, increasing number of executor per worker, a number of core per executor and memory fraction will increase spark performance under all workloads in any deploy mode.
【Keywords】: Spark Performance Tuning; Parameter configuration; Standalone cluster; Deploy mode; Memory Management; Performance improvement
【Paper Link】 【Pages】:2054-2058
【Authors】: Yifan Qiao
【Abstract】: This thesis advocates a novel DRAM-only strategy to reduce the computing system memory cost for the first time, and investigates its applications to database systems. This thesis envisions a low-cost DRAM module called block-protected DRAM, which reduces bit cost by significantly relaxing the DRAM raw reliability and meanwhile employs long error correction code (ECC) to ensure data integrity at small coding redundancy. Built upon the exactly same DRAM technology, today's byte-accessible DRAM and envisioned block-protected DRAM strike at different trade-offs between memory bit cost and native data access granularity, and naturally form a heterogeneous DRAM-only memory system. The practical feasibility of such heterogeneous memory systems is further strengthened by the new media-agnostic and latency-oblivious CPU-memory interfaces such as IBM's OpenCAPI/OMI and Intel's CXL. This DRAM-only design approach perfectly leverages the existing DRAM manufacturing infrastructure and is not subject to any fundamental technology risk and uncertainty. Hence, before NVM technologies could eventually fulfill their long-awaited promises (i.e., DRAM-grade speed at flash-grade cost), this DRAM-only design framework can fill the gap to empower continuous progress and advances of computing systems. This thesis aims to develop techniques that enable relational and NoSQL databases to take full advantage of the envisioned low-cost heterogeneous DRAM system. As the first step, we studied how one could employ heterogeneous DRAM to implement a low-cost tiered caching solution for relational database, and obtained encouraging results using MySQL as a test vehicle.
【Keywords】: Random access memory; Databases; Nonvolatile memory; Error correction codes; Redundancy; Industries
【Paper Link】 【Pages】:2059-2063
【Authors】: Botao Peng
【Abstract】: Data series similarity search is a core operation for several data series analysis applications across many different domains. However, the state-of-the-art techniques fail to deliver the time performance required for interactive exploration, or analysis of large data series collections. In this Ph.D. work, we present the first data series indexing solutions, for both on-disk and in-memory data, that are designed to inherently take advantage of multi-core architectures, in order to accelerate similarity search processing times. Our experiments on a variety of synthetic and real data demonstrate that our approaches are up to orders of magnitude faster than the alternatives. More specifically, our on-disk solution can answer exact similarity search queries on 100GB datasets in a few seconds, and our in-memory solution in a few milliseconds, which enables real-time, interactive data exploration on very large data series collections.
【Keywords】: Data series; Indexing; Modern hardware
【Paper Link】 【Pages】:2064-2068
【Abstract】: Area query, to find all elements contained in a specified area from a certain set of spatial objects, is a very important spatial query widely required in various fields. A number of approaches have been proposed to implement this query, the best known of which is to obtain a rough candidate set through spatial indexes and then refine the candidates through geometric validations to get the final result. When the shape of the query area is a rectangle, this method has very high efficiency. However, when the query area is irregular, the candidate set is usually much larger than the final result set, which means a lot of redundant detection needs to be done, thus the efficiency is greatly limited. In view of this issue, we propose a method of iteratively generating candidates based on Voronoi diagrams and apply it to area queries. The experimental results indicate that with our approach, the number of candidates in the process of area query is greatly reduced and the efficiency of the query is significantly improved.
【Keywords】: area query; spatial query; GIS; Voronoi diagram
【Paper Link】 【Pages】:2069-2073
【Authors】: Wenxiao Fu
【Abstract】: Hierarchical aggregation supports fast exploration of large datasets by pre-aggregating data into a multi-scale data structure. While the pre-aggregation process is done offline, it can be quite expensive, and the resulting data-structure extremely large. When the data is multi-dimensional, this is greatly compounded. Data-cube-based approaches can result in extremely large cubic data-structures, aggregating more than is needed. Other approaches do not aggregate enough, and so do not offer the necessary flexibility for dimension-wise roll-ups and drill-downs. We design a hierarchical data-structure for aggregation that strikes a balance, and provides enough flexibility for different exploration scenarios with low-cost construction and reasonable size. Inductive aggregation is a methodology to compute levels of aggregations efficiently, while the resulting data-structure supports smooth data exploration. Inspired by this, we propose a partitioned, inductively aggregated data-cube, picube. A framework we call stratum space is presented with the model to express the dependencies across aggregation levels. Optimization choices are discussed, providing good design tradeoffs between storing and querying of the data.
【Keywords】: inductive aggregation; data-cube; partition; flexibility; efficiency; size
【Paper Link】 【Pages】:2075-2079
【Authors】: Cynthia Joesph ; A. Rajeswari ; B. Premalatha ; C. Balapriya
【Abstract】: Emotion recognition plays an important role in depression detection. The proposed system aims at classifying emotions automatically by pre-processing the physiological signals, feature extraction followed by classification and analyzing classification accuracy. Support Vector Machine (SVM) has been used for classification because of its high recognition rates and superior performance compared to Bayesian and Regression-based classifiers. The data corresponding to eight emotion available in databases DEAP, MAHNOB-HCI has been used for training and testing the proposed system. The physiological signals namely Electromyogram (EMG), Blood Volume Pressure (BVP) and Galvanic Skin Response (GSR) from emotion Sentics database are considered. Classification accuracy of 75% has been obtained for five target emotions, namely, Joy, Grief, Anger, Hate and Reverence. An improved recognition rate of 91% has been obtained by using k-fold leave out one cross-validation to reduce the observation dependence of prediction when there is limited training data.
【Keywords】: Emotion Recognition; Support Vector Machine; Eight Emotion Sentics Data; k-fold leave out one cross-validation