35th ICDE 2019:Macao, China

35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. IEEE 【DBLP Link

Paper Num: 268 || Session Num: 0

1. AsterixDB Mid-Flight: A Case Study in Building Systems in Academia.

Paper Link】 【Pages】:1-12

【Authors】: Michael J. Carey

【Abstract】: Building large software systems is always a challenging venture, but it is especially so in academia. This paper describes the experiences that the author and his (mostly UC-based) partners in software crime have had that culminated in the Big Data Management System now available as Apache AsterixDB. It covers a mix of the history and technical content of the nearly ten-year-old project, starting with its inception during the MapReduce craze. It describes the phases that the effort has gone through and some of the lessons learned along the way. The paper also covers some personal reflections and opinions about the challenges of systems-building, as well as writing about it, in our current academic culture. Included is the case for doing this sort of work at all - discussing the pitfalls of doing "systems" research in the absence of an actual system, and why the gain outweighs the pain of building and sharing database software in academia. As of late 2018, Apache AsterixDB is also having a commercial impact as the storage and parallel query engine underlying a new offering called Couchbase Analytics. The last part of the paper explains how we are attempting to balance the uses of AsterixDB as (i) a generally available open source Apache software platform, (ii) an end-to-end research testbed for universities, and (iii) the technology powering a commercial NoSQL product.

【Keywords】: Big Data; software development; academia

2. Data Management at Huawei: Recent Accomplishments and Future Challenges.

Paper Link】 【Pages】:13-24

【Authors】: Jianjun Chen ; Yu Chen ; Zhibiao Chen ; Ahmad Ghazal ; Guoliang Li ; Sihao Li ; Weijie Ou ; Yang Sun ; Mingyi Zhang ; Minqi Zhou

【Abstract】: Huawei is a leading global provider of information and communication technologies (ICT) infrastructure and smart devices. With integrated solutions across four key domains: telecommunication networks, IT, smart devices, and cloud services, Huawei is committed to bringing digital transformation to every person, home and organization for a fully connected and intelligent world. Founded in 1987, Huawei currently has more than 180,000 employees, and operates in more than 170 countries and regions with revenue over 100 billion USD in 2018. Data management plays a key role in all of the four key domains above. We have developed innovative products and solutions to support rapid business growth driven by customer requirements. While many data management problems are common, each domain also has its own special requirements and challenges. In this paper, we will go through recent advancements in Huawei data management technologies including a petabyte scale enterprise analytics platform (FusionInsight MPPDB) and a highly available in-memory database for telecommunication networks (GMDB). In addition, we discuss data management challenges that we are facing in the areas of autonomous databases and device-edge-cloud collaboration data platforms.

【Keywords】: Distributed Transaction Processing, Multi-Model, Auto-Tuning, HTAP, MPPDB, Autonomous DB, Distributed Data Collaboration

3. Building a Broad Knowledge Graph for Products.

Paper Link】 【Pages】:25

【Authors】: Xin Luna Dong

【Abstract】: A knowledge graph (KG) describes entities and relations between them; for example, between entities Amazon and Seattle, there can be a headquarter_located_at relation. Recent years have witnessed broad applications of KG in search (e.g., by Google and Bing) and question answering (e.g., by Amazon Alexa or Google Home). In this talk, we ask the question: Can one build a knowledge graph (KG) for all products in the world?

【Keywords】: Ontologies; Data mining; Prototypes; Google; Encyclopedias

4. Knowledge Graphs and Enterprise AI: The Promise of an Enabling Technology.

Paper Link】 【Pages】:26-37

【Authors】: Luigi Bellomarini ; Daniele Fakhoury ; Georg Gottlob ; Emanuel Sallinger

【Abstract】: Adopting a mature AI strategy is fundamental for modern knowledge companies to govern the proliferation of smart AI-driven applications and to coordinate them within coherent knowledge workflows. We propose knowledge graphs as the reference technology for the enterprise AI context, i.e., the complex of entities, properties and relationships that shape a business domain and constitute a common backbone for all AI-driven applications. We contribute and discuss principles to design software architectures for AI-driven applications based on knowledge graphs. We focus on the Vadalog system, a successful knowledge graph middleware from the University of Oxford and show knowledge graphs in action in a number of use cases from the financial domain.

【Keywords】: knowledge graph; enterprise ai; datalog; finance; knowledge graph management system; reasoning; ai-driven applications

5. Is There a Data Science and Engineering Brain Drain? If So, How Can We Rebalance Them?

Paper Link】 【Pages】:38-39

【Authors】: Jian Pei

【Abstract】: In the early days of computing, many breakthroughs happened in private companies, such as AT&T Bell Labs. Nowadays, a similar trend is repeating, but this time, in data science and engineering. In the last decade, many remarkable milestone innovations were developed in large private enterprises li ke Google, Microsoft, and Facebook. The tendency seems to be even sped up with the disruptive development of deep learning, possibly due to the unparalleled availability of abundant amounts of real -world data, latest computational facility and related infrastructure, and sufficient engineering workforce in industry.

【Keywords】: Industries; Data science; Blogs; Conferences; Data engineering; Companies

6. Answering Why-Questions for Subgraph Queries in Multi-attributed Graphs.

Paper Link】 【Pages】:40-51

【Authors】: Qi Song ; Mohammad Hossein Namaki ; Yinghui Wu

【Abstract】: Subgraph queries have been routinely used to search graphs e.g., social networks and knowledge bases. With little knowledge of underlying data, users often need to rewrite queries multiple times to reach desirable answers. Why-questions are studied to explain missing (as “Why-not” questions) or unexpected answers (as “Why” questions). This paper makes a first step to answer why-questions for subgraph queries in attributed graphs. (1) We approach query rewriting and construct query rewrites, which modify original subgraph queries to identify desired entities that are specified by Why questions. We introduce measures that characterize good query rewrites by incorporating both query editing cost and answer closeness. (2) While computing optimal query rewrite is intractable for Why-questions, we develop feasible algorithms, from exact algorithms to heuristics, and provide query rewrites with (near) optimality guarantees whenever possible, for both Why and Why-not questions. These algorithms dynamically select “picky” operators that ensure to change (estimated) answers closer to desired ones, and incur cost determined by the size of query results and questions only. We also show that these algorithms readily extend to other Why-questions such as Why-empty and Why-so-many. Using real-world graphs, we experimentally verify that our algorithms are effective and feasible for large graphs. Our case study also verifies their application in e.g., knowledge exploration.

【Keywords】: Graph-Queries; Why-question; Query-Rewriting

7. Enumerating k-Vertex Connected Components in Large Graphs.

Paper Link】 【Pages】:52-63

【Authors】: Dong Wen ; Lu Qin ; Ying Zhang ; Lijun Chang ; Ling Chen

【Abstract】: In social network analysis, structural cohesion (or vertex connectivity) is a fundamental metric in measuring the cohesion of social groups. Given an undirected graph, a k-vertex connected component (k-VCC) is a maximal connected subgraph whose structural cohesion is at least k. A k-VCC has many outstanding structural properties, such as high cohesiveness, high robustness, and subgraph overlapping. In this paper, given a graph G and an integer k, we study the problem of computing all k-VCCs in G. The general idea for this problem is to recursively partition the graph into overlapped subgraphs. We prove the upper bound of the number of partitions, which implies the polynomial running time algorithm for the k-VCC enumeration. However, the basic solution is costly in computing the vertex cut. To improve the algorithmic efficiency, we observe that the key is reducing the number of local connectivity testings. We propose two effective optimization strategies, namely neighbor sweep and group sweep, to significantly reduce the number of local connectivity testings. We conduct extensive performance studies using ten large real datasets to demonstrate the efficiency of our proposed algorithms. The experimental results demonstrate that our approach can achieve a speedup of up to two orders of magnitude compared to the state-of-the-art algorithm.

【Keywords】: social network; cohesive subgraph; vertex connectivity

8. Index-Based Optimal Algorithm for Computing K-Cores in Large Uncertain Graphs.

Paper Link】 【Pages】:64-75

【Authors】: Bohua Yang ; Dong Wen ; Lu Qin ; Ying Zhang ; Lijun Chang ; Rong-Hua Li

【Abstract】: Uncertainty in graph data occurs for a variety of reasons, such as noise and measurement errors. Recently, uncertain graph management and analysis have attracted many research attentions. Among them, computing k-cores in uncertain graphs (aka, (k, η)-cores) is an important problem and has emerged in many applications, for example, community detection, protein-protein interaction network analysis and influence maximization. Given an uncertain graph, the (k, η)-cores can be derived by iteratively removing the vertex with an η-degree of less than k and updating the η-degrees of its neighbors. However, the results heavily depend on the two input parameters k and η, and the settings for these parameters are unique to the specific graph structure and the user's subjective requirements. Additionally, computing and updating the η-degree for each vertex is the most costly component of the algorithm, and that cost is high. To overcome these drawbacks, we have developed an index-based solution for computing (k, η)-cores in this paper. The size of the index is well bounded by O(m), where m is the number of edges in the graph. Based on this index, queries for any k and η can be answered in optimal time. Further, the method is accompanied by several different optimizations to speed up construction of the index. We conduct extensive experiments on eight real-world datasets to practically evaluate the performance of all the proposed algorithms. The results demonstrate that this index-based approach is several orders of magnitude faster at processing queries than the traditional online approaches.?

【Keywords】: cohesive subgraph; uncertain graph; social network

9. Computing a Near-Maximum Independent Set in Dynamic Graphs.

Paper Link】 【Pages】:76-87

【Authors】: Weiguo Zheng ; Chengzhi Piao ; Hong Cheng ; Jeffrey Xu Yu

【Abstract】: As a fundamental NP-hard problem in graph theory, the maximum independent set (MIS) has attracted a lot of efforts to improve the time efficiency. However, most graphs in real scenarios are usually changing over time. But the previous studies take the stationary graphs as input, the computation of MIS in dynamic graphs receives little attention. Since computing the exact MIS is intractable, we compute the high-quality (large-size) independent set for dynamic graphs in this paper, where 4 graph updating operations are allowed: adding or deleting a vertex or an edge. Based on two state-of-the-art reduction rules that are designed for static graphs, we propose a novel scheme, i.e., dependency graph based independent set computation, which can support computing the high-quality independent set on the basis of the previous result rather than calculating from scratch. Moreover, a dynamic searching strategy is devised to improve time efficiency. In order to make it more useful in practical applications, we devise an effective yet efficient method to deal with the batch update. To confirm the effectiveness and efficiency of the proposed methods, we conduct extensive experiments over both real and synthetic datasets.

【Keywords】: Independent set; Dynamic graphs; Batch update

Paper Link】 【Pages】:88-99

【Authors】: Lu Chen ; Chengfei Liu ; Kewen Liao ; Jianxin Li ; Rui Zhou

【Abstract】: Community search on attributed networks has recently attracted great deal of research interest. However, most of existing works require query users to specify some community structure parameters. This may not be always practical as sometimes a user does not have the knowledge and experience to decide the suitable parameters. In this paper, we propose a novel parameter-free contextual community model for attributed community search. The proposed model only requires a query context, i.e., a set of keywords describing the desired matching community context, while the community returned is both structure and attribute cohesive w.r.t. the provided query context. We theoretically show that both our exact and approximate contextual community search algorithms can be executed in worst case polynomial time. The exact algorithm is based on an elegant parametric maximum flow technique and the approximation algorithm that significantly improves the search efficiency is analyzed to have an approximation factor of 1/3. In the experiment, we use six real networks with ground-truth communities to evaluate the effectiveness of our contextual community model. Experimental results demonstrate that the proposed model can find near ground-truth communities. We also test both our exact and approximate algorithms using eight large real networks to demonstrate the high efficiency of the proposed algorithms.

【Keywords】: Community search; Social network

11. Rima: An RDMA-Accelerated Model-Parallelized Solution to Large-Scale Matrix Factorization.

Paper Link】 【Pages】:100-111

【Authors】: Jinkun Geng ; Dan Li ; Shuai Wang

【Abstract】: Matrix factorization (MF) is a fundamental technique in machine learning and data mining, which gains wide application in many fields. When the matrix becomes large, MF cannot be processed on a single machine. Considering this, many distributed SGD algorithms (e.g. DSGD) have been developed to solve large-scale MF on multiple machines in a model-parallel way. Existing distributed algorithms are primarily implemented under Map/Reduce or PS (parameter server)-based architectures, which incur significant communication overheads. Besides, existing solutions cannot well embrace the benefit of RDMA/RoCE transport and suffer from scalability problems. Targeting at these drawbacks, we propose Rima, which uses ring-based model parallelism to solve large-scale MF with higher communication efficiency. Compared with PS-based SGD algorithms, Rima also consumes less queue pairs (QPs) and can thus better leverage the power of RDMA/RoCE to accelerate the training speed. Our experiment shows that, compared with PS-based DSGD when solving 1M × 1M MF, Rima achieves comparable convergence performance after equal number of iterations, but reduces the training time by 68.7% and 85.4% via TCP and RDMA respectively.

【Keywords】: matrix factorization, SGD, RDMA, communication efficiency

12. Accelerating Partial Evaluation in Distributed SPARQL Query Evaluation.

Paper Link】 【Pages】:112-123

【Authors】: Peng Peng ; Lei Zou ; Runyu Guan

【Abstract】: Partial evaluation has recently been used for processing SPARQL queries over a large resource description framework (RDF) graph in a distributed environment. However, the previous approach is inefficient when dealing with complex queries. In this study, we further improve the "partial evaluation and assembly" framework for answering SPARQL queries over a distributed RDF graph, while providing performance guarantees. Our key idea is to explore the intrinsic structural characteristics of partial matches to filter out irrelevant partial results while providing performance guarantees on the data shipment and the response time. We also propose an efficient assembly algorithm to utilize the characteristics of partial matches to merge them and form final results. To improve the efficiency of finding partial matches further, we propose an optimization that communicates variables' candidates among sites to avoid redundant computations. In addition, although our approach is partitioning-tolerant, different partitioning strategies result in different performances, and we evaluate different partitioning strategies for our approach. Experiments over both real and synthetic RDF datasets confirm the superiority of our approach.

【Keywords】: Partial Evaluation; SPARQL Query Processing; Distributed RDF Systems

13. Blockplane: A Global-Scale Byzantizing Middleware.

Paper Link】 【Pages】:124-135

【Authors】: Faisal Nawab ; Mohammad Sadoghi

【Abstract】: The byzantine fault-tolerance model captures a wide-range of failures-common in real-world scenarios-such as ones due to malicious attacks and arbitrary software/hardware errors. We propose Blockplane, a middleware that enables making existing benign systems tolerate byzantine failures. This is done by making the existing system use Blockplane for durability and as a communication infrastructure. Blockplane proposes the following: (1) A middleware and communication infrastructure to make an entire benign protocol byzantine fault-tolerant, (2)A hierarchical locality-aware design to minimize the number of wide-area messages, (3) A separation of fault-tolerance concerns to enable designs with higher performance.

【Keywords】: blockplane; byzantine agreement; middleware; global scale

14. BENU: Distributed Subgraph Enumeration with Backtracking-Based Framework.

Paper Link】 【Pages】:136-147

【Authors】: Zhaokang Wang ; Rong Gu ; Weiwei Hu ; Chunfeng Yuan ; Yihua Huang

【Abstract】: Given a small pattern graph and a large data graph, the task of subgraph enumeration is to find all the subgraphs of the data graph that are isomorphic to the pattern graph. The state-of-the-art distributed algorithms like SEED and CBF turn subgraph enumeration into a distributed multi-way join problem. They are inefficient in communication as they have to shuffle partial matching results that are much larger than the data graph itself during the join. They also spend non-trivial costs on constructing indexes for data graphs. Different from those join-based algorithms, we develop a new backtracking-based framework BENU for distributed subgraph enumeration. BENU divides a subgraph enumeration task into a group of local search tasks that can be executed in parallel. Each local search task follows a backtracking-based execution plan to enumerate subgraphs. The data graph is stored in a distributed database and is queried as needed. BENU only queries the necessary edges of the data graph and avoids shuffling partial matching results. We also develop an efficient implementation for BENU. We set up an in-memory database cache on each machine. Taking advantage of the inter-task and intra-task locality, the cache significantly reduces the communication cost with controllable memory usage. We conduct extensive experiments to evaluate the performance of BENU. The results show that BENU is scalable and outperforms the state-of-the-art methods by up to an order of magnitude.

【Keywords】: subgraph isomorphism; subgraph matching; task parallel; backtracking

15. Efficient Synchronization of State-Based CRDTs.

Paper Link】 【Pages】:148-159

【Authors】: Vitor Enes ; Paulo Sérgio Almeida ; Carlos Baquero ; João Leitão

【Abstract】: To ensure high availability in large scale distributed systems, Conflict-free Replicated Data Types (CRDTs) relax consistency by allowing immediate query and update operations at the local replica, with no need for remote synchronization. State-based CRDTs synchronize replicas by periodically sending their full state to other replicas, which can become extremely costly as the CRDT state grows. Delta-based CRDTs address this problem by producing small incremental states (deltas) to be used in synchronization instead of the full state. However, current synchronization algorithms for delta-based CRDTs induce redundant wasteful delta propagation, performing worse than expected, and surprisingly, no better than state-based. In this paper we: 1) identify two sources of inefficiency in current synchronization algorithms for delta-based CRDTs; 2) bring the concept of join decomposition to state-based CRDTs; 3) exploit join decompositions to obtain optimal deltas and 4) improve the efficiency of synchronization algorithms; and finally, 5) experimentally evaluate the improved algorithms.

【Keywords】: CRDTs; Optimal Deltas; Join Decomposition

16. Learning Individual Models for Imputation.

Paper Link】 【Pages】:160-171

【Authors】: Aoqian Zhang ; Shaoxu Song ; Yu Sun ; Jianmin Wang

【Abstract】: Missing numerical values are prevalent, e.g., owing to unreliable sensor reading, collection and transmission among heterogeneous sources. Unlike categorized data imputation over a limited domain, the numerical values suffer from two issues: (1) sparsity problem, the incomplete tuple may not have sufficient complete neighbors sharing the same/similar values for imputation, owing to the (almost) infinite domain; (2) heterogeneity problem, different tuples may not fit the same (regression) model. In this study, enlightened by the conditional dependencies that hold conditionally over certain tuples rather than the whole relation, we propose to learn a regression model individually for each complete tuple together with its neighbors. Our IIM, Imputation via Individual Models, thus no longer relies on sharing similar values among the k complete neighbors for imputation, but utilizes their regression results by the aforesaid learned individual (not necessary the same) models. Remarkably, we show that some existing methods are indeed special cases of our IIM, under the extreme settings of the number ℓ of learning neighbors considered in individual learning. In this sense, a proper number ℓ of neighbors is essential to learn the individual models (avoid over-fitting or under-fitting). We propose to adaptively learn individual models over various number ℓ of neighbors for different complete tuples. By devising efficient incremental computation, the time complexity of learning a model reduces from linear to constant. Experiments on real data demonstrate that our IIM with adaptive learning achieves higher imputation accuracy than the existing approaches.

【Keywords】: Missing values; Data imputation

17. CurrentClean: Spatio-Temporal Cleaning of Stale Data.

Paper Link】 【Pages】:172-183

【Authors】: Mostafa Milani ; Zheng Zheng ; Fei Chiang

【Abstract】: Data currency is imperative towards achieving up-to-date and accurate data analysis. Data is considered current if changes in real world entities are reflected in the database. When this does not occur, stale data arises. Identifying and repairing stale data goes beyond simply having timestamps. Individual entities each have their own update patterns in both space and time. These update patterns can be learned and predicted given available query logs. In this paper, we present CurrentClean, a probabilistic system for identifying and cleaning stale values. We introduce a spatio-temporal probabilistic model that captures the database update patterns to infer stale values, and propose a set of inference rules that model spatio-temporal update patterns commonly seen in real data. We recommend repairs to clean stale values by learning from past update values over cells. Our evaluation shows CurrentClean's effectiveness to identify stale values over real data, and achieves improved error detection and repair accuracy over state-of-the-art techniques.

【Keywords】: data currency, stale data, spatio temporal data cleaning

18. Fine-Grained Provenance for Matching & ETL.

Paper Link】 【Pages】:184-195

【Authors】: Nan Zheng ; Abdussalam Alawini ; Zachary G. Ives

【Abstract】: Data provenance tools capture the steps used to produce analyses. However, scientists must choose among workflow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; provenance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks - for data types such as strings, images, etc. Scientists need new capabilities to identify the sources of errors, find why different code versions produce different results, and identify which parameter values affect output. We propose PROVision, a provenance-driven troubleshooting tool that supports ETL and matching computations and traces extraction of content within data objects. PROVision extends database-style provenance techniques to capture equivalences, support optimizations, and enable selective evaluation. We formalize our extensions, implement them in the PROVision system, and validate their effectiveness and scalability for common ETL and matching tasks.

【Keywords】: Provenance; Workflow; ETL; Record linking

19. Unsupervised String Transformation Learning for Entity Consolidation.

Paper Link】 【Pages】:196-207

【Authors】: Dong Deng ; Wenbo Tao ; Ziawasch Abedjan ; Ahmed K. Elmagarmid ; Ihab F. Ilyas ; Guoliang Li ; Samuel Madden ; Mourad Ouzzani ; Michael Stonebraker ; Nan Tang

【Abstract】: Data integration has been a long-standing challenge in data management with many applications. A key step in data integration is entity consolidation. It takes a collection of clusters of duplicate records as input and produces a single "golden record" for each cluster, which contains the canonical value for each attribute. Truth discovery and data fusion methods as well as Master Data Management (MDM) systems can be used for entity consolidation. However, to achieve better results, the variant values (i.e., values that are logically the same with different formats) in the clusters need to be consolidated before applying these methods. For this purpose, we propose a data-driven method to standardize the variant values based on two observations: (1) the variant values usually can be transformed to the same representation (e.g., "Mary Lee" and "Lee, Mary") and (2) the same transformation often appears repeatedly across different clusters (e.g., transpose the first and last name). Our approach first uses an unsupervised method to generate groups of value pairs that can be transformed in the same way. Then the groups are presented to a human for verification and the approved ones are used to standardize the data. In a real-world dataset with 17,497 records, our method achieved 75% recall and 99.5% precision in standardizing variant values by asking a human 100 yes/no questions, which completely outperformed a state of the art data wrangling tool.

【Keywords】: Data Integration; Entity Consolidation; Program Synthesis; String Transformation; Data Editing

20. A Semi-Supervised Framework of Clustering Selection for De-Duplication.

Paper Link】 【Pages】:208-219

【Authors】: Shrinu Kushagra ; Hemant Saxena ; Ihab F. Ilyas ; Shai Ben-David

【Abstract】: We view data de-duplication as a clustering problem. Recently, [1] introduced a framework called restricted correlation clustering (RCC) to model de-duplication problems. Given a set X, an unknown target clustering C* of X and a class F of clusterings of X, the goal is to find a clustering C from the set F which minimizes the correlation loss. The clustering algorithm is allowed to interact with a domain expert by asking whether a pair of records correspond to the same entity or not. Main drawback of the algorithm developed by [1] is that the pre-processing step had a time complexity of theta (|X|2) (where X is the input set). In this paper, we make the following contributions. We develop a sampling procedure (based on locality sensitive hashing) which requires a linear pre-processing time O(|X|). We prove that our sampling procedure can estimate the correlation loss of all clusterings in F using only a small number of labelled examples. In fact, the number of labelled examples is independent of |X| and depends only on the complexity of the class F. Further we show that to sample one pair, with high probability our procedure makes a constant number of queries to the domain expert. We then perform an extensive empirical evaluation of our approach which shows the efficiency of our method.

【Keywords】: De-duplication; clustering,; semi-supervised; locality-sensitive-hashing; correlation-clustering

21. Scaling Up Subgraph Query Processing with Efficient Subgraph Matching.

Paper Link】 【Pages】:220-231

【Authors】: Shixuan Sun ; Qiong Luo

【Abstract】: A subgraph query finds all data graphs in a graph database each of which contains the given query graph. Existing work takes the indexing-filtering-verification (IFV) approach to first index all data graphs, then filter out some of them based on the index, and finally test subgraph isomorphism on each of the remaining data graphs. This final test of subgraph isomorphism is a sub-problem of subgraph matching, which finds all subgraph isomorphisms from a query graph to a data graph. As such, in this paper, we study whether, and if so, how to utilize efficient subgraph matching to improve subgraph query processing. Specifically, we modify leading subgraph matching algorithms and integrate them with top-performing subgraph querying algorithms. Our results show that (1) the slow verification method in existing IFV algorithms can lead us to over-estimate the gain of filtering; and (2) our modified subgraph querying algorithms with efficient subgraph matching are competitive in time performance and can scale to hundreds of thousands of data graphs and graphs of thousands of vertices.

【Keywords】: subgraph isomorphism; subgraph query processing; subgraph matching

22. Efficient Parallel Subgraph Enumeration on a Single Machine.

Paper Link】 【Pages】:232-243

【Authors】: Shixuan Sun ; Yulin Che ; Lipeng Wang ; Qiong Luo

【Abstract】: Subgraph enumeration finds all subgraphs in an unlabeled graph that are isomorphic to another unlabeled graph. Existing depth-first search (DFS) based algorithms work on a single machine, but they are slow on large graphs due to the large search space. In contrast, distributed algorithms on clusters adopt a parallel breadth-first search (BFS) and improve the performance at the cost of large amounts of hardware resources, since the BFS approach incurs expensive data transfer and space cost due to the exponential number of intermediate results. In this paper, we develop an efficient parallel subgraph enumeration algorithm for a single machine, named LIGHT. Our algorithm reduces redundant computation in DFS by deferring the materialization of pattern vertices until necessary and converting the candidate set computation into finding a minimum set cover. Moreover, we parallelize our algorithm with both SIMD (Single-Instruction-Multiple-Data) instructions and SMT (Simultaneous Multi-Threading) technologies in modern CPUs. Our experimental results show that LIGHT running on a single machine outperforms existing single-machine DFS algorithms by more than three orders of magnitude, and is up to two orders of magnitude faster than the state-of-the-art distributed algorithms running on 12 machines. Additionally, LIGHT completed all test cases, whereas the existing algorithms fail in some cases due to either running out of time or running out of available hardware resources.

【Keywords】: subgraph isomorphism; subgraph enumeration; parallel processing; SIMD and SMT

23. Fast Dual Simulation Processing of Graph Database Queries.

Paper Link】 【Pages】:244-255

【Authors】: Stephan Mennicke ; Jan-Christoph Kalo ; Denis Nagel ; Hermann Kroll ; Wolf-Tilo Balke

【Abstract】: Graph database query languages feature expressive yet computationally expensive pattern matching capabilities. Answering optional query clauses in SPARQL for instance renders the query evaluation problem immediately PSPACE-complete. Light-weight graph pattern matching relations, such as simulation, have recently been investigated as promising alternatives to more expensive query mechanisms like, e.g., computing subgraph isomorphism. Still, pattern matching alone lacks expressive query capabilities: graph patterns may be combined by usual inner joins. However, including more sophisticated operators is inevitable to make solutions more useful for emerging applications. In this paper we bridge this gap by introducing a new dual simulation process operating on SPARQL queries. In addition to supporting the full syntactic structure of SPARQL queries, it features polynomial-time pattern matching to compute an overapproximation of the query results. Moreover, to achieve running times competing with state-of-the-art database systems, we develop a novel algorithmic solution to dual simulation graph pattern matching, based on a system of inequalities that allows for several optimization heuristics. Finally, we achieve soundness of our process for SPARQL queries including UNION, AND and OPTIONAL operators not restricted to well-designed patterns. Our experiments on synthetic and real-world graph data promise a clear gain for graph database systems when incorporating the new dual simulation techniques.

【Keywords】: graph databases; query processing; graph simulation; SPARQL

24. Efficient and Incremental Clustering Algorithms on Star-Schema Heterogeneous Graphs.

Paper Link】 【Pages】:256-267

【Authors】: Lu Chen ; Yunjun Gao ; Yuanliang Zhang ; Christian S. Jensen ; Bolong Zheng

【Abstract】: Many datasets including social media data and bibliographic data can be modeled as graphs. Clustering such graphs is able to provide useful insights into the structure of the data. To improve the quality of clustering, node attributes can be taken into account, resulting in attributed graphs. Existing attributed graph clustering methods generally consider attribute similarity and structural similarity separately. In this paper, we represent attributed graphs as star-schema heterogeneous graphs, where attributes are modeled as different types of graph nodes. This enables the use of personalized pagerank (PPR) as a unified distance measure that captures both structural and attribute similarity. We employ DBSCAN for clustering, and we update edge weights iteratively to balance the importance of different attributes. To improve the efficiency of the clustering, we develop two incremental approaches that aim to enable efficient PPR score computation when edge weights are updated. To boost the effectiveness of the clustering, we propose a simple yet effective edge weight update strategy based on entropy. In addition, we present a game theory based method that enables trading efficiency for result quality. Extensive experiments on real-life datasets offer insight into the effectiveness and efficiency of our proposals, compared with existing methods.

【Keywords】: Graph mining; Heterogeneous graph; Graph clustering; Algorithm

25. G*-Tree: An Efficient Spatial Index on Road Networks.

Paper Link】 【Pages】:268-279

【Authors】: Zijian Li ; Lei Chen ; Yue Wang

【Abstract】: In this paper, we propose an efficient hierarchical index, G-tree, to optimize spatial queries on road networks. Most existing graph indexes can only support one kind of query, and thus we need to build multiple indexes on a road network to handle various kinds of spatial queries, which is inefficient and unscalable for real-world applications. To address the problem, a recent study proposes G-tree to support multiple types of spatial queries on road networks within one framework. However, the assembly-based method on G-tree is not efficient enough to handle spatial queries when vertices, which are close in a road network, are distant in G-tree. To address the inefficiency problem of G-tree, in this paper, we propose a novel index structure on road networks, namely G-tree, whose key idea is to build shortcuts between selected leaf nodes. Based on G-tree, we propose three shortcut-based algorithms to answer distance queries, k-nearest neighbor queries and range queries, respectively, which are more efficient than the existing assembly-based algorithms on G-tree. Moreover, we propose a shortcut selection algorithm to optimize the performance of spatial queries on G-tree. We conduct extensive experiments to compare our G-tree and the state-of-the-art indexing methods on various large-scale road networks, where the results demonstrate that our G-tree has better efficiency and scalability than the competitors to handle spatial queries.

【Keywords】: G*-Tree; Spatial Query; Graph Index; Hierarchical Index

26. DBSVEC: Density-Based Clustering Using Support Vector Expansion.

Paper Link】 【Pages】:280-291

【Authors】: Zhen Wang ; Rui Zhang ; Jianzhong Qi ; Bo Yuan

【Abstract】: DBSCAN is a popular clustering algorithm that can discover clusters of arbitrary shapes with broad applications. However, DBSCAN is computationally expensive, as it performs range queries for all the points to determine their neighbors and grow the clusters. To address this problem, we propose a novel approximate density-based clustering algorithm named DBSVEC. DBSVEC introduces support vectors into density-based clustering, which allows performing range queries only on a small subset of points called the core support vectors. This technique significantly improves the efficiency while retaining high-quality cluster results. We evaluate the performance of DBSVEC via extensive experiments on real and synthetic datasets. The results show that DBSVEC is up to three orders of magnitude faster than DBSCAN. Compared with the state-of-the-art approximate density-based clustering methods, DBSVEC is up to two orders of magnitude faster, and the clustering results of DBSVEC are more similar to those of DBSCAN.

【Keywords】: density-based clustering, support vector expansion, scalable clustering

27. A Joint Context-Aware Embedding for Trip Recommendations.

Paper Link】 【Pages】:292-303

【Authors】: Jiayuan He ; Jianzhong Qi ; Kotagiri Ramamohanarao

【Abstract】: Trip recommendation is an important location-based service that helps relieve users from the time and efforts for trip planning. It aims to recommend a sequence of places of interest (POIs) for a user to visit that maximizes the user's satisfaction. When adding a POI to a recommended trip, it is essential to understand the context of the recommendation, including the POI popularity, other POIs co-occurring in the trip, and the preferences of the user. These contextual factors are learned separately in existing studies, while in reality, they jointly impact on a user's choice of POI visits. In this study, we propose a POI embedding model to jointly learn the impact of these contextual factors. We call the learned POI embedding a context-aware POI embedding. To showcase the effectiveness of this embedding, we apply it to generate trip recommendations given a user and a time budget. We propose two trip recommendation algorithms based on our context-aware POI embedding. The first algorithm finds the exact optimal trip by transforming and solving the trip recommendation problem as an integer linear programming problem. To achieve a high computation efficiency, the second algorithm finds a heuristically optimal trip based on adaptive large neighborhood search. We perform extensive experiments on real datasets. The results show that our proposed algorithms consistently outperform state-of-the-art algorithms in trip recommendation quality, with an advantage of up to 43% in F_1-score.

【Keywords】: Recommendation system; Data mining

28. AIR: Attentional Intention-Aware Recommender Systems.

Paper Link】 【Pages】:304-315

【Authors】: Tong Chen ; Hongzhi Yin ; Hongxu Chen ; Rui Yan ; Quoc Viet Hung Nguyen ; Xue Li

【Abstract】: The capability of extracting sequential patterns from the user-item interaction data is now becoming a key feature of recommender systems. Though it is important to capture the sequential effect, existing methods only focus on modelling the sparse item-wise sequential effect in user preference and only consider the homogeneous user interaction behaviors (i.e., a single type of user behavior). As a result, the data sparsity issue inevitably arises and makes the learned sequential patterns fragile and unreliable, impeding the sequential recommendation performance of existing methods. Hence, in this paper, we propose AIR, namely attentional intention-aware recommender systems to predict category-wise future user intention and collectively exploit the rich heterogeneous user interaction behaviors (i.e., multiple types of user behaviors). In AIR, we propose to represent user intention as an action-category tuple to discover category-wise sequential patterns and to capture varied effect of different types of actions for recommendation. A novel attentional recurrent neural network (ARNN) is proposed to model the intention migration effect and infer users' future intention. Besides, an intention-aware factorization machine (ITFM) is developed to perform intention-aware sequential recommendation. Experiments on two real-life datasets demonstrate the superiority and practicality of AIR in sequential top-k recommendation tasks.

【Keywords】: sequential recommender systems; deep neural networks; intention aware recommendation

29. No, That's Not My Feedback: TV Show Recommendation Using Watchable Interval.

Paper Link】 【Pages】:316-327

【Authors】: Kyung-Jae Cho ; Yeon-Chang Lee ; Kyungsik Han ; Jaeho Choi ; Sang-Wook Kim

【Abstract】: As the number of TV channels increases, it is becoming important to recommend TV shows that users prefer to watch. To this end, we investigate the inherent characteristics of implicit feedback given in the TV show domain, and identify the challenges for building an effective TV show recommendation. Based on the unique characteristics, we define a user's watchable interval, the most important and novel concept in understanding users' true preferences. In order to reflect this new concept into the TV show recommendation, we propose a novel framework based on collaborative filtering. Our framework is composed of (1) preference estimation based on a user's watchable interval, (2) preference prediction based on confidence exploiting watchable episodes, and (3) top-N recommendation considering TV show's staying and remaining times. Using a real-world TV show dataset, we demonstrate that our framework effectively solves the challenges and significantly outperforms other existing state-of-the-art methods.

【Keywords】: TV show recommendation; Implicit feedback; Watchable interval; Watchable episode; Recommender systems

30. Adaptive Wavelet Clustering for Highly Noisy Data.

Paper Link】 【Pages】:328-337

【Authors】: Zengjian Chen ; Jiayi Liu ; Yihe Deng ; Kun He ; John E. Hopcroft

【Abstract】: In this paper we make progress on the unsupervised task of mining arbitrarily shaped clusters in highly noisy datasets, which is a task present in many real-world applications. Based on the fundamental work that first applies a wavelet transform to data clustering, we propose an adaptive clustering algorithm, denoted as AdaWave, which exhibits favorable characteristics for clustering. By a self-adaptive thresholding technique, AdaWave is parameter free and can handle data in various situations. It is deterministic, fast in linear time, order-insensitive, shape-insensitive, robust to highly noisy data, and requires no pre-knowledge on data models. Moreover, AdaWave inherits the ability from the wavelet transform to cluster data in different resolutions. We adopt the "grid labeling" data structure to drastically reduce the memory consumption of the wavelet transform so that AdaWave can be used for relatively high dimensional data. Experiments on synthetic as well as natural datasets demonstrate the effectiveness and efficiency of our proposed method.

【Keywords】: Clustering, high noise data, wavelet transform, shape-insensitive

Paper Link】 【Pages】:338-349

【Authors】: Yueji Yang ; Divyakant Agrawal ; H. V. Jagadish ; Anthony K. H. Tung ; Shuang Wu

【Abstract】: Keyword search has recently become popular as a way to query relational databases, and even graphs, since it allows users to issue queries without learning a complex query language and data schema. Evaluating a keyword query is usually significantly more expensive than evaluating an equivalent selection query, since the query specification is less complete, and many alternative answers have to be considered by the system, requiring considerable effort to generate and compare. Current interest in big data and AI are putting even more demands on the efficiency of keyword search. In particular, searching of knowledge graphs is gaining popularity. As knowledge graphs often comprise many millions of nodes and edges, performing real-time search on graphs of this size is an open challenge. In this paper, we attempt to address this need by leveraging advances in hardware technologies, e.g. multi-core CPUs and GPUs. Specifically, we implement a parallel keyword search engine for Knowledge Bases (KB). To be able to do so, and to exploit parallelism, we devise a new approach to keyword search, based on a concept we introduce called Central Graph. Unlike the Group Steiner Tree (GST) model, widely used for keyword search, our approach can naturally work in parallel and still return compact answer graphs with rich information. Our approach can work in either multi-core CPUs or a single GPU. In particular, our GPU implementation is two to three orders of magnitudes faster than state-of-the-art keyword search method. We conduct extensive experiments to show that our approach is both efficient and effective.

【Keywords】: keyword search; knowledge base; Parallel search; modern hardwares

32. Towards Longitudinal Analytics on Social Media Data.

Paper Link】 【Pages】:350-361

【Authors】: Fan Xia ; Bin Yang ; Chengcheng Yu ; Weining Qian ; Aoying Zhou

【Abstract】: We are witnessing increasing interests in longitudinal analytics on social media data. Longitudinal analytics takes into account an interval and considers the temporal popularity of social media data in the interval, rather than only considering recently generated social media data in real-time search. We study a fundamental functionality in longitudinal analytics-the top-k temporal keyword (TkTK) querying. A TkTK query takes as input a set of query keywords and an interval, and returns the top-k most significant social items, e.g., tweets, where the significance of a social item is defined based on a combination of the textual relevance and temporal popularity. We model social media data as a forest of linkage trees along the time dimension, which well models the propagation processes, e.g., replies and forwards, among different social items. Based on the forest, we model the temporal popularity of a social item across time as a popularity time series. We design two indexing structures that index social items' popularity time series and textual content in a holistic manner-the temporal popularity inverted index (TPII) and the log-structured merge octree (LSMO). Empirical studies with two substantial social media data sets offer insight into the design properties of the indexes and confirm that LSMO enables both efficient query processing and indexing structure updates.

【Keywords】: time series; social media data; temporal keyword query

33. LCJoin: Set Containment Join via List Crosscutting.

Paper Link】 【Pages】:362-373

【Authors】: Dong Deng ; Chengcheng Yang ; Shuo Shang ; Fan Zhu ; Li Liu ; Ling Shao

【Abstract】: A set containment join operates on two set-valued attributes with a subset (⊆) relationship as the join condition. It has many real-world applications, such as in publish/subscribe services and inclusion dependency discovery. Existing solutions can be broadly classified into union-oriented and intersection-oriented methods. Based on several recent studies, union-oriented methods are not competitive as they involve an expensive subset enumeration step. Intersection-oriented methods build an inverted index on one attribute and perform inverted list intersection on another attribute. Existing intersection-oriented methods intersect inverted lists one-by-one. In contrast, in this paper, we propose to intersect all the inverted lists simultaneously while skipping many irrelevant entries in the lists. To share computation, we utilize the prefix tree structure and extend our novel list intersection method to operate on the prefix tree. To further improve the efficiency, we propose to partition the data and use different methods to process each partition. We evaluated our methods using both real-world and synthetic datasets. Experimental results show that our approach outperforms existing methods by up to 10×.

【Keywords】: Set Containment; Similarity Join; Approximate Query Processing; Data Integration

34. Bridging the Semantic Gap with SQL Query Logs in Natural Language Interfaces to Databases.

Paper Link】 【Pages】:374-385

【Authors】: Christopher Baik ; H. V. Jagadish ; Yunyao Li

【Abstract】: A critical challenge in constructing a natural language interface to database (NLIDB) is bridging the semantic gap between a natural language query (NLQ) and the underlying data. Two specific ways this challenge exhibits itself is through keyword mapping and join path inference. Keyword mapping is the task of mapping individual keywords in the original NLQ to database elements (such as relations, attributes or values). It is challenging due to the ambiguity in mapping the user's mental model and diction to the schema definition and contents of the underlying database. Join path inference is the process of selecting the relations and join conditions in the FROM clause of the final SQL query, and is difficult because NLIDB users lack the knowledge of the database schema or SQL and therefore cannot explicitly specify the intermediate tables and joins needed to construct a final SQL query. In this paper, we propose leveraging information from the SQL query log of a database to enhance the performance of existing NLIDBs with respect to these challenges. We present a system Templar that can be used to augment existing NLIDBs. Our extensive experimental evaluation demonstrates the effectiveness of our approach, leading up to 138% improvement in top-1 accuracy in existing NLIDBs by leveraging SQL query log information.

【Keywords】: natural language interface to database; query logs; nlq to sql

35. MF-Join: Efficient Fuzzy String Similarity Join with Multi-level Filtering.

Paper Link】 【Pages】:386-397

【Authors】: Jin Wang ; Chunbin Lin ; Carlo Zaniolo

【Abstract】: As an essential operation in data integration and data cleaning, similarity join has attracted considerable attention from the database community. In many application scenarios, it is essential to support fuzzy matching, which allows approximate matching between elements that improves the effectiveness of string similarity join. To describe the fuzzy matching between strings, we consider two levels of similarity, i.e., element-level and record-level similarity. Then the problem of calculating fuzzy matching similarity can be transformed into finding the weighted maximal matching in a bipartite graph. In this paper, we propose MF-Join, a multi-level filtering approach for fuzzy string similarity join. MF-Join provides a flexible framework that can support multiple similarity functions at both levels. To improve performance, we devise and implement several techniques to enhance the filter power. Specifically, we utilize a partition-based signature at the element-level and propose a frequency-aware partition strategy to improve the quality of signatures. We also devise a count filter at the record level to further prune dissimilar pairs. Moreover, we deduce an effective upper bound for the record-level similarity to reduce the computational overhead of verification. Experimental results on two popular datasets shows that our proposed method clearly outperforms state-of-the-art methods.

【Keywords】: similarity join; fuzzy matching; filter and verification

36. Finding Temporal Influential Users Over Evolving Social Networks.

Paper Link】 【Pages】:398-409

【Authors】: Shixun Huang ; Zhifeng Bao ; J. Shane Culpepper ; Bang Zhang

【Abstract】: Influence maximization (IM) continues to be a key research problem in social networks. The goal is to find a small seed set of target users that have the greatest influence in the network under various stochastic diffusion models. While significant progress has been made on the IM problem in recent years, several interesting challenges remain. For example, social networks in reality are constantly evolving, and "important" users with the most influence also change over time. As a result, several recent studies have proposed approaches to update the seed set as the social networks evolve. However, this seed set is not guaranteed to be the best seed set over a period of time. In this paper we study the problem of Distinct Influence Maximization (DIM) where the goal is to identify a seed set of influencers who maximize the number of distinct users influenced over a predefined window of time. Our new approach allows social network providers to make fewer incremental changes to targeted advertising while still maximizing the coverage of the advertisements. It also provides finer grained control over service level agreements where a certain number of impressions for an advertisement must be displayed in a specific time period. We propose two different strategies HCS and VCS with novel graph compression techniques to solve this problem. Additionally, VCS can also be applied directly to the traditional IM problem. Extensive experiments on real-world datasets verify the efficiency, accuracy and scalability of our solutions on both the DIM and IM problems.

【Keywords】: Influence maximization; Social network

37. Seed Selection and Social Coupon Allocation for Redemption Maximization in Online Social Networks.

Paper Link】 【Pages】:410-421

【Authors】: Tung-Chun Chang ; Yishuo Shi ; De-Nian Yang ; Wen-Tsuen Chen

【Abstract】: Online social networks have become the medium for efficient viral marketing exploiting social influence in information diffusion. However, the emerging application Social Coupon (SC) incorporating social referral into coupons cannot be efficiently solved by previous researches which do not take into account the effect of SC allocation. The number of allocated SCs restricts the number of influenced friends for each user. In the paper, we investigate not only the seed selection problem but also the effect of SC allocation for optimizing the redemption rate which represents the efficiency of SC allocation. Accordingly, we formulate a problem named Seed Selection and SC allocation for Redemption Maximization (S 3 CRM) and prove the hardness of S 3 CRM. We design an effective algorithm with a performance guarantee, called Seed Selection and Social Coupon allocation algorithm. For S 3 CRM, we introduce the notion of marginal redemption to evaluate the efficiency of investment in seeds and SCs. Moreover, for a balanced investment, we develop a new graph structure called guaranteed path, to explore the opportunity to optimize the redemption rate. Finally, we perform a comprehensive evaluation on our proposed algorithm with various baselines. The results validate our ideas and show the effectiveness of the proposed algorithm over baselines.

【Keywords】: Social Influence; Social Coupon; Social Networks; Approximation Algorithm

38. Keyword-Centric Community Search.

Paper Link】 【Pages】:422-433

【Authors】: Zhiwei Zhang ; Xin Huang ; Jianliang Xu ; Byron Choi ; Zechao Shang

【Abstract】: Community search that finds only the communities pertaining to the query input has been widely studied from simple graphs to attributed graphs. However, a significant limitation of previous studies is that they all require the input of query nodes, which makes it difficult for users to specify exact queries if they are unfamiliar with the queried graph. To address this issue, in this paper we study a novel problem of keyword-centric community search (KCCS) over attributed graphs. In contrast to prior studies, no query nodes, but only query keywords, need to be specified to discover relevant communities. Specifically, given an attributed graph G, a query Q consisting of query keywords W Q , and an integer k, KCCS serves to find the largest subgraph of k-core of G that achieves the strongest keyword closeness w.r.t. W Q . We design a new function of keyword closeness and propose efficient algorithms to solve the KCCS problem. Furthermore, a novel core-based inverted index is developed to optimize performance. Extensive experiments on large real networks demonstrate that our solutions are more than three times faster than the baseline approach, and can find cohesive communities closely related to the query keywords.

【Keywords】: keyword-centric; community search; attributed graph

39. Cohesive Group Nearest Neighbor Queries Over Road-Social Networks.

Paper Link】 【Pages】:434-445

【Authors】: Fangda Guo ; Ye Yuan ; Guoren Wang ; Lei Chen ; Xiang Lian ; Zimeng Wang

【Abstract】: The group nearest neighbor (GNN) search on a road network Gr, i.e., finding the spatial objects as activity assembly points with the smallest sum of distances to query users on Gr, has been extensively studied; however, previous works have neglected the fact that social relationships among query users, which ensure the maximally favorable atmosphere in the activity, can play an important role in GNN queries. Many real-world applications, such as location-based social networking services, require such queries. In this paper, we study a new problem: a GNN search on a road network that incorporates cohesive social relationships (CGNN). Specifically, both the query users of highest closeness and the corresponding top-j objects are retrieved. One critical challenge is to speed up the computation of CGNN queries over large social and road networks. To address this challenge, we propose a filtering-and-verification framework for efficient query processing. During filtering, we prune substantial unpromising users and objects using social and geographically spatial constraints. During verification, we obtain the object candidates, among which the top j are selected, with respect to the qualified users. Moreover, we further optimize search strategies to improve query performance. Finally, experimental results on real social and road networks significantly demonstrate the efficiency and efficacy of our solutions.

【Keywords】: Query processing; GNN query; k core; graph algorithm; road network; social network

40. Maximizing Multifaceted Network Influence.

Paper Link】 【Pages】:446-457

【Authors】: Yuchen Li ; Ju Fan ; George V. Ovchinnikov ; Panagiotis Karras

【Abstract】: An information dissemination campaign is often multifaceted, involving several facets or pieces of information disseminating from different sources. The question then arises, how should we assign such pieces to eligible sources so as to achieve the best viral dissemination results? Past research has studied the problem of Influence Maximization (IM), which is to select a set of k promoters that maximizes the expected reach of a message over a network. However, in this classical IM problem, each promoter spreads out the same unitary piece of information. In this paper, we propose the Optimal Influential Pieces Assignment (OIPA) problem, which is to assign k distinct pieces of an information campaign OIPA to k promoters, so as to achieve the highest viral adoption in a network. We express adoption by users with a logistic model, and show that approximating OIPA within any constant factor is NP-hard. Even so, we propose a branch-and-bound framework for problem with an (1-1/e) approximation ratio. We further optimize this framework with a pruning-intensive progressive upper-bound estimation approach, yielding a (1-1/e-ε) approximation ratio and significantly lower time complexity, as it relies on the power-law properties of real-world social networks to run efficiently. Our extensive experiments on several real-world datasets show that the proposed approaches consistently outperform intuitive baselines, adopted from state-of-the-art IM algorithms. Furthermore, the progressive approach demonstrates superior efficiency with an up to 24-fold speedup over the plain branch-and-bound approach.

【Keywords】: Social Network; Social Influence; Graph; Algorithm

41. GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search.

Paper Link】 【Pages】:458-469

【Authors】: Yang Yang ; Ying Zhang ; Wenjie Zhang ; Zengfeng Huang

【Abstract】: In this paper, we study the problem of approximate containment similarity search. Given two records Q and X, the containment similarity between Q and X with respect to Q is |Q intersect X|/ |Q|. Given a query record Q and a set of records S, the containment similarity search finds a set of records from S whose containment similarity regarding Q are not less than the given threshold. This problem has many important applications in commercial and scientific fields such as record matching and domain search. Existing solution relies on the asymmetric LSH method by transforming the containment similarity to well-studied Jaccard similarity. In this paper, we use a different framework by transforming the containment similarity to set intersection. We propose a novel augmented KMV sketch technique, namely GB-KMV, which is data-dependent and can achieve a good trade-off between the sketch size and the accuracy. We provide a set of theoretical analysis to underpin the proposed augmented KMV sketch technique, and show that it outperforms the state-of-the-art technique LSH-E in terms of estimation accuracy under practical assumption. Our comprehensive experiments on real-life datasets verify that GB-KMV is superior to LSH-E in terms of the space-accuracy trade-off, time-accuracy trade-off, and the sketch construction time. For instance, with similar estimation accuracy (F-1 score), GB-KMV is over 100 times faster than LSH-E on some real-life dataset.

【Keywords】: containment similarity; set intersection; data dependent

42. ARROW: Approximating Reachability Using Random Walks Over Web-Scale Graphs.

Paper Link】 【Pages】:470-481

【Authors】: Neha Sengupta ; Amitabha Bagchi ; Maya Ramanath ; Srikanta Bedathur

【Abstract】: Efficiently answering reachability queries on a directed graph is a fundamental problem and many solutions - theoretical and practical - have been proposed. A common strategy to make reachability query processing efficient, accurate and scalable is to precompute indexes on the graph. However this often becomes impractical, particularly when dealing with large graphs that are highly dynamic or when queries have additional constraints known only at the time of querying. In the former case, indexes become stale very quickly and keeping them up to date at the same speed as changes to the graph is untenable. For the latter setting, currently proposed indexes are often quite bulky and are highly customized to handle only a small class of constraints. In this paper, we propose a first practical attempt to address these issues by abandoning the traditional indexing approach altogether and operating directly on the graph as it evolves. Our approach, called ARROW, uses random walks to efficiently approximate reachability between vertices, building on ideas that have been prevalent in the theory community but ignored by practitioners. Not only is ARROW well suited for highly dynamic settings - as it is index-free, but it can also be easily adapted to handle many different forms of ad-hoc constraints while being competitive with custom-made index structures. In this paper, we show that ARROW, despite its simplicity, is near-accurate and scales to graphs with tens of millions of vertices and hundreds of millions of edges. We present extensive empirical evidence to illustrate these advantages.

【Keywords】: Reachability; graph mining; social networks; temporal reachability; random walks; index free; low memory; dynamic graphs; temporal graphs

43. Taster: Self-Tuning, Elastic and Online Approximate Query Processing.

Paper Link】 【Pages】:482-493

【Authors】: Matthaios Olma ; Odysseas Papapetrou ; Raja Appuswamy ; Anastasia Ailamaki

【Abstract】: Current Approximate Query Processing (AQP) engines are far from silver-bullet solutions, as they adopt several static design decisions that target specific workloads and deployment scenarios. Offline AQP engines target deployments with large storage budget, and offer substantial performance improvement for predictable workloads, but fail when new query types appear, i.e., due to shifting user interests. To the other extreme, online AQP engines assume that query workloads are unpredictable, and therefore build all samples at query time, without reusing samples (or parts of them) across queries. Clearly, both extremes miss out on different opportunities for optimizing performance and cost. In this paper, we present Taster, a self-tuning, elastic, online AQP engine that synergistically combines the benefits of online and offline AQP. Taster performs online approximation by injecting synopses (samples and sketches) into the query plan, while at the same time it strategically materializes and reuses synopses across queries, and continuously adapts them to changes in the workload and to the available storage resources. Our experimental evaluation shows that Taster adapts to shifting workload and to varying storage budgets, and always matches or significantly outperforms the state-of-the-art performing AQP approaches (online or offline).

【Keywords】: approximation; online; adaptive; tuning

44. An Iterative Scheme for Leverage-Based Approximate Aggregation.

Paper Link】 【Pages】:494-505

【Authors】: Shanshan Han ; Hongzhi Wang ; Jialin Wan ; Jianzhong Li

【Abstract】: The current data explosion poses great challenges to approximate aggregation with high efficiency and accuracy. To address this problem, we propose a novel approach to calculate the aggregation answers with a high accuracy using only a small portion of the data. We introduce leverages to reflect individual differences in the data from a statistical perspective. Two kinds of estimators, the leverage-based estimator, and the sketch estimator (a "rough picture" of the aggregation answer), are in constraint relations and iteratively improved according to the actual conditions until their difference is below a threshold. Due to the iteration mechanism and the leverages, our approach achieves a high accuracy. Moreover, some features, such as not requiring recording the sampled data and easy to extend to various execution modes, such as the online mode, make our approach well suited to deal with big data. Experiments show that our approach has an extraordinary performance, and when compared with the uniform sampling, our approach can achieve high-quality answers with only 1/3 sample size.

【Keywords】: approximate aggregation, leverage, iteration

45. Deletion Propagation for Multiple Key Preserving Conjunctive Queries: Approximations and Complexity.

Paper Link】 【Pages】:506-517

【Authors】: Zhipeng Cai ; Dongjing Miao ; Yingshu Li

【Abstract】: This paper studies the deletion propagation problem in terms of minimizing view side-effect. It is a problem funda-mental to data lineage and quality management which could be a key step in analyzing view propagation and repairing data. The investigated problem is a variant of the standard deletion propagation problem, where given a source database D, a set of key preserving conjunctive queries Q, and the set of views V obtained by the queries in Q, we try to identify a set T of tuples from D whose elimination prevents all the tuples in a given set of deletions on views △V while preserving any other results. The complexity of this problem has been well studied for the case with only a single query. Dichotomies, even trichotomies, for different settings are developed. However, no results on multiple queries are given which is a more realistic case. We study the complexity and approximations of optimizing the side-effect on the views, i.e., find T to minimize the additional damage on V after removing all the tuples of △V. We focus on the class of key-preserving conjunctive queries which is a dichotomy for the single query case. It is surprising to find that except the single query case, this problem is NP-hard to approximate within any constant even for a non-trivial set of multiple project-free conjunctive queries in terms of view side-effect. The proposed algorithm shows that it can be approximated within a bound depending on the number of tuples of both V and △V. We identify a class of polynomial tractable inputs, and provide a dynamic programming algorithm to solve the problem. Besides data lineage, study on this problem could also provide important foundations for the computational issues in data repairing. Furthermore, we introduce some related applications of this problem, especially for query feedback based data cleaning.

【Keywords】: key preserving; conjunctive query; deletion propagation; approximation

46. Enumerating Minimal Weight Set Covers.

Paper Link】 【Pages】:518-529

【Authors】: Zahi Ajami ; Sara Cohen

【Abstract】: The weighted set cover problem is defined over a universe U of elements, and a set S of subsets of U, each of which is associated with a weight. The goal is then to find a subset C of S that collectively covers U, while having minimal weight. The decision version of this well-known problem is NP-complete, but approximation algorithms have been presented that are guaranteed to find a θ S -approximation of the optimal solution, where θ S is the harmonic sum of the size of the largest set in S. Finding minimal weight set covers is an important problem, used, e.g., in facility location, team formation and transaction summarization. This paper studies the enumeration version of this problem. Thus, we present an algorithm that enumerates all minimal weight set covers in polynomial delay (i.e., with polynomial time between results) in θ S -approximate order. We also present a variant of this algorithm in order to enumerate non-redundant set covers in θ S -approximate order. Experimental results show that our algorithms run well in practice over both real and synthetic data.

【Keywords】: enumeration; set cover; approximation algorithm

47. Constraints-Based Explanations of Classifications.

Paper Link】 【Pages】:530-541

【Authors】: Daniel Deutch ; Nave Frost

【Abstract】: A main component of many Data Science applications is the invocation of Machine Learning (ML) classifiers. The typical complexity of these classification models makes it difficult to understand the reason for a result, and consequently to assess its trustworthiness and detect errors. We propose a simple generic approach for explaining classifications, by identifying relevant parts of the input whose perturbation would be significant in affecting the classification. In contrast to previous work, our solution makes use of constraints over the data, to guide the search for meaningful explanations in the application domain. Constraints may either be derived from the schema or specified by a domain expert for the purpose of computing explanations. We have implemented the approach for prominent ML models such as Random Forests and Neural Networks. We demonstrate, through examples and experiments, the effectiveness of our solution, and in particular of its novel use of constraints.

【Keywords】: Database constraints theory; Supervised learning by classification; Data provenance

48. KARL: Fast Kernel Aggregation Queries.

Paper Link】 【Pages】:542-553

【Authors】: Tsz Nam Chan ; Man Lung Yiu ; Leong Hou U

【Abstract】: Kernel functions support a broad range of applications that require tasks like density estimation, classification, or outlier detection. In these tasks, a common online operation is to compute the weighted aggregation of kernel function values with respect to a set of points. Scalable aggregation methods are still unknown for typical kernel functions (e.g., Gaussian kernel, polynomial kernel, and sigmoid kernel) and weighting schemes. In this paper, we propose a novel and effective bounding technique to speedup the computation of kernel aggregation. We further boost its efficiency by leveraging index structures and exploiting index tuning opportunities. In addition, our technique is extensible to different types of kernel functions and weightings. Experimental studies on many real datasets reveal that our proposed method achieves speedups of 2.5-738 over the state-of-the-art.

【Keywords】: Kernel functions; Kernel Aggregation Queries

49. Assessing and Remedying Coverage for a Given Dataset.

Paper Link】 【Pages】:554-565

【Authors】: Abolfazl Asudeh ; Zhongjun Jin ; H. V. Jagadish

【Abstract】: Data analysis impacts virtually every aspect of our society today. Often, this analysis is performed on an existing dataset, possibly collected through a process that the data scientists had limited control over. The existing data analyzed may not include the complete universe, but it is expected to cover the diversity of items in the universe. Lack of adequate coverage in the dataset can result in undesirable outcomes such as biased decisions and algorithmic racism, as well as creating vulnerabilities such as opening up room for adversarial attacks. In this paper, we assess the coverage of a given dataset over multiple categorical attributes. We first provide efficient techniques for traversing the combinatorial explosion of value combinations to identify any regions of attribute space not adequately covered by the data. Then, we determine the least amount of additional data that must be obtained to resolve this lack of adequate coverage. We confirm the value of our proposal through both theoretical analyses and comprehensive experiments on real data.

【Keywords】: Input Data Assessment; Disparate Prediction; Fairness; Bias; Discrimination

50. Social Influence-Based Group Representation Learning for Group Recommendation.

Paper Link】 【Pages】:566-577

【Authors】: Hongzhi Yin ; Qinyong Wang ; Kai Zheng ; Zhixu Li ; Jiali Yang ; Xiaofang Zhou

【Abstract】: As social animals, attending group activities is an indispensable part in people's daily social life, and it is an important task for recommender systems to suggest satisfying activities to a group of users. The major challenge in this task is how to aggregate personal preferences of group members to infer the decision of a group. Conventional group recommendation methods applied a predefined strategy for preference aggregation. However, these static strategies are too simple to model the real and complex process of group decision-making, especially for occasional groups which are formed ad-hoc. Moreover, group members should have non-uniform influences or weights in a group, and the weight of a user can be varied in different groups. Therefore, an ideal group recommender system should be able to accurately learn not only users' personal preferences but also the preference aggregation strategy from data. In this paper, we propose a novel group recommender system, namely SIGR (short for "Social Influence-based Group Recommender"), which takes an attention mechanism and a bipartite graph embedding model BGEM as building blocks. Specifically, we adopt an attention mechanism to learn each user's social influence and adapt their social influences to different groups and develop a novel deep social influence learning framework to exploit and integrate users' global and local social network structure information to further improve the estimation of users' social influences. BGEM is extended to model group-item interactions. In order to overcome the limitation and sparsity of the interaction data generated by occasional groups, we propose two model optimization approaches to seamlessly integrate the user-item interaction data. We create two large-scale benchmark datasets and conduct extensive experiments on them. The experimental results show the superiority of our proposed SIGR by comparing with state-of-the-art group recommender models.

【Keywords】: Group Recommendation; Representation Learning; Social Influence Learning; Network Embedding; Data Sparsity; Deep learning

51. MIDAS: Finding the Right Web Sources to Fill Knowledge Gaps.

Paper Link】 【Pages】:578-589

【Authors】: Xiaolan Wang ; Xin Luna Dong ; Yang Li ; Alexandra Meliou

【Abstract】: Knowledge bases, massive collections of facts (RDF triples) on diverse topics, support vital modern applications. However, existing knowledge bases contain very little data compared to the wealth of information on the Web. This is because the industry standard in knowledge base creation and augmentation suffers from a serious bottleneck: they rely on domain experts to identify appropriate web sources to extract data from. Efforts to fully automate knowledge extraction have failed to improve this standard: these automated systems are able to retrieve much more data and from a broader range of sources, but they suffer from very low precision and recall. As a result, these large-scale extractions remain unexploited. In this paper, we present MIDAS, a system that harnesses the results of automated knowledge extraction pipelines to repair the bottleneck in industrial knowledge creation and augmentation processes. MIDAS automates the suggestion of good-quality web sources and describes what to extract with respect to augmenting an existing knowledge base. We make three major contributions. First, we introduce a novel concept, web source slices, to describe the contents of a web source. Second, we define a profit function to quantify the value of a web source slice with respect to augmenting an existing knowledge base. Third, we develop effective and highly-scalable algorithms to derive high-profit web source slices. We demonstrate that MIDAS produces high-profit results and outperforms the baselines significantly on both real-world and synthetic datasets.

【Keywords】: source recommendation; knowledge base

52. Exploiting Centrality Information with Graph Convolutions for Network Representation Learning.

Paper Link】 【Pages】:590-601

【Authors】: Hongxu Chen ; Hongzhi Yin ; Tong Chen ; Quoc Viet Hung Nguyen ; Wen-Chih Peng ; Xue Li

【Abstract】: Network embedding has been proven effective to learn low-dimensional vector representations for network vertices, and recently received a tremendous amount of research attention. However, most of existing methods for network embedding merely focus on preserving the first and second order proximities between nodes, and the important properties of node centrality are neglected. Various centrality measures such as Degree, Closeness, Betweenness, Eigenvector and PageRank centralities have been designed to measure the importance of individual nodes. In this paper, we focus on a novel yet unsolved problem that aims to learn low-dimensional continuous nodes representations that not only preserve the network structure, but also keep the centrality information. We propose a generalizable model, namely GraphCSC, that utilizes both linkage information and centrality information to learn low-dimensional vector representations for network vertices. The learned embeddings by GraphCSC are able to preserve different centrality information of nodes. In addition, we further propose GraphCSC-M, a more comprehensive model that can preserve different centrality information simultaneously through learning multiple centrality-specific embeddings, and a novel attentive multi-view learning approach is developed to compress multiple embeddings of one node into a compact vector representation. Extensive experiments have been conducted to demonstrate that our model is able to preserve different centrality information of nodes, and achieves better performance on several benchmark tasks compared with recent state-of-the-art network embedding methods.

【Keywords】: Network Embedding; Graph Convolutional Networks; Network Centrality

53. Route Recommendations on Road Networks for Arbitrary User Preference Functions.

Paper Link】 【Pages】:602-613

【Authors】: Pranali Yawalkar ; Sayan Ranu

【Abstract】: Several services today are annotated with points of interest (PoIs) such as "coffee shop", "park", etc. In this paper, we study the query where a user provides a set of relevant PoIs and wants to identify the optimal route covering these PoIs. Ideally, the route should be small in length so that the user can conveniently explore the PoIs. On the other hand, the route should cover as many of the input PoIs as possible. These conflicting requirements of the optimal route raise an intriguing question: how do you balance the importance of route length vs. PoI coverage? If the route is to be covered on foot, and it is raining, length is critical for convenience. On the other hand, if the weather conditions are good, or the user is equipped with a vehicle, coverage is more important. In essence, the relative importance depends on several latent factors and we solve this dilemma through skyline route queries. Skyline routes subsume the choice of the optimization function for a route since the optimal route for any rational user is necessarily a part of the skyline set. Our analysis reveals that the problem is NP-hard. We overcome this bottleneck by designing an algorithm called SkyRoute. SkyRoute drastically prunes the search space through a goal-directed search, which is further empowered by Lipschitz embedding, and provides up to 4 orders of magnitude speed-up over baseline techniques.

【Keywords】: route recommendation; road network; POI; points of interest

54. NSCaching: Simple and Efficient Negative Sampling for Knowledge Graph Embedding.

Paper Link】 【Pages】:614-625

【Authors】: Yongqi Zhang ; Quanming Yao ; Yingxia Shao ; Lei Chen

【Abstract】: Knowledge graph (KG) embedding is a fundamental problem in data mining research with many real-world applications. It aims to encode the entities and relations in the graph into low dimensional vector space, which can be used for subsequent algorithms. Negative sampling, which samples negative triplets from non-observed ones in the training data, is an important step in KG embedding. Recently, generative adversarial network (GAN), has been introduced in negative sampling. By sampling negative triplets with large scores, these methods avoid the problem of vanishing gradient and thus obtain better performance. However, using GAN makes the original model more complex and harder to train, where reinforcement learning must be used. In this paper, motivated by the observation that negative triplets with large scores are important but rare, we propose to directly keep track of them with cache. However, how to sample from and update the cache are two important questions. We carefully design the solutions, which are not only efficient but also achieve good balance between exploration and exploitation. In this way, our method acts as a "distilled" version of previous GAN-based methods, which does not waste training time on additional parameters to fit the full distribution of negative triplets. The extensive experiments show that our method can gain significant improvement on various KG embedding models, and outperform the state-of-the-arts negative sampling methods based on GAN.

【Keywords】: Knowledge graph; Graph embedding; Negative sampling

55. ServeDB: Secure, Verifiable, and Efficient Range Queries on Outsourced Database.

Paper Link】 【Pages】:626-637

【Authors】: Songrui Wu ; Qi Li ; Guoliang Li ; Dong Yuan ; Xingliang Yuan ; Cong Wang

【Abstract】: Data outsourcing to cloud has been a common IT practice nowadays due to its significant benefits. Meanwhile, security and privacy concerns are critical obstacles to hinder the further adoption of cloud. Although data encryption can mitigate the problem, it reduces the functionality of query processing, e.g., disabling SQL queries. Several schemes have been proposed to enable one-dimensional query on encrypted data, but multi-dimensional range query has not been well addressed. In this paper, we propose a secure and scalable scheme that can support multi-dimensional range queries over encrypted data. The proposed scheme has three salient features: (1) Privacy: the server cannot learn the contents of queries and data records during query processing. (2) Efficiency: we utilize hierarchical cubes to encode multi-dimensional data records and construct a secure tree index on top of such encoding to achieve sublinear query time. (3) Verifiability: our scheme allows users to verify the correctness and completeness of the query results to address server's malicious behaviors. We perform formal security analysis and comprehensive experimental evaluations. The results on real datasets demonstrate that our scheme achieves practical performance while guaranteeing data privacy and result integrity.

【Keywords】: Multi-dimensional range query; Privacy; Efficiency; Verifiability

56. Collecting and Analyzing Multidimensional Data with Local Differential Privacy.

Paper Link】 【Pages】:638-649

【Authors】: Ning Wang ; Xiaokui Xiao ; Yin Yang ; Jun Zhao ; Siu Cheung Hui ; Hyejin Shin ; Junbum Shin ; Ge Yu

【Abstract】: Local differential privacy (LDP) is a recently proposed privacy standard for collecting and analyzing data, which has been used, e.g., in the Chrome browser, iOS and macOS. In LDP, each user perturbs her information locally, and only sends the randomized version to an aggregator who performs analyses, which protects both the users and the aggregator against private information leaks. Although LDP has attracted much research attention in recent years, the majority of existing work focuses on applying LDP to complex data and/or analysis tasks. In this paper, we point out that the fundamental problem of collecting multidimensional data under LDP has not been addressed sufficiently, and there remains much room for improvement even for basic tasks such as computing the mean value over a single numeric attribute under LDP. Motivated by this, we first propose novel LDP mechanisms for collecting a numeric attribute, whose accuracy is at least no worse (and usually better) than existing solutions in terms of worst-case noise variance. Then, we extend these mechanisms to multidimensional data that can contain both numeric and categorical attributes, where our mechanisms always outperform existing solutions regarding worst-case noise variance. As a case study, we apply our solutions to build an LDP-compliant stochastic gradient descent algorithm (SGD), which powers many important machine learning tasks. Experiments using real datasets confirm the effectiveness of our methods, and their advantages over existing solutions.

【Keywords】: Local differential privacy, multidimensional data, stochastic gradient descent

57. Partitioned Data Security on Outsourced Sensitive and Non-Sensitive Data.

Paper Link】 【Pages】:650-661

【Authors】: Sharad Mehrotra ; Shantanu Sharma ; Jeffrey D. Ullman ; Anurag Mishra

【Abstract】: Despite extensive research on cryptography, secure and efficient query processing over outsourced data remains an open challenge. This paper continues along the emerging trend in secure data processing that recognizes that the entire dataset may not be sensitive, and hence, non-sensitivity of data can be exploited to overcome limitations of existing encryption-based approaches. We propose a new secure approach, entitled query binning (QB) that allows non-sensitive parts of the data to be outsourced in clear-text while guaranteeing that no information is leaked by the joint processing of non-sensitive data (in clear-text) and sensitive data (in encrypted form). QB maps a query to a set of queries over the sensitive and non-sensitive data in a way that no leakage will occur due to the joint processing over sensitive and non-sensitive data. Interestingly, in addition to improve performance, we show that QB actually strengthens the security of the underlying cryptographic technique by preventing size, frequency-count, and workload-skew attacks.

【Keywords】: Data partitioning; Cryptographic techniques; Scalability; Inference attacks; Partitioned data security

58. SecEQP: A Secure and Efficient Scheme for SkNN Query Problem Over Encrypted Geodata on Cloud.

Paper Link】 【Pages】:662-673

【Authors】: Xinyu Lei ; Alex X. Liu ; Rui Li ; Guan-Hua Tu

【Abstract】: Nowadays, location-based services are proliferating and being widely deployed. For example, a Yelp user can obtain a list of the recommended restaurants near his/her current location. For some small or medium location service providers, they may rely on commercial cloud services, e.g., Dropbox, to store the tremendous geospatial data and deal with a number of user queries. However, it is challenging to achieve a secure and efficient location-based query processing over encrypted geospatial data stored on the cloud. In this paper, we propose the Secure and Efficient Query Processing (SecEQP) scheme to address the secure k nearest neighbor (SkNN) query problem. SecEQP employs the projection function-based approach to code neighbor regions of a given location. Given the codes of two locations, the cloud server only needs to compare whether codes equal or not to check the proximity of the two locations. The codes are further embedded into an indistinguishable Bloom filter tree to build a secure and efficient index. The security of SecEQP is formally proved in the random oracle model. We further prototype SecEQP scheme and evaluate its performance on both real-world and synthetic datasets. Our evaluation results show that SecEQP is a highly efficient approach, e.g., top-10 NN query over 1 million datasets only needs less than 40 msec to get queried results.

【Keywords】: secure k-nearest neighbor; projection function; provable security; efficient

59. Joins Over Encrypted Data with Fine Granular Security.

Paper Link】 【Pages】:674-685

【Authors】: Florian Hahn ; Nicolas Loza ; Florian Kerschbaum

【Abstract】: Performing joins over encrypted data is particularly challenging, since the query result or access pattern of 1 : n-joins reveals the frequency of each distinct element in the column. This frequency information is used in many easy, but very detrimental inference attacks. In this paper we present a different approach: Instead of implementing a stand-alone join operator that reveals the frequency of each element in the column, we show how to construct joins over encrypted data after selection operations have been applied. These joins only leak the fine granular access pattern and frequency of elements selected for the join. Our new fine-granularly secure joins use searchable encryption and key-policy attribute-based encryption and support dynamically adding and removing database rows. Their performance is practical and we present an implementation in MySQL.

【Keywords】: secure joins; encrypted databases; database as a service

60. Column-Oriented Database Acceleration Using FPGAs.

Paper Link】 【Pages】:686-697

【Authors】: Satoru Watanabe ; Kazuhisa Fujimoto ; Yuji Saeki ; Yoshifumi Fujikawa ; Hiroshi Yoshino

【Abstract】: The in-memory system is promising for improving the performance of column-oriented database management systems (DBMSs). However, in comparison with NAND-flash-based solid-state drives (SSDs), dynamic random access memories (DRAMs) are around ten times more expensive and tens to thousands of times more energy inefficient. To overcome this drawback, we developed a column-oriented DBMS and a field-programmable-gate-array-based acceleration engine. We integrated them in FCAccel, our prototype system. The acceleration engine accelerates data extraction from SSDs for SQL processing. We compared the performance of FCAccel with that of MonetDB, Impala and PostgreSQL. Performance was evaluated under the conditions that MonetDB and Impala stored all data in DRAMs and FCAccel stored all data in SSDs. With regard to data extraction, the performance of FCAccel ranged from 0.77 to 1.79 times in comparison with that of MonetDB and from 4.10 to 23.4 times in comparison with that of Impala. These experimental results imply that the acceleration engine can remove the necessity to store all data in DRAMs.

【Keywords】: Column-oriented Database; Acceleration; FPGA

61. Hardware-Conscious Hash-Joins on GPUs.

Paper Link】 【Pages】:698-709

【Authors】: Panagiotis Sioulas ; Periklis Chrysogelos ; Manos Karpathiotakis ; Raja Appuswamy ; Anastasia Ailamaki

【Abstract】: Traditionally, analytical database engines have used task parallelism provided by modern multisocket multicore CPUs for scaling query execution. Over the past few years, GPUs have started gaining traction as accelerators for processing analytical queries due to their massively data-parallel nature and high memory bandwidth. Recent work on designing join algorithms for CPUs has shown that carefully tuned join implementations that exploit underlying hardware can outperform naive, hardware-oblivious counterparts and provide excellent performance on modern multicore servers. However, there has been no such systematic analysis of hardware-conscious join algorithms for GPUs that systematically explores the dimensions of partitioning (partitioned versus non-partitioned joins), data location (data fitting and not fitting in GPU device memory), and access pattern (skewed versus uniform). In this paper, we present the design and implementation of a family of novel, partitioning-based GPU-join algorithms that are tuned to exploit various GPU hardware characteristics for working around the two main limitations of GPUs-limited memory capacity and slow PCIe interface. Using a thorough evaluation, we show that: i) hardware-consciousness plays a key role in GPU joins similar to CPU joins and our join algorithms can process 1 Billion tuples/second even if no data is GPU resident, ii) radix partitioning-based GPU joins that are tuned to exploit GPU hardware can substantially outperform non-partitioned hash joins, iii) hardware-conscious GPU joins can effectively overcome GPU limitations and match, or even outperform, state-of-the-art CPU joins.

【Keywords】: join; GPU; databases; analytics

62. TuFast: A Lightweight Parallelization Library for Graph Analytics.

Paper Link】 【Pages】:710-721

【Authors】: Zechao Shang ; Jeffrey Xu Yu ; Zhiwei Zhang

【Abstract】: Recently, there has been significant interest in large-scale graph analytics systems. However, most of the design efforts focus on accelerating graph analytics on giant graphs and/or in a distributed environment. Little attention focuses on the programmer usability perspective, which is critical to implementing ad-hoc analytics on moderate size graphs. In this paper, we present a lightweight transactional memory (TM) library TuFast which provides easy-to-use primitives for the end-user to agilely develop fast shared memory graph parallelization on a multi-core server. TuFast exploits recent CPU instructions set Hardware Transactional Memory (HTM), which has been available in off-the-shelf CPUs. HTM offers free transactional semantic but also suffers from capacity limitation. Our framework resolves the capacity challenge and efficiently utilizes HTM on graph parallelization by exploiting the graph degree information. Large scale graphs have a power-law degree distribution: a large proportion of the vertices with a small degree, fits in single HTM transactions; a small proportion of vertices with a big degree fits a pessimistic approach like locking; other vertices with a moderate degree can be processed with an optimistic approach with HTM acceleration. Our hybrid approach automatically adapts to the degree of graphs dynamically during the processing. The graph analytical jobs expressed via our library are straightforward and concise and outperform state-of-the-art distributed and multi-core graph analytical systems by up to 4 orders of magnitude.

【Keywords】: Hardware Transactional Memory; Graph Analytics; Hybrid Transactional Memory

63. LDC: A Lower-Level Driven Compaction Method to Optimize SSD-Oriented Key-Value Stores.

Paper Link】 【Pages】:722-733

【Authors】: Yunpeng Chai ; Yanfeng Chai ; Xin Wang ; Haocheng Wei ; Ning Bao ; Yushi Liang

【Abstract】: Log-structured merge (LSM) tree key-value (KV) stores have been widely deployed in many NoSQL and SQL systems, serving online big data applications such as social networking, bioinfomatics, graph processing, machine learning, etc. The batch processing of sorted data merging (i.e., compaction) in LSM-tree KV stores greatly improves the efficiency of writing, leading to good write performance and high space efficiency. Recently, some lazy compaction methods were proposed to further promote the system throughput through delaying the compaction to accumulate more data within a compaction batch. However, the batched writing manner also leads to significant tail latency, which is unacceptable for online processing, and the newly proposed lazy approaches worsen the tail latency problem. Furthermore, the unbalanced read/write performance of the widely deployed SSDs make the performance optimization harder. Aiming to optimize both the tail latency and the system throughput, in this paper, we propose a novel Lower-level Driven Compaction (LDC) method for LSM-tree KV stores. LDC breaks the limitations of the traditional upper-level driven compaction manner and triggers practical compaction actions by lower-level data. It has the benefits of both decreasing the compaction granularity effectively for smaller tail latency and reducing the write amplification of LSM-tree compaction for higher throughput. We have implemented LDC in LevelDB; the experimental results indicate that LDC can reduce the 99.9th percentile latency for 2.62 times compared with the traditional upper-level driven compaction mechanism, and achieve 56.7% ~ 72.3% higher system throughput at the same time.

【Keywords】: LSM-tree; compaction; SSD; KV store; tail latency

64. No False Negatives: Accepting All Useful Schedules in a Fast Serializable Many-Core System.

Paper Link】 【Pages】:734-745

【Authors】: Dominik Durner ; Thomas Neumann

【Abstract】: Concurrency control is one of the most performance critical steps in modern many-core database systems. Achieving higher throughput on multi-socket servers is difficult and many concurrency control algorithms reduce the amount of accepted schedules in favor of transaction throughput or relax the isolation level which introduces unwanted anomalies. Both approaches lead to unexpected transaction behavior that is difficult to understand by the database users. We introduce a novel multi-version concurrency protocol that achieves high performance while reducing the number of aborted schedules to a minimum and providing the best isolation level. Our approach leverages the idea of a graph-based scheduler that uses the concept of conflict graphs. As conflict serializable histories can be represented by acyclic conflict graphs, our scheduler maintains the conflict graph and allows all transactions that keep the graph acyclic. All conflict serializable schedules can be accepted by such a graph-based algorithm due to the conflict graph theorem. Hence, only transaction schedules that truly violate the serializability constraints need to abort. Our developed approach is able to accept the useful intersection of commit order preserving conflict serializable (COCSR) and recoverable (RC) schedules which are the two most desirable classes in terms of correctness and user experience. We show experimentally that our graph-based scheduler has very competitive throughput in pure transactional workloads while providing fewer aborts and improved user experience. Our multi-version extension helps to efficiently perform long-running read transactions on the same up-to-date database. Moreover, our graph-based scheduler can outperform the competitors on mixed workloads.

【Keywords】: transaction processing; concurrency control; Modern Hardware and In Memory Database Systems

65. Discovering Maximal Motif Cliques in Large Heterogeneous Information Networks.

Paper Link】 【Pages】:746-757

【Authors】: Jiafeng Hu ; Reynold Cheng ; Kevin Chen-Chuan Chang ; Aravind Sankar ; Yixiang Fang ; Brian Y. H. Lam

【Abstract】: We study the discovery of cliques (or "complete" subgraphs) in heterogeneous information networks (HINs). Existing clique-finding solutions often ignore the rich semantics of HINs. We propose motif clique, or m-clique, which redefines subgraph completeness with respect to a given motif. A motif, essentially a small subgraph pattern, is a fundamental building block of an HIN. The m-clique concept is general and allows us to analyse "complete" subgraphs in an HIN with respect to desired high-order connection patterns. We further investigate the maximal m-clique enumeration problem (MMCE), which finds all maximal m-cliques not contained in any other m-cliques. Because MMCE is NP-hard, developing an accurate and efficient solution for MMCE is not straightforward. We thus present the META algorithm, which employs advanced pruning strategies to effectively reduce the search space. We also design fast techniques to avoid generating duplicated maximal m-clique instances. Our extensive experiments on large real and synthetic HINs show that META is highly effective and efficient.

【Keywords】: Motif; Clique; Heterogeneous Information Networks

66. REPT: A Streaming Algorithm of Approximating Global and Local Triangle Counts in Parallel.

Paper Link】 【Pages】:758-769

【Authors】: Pinghui Wang ; Peng Jia ; Yiyan Qi ; Yu Sun ; Jing Tao ; Xiaohong Guan

【Abstract】: Recently, considerable efforts have been devoted to approximately computing the global and local (i.e., incident to each node) triangle counts of a large graph stream represented as a sequence of edges. Existing approximate triangle counting algorithms rely on sampling techniques to reduce the computational cost. However, their estimation errors are significantly determined by the covariance between sampled triangles. Moreover, little attention has been paid to developing parallel one-pass streaming algorithms that can be used to fast and approximately count triangles on a multi-core machine or a cluster of machines. To solve these problems, we develop a novel parallel method REPT to significantly reduce the covariance (even completely eliminate the covariance for some cases) between sampled triangles. We theoretically prove that REPT is more accurate than parallelizing existing triangle count estimation algorithms in a direct manner. In addition, we also conduct extensive experiments on a variety of real-world graphs, and the results demonstrate that our method REPT is several times more accurate than state-of-the-art methods.

【Keywords】: streaming algorithm; global and local triangle approximation; parallel

67. Information Diffusion Prediction via Recurrent Cascades Convolution.

Paper Link】 【Pages】:770-781

【Authors】: Xueqin Chen ; Fan Zhou ; Kunpeng Zhang ; Goce Trajcevski ; Ting Zhong ; Fengli Zhang

【Abstract】: Effectively predicting the size of an information cascade is critical for many applications spanning from identifying viral marketing and fake news to precise recommendation and online advertising. Traditional approaches either heavily depend on underlying diffusion models and are not optimized for popularity prediction, or use complicated hand-crafted features that cannot be easily generalized to different types of cascades. Recent generative approaches allow for understanding the spreading mechanisms, but with unsatisfactory prediction accuracy. To capture both the underlying structures governing the spread of information and inherent dependencies between re-tweeting behaviors of users, we propose a semi-supervised method, called Recurrent Cascades Convolutional Networks (CasCN), which explicitly models and predicts cascades through learning the latent representation of both structural and temporal information, without involving any other features. In contrast to the existing single, undirected and stationary Graph Convolutional Networks (GCNs), CasCN is a novel multi-directional/dynamic GCN. Our experiments conducted on real-world datasets show that CasCN significantly improves the prediction accuracy and reduces the computational cost compared to state-of-the-art approaches.

【Keywords】: information cascade, structural-temporal information, cascade size prediction, deep learning

68. Finding Densest Lasting Subgraphs in Dynamic Graphs: A Stochastic Approach.

Paper Link】 【Pages】:782-793

【Authors】: Xuanming Liu ; Tingjian Ge ; Yinghui Wu

【Abstract】: One important problem that is insufficiently studied is finding densest lasting-subgraphs in large dynamic graphs, which considers the time duration of the subgraph pattern. We propose a framework called Expectation-Maximization with Utility functions (EMU), a novel stochastic approach that nontrivially extends the conventional EM approach. EMU has the flexibility of optimizing any user-defined utility functions. We validate our EMU approach by showing that it converges to the optimum-by proving that it is a specification of the general Minorization-Maximization (MM) framework with convergence guarantees. We then devise EMU algorithms for the densest lasting subgraph problem. Using real-world graph data, we experimentally verify the effectiveness and efficiency of our techniques, and compare with two prior approaches on dense subgraph detection.

【Keywords】: dynamic graph; densest lasting subgraph; Expectation Maximization with Utility functions

69. Multicapacity Facility Selection in Networks.

Paper Link】 【Pages】:794-805

【Authors】: Alvis Logins ; Panagiotis Karras ; Christian S. Jensen

【Abstract】: Consider the task of selecting a set of facilities, e.g., hotspots, shops, or utility stations, each with a capacity to serve a certain number of customers. Given a set of customer locations, we have to minimize a cumulative distance between each customer and the facility earmarked to serve this customer within its capacity. This problem is known as the Capacitated k-Median (CKM) problem. In a data-intensive variant, distances are calculated over a network, while a data set associates each candidate facility location with a different capacity. In other words, going beyond positioning facilities in a metric space, the problem is to select a small subset out of a large data set of candidate network-based facilities with capacity constraints. We call this variant the Multicapacity Facility Selection (MCFS) problem. Linear Programming solutions are unable to contend with the network sizes and supplies of candidate facilities encountered in real-world applications; yet the problem may need to be solved scalably and repeatedly, as in applications requiring the dynamic reallocation of customers to facilities. We present the first, to our knowledge, solution to the MCFS problem that achieves both scalability and high quality, the Wide Matching Algorithm (WMA). WMA iteratively assigns customers to candidate facilities and leverages a data-driven heuristic for the SETCOVER problem inherent to the MCFS problem. An extensive experimental study with real-world and synthetic networks demonstrates that WMA scales gracefully to million-node networks and large facility and customer data sets; further, WMA provides a solution quality superior to scalable baselines (also proposed in the paper) and competitive vis-á-vis the optimal solution, returned by an off-the-shelf solver that runs only on small facility databases.

【Keywords】: facility selection; multicapacity

70. An MBR-Oriented Approach for Efficient Skyline Query Processing.

Paper Link】 【Pages】:806-817

【Authors】: Ji Zhang ; Wenlu Wang ; Xunfei Jiang ; Wei-Shinn Ku ; Hua Lu

【Abstract】: This research proposes an advanced approach that improves the efficiency of skyline query processing by significantly reducing the computational cost on object comparisons, i.e., dominance tests between objects. Our solutions are based on two novel concepts. The skyline query over Minimum Bounding Rectangles (MBRs) receives a set of MBRs and returns the MBRs that are not dominated by other MBRs. In the dominance test for MBRs, the detailed attribute values of objects in the MBRs are not accessed. Moreover, the dependent group of MBRs reduces the search space for dominance tests. Objects in an MBR are only compared with the ones in the corresponding dependent groups of the MBR rather than with the entire dataset. Our solutions apply the two concepts to the R-tree in order to use its hierarchical structure in which every node is a natural abstraction of an MBR. Specifically, given the R-tree index of an input dataset, we first eliminate unqualified objects by utilizing the skyline query over MBRs (i.e., intermediate nodes in the R-tree). Subsequently, we generate dependent groups for the skyline MBRs. Two dependent group generation methods that rely on either the sorting technique or the R-tree index are developed. Further, we apply an existing skyline algorithm to every dependent group, and the results of the original skyline query are the union of skyline objects in the dependent groups. In addition, we also analyze the cardinality of the two new concepts based on a probabilistic model, which enables us to analyze the computational complexity of the proposed solutions. Our experimental results show that the proposed solutions are clearly more efficient than the state-of-the-art approaches.

【Keywords】: Skyline; MBR

71. Dynamic Set kNN Self-Join.

Paper Link】 【Pages】:818-829

【Authors】: Daichi Amagata ; Takahiro Hara ; Chuan Xiao

【Abstract】: In many applications, data objects can be represented as sets. For example, in video on-demand and social network services, the user data consists of a set of movies that have been watched and a set of users (friends), respectively, and they can be used for recommendation and information extraction. The problem of set similarity self-join hence has been studied extensively. Existing studies assume that sets are static, but in the above applications, sets are dynamically updated, and this requires continuous updating the join result. In this paper, we study a novel problem, dynamic set kNN self-join, i.e., for each set, we continuously compute its k nearest neighbor sets. Our problem poses a challenge for the efficiency of computation, because just an element insertion (deletion) into (from) a set may affect the kNN results of many sets. To address this challenge, we first investigate the property of the dynamic set kNN self-join problem to observe the search space derived from a set update. Then, based on this observation, we propose an efficient algorithm. This algorithm employs an indexing technique that enables incremental similarity computation and prunes unnecessary similarity computation. Our empirical studies using real datasets show the efficiency and scalability of our algorithm.

【Keywords】: Social networking (online); Indexes; Motion pictures; Heuristic algorithms; Collaboration; Real-time systems; Sorting

72. Packed Memory Arrays - Rewired.

Paper Link】 【Pages】:830-841

【Authors】: Dean De Leo ; Peter A. Boncz

【Abstract】: The physical memory layout of a tree-based index structure deteriorates over time as it sustains more updates; such that sequential scans on the physical level become non-sequential, and therefore slower. Packed Memory Arrays (PMAs) prevent this by managing all data in a sequential sparse array. PMAs have been studied mostly theoretically but suffer from practical problems, as we show in this paper. We study and fix these problems, resulting in an improved data structure: the Rewired Memory Array (RMA). We compare RMA with the main previous PMA implementations as well as state-of-the-art tree index structures and show on a wide variety of data and query distributions that RMA can reach competitive update and point lookup performance, while always providing superior scan performance - close to dense column scans.

【Keywords】: Indexing; Data structure; PMA; Sparse array

73. GEM^2-Tree: A Gas-Efficient Structure for Authenticated Range Queries in Blockchain.

Paper Link】 【Pages】:842-853

【Authors】: Ce Zhang ; Cheng Xu ; Jianliang Xu ; Yuzhe Tang ; Byron Choi

【Abstract】: Blockchain technology has attracted much attention due to the great success of the cryptocurrencies. Owing to its immutability property and consensus protocol, blockchain offers a new solution for trusted storage and computation services. To scale up the services, prior research has suggested a hybrid storage architecture, where only small meta-data are stored onchain and the raw data are outsourced to off-chain storage. To protect data integrity, a cryptographic proof can be constructed online for queries over the data stored in the system. However, the previous schemes only support simple key-value queries. In this paper, we take the first step toward studying authenticated range queries in the hybrid-storage blockchain. The key challenge lies in how to design an authenticated data structure (ADS) that can be efficiently maintained by the blockchain, in which a unique gas cost model is employed. By analyzing the performance of the existing techniques, we propose a novel ADS, called GEM 2 -tree, which is not only gas-efficient but also effective in supporting authenticated queries. To further reduce the ADS maintenance cost without sacrificing much the query performance, we also propose an optimized structure, GEM 2 *-tree, by designing a two-level index structure. Theoretical analysis and empirical evaluation validate the performance of the proposed ADSs.

【Keywords】: Blockchain; Smart contract; Range query; Authenticated query

74. Effective Filters and Linear Time Verification for Tree Similarity Joins.

Paper Link】 【Pages】:854-865

【Authors】: Thomas Hütter ; Mateusz Pawlik ; Robert Loschinger ; Nikolaus Augsten

【Abstract】: The tree similarity join computes all similar pairs in a collection of trees. Two trees are similar if their edit distance falls within a user-defined threshold. Previous algorithms, which are based on a filter-verify approach, suffer from the following two issues. First, ineffective filters produce a large number of candidates that must be further verified. Second, the candidates are verified by computing the tree edit distance, which is cubic in the number of tree nodes. Thus, these techniques fail to scale to large tree collections and are infeasible even for small collections when the trees are large. In this paper, we present a scalable solution for the tree similarity join that is based on (1) an effective indexing technique that leverages both the labels and the structure of trees to reduce the number of candidates, (2) an efficient upper bound filter that moves many of the candidates directly to the join result without additional verification, (3) a linear time verification technique for the remaining candidates that avoids the expensive tree edit distance. We are the first to propose an efficient, non-trivial upper bound and linear time verification for the tree edit distance. Unlike previous solutions, our approach scales to collections with millions of trees. We improve the overall join time by up to two orders of magnitude w.r.t. the state of the art.

【Keywords】: tree similarity join; tree edit distance

75. KV-Match: A Subsequence Matching Approach Supporting Normalization and Time Warping.

Paper Link】 【Pages】:866-877

【Authors】: Jiaye Wu ; Peng Wang ; Ningting Pan ; Chen Wang ; Wei Wang ; Jianmin Wang

【Abstract】: The volume of time series data has exploded due to the popularity of new applications, such as data center management and IoT. Subsequence matching is a fundamental task in mining time series data. All index-based approaches only consider raw subsequence matching (RSM) and do not support subsequence normalization. UCR Suite can deal with normalized subsequence matching problem (NSM), but it needs to scan full time series. In this paper, we propose a novel problem, named constrained normalized subsequence matching problem (cNSM), which adds some constraints to NSM problem. The cNSM problem provides a knob to flexibly control the degree of offset shifting and amplitude scaling, which enables users to build the index to process the query. We propose a new index structure, KV-index, and the matching algorithm, KV-match. With a single index, our approach can support both RSM and cNSM problems under either ED or DTW distance. KV-index is a key-value structure, which can be easily implemented on local files or HBase tables. To support the query of arbitrary lengths, we extend KV-match to KV-match_DP, which utilizes multiple varied-length indexes to process the query. We conduct extensive experiments on synthetic and real-world datasets. The results verify the effectiveness and efficiency of our approach.

【Keywords】: timeseries; subsequence matching; similarity search; normalization; dynamic time warping

76. Efficient Maximal Spatial Clique Enumeration.

Paper Link】 【Pages】:878-889

【Authors】: Chen Zhang ; Ying Zhang ; Wenjie Zhang ; Lu Qin ; Jianye Yang

【Abstract】: Maximal clique enumeration is a fundamental problem in graph database. In this paper, we investigate this problem in the context of spatial database. Given a set P of spatial objects in a 2-dimensional space (e.g., geo-locations of users or point of interests) and a distance threshold r, we can come up with a spatial neighbourhood graph P r by connecting every pair of objects (vertices) in P within distance r. Given a clique S of P r , namely a spatial clique, it is immediate that any pairwise distance among objects in S is bounded by r. As the maximal pairwise distance has been widely used to capture the spatial cohesiveness of a group of objects, the maximal spatial clique enumeration technique can identify groups of spatially close objects in a variety of location-based-service (LBS) applications. In addition, we show that the maximal spatial clique enumeration can also be used to identify maximal clique pattern instances in the co-location pattern mining applications. Given the existing techniques for maximal clique enumeration, which can be immediately applied on the spatial neighbourhood graph P r , two questions naturally arise for the enumeration of maximal spatial cliques: (1) the maximal clique enumeration on general graph is NP hard, can we have a polynomial time solution on the spatial neighbourhood graph? and (2) can we exploit the geometric property of the spatial clique to speed up the computation? In this paper, we give a negative answer to the first question by an example where the number of maximal spatial cliques is exponential to the number of the objects. While the answer to the second question is rather positive: we indeed develop two pruning techniques based on geometric properties of the maximal spatial clique to significantly enhance the computing efficiency. Extensive experiments on real-life geolocation data demonstrate the superior performance of proposed methods compared with two baseline algorithms.

【Keywords】: spatial clique

77. Cluster-Based Subscription Matching for Geo-Textual Data Streams.

Paper Link】 【Pages】:890-901

【Authors】: Lisi Chen ; Shuo Shang ; Kai Zheng ; Panos Kalnis

【Abstract】: Geo-textual data that contain spatial, textual, and temporal information are being generated at a very high rate. These geo-textual data cover a wide range of topics. Users may be interested in receiving local popular topics from geo-textual messages. We study the cluster-based subscription matching (CSM) problem. Given a stream of geo-textual messages, we maintain up-to-date clustering results based on a threshold-based online clustering algorithm. Based on the clustering result, we feed subscribers with their preferred geo-textual message clusters according to their specified keywords and location. Moreover, we summarize each cluster by selecting a set of representative messages. The CSM problem considers spatial proximity, textual relevance, and message freshness during the clustering, cluster feeding, and summarization processes. To solve the CSM problem, we propose a novel solution to cluster, feed, and summarize a stream of geo-textual messages efficiently. We evaluate the efficiency of our solution on two real-world datasets and the experimental results demonstrate that our solution is capable of high efficiency compared with baselines.

【Keywords】: geo textual; stream; subscribe; event detection; clustering; summarisation

78. Time-Dependent Hop Labeling on Road Network.

Paper Link】 【Pages】:902-913

【Authors】: Lei Li ; Sibo Wang ; Xiaofang Zhou

【Abstract】: Route scheduling on time-dependent road network is slow due to its problem complexity of Ω(T(|V|log |V|+|E|)), where T is the size of the result's time-dependent function, |V| is the number of vertices and |E| is the number of edges. To make things worse, T grows larger as the route becomes longer or the query time interval becomes bigger, especially for a fastest path profile query whose time interval is 24 hours. In this paper, we aim to answer the fastest path profile query on time-dependent road network faster by extending the 2-hop labeling approach, which is fast in answering shortest distance query on the static graph. However, building an index on a time-dependent graph is both time and space consuming, so currently only online-search approach exist. Apparently, its query answering power is limited by the online searching. To solve this problem, we first propose the time-dependent hop on large road network by partitioning it into smaller sub-graphs. The index is built within and between the partitions, and is retrieved from disk during query answering with the help of sampling. Moreover, we propose an online approximation technique AT-Dijkstra and a bottom-up compression method to further reduce the label size, save construction time and speedup query answering. Experiments on real world road network show that our approach outperforms the state-of-art fastest path index approaches and can speed up the query answering by hundreds of times.

【Keywords】: Road Network; Time Dependent; 2 Hop

79. Weight-Constrained Route Planning Over Time-Dependent Graphs.

Paper Link】 【Pages】:914-925

【Authors】: Ye Yuan ; Xiang Lian ; Guoren Wang ; Lei Chen ; Yuliang Ma ; Yishu Wang

【Abstract】: Weight-constrained route planning (WRP) over static graphs has been extensively studied due to its wide application to transportation networks. However, real transportation networks often evolve over time and are thus modeled as time-dependent graphs. In this paper, we study the WRP problem over a large time-dependent graph by incorporating continuous time and weight functions into it. Most existing works regarding route planning over time-dependent graphs are based on the first-in-first-out (FIFO) property. Unfortunately, the FIFO property does not hold for our problem. To solve the problem, we propose two novel route planning algorithms, namely, a baseline algorithm and an advanced algorithm. Specifically, the advanced algorithm is even more efficient than the baseline algorithm, as the advanced algorithm incorporates a fast traversal scheme and tight bounds of time functions to terminate the traversal as early as possible. We confirm the effectiveness and efficiency of our algorithms by extensive experiments on real datasets.

【Keywords】: Time Dependent Graphs; Route Planning; Weight Constraint

80. Skyline Queries Constrained by Multi-cost Transportation Networks.

Paper Link】 【Pages】:926-937

【Authors】: Qixu Gong ; Huiping Cao ; Parth Nagarkar

【Abstract】: Skyline queries are used to find the Pareto optimal solution from datasets containing multi-dimensional data points. In this paper, we propose a new type of skyline queries whose evaluation is constrained by a multi-cost transportation network (MCTN) and whose answers are off the network. This type of skyline queries is useful in many applications. For example, a person wants to find an apartment by considering not only the price and the surrounding area of the apartment, but also the transportation cost, time, and distance between the apartment and his/her work place. Most existing works that evaluate skyline queries on multi-cost networks (MCNs), which are either MCTNs or road networks, find interesting objects that locate on edges of the networks. Formally, our new type of skyline queries takes as input an MCTN, a query point q, and a set of objects of interest D with spatial information, where q and the objects in D are off the network. The answers to such queries are objects in D that are not dominated by other D objects when considering the multiple attributes of these objects and the multiple network cost from q to the solution objects. To evaluate such queries, we propose an exact search algorithm and its improved version by implementing several properties. The space of the exact skyline solutions is huge and can easily reach the order of thousands and incur long evaluation time. We further design much more efficient heuristic methods to find approximate solutions. We run extensive experiments using both real and synthetic datasets to test the effectiveness and efficiency of our proposed approaches. The results show that the exact search algorithm can be dramatically improved by utilizing several properties. The heuristic approaches to find approximate answers can largely reduce the query time and retrieve results that are comparable to the exact solutions.

【Keywords】: Skyline Queries; Transportation Networks; Multi-dimensional Data

81. Online Social Media Recommendation Over Streams.

Paper Link】 【Pages】:938-949

【Authors】: Xiangmin Zhou ; Dong Qin ; Xiaolu Lu ; Lei Chen ; Yanchun Zhang

【Abstract】: As one of the most popular services over online platforms, social recommendation has attracted increasing research efforts recently. Among all the recommendation tasks, an important one is item recommendation over high speed social media streams. Existing stream recommendation techniques are not effective for handling social users with diverse interests. Meanwhile, approaches for recommending items to a particular user are not efficient when applied to a huge number of users over high speed streams. In this paper, we propose a novel framework for the social recommendation over streams. Specifically, we first propose a novel Bi-Layer Hidden Markov Model (BiHMM) that adaptively captures the users' behaviors and their interactions with influential official accounts to predict their long-term and short-term interests. Then, we design a new probabilistic entity matching scheme for identifying the relevance score of a streaming item to a user. Moreover, we propose a novel index scheme called CPPse-index for improving the efficiency of our solution. Extensive tests are conducted to prove the superiority of our approach in terms of the recommendation quality and time cost.

【Keywords】: User interests; Bi-Layer HMM; Social stream

82. Canonicalization of Open Knowledge Bases with Side Information from the Source Text.

Paper Link】 【Pages】:950-961

【Authors】: Xueling Lin ; Lei Chen

【Abstract】: Nowadays Open Information Extraction (Open IE) approaches, which extract <;noun phrase, relation phrase, noun phrase> triples from unstructured text, contribute to the construction of large Open Knowledge Bases (Open KBs). However, one crucial problem is that the noun phrases and relation phrases in the extracted triples are not well canonicalized, which leads to a large number of redundant and ambiguous facts. For example, both <;Barack Obama, was born in, Honolulu> and <;President Obama, has birthplace, Honolulu> may be extracted and stored in Open KBs. Recent research proposes to solve this problem by clustering over manually-defined feature spaces based on the similarity of the noun phrases and relation phrases. However, the performance of such techniques is limited, since only the information contained in the triples is utilized to measure their similarity. In this paper, we propose to perform canonicalization over Open IE triples by incorporating the side information from the original data sources, including the candidate entities of the noun phrases detected in the source text, the types of the candidate entities and the domain knowledge of the source text. We model the canonicalization problem of noun phrases and relation phrases jointly based on such side information, and demonstrate the effectiveness of our approach through extensive experiments on two real-world datasets.

【Keywords】: Open Knowledge Base Canonicalization; Information Extraction; Knowledge Base; Knowledge Representation

83. Walking with Perception: Efficient Random Walk Sampling via Common Neighbor Awareness.

Paper Link】 【Pages】:962-973

【Authors】: Yongkun Li ; Zhiyong Wu ; Shuai Lin ; Hong Xie ; Min Lv ; Yinlong Xu ; John C. S. Lui

【Abstract】: Random walk is widely applied to sample large-scale graphs due to its simplicity of implementation and solid theoretical foundations of bias analysis. However, its computational efficiency is heavily limited by the slow convergence rate (a.k.a. long burn-in period). To address this issue, we propose a common neighbor aware random walk framework called CNARW, which leverages weighted walking by differentiating the next-hop candidate nodes to speed up the convergence. Specifically, CNARW takes into consideration the common neighbors between previously visited nodes and next-hop candidate nodes in each walking step. Based on CNARW, we further develop two efficient "unbiased sampling" schemes. Experimental results on real-world network datasets show that our approach converges remarkably faster than the state-of-the-art random walk sampling algorithms. Furthermore, to achieve the same estimation accuracy, our approach reduces the query cost (a measure of sampling budget) significantly. Lastly, we also use two case studies to demonstrate the effectiveness of our sampling framework in solving large-scale graph analysis tasks.

【Keywords】: Random Walk; Graph Analysis; Convergence

Paper Link】 【Pages】:974-985

【Authors】: Tova Milo ; Amit Somech ; Brit Youngmann

【Abstract】: As more and more social network users interact through Internet Memes, an emerging popular type of captioned images, there is a growing need for users to quickly retrieve the right Meme for a given situation. As opposed conventional image search, visually similar Memes may reflect different concepts. Intent is sometimes captured by user annotations (e.g., tags), but these are often incomplete and ambiguous. Thus, a deeper analysis of the relations among Memes is required for an accurate, custom search. To address this problem, we present SimMeme, a Meme-dedicated search engine. SimMeme uses a generic graph-based data model that aligns various types of information about the Memes with a semantic ontology. A novel similarity measure that effectively considers all incorporated data is employed and serves as the foundation of our system. Our experimental results achieve using common evaluation metrics and crowd feedback, over a large repository of real-life annotated Memes, show that in the task of Meme retrieval, SimMeme outperforms state-of-the-art solutions for image retrieval.

【Keywords】: Internet Memes; Similarity; Semantics; Information Network

85. A Hierarchical Framework for Top-k Location-Aware Error-Tolerant Keyword Search.

Paper Link】 【Pages】:986-997

【Authors】: Junye Yang ; Yong Zhang ; Xiaofang Zhou ; Jin Wang ; Huiqi Hu ; Chunxiao Xing

【Abstract】: Location-aware services have become widely available on a variety of devices. The resulting fusion of spatio-textual data enables the kind of top-k query that takes into account both location proximity and text relevance. Considering both the misspellings in user input and the data quality issues of spatiotextual databases, it is necessary to support error-tolerant spatial keyword search for end-users. Existing studies mainly focused on set-based textual relevance, but they cannot find reasonable results when the input tokens are not exactly matched with those from records in the database. In this paper, we propose a novel framework to solve the problem of top-k location-aware similarity search with fuzzy token matching. We propose a hierarchical index HGR-Tree to capture signatures of both spatial and textual relevance. Based on such an index structure, we devise a best-first search algorithm to preferentially access nodes of HGR-Tree with more similar objects while those with dissimilar ones can be pruned. We further devise an incremental search strategy to reduce the overhead brought by supporting fuzzy token matching. Experimental results on real world POI datasets show that our framework outperforms state-of-the-art methods by one to two orders of magnitude.

【Keywords】: spatio textual similarity search; top-k search; index

86. 2ED: An Efficient Entity Extraction Algorithm Using Two-Level Edit-Distance.

Paper Link】 【Pages】:998-1009

【Authors】: Zeyi Wen ; Dong Deng ; Rui Zhang ; Ramamohanarao Kotagiri

【Abstract】: Entity extraction is fundamental to many text mining tasks such as organisation name recognition. A popular approach to entity extraction is based on string matching against a dictionary of known entities. For approximate entity extraction from free text, considering solely character-based or solely token-based similarity cannot simultaneously deal with minor name variations at token-level and typos at character-level. Moreover, the tolerance of mismatch in character-level may be different from that in token-level, and the tolerance thresholds of the two levels should be able to be customised individually. In this paper, we propose an efficient character-level and token-level edit-distance based algorithm called FuzzyED. To improve the efficiency of FuzzyED, we develop various novel techniques including (i) a spanning-based candidate sub-string producing technique, (ii) a lower bound dissimilarity to determine the boundaries of candidate sub-strings, (iii) a core token based technique that makes use of the importance of tokens to reduce the number of unpromising candidate sub-strings, and (iv) a shrinking technique to reuse computation. Empirical results on real world datasets show that FuzzyED can efficiently extract entities and produce a high F 1 score in the range of [0.91, 0.97].

【Keywords】: Entity extraction; Edit distance; Approximation

87. Bridging Quantities in Tables and Text.

Paper Link】 【Pages】:1010-1021

【Authors】: Yusra Ibrahim ; Mirek Riedewald ; Gerhard Weikum ; Demetrios Zeinalipour-Yazti

【Abstract】: There is a wealth of schema-free tables on the Web, holding valuable information about quantities on sales and costs, environmental footprint of cars, health data and more. Table content can only be properly interpreted in conjunction with the textual context that surrounds the tables. This paper introduces the quantity alignment problem: bidirectional linking between textual mentions of quantities and the corresponding table cells, in order to support advanced content summarization and faster navigation between explanations in text and details in tables. We present the BriQ system for computing such alignments. BriQ is designed to cope with the specific challenges of approximate quantities, aggregated quantities, and calculated quantities in text that are common but cannot be directly matched in table cells. We judiciously combine feature-based classification with joint inference by random walks over candidate alignment graphs. Experiments with a large collection of tables from the Common Crawl project demonstrate the viability of our methods.

【Keywords】: Information Extraction; Quantities in Text; Web Tables

88. An Efficient Insertion Operator in Dynamic Ridesharing Services.

Paper Link】 【Pages】:1022-1033

【Authors】: Yi Xu ; Yongxin Tong ; Yexuan Shi ; Qian Tao ; Ke Xu ; Wei Li

【Abstract】: Dynamic ridesharing refers to services that arrange one-time shared rides on short notice. It underpins various real-world intelligent transportation applications such as car-pooling, food delivery and last-mile logistics. A core operation in dynamic ridesharing is the "insertion operator". Given a worker and a feasible route which contains a sequence of origin-destination pairs from previous requests, the insertion operator inserts a new origin-destination pair from a newly arrived request into the current route such that certain objective is optimized. Common optimization objectives include minimizing the maximum flow time of all requests and minimizing the total travel time of the worker. Despite its frequent usage, the insertion operator has a time complexity of O(n 3 ), where n is the number of all requests assigned to the worker. The cubic running time of insertion fundamentally limits the efficiency of urban-scale dynamic ridesharing based applications. In this paper, we propose a novel partition framework and a dynamic programming based insertion with a time complexity of O(n 2 ). We further improve the time efficiency of the insertion operator to O(n) harnessing efficient index structures, such as fenwick tree. Evaluations on two real-world large-scale datasets show that our methods can accelerate insertion by 1.5 to 998.1 times.

【Keywords】: Ridesharing; Insertion; Dynamic programming

89. Auction-Based Order Dispatch and Pricing in Ridesharing.

Paper Link】 【Pages】:1034-1045

【Authors】: Libin Zheng ; Peng Cheng ; Lei Chen

【Abstract】: Ridesharing plays a more and more important role in modern transport. In this paper, we propose solutions for the bonus-offering scenario of ridesharing platforms (service providers). When vehicles are in shortage, requesters are allowed to offer bonus so that their orders can get prioritized in the dispatch process. To enable self-motivated bonus bidding of requesters, we devise an auction mechanism, where requesters are supposed to submit their bids truthfully and the platform conducts order dispatch and pricing. Our goal is to maximize the overall utility of the auction, while ensuring desirable auction properties such as truthfulness and individual rationality. To realize that, we propose a greedy and a ranking approach for order dispatch and their corresponding pricing strategies. Extensive experiments on real data suggest that the ranking approach is both effective and efficient.

【Keywords】: ridesharing; car-hailing; auction; order dispatch

90. When Geo-Text Meets Security: Privacy-Preserving Boolean Spatial Keyword Queries.

Paper Link】 【Pages】:1046-1057

【Authors】: Ningning Cui ; Jianxin Li ; Xiaochun Yang ; Bin Wang ; Mark Reynolds ; Yong Xiang

【Abstract】: In recent years, spatial keyword query has attracted wide-spread research attention due to the popularity of the location-based services. To efficiently support the online spatial keyword query processing, the data owners need to outsource their data and the query processing service to cloud platforms. However, the outsourcing services may raise privacy leaking issues because the cloud server on the platforms may not be trusted for both data owners and query users. Therefore, in this work, we first propose and formalize the problem of privacy-preserving boolean spatial keyword query under the widely accepted Known Background Thread Model. And then, we devise a novel privacy-preserving spatial-textual Bloom Filter encoding structure and an encrypted R-tree index. They can maintain both spatial and text information together in a secure way while answering the encrypted spatial keyword queries without the need for data decryption. To further accelerate the query processing, a compressed encrypted index is provided to deal with the challenges of the large dimension expansion and the expensive space consumption in the encrypted R-tree index. In addition, we develop the corresponding algorithms based on the designed index, and present the in-depth security analysis to show our work's satisfaction meeting the strong secure scheme. Finally, we demonstrate the performance of our proposed index and algorithms by conducting extensive experiments on four datasets under various system settings.

【Keywords】: spatial keyword; Bloom filter; Location based

91. Moving Object Linking Based on Historical Trace.

Paper Link】 【Pages】:1058-1069

【Authors】: Fengmei Jin ; Wen Hua ; Jiajie Xu ; Xiaofang Zhou

【Abstract】: The prevalent adoption of GPS-enabled devices has witnessed an explosion of various location-based services which produce a huge amount of trajectories monitoring an individual's movement. This triggers an interesting question: is movement history sufficiently representative and distinctive to identify an individual? In this work, we study the problem of moving object linking based on their historical traces. However, it is non-trivial to extract effective patterns from moving history and meanwhile conduct object linking efficiently. To this end, we propose four representation strategies (sequential, temporal, spatial, and spatiotemporal) and two quantitative criteria (commonality and unicity) to construct the personalised signature from the historical trace. Moreover, we formalise the problem of moving object linking as a k-nearest neighbour (k-NN) search on the collection of signatures, and aim to improve efficiency considering the high dimensionality of signatures and the large cardinality of the candidate object set. A simple but effective dimension reduction strategy is introduced in this work, which empirically outperforms existing algorithms including PCA and LSH. We propose a novel indexing structure, Weighted R-tree (WR-tree), and two pruning methods to further speed up k-NN search by combining weight and spatial information contained in the signature. Our extensive experimental results on a real world dataset verify the superiority of our proposals, in terms of both accuracy and efficiency, over state-of-the-art approaches.

【Keywords】: moving object linking; historical trace; signature; dimension reduction; k-NN search; weighted r-tree

92. ImageProof: Enabling Authentication for Large-Scale Image Retrieval.

Paper Link】 【Pages】:1070-1081

【Authors】: Shangwei Guo ; Jianliang Xu ; Ce Zhang ; Cheng Xu ; Tao Xiang

【Abstract】: With the explosive growth of online images and the popularity of search engines, a great demand has arisen for small and medium-sized enterprises to build and outsource large-scale image retrieval systems to cloud platforms. While reducing storage and retrieval burdens, enterprises are at risk of facing untrusted cloud service providers. In this paper, we take the first step in studying the problem of query authentication for large-scale image retrieval. Due to the large size of image files, the main challenges are to (i) design efficient authenticated data structures (ADSs) and (ii) balance search, communication, and verification complexities. To address these challenges, we propose two novel ADSs, the Merkle randomized k-d tree and the Merkle inverted index with cuckoo filters, to ensure the integrity of query results in each step of image retrieval. For each ADS, we develop corresponding search and verification algorithms on the basis of a series of systemic design strategies. Furthermore, we put together the ADSs and algorithms to design the final authentication scheme for image retrieval, which we name ImageProof. We also propose several optimization techniques to improve the performance of the proposed ImageProof scheme. Security analysis and extensive experiments are performed to show the robustness and efficiency of ImageProof.

【Keywords】: Verifiable; Content based Image Retrieval; Bag of Visual Word; Query authentication

Paper Link】 【Pages】:1082-1093

【Authors】: Youhuan Li ; Lei Zou ; M. Tamer Özsu ; Dongyan Zhao

【Abstract】: The growing popularity of dynamic applications such as social networks provides a promising way to detect valuable information in real time. These applications create high-speed data that can be easily modeled as streaming graph. Efficient analysis over these data is of great significance. In this paper, we study the subgraph (isomorphism) search over streaming graph data that obeys timing order constraints over the occurrence of edges in the stream. We propose a solution to efficiently answer subgraph search, introduce optimizations to greatly reduce the space cost, and design concurrency management to improve system throughput. Extensive experiments on real network traffic data and synthetic social streaming data confirms the efficiency and effectiveness of our solution.

【Keywords】: Streaming Graphs; Subgraph; Timing Order

94. Utilizing Dynamic Properties of Sharing Bits and Registers to Estimate User Cardinalities Over Time.

Paper Link】 【Pages】:1094-1105

【Authors】: Pinghui Wang ; Peng Jia ; Xiangliang Zhang ; Jing Tao ; Xiaohong Guan ; Don Towsley

【Abstract】: Online monitoring user cardinalities (or degrees) in graph streams is fundamental for many applications. For example in a bipartite graph representing user-website visiting activities, user cardinalities (the number of distinct visited websites) are monitored to report network anomalies. These real-world graph streams may contain user-item duplicates and have a huge number of distinct user-item pairs, therefore, it is infeasible to exactly compute user cardinalities when memory and computation resources are limited. Existing methods are designed to approximately estimate user cardinalities, whose accuracy highly depends on parameters that are not easy to set. Moreover, these methods cannot provide anytime-available estimation, as the user cardinalities are computed at the end of the data stream. Realtime applications such as anomaly detection require that user cardinalities are estimated on the fly. To address these problems, we develop novel bit and register sharing algorithms, which use a bit array and a register array to build a compact sketch of all users' connected items respectively. Compared with previous bit and register sharing methods, our algorithms exploit the dynamic properties of the bit and register arrays (e.g., the fraction of zero bits in the bit array at each time) to significantly improve the estimation accuracy, and have low time complexity (O(1)) to update the estimations each time they observe a new useritem pair. In addition, our algorithms are simple and easy to use, without requirements to tune any parameter. We evaluate the performance of our methods on real-world datasets. The experimental results demonstrate that our methods are several times more accurate and faster than state-of-the-art methods using the same amount of memory.

【Keywords】: dynamic properties; sharing bits and registers; cardinality estimation; data stream

95. Tracking Influential Nodes in Time-Decaying Dynamic Interaction Networks.

Paper Link】 【Pages】:1106-1117

【Authors】: Junzhou Zhao ; Shuo Shang ; Pinghui Wang ; John C. S. Lui ; Xiangliang Zhang

【Abstract】: Identifying influential nodes that can jointly trigger the maximum influence spread in networks is a fundamental problem in many applications such as viral marketing, online advertising, and disease control. Most existing studies assume that social influence is static and they fail to capture the dynamics of influence in reality. In this work, we address the dynamic influence challenge by designing efficient streaming methods that can identify influential nodes from highly dynamic node interaction streams. We first propose a general time-decaying dynamic interaction network (TDN) model to model node interaction streams with the ability to smoothly discard outdated data. Based on the TDN model, we design three algorithms, i.e., SieveADN, BasicReduction and HistApprox. SieveADN identifies influential nodes from a special kind of TDNs with efficiency. BasicReduction uses SieveADN as a basic building block to identify influential nodes from general TDNs. HistApprox significantly improves the efficiency of BasicReduction. More importantly, we theoretically show that all three algorithms enjoy constant factor approximation guarantees. Experiments conducted on various real interaction datasets demonstrate that our approach finds near-optimal solutions with speed at least 5 to 15 times faster than baseline methods.

【Keywords】: influence maximization; sreaming algorithms; submodular optimization

96. Fast and Accurate Graph Stream Summarization.

Paper Link】 【Pages】:1118-1129

【Authors】: Xiangyang Gou ; Lei Zou ; Chenxingyu Zhao ; Tong Yang

【Abstract】: A graph stream is a continuous sequence of data items, in which each item indicates an edge, including its two endpoints and edge weight. It forms a dynamic graph that changes with every item. Graph streams play important roles in cyber security, social networks, cloud troubleshooting systems and more. Due to the vast volume and high update speed of graph streams, traditional data structures for graph storage such as the adjacency matrix and the adjacency list are no longer sufficient. However, prior art of graph stream summarization, like CM sketches, gSketches, TCM and gMatrix, either supports limited kinds of queries or suffers from poor accuracy of query results. In this paper, we propose a novel Graph Stream Sketch (GSS for short) to summarize the graph streams, which has linear space cost O(|E|) (E is the edge set of the graph) and constant update time cost (O(1)) and supports most kinds of queries over graph streams with the controllable errors. Both theoretical analysis and experiment results confirm the superiority of our solution with regard to the time/space complexity and query results' precision compared with the state-of-the-art.

【Keywords】: graph, data stream, sketch, approximate query

97. Mining Periodic Cliques in Temporal Networks.

Paper Link】 【Pages】:1130-1141

【Authors】: Hongchao Qin ; Rong-Hua Li ; Guoren Wang ; Lu Qin ; Yurong Cheng ; Ye Yuan

【Abstract】: Periodicity is a frequently happening phenomenon for social interactions in temporal networks. Mining periodic communities are essential to understanding periodic group behaviors in temporal networks. Unfortunately, most previous studies for community mining in temporal networks ignore the periodic patterns of communities. In this paper, we study a problem of seeking periodic communities in a temporal network, where each edge is associated with a set of timestamps. We propose a novel model, called maximal σ-periodic k-clique, that represents a periodic community in temporal networks. Specifically, a maximal σ-periodic k-clique is a clique with size larger than k that appears at least σ times periodically in the temporal graph. We show that the problem of enumerating all those periodic cliques is NP-hard. To compute all of them efficiently, we first develop two effective graph reduction techniques to significantly prune the temporal graph. Then, we present an efficient enumeration algorithm to enumerate all maximal σ-periodic k-cliques in the reduced graph. The results of extensive experiments on five real-life datasets demonstrate the efficiency, scalability, and effectiveness of our algorithms.

【Keywords】: Periodic Community; Periodic Clique; Social Network; Temporal Network; Graph Mining

98. Coloring Embedder: A Memory Efficient Data Structure for Answering Multi-set Query.

Paper Link】 【Pages】:1142-1153

【Authors】: Tong Yang ; Dongsheng Yang ; Jie Jiang ; Siang Gao ; Bin Cui ; Lei Shi ; Xiaoming Li

【Abstract】: Multi-set query is a fundamental issue in data science. When the sizes of multi-sets are large, exact matching methods like hash tables need too much memory, and they cannot achieve high query speed. Bloom filters are recently used to handle big data query, but they cannot achieve high accuracy when the memory space is tight. In this paper, we propose a new data structure named coloring embedder, which is fast, accurate as well as memory efficient. The insight is to first map elements to a high dimensional space to almost eliminate hashing collisions, and then use a dimensional reduction representation, which is similar to coloring a graph, to save memory. Theoretical proofs and experimental results show that compared to the state-of-the[1]art, the error rate of the coloring embedder is thousands of times smaller even with much less memory usage, and the query speed of the coloring embedder is about 2 times faster. The source code of coloring embedder is released on Github.

【Keywords】: data structure; query processing; multi set query; coloring embedder; bloom filter

99. Mining Order-Preserving Submatrices Under Data Uncertainty: A Possible-World Approach.

Paper Link】 【Pages】:1154-1165

【Authors】: Ji Cheng ; Da Yan ; Xiaotian Hao ; Wilfred Ng

【Abstract】: Given a data matrix D, a submatrix S of D is an order-preserving submatrix (OPSM) if there is a permutation of the columns of S, under which the entry values of each row in S are strictly increasing. OPSM mining is widely used in real-life applications such as identifying coexpressed genes, and finding customers with similar preference. However, noise is ubiquitous in real data matrices due to variable experimental conditions and measurement errors, which makes conventional OPSM mining algorithms inapplicable. No previous work has ever combated uncertain value intervals using the possible world semantics. We establish two different definitions of significant OPSMs based on the possible world semantics: (1) expected support based and (2) probabilistic frequentness based. An optimized dynamic programming approach is proposed to compute the probability that a row supports a particular column permutation, and several effective pruning rules are introduced to efficiently prune insignificant OPSMs. These techniques are integrated into our two OPSM mining algorithms, based on prefix-projection and Apriori respectively. Extensive experiments on real microarray data demonstrate that the OPSMs found by our algorithms have a much higher quality than those found by existing approaches.

【Keywords】: Order-preserving submatrices; OPSM; possible world semantics; expected support; probabilistic frequentness; data mining

100. Multi-Dimensional Genomic Data Management for Region-Preserving Operations.

Paper Link】 【Pages】:1166-1177

【Authors】: Olha Horlova ; Abdulrahman Kaitoua ; Volker Markl ; Stefano Ceri

【Abstract】: In previous work, we presented GenoMetric Query Language (GMQL), an algebraic language for querying genomic datasets, supported by Genomic Data Management System (GDMS), an open-source big data engine implemented on top of Apache Spark. GMQL datasets are represented as genomic regions (i.e. intervals of the genome, included within a start and stop position) with an associated value, representing the signal associated to that region (the most typical signals represent gene expressions, peaks of expressions, and variants relative to a reference genome.) GMQL can process queries over billions of regions, organized within distinct datasets. In this paper, we focus on the efficient execution of region-preserving GMQL operations, in which the regions of the result are a subset of the regions of one of the operands; most GMQL operations are region-preserving. Chains of region-preserving operations can be efficiently executed by taking advantage of an array-based data organization, where region management can be separated from value management. We discuss this optimization in the context of the current GDMS system which has a row-based (relational) organization, and therefore requires dynamic data transformations. A similar approach applies to other application domains with interval-based data organization.

【Keywords】: Big data processing; data management; cloud computing; genomic computing

Paper Link】 【Pages】:1178-1189

【Authors】: Rong-Hua Li ; Qiangqiang Dai ; Guoren Wang ; Zhong Ming ; Lu Qin ; Jeffrey Xu Yu

【Abstract】: Enumerating maximal cliques from an uncertain graph is a fundamental problem in uncertain graph analysis. Given an uncertain graph G, a set of nodes C in G is a maximal (k, τ)-clique if (1) |C|>k and C is a clique with probability at least τ, and (2) C is a maximal node set meeting (1). The state-of-the-art algorithm for enumerating all maximal (k, τ)-cliques is very costly when handling large uncertain graphs, as its time complexity is proportional to 2^n where n is the number of nodes in the uncertain graph. To overcome this issue, we propose two new core-based pruning algorithms to reduce the uncertain graph size without missing any maximal (k, τ)-clique. We also develop a novel cut-based optimization technique to further improve the pruning performance of the core-based pruning algorithms. Based on these pruning techniques, we propose an improved algorithm to enumerate all maximal (k, τ)-cliques, and a new algorithm with several novel upper-bounding techniques to compute one of maximum (k, τ)-cliques from the pruned uncertain graph. The results of extensive experiments on six real-world datasets demonstrate the efficiency and effectiveness of the proposed algorithms.

【Keywords】: Uncertain Graph; Maximal Clique; Uncertain k Core; Dynamic Programming Algorithm

102. Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment.

Paper Link】 【Pages】:1190-1201

【Authors】: Raul Castro Fernandez ; Jisoo Min ; Demitri Nava ; Samuel Madden

【Abstract】: Data analysts often need to find datasets that are similar (i.e., have high overlap) or that are subsets of one another (i.e., one contains the other). Exactly computing such relationships is expensive because it entails an all-pairs comparison between all values in all datasets, an O(n 2 ) operation. Fortunately, it is possible to obtain approximate solutions much faster, using locality sensitive hashing (LSH). Unfortunately, LSH does not lend itself naturally to compute containment, and only returns results with a similarity beyond a pre-defined threshold; we want to know the specific similarity and containment score. The main contribution of this paper is LAZO, a method to simultaneously estimate both the similarity and containment of datasets, based on a redefinition of Jaccard similarity which takes into account the cardinality of each set. In addition, we show how to use the method to improve the quality of the original JS and JC estimates. Last, we implement LAZO as a new indexing structure that has these additional properties: i) it returns numerical scores to indicate the degree of similarity and containment between each candidate and the query-instead of only returning the candidate set; ii) it permits to query for a specific threshold on-the-fly, as opposed to LSH indexes that need to be configured with a pre-defined threshold a priori; iii) it works in a data-oblivious way, so it can be incrementally maintained. We evaluate LAZO on real-world datasets and show its ability to estimate containment and similarity better and faster than existing methods.

【Keywords】: containment; data discovery; estimate inclusion dependencies; aurum

103. TARDIS: Distributed Indexing Framework for Big Time Series Data.

Paper Link】 【Pages】:1202-1213

【Authors】: Liang Zhang ; Noura Alghamdi ; Mohamed Y. Eltabakh ; Elke A. Rundensteiner

【Abstract】: The massive amounts of time series data continuously generated and collected by applications warrant the need for large scale distributed time series processing systems. Indexing plays a critical role in speeding up time series similarity queries on which various analytics and applications rely. However, the state-of-the-art indexing techniques, which are iSAX-based structures, do not scale well due to the small adopted fan-out (binary) that leads to a highly deep index tree, and the expensive search cost through many internal nodes. More seriously, the iSAX character-level cardinality adopted by these indices suffers from a poor maintenance of the proximity relationships among the time series objects, which leads to severe accuracy degradation for approximate similarity queries. In this paper, we propose the TARDIS distributed indexing framework to overcome the aforementioned limitations. TARDIS introduces a novel iSAX index tree that is based on a new word-level variable cardinality. The proposed index ensures compact structure, efficient search and comparison, and good preservation of the similarity relationships. TARDIS is suitable for indexing and querying billion-scale time series datasets. TARDIS is composed of one centralized global index and local distributed indices-one per each data partition across the cluster. TARDIS uses both the global and local indices to efficiently support exact match and kNN approximate queries. The system is implemented using Apache Spark, and extensive experiments are conducted on benchmark and real-world datasets. Evaluation results demonstrate that for over one billion time series dataset (TB scale), the construction of a clustered index is about 83% faster than the existing techniques. Moreover, the average response time of exact match queries is decreased by 50%, and the accuracy of the kNN approximate queries has increased more than 10 fold (from 3% to 40%) compared to the existing techniques.

【Keywords】: Time Series, Whole Similarity Matching, Indexing, Exact Matching Query, kNN Approximate Query, iSAX-T, sigTree, Word-level Cardinality, TARDIS

104. Mostly Order Preserving Dictionaries.

Paper Link】 【Pages】:1214-1225

【Authors】: Chunwei Liu ; McKade Umbenhower ; Hao Jiang ; Pranav Subramaniam ; Jihong Ma ; Aaron J. Elmore

【Abstract】: Dictionary encoding, or domain encoding, is an important form of compression that uses a bijective mapping to replace attributes from a large domain (i.e. strings) with a finite domain (i.e. 32 bit integers). This encoding both reduces data storage and allows for more efficient query execution. Traditional dictionary encoding only supports efficient equality queries, while range queries require that encoded values are decoded for evaluating the predicates. An order preserving dictionary allows for range queries without decoding by ensuring that encoded keys follow the same order as the values in the dictionary. While this approach enables efficient queries it requires that the full set of values is known to create the mappings. In this work we bridge this gap by introducing mostly ordered dictionaries that use a best effort dictionary generation based on sampling the input dataset. Query evaluation on a mostly ordered dictionary avoids decoding when possible and gracefully degrades performance as the ratio of ordered values decreases.

【Keywords】: Dictionary; Encoding; Compression; Order preserving; Query

105. Multi-copy Cuckoo Hashing.

Paper Link】 【Pages】:1226-1237

【Authors】: Dagang Li ; Rong Du ; Ziheng Liu ; Tong Yang ; Bin Cui

【Abstract】: Cuckoo hashing is widely used for its worst-case constant lookup performance even at very high load. However at high load, collision resolution will involve lots of probes on item relocation and may still fail in the end. To address the problem, we propose an efficient Cuckoo hashing scheme called Multi-copy Cuckoo or McCuckoo. Different from the blind kick-outs of standard Cuckoo hashing during a collision, we can foresee which way to successfully kick items by using multiple copies. Furthermore, with the knowledge of how many copies each item has in the table, we can identify impossible buckets and skip them during a lookup. In order to avoid expensive rehashing during insertion failures, McCuckoo also supports more efficient stash strategy that minimizes stash checking. McCuckoo uses simple logic and simple data structure, so it is suitable for both software and hardware implementation on platforms where intensive access to the slow and bandwidth limited off-chip external memory is the main bottleneck.

【Keywords】: Cuckoo Hashing; Hashing algorithm; Multi-copy

106. Efficient Scalable Multi-attribute Index Selection Using Recursive Strategies.

Paper Link】 【Pages】:1238-1249

【Authors】: Rainer Schlosser ; Jan Kossmann ; Martin Boissier

【Abstract】: An efficient selection of indexes is indispensable for database performance. For large problem instances with hundreds of tables, existing approaches are not suitable: They either exhibit prohibitive runtimes or yield far from optimal index configurations by strongly limiting the set of index candidates or not handling index interaction explicitly. We introduce a novel recursive strategy that does not exclude index candidates in advance and effectively accounts for index interaction. Using large real-world workloads, we demonstrate the applicability of our approach. Further, we evaluate our solution end to end with a commercial database system using a reproducible setup. We show that our solutions are near-optimal for small index selection problems. For larger problems, our strategy outperforms state-of-the-art approaches in both scalability and solution quality.

【Keywords】: Index Selection

107. To Index or Not to Index: Optimizing Exact Maximum Inner Product Search.

Paper Link】 【Pages】:1250-1261

【Authors】: Firas Abuzaid ; Geet Sethi ; Peter Bailis ; Matei Zaharia

【Abstract】: Exact Maximum Inner Product Search (MIPS) is an important task that is widely pertinent to recommender systems and high-dimensional similarity search. The brute-force approach to solving exact MIPS is computationally expensive, thus spurring recent development of novel indexes and pruning techniques for this task. In this paper, we show that a hardware-efficient brute-force approach, blocked matrix multiply (BMM), can outperform the state-of-the-art MIPS solvers by over an order of magnitude, for some-but not all-inputs. In this paper we also present a novel MIPS solution, MAX-IMUS, that takes advantage of hardware efficiency and pruning of the search space. Like BMM, MAXIMUS is faster than other solvers by up to an order of magnitude, but again only for some inputs. Since no single solution offers the best runtime performance for all inputs, we introduce a new data-dependent optimizer, OPTIMUS, that selects online with minimal overhead the best MIPS solver for a given input. Together, OPTIMUS and MAXIMUS outperform state-of-the-art MIPS solvers by 3.2× on average, and up to 10.9×, on widely studied MIPS datasets.

【Keywords】: Maximum Inner Product Search; Matrix Factorization; Model Serving; Inference; Recommendations; Top K; BLAS; Hardware Efficiency; Index; Machine Learning

Paper Link】 【Pages】:1262-1273

【Authors】: Haitao Yuan ; Guoliang Li

【Abstract】: Many applications, e.g., Uber, collect large-scale trajectory data from moving vehicles on road network. Trajectory data analytics can benefit many real-world applications, such as route planning and transportation optimizations. Two core operations in trajectory data analytics are trajectory similarity search and join, and both of them rely on a trajectory similarity function to measure the similarity between two trajectories. However, existing similarity functions focus on trajectory points distance and neglect the fact the trajectories should be on road network. Obviously aligning trajectories on road network can remove the noise points introduced by system errors. Toward this goal, we define a road-network-aware trajectory similarity function to measure trajectory similarity. To support trajectory similarity search and join, we propose a filtering-refine framework. In the filtering step, we compute a signature of each trajectory such that if two trajectories are similar, they must share a common signature. We utilize the signatures to prune a huge number of dissimilar pairs. In the refine step, we design effective algorithms to verify the candidates that are not pruned in the filtering step. To support large-scale trajectories, we develop a system DISON for Distributed In-Memory Trajectory Similarity Search and Join on Road Network. DISON splits trajectories into disjoint partitions by considering load balance and locality, and designs effective global index to prune irrelevant partitions. Extensive experiments on real datasets showed that our method achieved high effectiveness, efficiency, and scalability and outperformed existing solutions significantly.

【Keywords】: spatial; trajectory; similarity search; similarity join; road network

109. Stochastic Weight Completion for Road Networks Using Graph Convolutional Networks.

Paper Link】 【Pages】:1274-1285

【Authors】: Jilin Hu ; Chenjuan Guo ; Bin Yang ; Christian S. Jensen

【Abstract】: Innovations in transportation, such as mobility-on-demand services and autonomous driving, call for high-resolution routing that relies on an accurate representation of travel time throughout the underlying road network. Specifically, the travel time of a road-network edge is modeled as a time-varying distribution that captures the variability of traffic over time and the fact that different drivers may traverse the same edge at the same time at different speeds. Such stochastic weights may be extracted from data sources such as GPS and loop detector data. However, even very large data sources are incapable of covering all edges of a road network at all times. Yet, high-resolution routing needs stochastic weights for all edges. We solve the problem of filling in the missing weights. To achieve that, we provide techniques capable of estimating stochastic edge weights for all edges from traffic data that covers only a fraction of all edges. We propose a generic learning framework called Graph Convolutional Weight Completion (GCWC) that exploits the topology of a road network graph and the correlations of weights among adjacent edges to estimate stochastic weights for all edges. Next, we incorporate contextual information into GCWC to further improve accuracy. Empirical studies using loop detector data from a highway toll gate network and GPS data from a large city offer insight into the design properties of GCWC and its effectiveness.

【Keywords】: uncertain graph; travel cost estimation; graph convolutional neural network; travel time prediction; graph completion

110. Identifying the Most Interactive Object in Spatial Databases.

Paper Link】 【Pages】:1286-1297

【Authors】: Daichi Amagata ; Takahiro Hara

【Abstract】: This paper investigates a new query, called an MIO query, that retrieves the Most Interactive Object in a given spatial dataset. Consider that an object consists of many spatial points. Given a distance threshold, we say that two objects interact with each other if they have a pair of points whose distance is within the threshold. An MIO query outputs the object that interacts with other objects the most, and it is useful for analytical applications e.g., neuroscience and trajectory databases. The MIO query processing problem is challenging: a nested loop algorithm is computationally inefficient and a theoretical algorithm is computationally efficient but incurs a quadratic space cost. Our solution efficiently processes MIO queries with a novel index, BIGrid (a hybrid index of compressed Bitset, Inverted list, and spatial Grid structures), with a practical memory cost. Furthermore, our solution is designed so that previous query results and multi-core environments can be exploited to accelerate query processing efficiency. Our experiments on synthetic and real datasets demonstrate the efficiency of our solution.

【Keywords】: MIO query; Spatial databases

111. An Efficient Framework for Correctness-Aware kNN Queries on Road Networks.

Paper Link】 【Pages】:1298-1309

【Authors】: Dan He ; Sibo Wang ; Xiaofang Zhou ; Reynold Cheng

【Abstract】: Given a set O of objects and a query point q on a road network, the k Nearest Neighbor (kNN) query returns the k nearest objects in O with the shortest road network distance to q. These kNN queries find many applications in location-based services, e.g., ride-hailing services, where each taxi is regarded as an object. In such applications, objects are constantly moving such that even for the same query point, the correct answer of a kNN query may vary with time. Ideally, the returned answer should be adequately correct with respect to the moving object set. However, in literature, all existing solutions for kNN queries mainly focus on reducing the query time, indexing storage, or throughput of the kNN queries with little focus on their correctness. Motivated by this, we propose a framework on correctness-aware kNN queries which aim to optimize system throughput while guaranteeing query correctness on moving objects. We formally define the serializable-kNN query that ensures the correctness of the query answer when considering moving objects and dependencies of different queries. We propose several techniques to optimize the throughput of serializable-kNN queries: firstly, we propose efficient index structures and new query algorithms that significantly improve the throughput; we further present novel scheduling algorithms that aim to avoid conflicts and improve the system throughput. Moreover, we devise approximate solutions that provide a controllable trade-off between the correctness of kNN queries and system throughput. Extensive experiments on real-world data demonstrate the effectiveness and efficiency of our proposed solutions over alternatives.

【Keywords】: kNN query; serializable kNN; moving object index; conflict-aware scheduling

Paper Link】 【Pages】:1310-1321

【Authors】: Siqiang Luo ; Ben Kao ; Xiaowei Wu ; Reynold Cheng

【Abstract】: We study the problem of executing road-network k-nearest-neighbor (kNN) search on multi-core machines. State-of-the-art kNN algorithms on road networks often involve elaborate index structures and complex computational logic. Moreover, most kNN algorithms are inherently sequential. These make the traditional approach of parallel programming very costly, laborious, and ineffective when they are applied to kNN algorithms. We propose the MPR (Multi-layer Partitioning-Replication) mechanism that orchestrates CPU cores and schedules kNN query and index update processes to run on the cores. The MPR mechanism performs workload analysis to determine the best arrangement of the cores with the objective of optimizing quality-of-service (QoS) measures, such as system throughput and query response time. We demonstrate the effectiveness of MPR by applying it to a number of state-of-the-art kNN indexing methods running on a multi-core machine. Our experiments show that multi-processing using our MPR approach requires minimal programming effort. It also leads to significant improvements in query response time and system throughput compared with other baseline parallelization methods.

【Keywords】: kNN search; road network; adaptive approach

113. AppUsage2Vec: Modeling Smartphone App Usage for Prediction.

Paper Link】 【Pages】:1322-1333

【Authors】: Sha Zhao ; Zhiling Luo ; Ziwen Jiang ; Haiyan Wang ; Feng Xu ; Shijian Li ; Jianwei Yin ; Gang Pan

【Abstract】: App usage prediction, i.e. which apps will be used next, is very useful for smartphone system optimization, such as operating system resource management, battery energy consumption optimization, and user experience improvement as well. However, it is still challenging to achieve usage prediction of high accuracy. In this paper, we propose a novel framework for app usage prediction, called AppUsage2Vec, inspired by Doc2Vec. It models app usage records by considering the contribution of different apps, user personalized characteristics, and temporal context. We measure the contribution of each app to the target app by introducing an app-attention mechanism. The user personalized characteristics in app usage are learned by a module of dual-DNN. Furthermore, we encode the top-k supervised information in loss function for training the model to predict the app most likely to be used next. The AppUsage2Vec was evaluated on a dataset of 10,360 users and 46,434,380 records in three months. The results demonstrate the state-of-the-art performance.

【Keywords】: Smartphone applications; app usage prediction; user modeling

114. iFair: Learning Individually Fair Data Representations for Algorithmic Decision Making.

Paper Link】 【Pages】:1334-1345

【Authors】: Preethi Lahoti ; Krishna P. Gummadi ; Gerhard Weikum

【Abstract】: People are rated and ranked, towards algorithmic decision making in an increasing number of applications, typically based on machine learning. Research on how to incorporate fairness into such tasks has prevalently pursued the paradigm of group fairness: giving adequate success rates to specifically protected groups. In contrast, the alternative paradigm of individual fairness has received relatively little attention, and this paper advances this less explored direction. The paper introduces a method for probabilistically mapping user records into a low-rank representation that reconciles individual fairness and the utility of classifiers and rankings in downstream applications. Our notion of individual fairness requires that users who are similar in all task-relevant attributes such as job qualification, and disregarding all potentially discriminating attributes such as gender, should have similar outcomes. We demonstrate the versatility of our method by applying it to classification and learning-to-rank tasks on a variety of real-world datasets. Our experiments show substantial improvements over the best prior work for this setting.

【Keywords】: Algorithmic Fairness; Learning Fair Representations; Individual Fairness

115. DBSCAN-MS: Distributed Density-Based Clustering in Metric Spaces.

Paper Link】 【Pages】:1346-1357

【Authors】: Keyu Yang ; Yunjun Gao ; Rui Ma ; Lu Chen ; Sai Wu ; Gang Chen

【Abstract】: DBSCAN is one of important density-based clustering methods, which has a wide range of applications in machine learning and data mining, to name but a few. However, the rapid growing volume and variety of data nowadays challenges traditional DBSCAN, and thus, distributed DBSCAN in metric spaces is required. In this paper, we propose DBSCAN-MS, a distributed density-based clustering in metric spaces. To ensure load balancing, we present a k-d tree based partitioning approach. It utilizes pivots to map the data in metric spaces to vector spaces, and employs k-d tree partitioning technique to equally divide the data. To avoid unnecessary computation and communication cost, we propose a framework that divides data into partitions, find out local DBSCAN result, and merge local result based on a merging graph. In addition, the pivot filtering and the sliding window techniques are also used in the framework for pruning. Extensive experiments with both real and synthetic datasets demonstrate the efficiency and scalability of our proposed DBSCAN-MS.

【Keywords】: DBSCAN, metric space, spark, algorithm

116. Computing Trajectory Similarity in Linear Time: A Generic Seed-Guided Neural Metric Learning Approach.

Paper Link】 【Pages】:1358-1369

【Authors】: Di Yao ; Gao Cong ; Chao Zhang ; Jingping Bi

【Abstract】: Trajectory similarity computation is a fundamental problem for various applications in trajectory data analysis. However, the high computation cost of existing trajectory similarity measures has become the key bottleneck for trajectory analysis at scale. While there have been many research efforts for reducing the complexity, they are specific to one similarity measure and often yield limited speedups. We propose NeuTraj to accelerate trajectory similarity computation. NeuTraj is generic to accommodate any existing trajectory measure and fast to compute the similarity of a given trajectory pair in linear time. Furthermore, NeuTraj is elastic to collaborate with all spatial-based trajectory indexing methods to reduce the search space. NeuTraj samples a number of seed trajectories from the given database, and then uses their pair-wise similarities as guidance to approximate the similarity function with a neural metric learning framework. NeuTraj features two novel modules to achieve accurate approximation of the similarity function: (1) a spatial attention memory module that augments existing recurrent neural networks for trajectory encoding; and (2) a distance-weighted ranking loss that effectively transcribes information from the seed-based guidance. With these two modules, NeuTraj can yield high accuracies and fast convergence rates even if the training data is small. Our experiments on two real-life datasets show that NeuTraj achieves over 80% accuracy on Fre chet, Hausdorff, ERP and DTW measures, which outperforms state-of-the-art baselines consistently and significantly. It obtains 50x-1000x speedup over bruteforce methods and 3x-500x speedup over existing approximate algorithms, while yielding more accurate approximations of the similarity functions.

【Keywords】: deep metric learning; trajectory similarity; linear time

117. Bursty Event Detection Throughout Histories.

Paper Link】 【Pages】:1370-1381

【Authors】: Debjyoti Paul ; Yanqing Peng ; Feifei Li

【Abstract】: The widespread use of social media and the active trend of moving towards more web-and mobile-based reporting for traditional media outlets have created an avalanche of information streams. These information streams bring in first-hand reporting on live events to massive crowds in real time as they are happening. It is important to study the phenomenon of burst in this context so that end-users can quickly identify important events that are emerging and developing in their early stages. In this paper, we investigate the problem of bursty event detection where we define burst as the acceleration over the incoming rate of an event mentioning. Existing works focus on the detection of current trending events, but it is important to be able to go back in time and explore bursty events throughout the history, while without the needs of storing and traversing the entire information stream from the past. We present a succinct probabilistic data structure and its associated query strategy to find bursty events at any time instance for the entire history. Extensive empirical results on real event streams have demonstrated the effectiveness of our approach.

【Keywords】: data streams; bursty event detection; data sketch; synopsis; data summaries

118. RUM: Network Representation Learning Using Motifs.

Paper Link】 【Pages】:1382-1393

【Authors】: Yanlei Yu ; Zhiwu Lu ; Jiajun Liu ; Guoping Zhao ; Ji-Rong Wen

【Abstract】: We bring the novel idea of exploiting motifs into network embedding, in a dual-level network representation learning model called RUM (network Representation learning Using Motifs). Towards the leveraging of graph motifs that constitute higher-order organizations in a network, we propose two strategies, namely MotifWalk and MotifRe-weighting for learning motif-aware network embeddings. Motif-based and node-based representations are simultaneously generated, so that both the high-order structures and each node's individual properties are preserved in the final embeddings. We demonstrate that RUM has strong and well-balanced capability of preserving lowerorder proximities while discovering and capturing higher-order network structures. In empirical evaluation, RUM is tested on multiple public datasets, that range from small to medium citation networks to a large social network with more than a million nodes. Results show that the use of motifs in the representation learning process brings substantial benefits in reallife tasks, resulting in up to 12% microF1 and 8% macroF1 relative gains for node classification performance over the bestperforming competing methods.

【Keywords】: Network Representation Learning; Motif Walk; MotifRe weighting

119. Finding Significant Items in Data Streams.

Paper Link】 【Pages】:1394-1405

【Authors】: Tong Yang ; Haowei Zhang ; Dongsheng Yang ; Yucheng Huang ; Xiaoming Li

【Abstract】: Finding top-k frequent items has been a hot issue in databases. Finding top-k persistent items is a new issue, and has attracted increasing attention in recent years. In practice, users often want to know which items are significant, i.e., not only frequent but also persistent. No prior art can address both of the above two issues at the same time. Also, for high-speed data streams, they cannot achieve high accuracy when the memory is tight. In this paper, we define a new issue, named finding top-k significant items, and propose a novel algorithm namely LTC to address this issue. It includes two key techniques: Long-tail Replacement and a modified CLOCK algorithm. We theoretically prove there is no overestimation error and derive the correct rate and error bound. We conduct extensive experiments on three real datasets. Our experimental results show that LTC achieves 300~10^8 and in average 10^5 times higher accuracy than other related algorithms.

【Keywords】: data streams; significant items; long tail distribution; clock

120. Knowledge-Aware Deep Dual Networks for Text-Based Mortality Prediction.

Paper Link】 【Pages】:1406-1417

【Authors】: Ning Liu ; Pan Lu ; Wei Zhang ; Jianyong Wang

【Abstract】: Mortality prediction is one of the essential tasks in medical data mining and is significant for inferring clinical outcomes. With a large number of medical notes collected from hospitals, there is an urgent need for developing effective models for predicting mortality based on them. In contrast to structured electronic health records, medical notes are unstructured texts written by experienced caregivers and contain more complicated information about patients, posing more challenges for modeling. Most previous studies rely on tedious hand-crafted features or generating indirect features based on some statistical models such as topic modeling, which might incur information loss for later model training. Recently, some deep models have been proposed to unify the stages of feature construction and model training. However, domain concept knowledge has been neglected, which is important to gain a better understanding of medical notes. To address the above issues, we propose novel Knowledge-aware Deep Dual Networks (K-DDN) for the text-based mortality prediction task. Specifically, a simple deep dual network is first proposed to fuse the representations of medical knowledge and raw text for prediction. Afterward, we incorporate a co-attention mechanism into the basic model, guiding the knowledge and text representation learning with the help of each other. Experimental results on two publicly real-world datasets show the proposed deep dual networks outperform state-of-the-art methods and the co-attention mechanism can further improve the performance.

【Keywords】: mortality prediction; deep learning; medical concept

121. Robust High Dimensional Stream Classification with Novel Class Detection.

Paper Link】 【Pages】:1418-1429

【Authors】: Zhuoyi Wang ; Zelun Kong ; Swarup Chandra ; Hemeng Tao ; Latifur Khan

【Abstract】: A primary challenge in label prediction over a data stream is the emergence of instances belonging to unknown or novel class over time. Traditionally, studies addressing this problem aim to detect such instances using cluster-based mechanisms. They typically assume that instances from the same class are closer to each other than those belonging to different classes in observed feature space. Unfortunately, this may not hold true in higher-dimensional feature space such as images. In recent years, Convolutional neural network (CNN) have emerged as a leading system to be employed in many real-world application. Yet, based on the assumption of closed world dataset with a fixed number of categories, CNN lacks robustness for novel class detection, so it is unclear on how such models can be used to deal with novel class instances along a high-dimensional image stream. In this paper, we focus on addressing this challenge by proposing an effective learning framework called CNN-based Prototype Ensemble (CPE) for novel class detection and correction. Our framework includes a prototype ensemble loss (PE) to improve the intra-class compactness and expand inter-class separateness in the output feature representation, thereby enabling the robustness of novel class detection. Moreover, we provide an incremental learning strategy which maintains a constant amount of exemplars to update the network, making it more practical for real-world application. We empirically demonstrate the effectiveness of our framework by comparing its performance over multiple realworld image benchmark data streams with existing state-of-theart data stream detection techniques. The implementation of CPE is on: https://github.com/Vitvicky/Convolutional-Net-PrototypeEnsemble

【Keywords】: Novel Class Detection; Stream Classification; Covolutional Neural Network; Prototype Ensemble; Incremental Learning

122. Towards the Completion of a Domain-Specific Knowledge Base with Emerging Query Terms.

Paper Link】 【Pages】:1430-1441

【Authors】: Sihang Jiang ; Jiaqing Liang ; Yanghua Xiao ; Haihong Tang ; Hai-Kuan Huang ; Jun Tan

【Abstract】: Domain-specific knowledge bases play an increasingly important role in a variety of real applications. In this paper, we use the product knowledge base in the largest Chinese e-commerce platform, Taobao, as an example to investigate a completion procedure of a domain-specific knowledge base. We argue that the domain-specific knowledge bases tend to be incomplete, and are oblivious to their incompleteness, without a continuous completion procedure in place. The key component of this completion procedure is the classification of emerging query terms into corresponding properties of categories in existing taxonomy. Our proposal is that we use query logs to complete the product knowledge base of Taobao. However, the query driven completion usually faces many challenges including distinguishing the fine-grained semantic of unrecognized terms, handling the sparse data and so on. We propose a graph based solution to overcome these challenges. We first construct a lot of positive evidence to establish the semantical similarity between terms, and then run a shortest path or alternatively a random walk on the similarity graph under a set of constraints derived from a set of negative evidence to find the best candidate property for emerging query terms. We finally conduct extensive experiments on real data of Taobao and a subset of CN-DBpedia. The results show that our solution classifies emerging query terms with a good performance. Our solution is already deployed in Taobao, helping it find nearly 7 million new values for properties. The complete product knowledge base significantly improves the ratio of recognized queries and recognized terms by more than 25% and 32%, respectively.

【Keywords】: Product Knowledge Base; E-commerce; Knowledge Base Completion

123. Cooperation-Aware Task Assignment in Spatial Crowdsourcing.

Paper Link】 【Pages】:1442-1453

【Authors】: Peng Cheng ; Lei Chen ; Jieping Ye

【Abstract】: With the popularity of smart devices and the development of high-speed wireless networks, the spatial crowdsourcing has attracted much attention from both academia and industry (e.g., Uber and TaskRabbit). Specifically, a spatial crowdsourcing platform assigns workers to location-based tasks according to their current positions, then the workers need to physically move to the specified locations to conduct the assigned tasks. In this paper, we consider an important spatial crowdsourcing problem, namely cooperation-aware spatial crowdsourcing (CA-SC), where spatial tasks (e.g., collecting the Wi-Fi signal strength in one building) are time-constrained and require more than one worker to complete thus the cooperation among assigned workers is essential to the result. Our CA-SC problem is to assign workers to spatial tasks such that the overall cooperation quality is maximized. We prove that the CA-SC problem is NP-hard by reducing from the k-set packing problem, thus intractable. To tackle the CA-SC problem, we propose task-priority greedy (TPG) approach and game theoretic (GT) approach with two optimization methods to quickly solve the CA-SC problem and achieve high total cooperation quality scores. Through extensive experiments, we demonstrate the efficiency and effectiveness of our proposed approaches over both real and synthetic datasets.

【Keywords】: Spatial Crowdsourcing; Game Theoretic Algorithm; Task Assignment

124. Minimizing Maximum Delay of Task Assignment in Spatial Crowdsourcing.

Paper Link】 【Pages】:1454-1465

【Authors】: Zhao Chen ; Peng Cheng ; Yuxiang Zeng ; Lei Chen

【Abstract】: Spatial crowdsourcing services, such as Uber and Grabhub, become popular recently. Task assignment plays an important role in offering high-quality services. However, most of the existing solutions for task assignment only focus on the entire performance of the platform and do not optimize the maximum assignment delay. As a result, they cannot handle some real world scenarios which require minimizing the maximum delay in task assignment. In this paper, we study the minimizing maximum delay spatial crowdsourcing (MMD-SC) problem and propose solutions aiming at achieving a worst case controlled task assignment. The MMD-SC problem assumes that both workers and requesters come dynamically and considers not only the workers' travel costs but also the buffering time of tasks, thus it is very challenging due to two-sided online setting. To address these challenges, in this work, we propose a space embedding based online random algorithm with a competitive ratio of O(log n) and two efficient heuristic algorithms, namely the threshold based greedy approach and the batch-based approach. In addition, we demonstrate the effectiveness and efficiency of our methods via extensive experiments on both synthetic and real datasets.

【Keywords】: spatial crowdsourcing; online matching; task assignment

125. Physical Representation-Based Predicate Optimization for a Visual Analytics Database.

Paper Link】 【Pages】:1466-1477

【Authors】: Michael R. Anderson ; Michael J. Cafarella ; German Ros ; Thomas F. Wenisch

【Abstract】: Querying the content of images and video requires expensive content extraction methods. Modern extraction techniques are based on deep convolutional neural networks (CNNs) and can classify objects within images with astounding accuracy. Unfortunately, these methods are slow: processing a single image can take about 10 milliseconds on modern GPU-based hardware. As massive video libraries become ubiquitous, running a content-based query over millions of video frames is prohibitive. One promising approach to reduce the runtime cost of queries of visual content is to use a hierarchical model, such as a cascade, where simple cases are handled by an inexpensive classifier. Prior work has sought to design cascades that optimize the computational cost of inference by, for example, using smaller CNNs. However, we observe that there are critical factors besides the inference time that dramatically impact the overall query time. Notably, by treating the physical representation of the input image as part of our query optimization-that is, by including image transformations such as resolution scaling or color-depth reduction within the cascade-we can optimize data handling costs and enable drastically more efficient classifier cascades. In this paper, we propose TAHOMA, which generates and evaluates many potential classifier cascades that jointly optimize the CNN architecture and input data representation. Our experiments on a subset of ImageNet show that TAHOMA's input transformations speed up cascades by up to 35 times. We also find up to a 98x speedup over the ResNet50 classifier with no loss in accuracy and a 280x speedup if some accuracy is sacrificed.

【Keywords】: image classification; deep learning; classifier cascades; visual analytics

126. Adaptive Dynamic Bipartite Graph Matching: A Reinforcement Learning Approach.

Paper Link】 【Pages】:1478-1489

【Authors】: Yansheng Wang ; Yongxin Tong ; Cheng Long ; Pan Xu ; Ke Xu ; Weifeng Lv

【Abstract】: Online bipartite graph matching is attracting growing research attention due to the development of dynamic task assignment in sharing economy applications, where tasks need be assigned dynamically to workers. Past studies lack practicability in terms of both problem formulation and solution framework. On the one hand, some problem settings in prior online bipartite graph matching research are impractical for real-world applications. On the other hand, existing solutions to online bipartite graph matching are inefficient due to the unnecessary real-time decision making. In this paper, we propose the dynamic bipartite graph matching (DBGM) problem to be better aligned with real-world applications and devise a novel adaptive batch-based solution framework with a constant competitive ratio. As an effective and efficient implementation of the solution framework, we design a reinforcement learning based algorithm, called Restricted Q-learning (RQL), which makes near-optimal decisions on batch splitting. Extensive experimental results on both real and synthetic datasets show that our methods outperform the state-of-the-arts in terms of both effectiveness and efficiency.

【Keywords】: bipartite graph matching; online algorithm; reinforcement learning

127. Scalable Frequent Sequence Mining with Flexible Subsequence Constraints.

Paper Link】 【Pages】:1490-1501

【Authors】: Alexander Renz-Wieland ; Matthias Bertsch ; Rainer Gemulla

【Abstract】: We study scalable algorithms for frequent sequence mining under flexible subsequence constraints. Such constraints enable applications to specify concisely which patterns are of interest and which are not. We focus on the bulk synchronous parallel model with one round of communication; this model is suitable for platforms such as MapReduce or Spark. We derive a general framework for frequent sequence mining under this model and propose the D-SEQ and D-CAND algorithms within this framework. The algorithms differ in what data are communicated and how computation is split up among workers. To the best of our knowledge, D-SEQ and D-CAND are the first scalable algorithms for frequent sequence mining with flexible constraints. We conducted an experimental study on multiple real-world datasets that suggests that our algorithms scale nearly linearly, outperform common baselines, and offer acceptable generalization overhead over existing, less general mining algorithms.

【Keywords】: data mining; frequent sequence mining; sequential pattern mining; large scale sequence mining; scalable data analysis

128. Adaptive Influence Blocking: Minimizing the Negative Spread by Observation-Based Policies.

Paper Link】 【Pages】:1502-1513

【Authors】: Qihao Shi ; Can Wang ; Deshi Ye ; Jiawei Chen ; Yan Feng ; Chun Chen

【Abstract】: Spread of negative influence (N-Inf) in a networked system seems to be inevitable, e.g., epidemic spread in human networks, rumors in an online social network and computer virus plaguing the Internet etc. The widespread of N-Inf might cause severe damage and hence the Influence Blocking (IB) problem is attracting ample research interest. The IB problem aims at minimizing the N-Inf spread by immunization, i.e. selecting k (budget size) immunization nodes (Imm-nodes) to prevent the N-Inf from spreading. However, existing works for IB problem are all formulated as a one-shot task: selecting all the k Imm-nodes at the very beginning of N-Inf spread. In real world, unforeseen events might occur and one-shot policies will lack reserved measures to handle these situations. A more reasonable policy is to adaptively invest the budget based on the observation of N-Inf spread along as the time goes by. With the adaptive policy, we can both reserve resources for handling unforeseen events and save unnecessary costs if the spread of N-Inf dies out quickly. Motivated by the above considerations, we propose a novel Adaptive Influence Blocking (AIB) problem. Given the intermediate observations of N-Inf spread, the AIB problem aims at selecting Imm-nodes adaptively. We design a k-R (k-nodes-per-Round) policy which selects k Imm-nodes for each round until the budget is exhausted, and an α-T (α-Tolerance) policy which selects a new Imm-node if the expected N-Inf spread exceeds a threshold α. Scalable algorithms with provable approximation guarantees and error bounds are implemented for these policies and significant improvements on time complexity are achieved. Experimental results on real-world datasets demonstrate the effectiveness and scalability of the proposed methods.

【Keywords】: influence blocking; adaptive policy; immunization

129. Fraction-Score: A New Support Measure for Co-location Pattern Mining.

Paper Link】 【Pages】:1514-1525

【Authors】: Harry Kai-Ho Chan ; Cheng Long ; Da Yan ; Raymond Chi-Wing Wong

【Abstract】: Co-location patterns are well-established on spatial objects with categorical labels, which capture the phenomenon that objects with certain labels are often located in close geographic proximity. Similar to frequent itemsets, co-location patterns are defined based on a support measure which quantifies the popularity (or prevalence) of a pattern candidate (a label set). Quite a few support measures exist for defining co-location patterns and they share an idea of counting the number of instances of a given label set C as its support, where an instance of C is an object set whose objects carry all the labels in C and are located close to one another. Unfortunately, these measures suffer from various weaknesses, e.g., some fail to capture all possible instances while some others overlook the cases when multiple instances overlap. In this paper, we propose a new measure called Fraction-Score whose idea is to count instances fractionally if they overlap. Compared to existing measures, Fraction-Score not only captures all possible instances, but also handles the cases where instances overlap appropriately (so that the supports defined are more meaningful and consistent with the desirable anti-monotonicity property). To solve the co-location pattern mining problem based on Fraction-Score, we develop efficient algorithms which are significantly faster than a baseline that adapts the state-of-the-art. We conduct extensive experiments using both real and synthetic datasets, which verified the superiority of Fraction-Score and also the efficiency of our developed algorithms.

【Keywords】: co-location pattern mining; fraction score

130. Discovery and Ranking of Functional Dependencies.

Paper Link】 【Pages】:1526-1537

【Authors】: Ziheng Wei ; Sebastian Link

【Abstract】: Computing the functional dependencies that hold on a given data set is one of the most important problems in data profiling. Utilizing new data structures and original techniques for the dynamic computation of stripped partitions, we devise a new hybridization strategy that outperforms the best algorithms in terms of efficiency, column-, and row-scalability. This is demonstrated on real-world benchmark data. We further propose the number of redundant data values for ranking the output of discovery algorithms. Our ranking assesses the relevance of functional dependencies for the given data set.

【Keywords】: Algorithm; data-profiling; discovery; functional-dependency; hybrid; memory-consumption; missing-data; ranking; redundancy; runtime-efficiency

131. Adaptive Deep Reuse: Accelerating CNN Training on the Fly.

Paper Link】 【Pages】:1538-1549

【Authors】: Lin Ning ; Hui Guan ; Xipeng Shen

【Abstract】: This work proposes adaptive deep reuse, a method for accelerating CNN training by identifying and avoiding the unnecessary computations contained in each specific training on the fly. It makes two-fold major contributions. (1) It empirically proves the existence of a lot of similarities among neuron vectors in both forward and backward propagation of CNN. (2) It introduces the first adaptive strategy for translating the similarities into computation reuse in CNN training. The strategy adaptively adjusts the strength of reuse based on the different tolerance of precision relaxation in different CNN training stages. Experiments show that adaptive deep reuse saves 69% CNN training time with no accuracy loss.

【Keywords】: CNN; neuron vector; similarity; training; adaptive; deep reuse

132. Slice Finder: Automated Data Slicing for Model Validation.

Paper Link】 【Pages】:1550-1553

【Authors】: Yeounoh Chung ; Tim Kraska ; Neoklis Polyzotis ; Ki Hyun Tae ; Steven Euijong Whang

【Abstract】: As machine learning (ML) systems become democratized, it becomes increasingly important to help users easily debug their models. However, current data tools are still primitive when it comes to helping users trace model performance problems all the way to the data. We focus on the particular problem of slicing data to identify subsets of the validation data where the model performs poorly. This is an important problem in model validation because the overall model performance can fail to reflect that of the smaller subsets, and slicing allows users to analyze the model performance on a more granular-level. Unlike general techniques (e.g., clustering) that can find arbitrary slices, our goal is to find interpretable slices (which are easier to take action compared to arbitrary subsets) that are large and problematic. We propose Slice Finder, which is an interactive framework for identifying such slices using statistical techniques. Applications include diagnosing model fairness and fraud detection, where identifying slices that are interpretable to humans is crucial.

【Keywords】: data slicing; model validation

133. Neural Multi-task Recommendation from Multi-behavior Data.

Paper Link】 【Pages】:1554-1557

【Authors】: Chen Gao ; Xiangnan He ; Dahua Gan ; Xiangning Chen ; Fuli Feng ; Yong Li ; Tat-Seng Chua ; Depeng Jin

【Abstract】: Most existing recommender systems leverage user behavior data of one type, such as the purchase behavior data in E-commerce. We argue that other types of user behavior data also provide valuable signal, such as views, clicks, and so on. In this work, we contribute a new solution named NMTR (short for Neural Multi-Task Recommendation) for learning recommender systems from user multi-behavior data. In particular, our model accounts for the cascading relationship among different types of behaviors (e.g., a user must click on a product before purchasing it). We perform a joint optimization based on the multi-task learning framework, where the optimization on a behavior is treated as a task. Extensive experiments on the real-world dataset demonstrate that NMTR significantly outperforms state-of-the-art recommender systems that are designed to learn from both single-behavior data and multi-behavior data.

【Keywords】: Multi-Behavior Recommendation; Collaborative Filtering; Deep Learning

134. AUC-MF: Point of Interest Recommendation with AUC Maximization.

Paper Link】 【Pages】:1558-1561

【Authors】: Peng Han ; Shuo Shang ; Aixin Sun ; Peilin Zhao ; Kai Zheng ; Panos Kalnis

【Abstract】: The task of point of interest (POI) recommendation aims to recommend unvisited places to users based on their check-in history. A major challenge in POI recommendation is data sparsity, because a user typically visits only a very small number of POIs among all available POIs. In this paper, we propose AUC-MF to address the POI recommendation problem by maximizing Area Under the ROC curve (AUC). AUC has been widely used for measuring classification performance with imbalanced data distributions. To optimize AUC, we transform the recommendation task to a classification problem, where the visited locations are positive examples and the unvisited are negative ones. We define a new lambda for AUC to utilize the LambdaMF model, which combines the lambda-based method and matrix factorization model in collaborative filtering. Experiments on two datasets show that the proposed AUC-MF outperforms state-of-the-art methods significantly in terms of recommendation accuracy.

【Keywords】: POI recommendaton; matrix factorization; AUC

135. Efficient Batch One-Hop Personalized PageRanks.

Paper Link】 【Pages】:1562-1565

【Authors】: Siqiang Luo ; Xiaokui Xiao ; Wenqing Lin ; Ben Kao

【Abstract】: Personalized PageRank (PPR) is a classic measure of the relevance among different nodes in a graph. Existing work on PPR has mainly focused on three general types of queries, namely, single-pair PPR, single-source PPR, and all-pair PPR. However, there are applications that rely on a new query type (referred to as batch one-hop PPR), which takes as input a set S of source nodes and, for each node s in S and each of s's neighbor v, asks for the PPR value of v with respect to s. None of the existing PPR algorithms is able to efficiently process batch one-hop queries, due to the inherent differences between batch one-hop PPR and the three general query types. To address the limitations of existing algorithms, this paper presents Baton, an algorithm for batch one-hop PPR that offers strong practical efficiency.

【Keywords】: Personalized PageRank; Graph Algorithm; Query Processing; Social Networks; Indexing

136. BB-Tree: A Main-Memory Index Structure for Multidimensional Range Queries.

Paper Link】 【Pages】:1566-1569

【Authors】: Stefan Sprenger ; Patrick Schäfer ; Ulf Leser

【Abstract】: We present the BB-Tree, a fast and space-efficient index structure for processing multidimensional workloads in main memory. It uses a k-ary search tree for pruning and searching while keeping all data in leaf nodes. It linearizes the inner search tree and manages it in a cache-optimized array, with occasional re-organizations when data changes. To reduce the frequency of re-organizations, the BB-Tree introduces a novel architecture for leaf nodes, called bubble buckets, which automatically morphs between different representations based on their fill degree and are thus able to buffer large numbers of insertions in-place. We compare the BB-Tree to scanning, mainmemory variants of the R*-tree, the kd-tree, and the VA-file, and the PH-tree using workloads over real and synthetic data. The BB-Tree is the fastest index for range queries up to a selectivity of 20%, and achieves an exact-match query performance similar to that of the best point access method, and is the most spaceefficient index structure.

【Keywords】: Multidimensional; Index Structure; Range Queries; Main-Memory

137. Explaining Queries Over Web Tables to Non-experts.

Paper Link】 【Pages】:1570-1573

【Authors】: Jonathan Berant ; Daniel Deutch ; Amir Globerson ; Tova Milo ; Tomer Wolfson

【Abstract】: Designing a reliable natural language (NL) interface for querying tables has been a longtime goal of researchers in both the data management and natural language processing (NLP) communities. Such an interface receives as input an NL question, translates it into a formal query, executes the query and returns the results. Errors in the translation process are not uncommon, and users typically struggle to understand whether their query has been mapped correctly. We address this problem by explaining the obtained formal queries to non-expert users. We introduce novel query explanations that provide a graphic representation of the query cell-based provenance (in its execution on a given table). Our solution augments a state-of-the-art NL interface over web tables, enhancing it in both its training and deployment phase. Experiments, including a user study conducted on Amazon Mechanical Turk, show our solution to improve both the correctness and reliability of the NL interface.

【Keywords】: NL interfaces; semantic parsing; provenance; natural language processing; NLP; weak supervision; annotations; query explanations

138. Nonlinear Models Over Normalized Data.

Paper Link】 【Pages】:1574-1577

【Authors】: Zhaoyue Cheng ; Nick Koudas

【Abstract】: Machine Learning (ML) applications are proliferating in the enterprise. Increasingly enterprise data are used to build sophisticated ML models to assist critical business functions. Relational data which are prevalent in enterprise applications are typically normalized; as a result data have to be denormalized via primary/foreign-key joins to be provided as input to ML algorithms. In this paper we study the implementation of popular nonlinear ML models and in particular independent Gaussian Mixture Models (IGMM) over normalized data. For the case of IGMM we propose algorithms taking the statistical properties of the Gaussians into account to construct mixture models, factorizing the computation. In that way we demonstrate that we can conduct the training of the models much faster compared to other applicable approaches, without any loss in accuracy. We present the results of a thorough experimental evaluation, varying several parameters of the input relations involved and demonstrate that our proposals both for the case of IGMM yield drastic performance improvements which become increasingly higher as parameters of the underlying data vary, without any loss in accuracy.

【Keywords】: normalized databases; no linear models; machine learning; databases; data management

Paper Link】 【Pages】:1578-1581

【Authors】: Yuyang Dong ; Hanxiong Chen ; Hiroyuki Kitagawa

【Abstract】: As the popularity of SNS and the number of GPS-equipped mobile devices increases, a large number of web users frequently change their location (spatial attribute) and interesting keywords (keyword attribute) in real-time. An example of such would be when a user watches the news, videos, and blogs while moving. Many location-based web applications can benefit from continuously searching for these dynamic spatial keyword objects. In this paper, we define a novel query problem to continuously search for dynamic spatial keyword objects. To the best of our knowledge, this is the first work to consider dynamic spatial keyword objects. We employ a novel grid-based index to manage both queries and dynamic spatial keyword objects. With the proposed index, we develop a buffer named partial cell list to reduce the computation cost in the top-k reevaluation. The experiments confirm the superiorities of our proposed methods.

【Keywords】: spatial keyword search; dynamic object; Grid based index; stream data

140. Top-K Frequent Term Queries on Streaming Data.

Paper Link】 【Pages】:1582-1585

【Authors】: Sara Farazi ; Davood Rafiei

【Abstract】: With rapidly increasing user-generated geo-tagged content in social media, location-based queries have received more attention lately. This paper studies the problem of efficiently answering top-K spatial term queries on streaming data where given a term, the goal is to find top K locations where the term is frequent. We propose two variations of reverse spatial term queries on streaming data and an approach for efficiently evaluating them with some bounds on the error. We conduct experiments on a relatively large dataset of Flickr photos, reporting a preliminary evaluation of our approach. We also discuss the limitations of the approaches in the literature for answering reverse spatial term queries and some of the advantages and disadvantages of our proposed approach.

【Keywords】: spatial queries; top K frequent terms; spatial stream data

141. Parallel and Distributed Processing of Reverse Top-k Queries.

Paper Link】 【Pages】:1586-1589

【Authors】: Panagiotis Nikitopoulos ; Georgios A. Sfyris ; Akrivi Vlachou ; Christos Doulkeridis ; Orestis Telelis

【Abstract】: In this paper, we address the problem of processing reverse top-k queries in a parallel and distributed setting. Given a database of objects, a set of user preferences, and a query object q, the reverse top-k query returns the subset of user preferences for which the query object belongs to the top-k results. Although recently, the reverse top-k query operator has been studied extensively, its CPU-intensive nature results in prohibitively expensive processing cost, when applied on vast-sized data sets. This limitation motivates us to explore a parallel processing solution, to enable reverse top-k query evaluation over GBs of data in reasonable execution time. To the best of our knowledge, this is the first work that addresses the problem of parallel reverse top-k query processing. We propose a solution to this problem, called DiPaRT, which is based on MapReduce and is provably correct. DiPaRT is empirically evaluated using GB-sized data sets.

【Keywords】: reverse top-k; distributed; parallel

142. Distributed Discovery of Functional Dependencies.

Paper Link】 【Pages】:1590-1593

【Authors】: Hemant Saxena ; Lukasz Golab ; Ihab F. Ilyas

【Abstract】: We address the problem of discovering functional dependencies from distributed big data. Existing (non-distributed) algorithms such as FastFDs focus on minimizing computation. However, distributed algorithms must also optimize data communication costs, especially in shared-nothing settings. We propose a distributed version of FastFDs that is communication-efficient and we experimentally show significant performance improvements over a straightforward distributed implementation.

【Keywords】: data profiling, functional dependencies, distributed algorithms

143. AID: An Adaptive Image Data Index for Interactive Multilevel Visualization.

Paper Link】 【Pages】:1594-1597

【Authors】: Saheli Ghosh ; Ahmed Eldawy ; Shipra Jais

【Abstract】: Visualization has become an integral part of big data management and exploration. Big spatial data is visualized on a map by processing the geometry of the data as well as other attributes. To speed up big spatial data visualization, two visualization indexes are currently available, image indexes and data indexes. Image indexes provide an interactive visualization but require a long indexing time, while data indexes are fast to build but are not interactive for big data. This paper introduces the first adaptive visualization index that combines both data and images to provide a scalable, interactive visualization while minimizing the index size and index construction time. They key idea is to identify the regions that are costly to visualize and store them as partial images. The remaining regions are stored as raw data and are visualized on-the-fly at query time. The preliminary results show that the proposed index can provide highly interactive visualization with a minimal indexing time.

【Keywords】: Visualizations; Multilevel Visualizations; Image Index; Data Index

144. Collecting Preference Rankings Under Local Differential Privacy.

Paper Link】 【Pages】:1598-1601

【Authors】: Jianyu Yang ; Xiang Cheng ; Sen Su ; Rui Chen ; Qiyu Ren ; Yuhan Liu

【Abstract】: In this paper, we initiate the study of collecting preference rankings under local differential privacy. The key technical challenge comes from the fact that the number of possible rankings increases factorially in the number of items to rank. In practical settings, this number could be large, leading to excessive injected noise. To solve this problem, we present a novel approach called SAFARI. The general idea is to collect a set of distributions over small domains which are carefully chosen based on the riffle independent model to approximate the overall distribution of users' rankings, and then generate a synthetic ranking dataset from the obtained distributions. By working on small domains instead of a large domain, SAFARI can significantly reduce the magnitude of added noise. Extensive experiments on real datasets confirm the effectiveness of SAFARI.

【Keywords】: Data Collection; Preference Rankings; Local Differential Privacy; Riffle Independent Model

145. Muses: Distributed Data Migration System for Polystores.

Paper Link】 【Pages】:1602-1605

【Authors】: Abdulrahman Kaitoua ; Tilmann Rabl ; Asterios Katsifodimos ; Volker Markl

【Abstract】: Large datasets can originate from various sources and are being stored in heterogeneous formats, schemas, and locations. Typical data science tasks need to combine those datasets in order to increase their value and extract knowledge. This is done in various data processing systems with diverse execution engines. In order to take advantage of each execution engine's characteristics and APIs data scientists need to migrate and transform their datasets at a very high computational cost and manual labor. Data migration is challenging for two main reasons: i) execution engines expect specific types/shapes of the data as input; ii) there are various physical representations of the data (e.g., partitions). Therefore, migrating data efficiently requires knowledge of systems internals and assumptions. In this paper we present Muses, a distributed, high-performance data migration engine that is able to forward, transform, repartition, and broadcast data between distributed engines' instances efficiently. Muses does not require any changes in the underlying execution engines. In an experimental evaluation, we show that migrating data from one execution engine to another (in order to take advantage of faster, native operations) can increase a pipeline's performance by 30%.

【Keywords】: Distributed systems, data migration, data transformation, big data engine, data integration.

146. PriSTE: From Location Privacy to Spatiotemporal Event Privacy.

Paper Link】 【Pages】:1606-1609

【Authors】: Yang Cao ; Yonghui Xiao ; Li Xiong ; Liquan Bai

【Abstract】: Location privacy-preserving mechanisms (LPPMs) have been extensively studied for protecting a user's location at each time point or a sequence of locations with different timestamps (i.e., a trajectory). We argue that existing LPPMs are not capable of protecting the sensitive information in user's spatiotemporal activities, such as "visited hospital in the last week" or "regularly commuting between Address 1 and Address 2 every morning and afternoon" (it is easy to infer that Addresses 1 and 2 may be home and office). To address this problem, we define the spatiotemporal event as a new privacy goal, which can be formalized as Boolean expressions between location and time predicates. We show that the spatiotemporal event is a generalization of a single location or a trajectory which is protected by existing LPPMs, while some types of spatiotemporal event may not be protected by the existing LPPMs. Hence, we formally define -spatiotemporal event privacy which is an indistinguishability-based privacy metric. It turns out that, interestingly, such privacy metric is orthogonal to the existing indistinguishability-based location privacy metric such as Geo-indistinguishability. We also discussed the potential solution to achieve both -spatiotemporal event privacy and Geo-indistinguishability.

【Keywords】: location privacy; spatiotemporal data; spatiotemporal event; differential privacy; LPPM

147. Continuous Range Queries Over Multi-attribute Trajectories.

Paper Link】 【Pages】:1610-1613

【Authors】: Jianqiu Xu ; Zhifeng Bao ; Hua Lu

【Abstract】: A multi-attribute trajectory consists of a sequence of time-stamped locations and a set of attributes that characterize diverse aspects of the corresponding moving object. In this paper, we study continuous range queries over multi-attribute trajectories. Such a query returns the objects whose attributes contain expected values and whose locations are always within a distance threshold to the query trajectory during the entire overlapping time period. To efficiently answer the query, an optimal method of partitioning the trajectories is proposed and an index structure is developed to support the combined search of spatio-temporal parameters and attribute values. We provide a general solution that is able to process multi-attribute trajectories as well as traditional trajectories without attributes. We carry out comprehensive experiments in a prototype database system to evaluate the efficiency and scalability of our designs. The experimental results show that our approach outperforms five alternative approaches by a factor of 5-50x on large datasets.

【Keywords】: multi-attribute; trajectories; range queries; partition; algorithms

148. Insecurity and Hardness of Nearest Neighbor Queries Over Encrypted Data.

Paper Link】 【Pages】:1614-1617

【Authors】: Rui Li ; Alex X. Liu ; Ying Liu ; Huanle Xu ; Huaqiang Yuan

【Abstract】: Nearest neighbor query processing is a fundamental problem that arises in many fields such as spatial databases and machine learning. ASPE, which uses invertible matrices to encrypt data, is a widely adopted Secure Nearest Neighbor (SNN) query scheme. Encrypting data by matrices is actually a linear combination of the multiple dimensions of the data, which is completely consistent with the relationship between the source signals and observed signals in the signal processing. By viewing dimensions of the data and the encrypted data as source signals and observed signals, respectively, we formally prove and experimentally demonstrate that ASPE is actually insecure against even ciphertext only attacks, using signal processing theory. Prior work proved that it is impossible to construct an SNN scheme even in much relaxed standard security models, we invalidate this hardness understanding by pointing out the incorrectness of the hardness proof.

【Keywords】: Secure Nearest Neighbor Queries; Independent Component Analysis; Insecurity of ASPE

149. Modeling Multidimensional User Preferences for Collaborative Filtering.

Paper Link】 【Pages】:1618-1621

【Authors】: Farhan Khawar ; Nevin L. Zhang

【Abstract】: A popular idea in collaborative filtering is to map users and items to latent vectors in the same Euclidean space and make recommendations based on their inner products. The idea of user/item clustering has also been exploited. However, the possibility of obtaining latent user and item feature vectors from user/item clusters has not been investigated. In this paper, we propose such a method for implicit feedback data. We cluster users along multiple latent dimensions, with each latent dimension being defined by a distinct subset of items. User clustering along a latent dimension results in two soft groups of users: those who have a tendency to consume the corresponding items and those who do not. The first group is called a taste group. As there are multiple latent dimensions, we get multiple taste groups. We map users and items to latent feature vectors based on the taste groups such that the vector for a user tells us what tastes she possesses, and the vector for an item tells us how popular it is for users with various tastes. We call the method Multidimensional User Clustering for Collaborative Filter (MUC-CF). In comparison with other methods, MUC-CF leads to more meaningful latent factors and hence its recommendations are easier to explain. MUC-CF is also scalable and in empirical evaluations, it outperforms state-of-the-art baselines.

【Keywords】: Multidimensional clustering, Recommender Systems, Collaborative Filtering, Implicit Feedback

150. A Queueing-Theoretic Framework for Vehicle Dispatching in Dynamic Car-Hailing.

Paper Link】 【Pages】:1622-1625

【Authors】: Peng Cheng ; Chao Feng ; Lei Chen ; Zheng Wang

【Abstract】: With the rapid development of smart mobile devices, the car-hailing platforms (e.g., Uber or Lyft) have attracted much attention from both the academia and the industry. In this paper, we consider an important dynamic car-hailing problem, namely maximum revenue vehicle dispatching (MRVD), in which rider requests dynamically arrive and drivers need to serve as many riders as possible such that the entire revenue of the platform is maximized. We prove that the MRVD problem is NP-hard and intractable. To handle the MRVD problem, we propose a queueing-based vehicle dispatching framework, which first uses existing machine learning algorithms to predict the future vehicle demand of each region, then estimates the idle time periods of drivers through a queueing model for each region. With the information of the predicted vehicle demands and estimated idle time periods of drivers, we propose one batch-based vehicle dispatching algorithm to efficiently assign suitable drivers to riders such that the expected entire revenue of the platform is maximized during each batch processing. Through experiments over real data sets, we demonstrate the efficiency and effectiveness of our proposed framework.

【Keywords】: Demand Prediction; Queueing Theoretic Algorithm; Car hailing

151. Maximizing the Utility in Location-Based Mobile Advertising.

Paper Link】 【Pages】:1626-1629

【Authors】: Peng Cheng ; Xiang Lian ; Lei Chen ; Siyuan Liu

【Abstract】: Nowadays, the locations and contexts of users are easily accessed by mobile advertising brokers, and the brokers can send customers related location-based advertisement. In this paper, we consider a location-based advertising problem, namely maximum utility advertisement assignment (MUAA) problem, with the estimation of the interests of customers and the contexts of the vendors, we want to maximize the overall utility of ads by determining the ads sent to each customer subject to the constraints of the capacities of customers, the distance ranges and the budgets of vendors. We prove that the MUAA problem is NP-hard and intractable. Thus, we propose one offline approach, namely the reconciliation approach, which has an approximation ratio of (1 - ε) · θ, where θ = min(a 1 /2n 1 c , a 2 /n 2 c , ⋯,a m /n m c ), and nz is the larger value between the number of valid vendors and the capacity a i of customer u i . Experiments on real data sets confirm the efficiency and effectiveness of our proposed approach.

【Keywords】: Location based mobile advertising; Task Assignment; Spatial Databases

152. Automated Grading of SQL Queries.

Paper Link】 【Pages】:1630-1633

【Authors】: Bikash Chandra ; Ananyo Banerjee ; Udbhas Hazra ; Mathew Joseph ; S. Sudarshan

【Abstract】: Grading student SQL queries manually is a tedious and error-prone process. The XData system, developed at IIT Bombay, can be used to test if a student query is correct or not. However, in case a student query is found to be incorrect, there is currently no way to automatically assign partial marks. Manually awarding partial marks is not scalable for classes with a large number of students, especially MOOCs, and is also prone to human errors. In this paper, we discuss techniques to award partial marks to student SQL queries, in case they are incorrect, based on a weighted equivalence edit distance metric. Our goal is to find a minimal sequence of edits on the student query such that it can be transformed to a query that is equivalent to a correct query. Our system can also be used in a learning mode where query edits can be suggested as feedback to students to guide them towards a correct query. Our automated partial marking system has been successfully used in courses at IIT Bombay and IIT Dharwad.

【Keywords】: SQL query grading; SQL query edits

153. Parameter Discovery in Unsupervised Clustering.

Paper Link】 【Pages】:1634-1637

【Authors】: Valentin Clement ; Thomas Heinis

【Abstract】: Analyzing massive amounts of data and extracting value has become key across different disciplines. A plethora of approaches has been developed to analyze the deluge of data. Using these approaches, however, is not straightforward and many require a priori knowledge of the dataset to set parameters, such as the number of clusters, making their use challenging. Lack of knowledge about the dataset means either that the clustering algorithm has to be run multiple times with different parameters or expensive human intervention and in-depth analysis is required, significantly delaying the analysis and reducing its reproducibility. In this paper, we introduce the idea of simple assumptions about the global distribution of some property of the data leading to local, actionable insights. More specifically, we derive configuration parameters for a clustering method from global distribution properties of a dataset.

【Keywords】: Clustering; Parameter Discovery

154. Interaction-Aware Arrangement for Event-Based Social Networks.

Paper Link】 【Pages】:1638-1641

【Authors】: Feifei Kou ; Zimu Zhou ; Hao Cheng ; Junping Du ; Yexuan Shi ; Pan Xu

【Abstract】: The last decade has witnessed the emergence and popularity of event-based social networks (EBSNs), which extend online social networks to the physical world. Fundamental on EBSN platforms is to appropriately assign EBSN users to events they are interested to attend, known as event-participant arrangement. Previous event-participant arrangement studies either fail to avoid conflicts among events or ignore the social interactions among participants. In this work, we propose a new event-participant arrangement problem called Interaction-aware Global Event-Participant Arrangement (IGEPA). It globally optimizes arrangements between events and participants to avoid conflicts in events, and not only accounts for user interests, but also encourages socially active participants to join. To solve the IGEPA problem, we design an approximation algorithm which has an approximation ratio of at least 1\4. Experimental results validate the effectiveness of our solution.

【Keywords】: Social Network; Matching

155. Optimizing Cross-Platform Data Movement.

Paper Link】 【Pages】:1642-1645

【Authors】: Sebastian Kruse ; Zoi Kaoudi ; Jorge-Arnulfo Quiané-Ruiz ; Sanjay Chawla ; Felix Naumann ; Bertty Contreras-Rojas

【Abstract】: Data analytics are moving beyond the limits of a single data processing platform. A cross-platform query optimizer is necessary to enable applications to run their tasks over multiple platforms efficiently and in a platform-agnostic manner. For the optimizer to be effective, it must consider data movement costs across different data processing platforms. In this paper, we present the graph-based data movement strategy used by Rheem, our open-source cross-platform system. In particular, we (i) model the data movement problem as a new graph problem, which we prove to be NP-hard, and (ii) propose a novel graph exploration algorithm, which allows Rheem to discover multiple hidden opportunities for cross-platform data processing.

【Keywords】: data movement; cross-platform; polystore; query opimization

156. Highly Efficient Pattern Mining Based on Transaction Decomposition.

Paper Link】 【Pages】:1646-1649

【Authors】: Youcef Djenouri ; Jerry Chun-Wei Lin ; Kjetil Nørvåg ; Heri Ramampiaro

【Abstract】: This paper introduces a highly efficient pattern mining technique called Clustering-Based Pattern Mining (CBPM). This technique discovers relevant patterns by studying the correlation between transactions in transaction databases using clustering techniques. The set of transactions are first clus-tered using the k-means algorithm, where highly correlated transactions are grouped together. Next, the relevant patterns are derived by applying a pattern mining algorithm to each cluster. We present two different pattern mining algorithms, one approximate and one exact. We demonstrate the efficiency and effectiveness of CBPM through a thorough experimental evaluation.

【Keywords】: Clustering, Pattern Mining, Decomposition, Scalability.

157. Procrastination-Aware Scheduling: A Bipartite Graph Perspective.

Paper Link】 【Pages】:1650-1653

【Authors】: Libin Wang ; Yongxin Tong ; Chunming Hu ; Lei Chen ; Yiming Li

【Abstract】: Procrastination is a prevalent form of self-control failure. As it often concerns with the individual's ability to meet the deadline, an efficient time management is crucial for overcoming it. Though a considerable amount of work in behavioral economics provides useful insights, there is not a computational way to guide us how to obtain an appropriate schedule for all the things to be done, especially when the relationship of the deadlines is intrinsic. In this paper, we first propose the Procrastination-aware Scheduling Problem (PSP) to model an appropriate schedule. A bipartite graph formulation is then developed to further illustrate the concepts. We find the PSP is NP-hard in the strong sense and design an approximation algorithm. In addition, we note the significance of the PSP under the online scenario (called OnlinePSP). Finally, we verify the effectiveness and efficiency of the proposed algorithms through extensive experiments on real datasets.

【Keywords】: Scheduling; Bipartite Graph

158. Hankel Matrix Factorization for Tagged Time Series to Recover Missing Values During Blackouts.

Paper Link】 【Pages】:1654-1657

【Authors】: Simeng Wu ; Liang Wang ; Tianheng Wu ; Xianping Tao ; Jian Lu

【Abstract】: Recovering missing values in time series is critical when performing time series analysis. And the blackouts issue studied in this paper, described as losing all the data during a certain period, is among the most urgent and challenging issues. While the existing approaches for missing value recovery in time series could not handle this issue properly, in this work, we proposes a Hankel matrix factorization-based approach for tagged time series called HKMF-T, following the idea of decomposing a data sequence into the smooth trend and the external impact components. By transforming the data sequence into its Hankel matrix form, HKMF-T models the smooth trend implied by high-order temporal correlations as the product of two low-rank matrices, and learns the external impacts indicated by a corresponding tag sequence. Through extensive experiments conducted on three real-world data sets, HKMF-T shows its effectiveness by outperforming all baseline methods for blackouts with durations longer than nine sampling intervals.

【Keywords】: Time Series; Missing Value Recovery; Blackouts; Hankel Matrix Factorization

159. PerRD: A System for Personalized Route Description.

Paper Link】 【Pages】:1658-1661

【Authors】: Han Su ; Guanglin Cong ; Wei Chen ; Qinyuan Su ; Bolong Zheng ; Kai Zheng

【Abstract】: Nowadays, mobile devices are already seen everywhere in life, which makes the application of vehicle navigation more and more widely. The traditional turn-by-turn navigation does the path planning just based on the characteristics of the roads themselves, and then gives mechanized steering instructions at each corner. For those roads people are familiar with in this route, path descriptions which provide detailed route description information, will become redundant and verbose. In this paper, we study a Personalized Route Description system dubbed PerRD - with which the goal is to generate more customized and intuitive route descriptions based on user generated content. The goal is to optimize a given route description with paths which users know well, which makes the route more consistent with users' driving habits, and to create a concise and meaningful route descriptions with POIs and street names.

【Keywords】: route description

160. Scalable Metric Similarity Join Using MapReduce.

Paper Link】 【Pages】:1662-1665

【Authors】: Jiacheng Wu ; Yong Zhang ; Jin Wang ; Chunbin Lin ; Yingjia Fu ; Chunxiao Xing

【Abstract】: Given two collections of objects, metric similarity join finds all similar pairs of objects according to a particular distance function in metric space. There is an increasing demand to provide a scalable similarity join algorithm which can support efficient query and analytical services in the era of Big Data. In this paper, we propose SMS-Join, a parallel framework to support similarity join in metric space based on the MapReduce paradigm. The overall workflow of SMS-Join is that it first finds some records as pivots in the preprocessing phase and then splits the data into partitions based on them with a map job. Finally the join results are obtained via a reduce job. To ensure load balancing between the partitions, we devise a light-weighted sampling technique to obtain high quality samples while maintaining the high performance. To reduce the partition cost, we develop an iterative partition strategy in the map phase. We implement our framework upon Apache Spark platform and conduct extensive experiments on four real world datasets. The results show that our method significantly outperforms state-of-the-art methods.

【Keywords】: similarity join; MapReduce; metric space

Paper Link】 【Pages】:1666-1669

【Authors】: Chaohui Wang ; Miao Xie ; Sourav S. Bhowmick ; Byron Choi ; Xiaokui Xiao ; Shuigeng Zhou

【Abstract】: Although exploratory search has received significant attention recently in the context of structured data, scant attention has been paid for graph-structured data. In this paper, we present two novel index structures called VACCINE and ADVISE to efficiently support exploratory subgraph search in a visual environment (VESS). VACCINE is an offline, feature-based index that stores rich information related to frequent and infrequent subgraphs in the underlying graph database and how they can be transformed from one subgraph to another. ADVISE, on the other hand, is an adaptive, compact, on-the-fly index instantiated during iterative visual formulation/reformulation of a subgraph query for exploratory search and records relevant information to efficiently support its repeated evaluation. These indexes engender more efficient and scalable visual exploratory subgraph search framework compared to a state-of-the-art technique.

【Keywords】: exploratory search, graph database, visual interface, indexing

Paper Link】 【Pages】:1670-1673

【Authors】: Wanqi Liu ; Hanchen Wang ; Ying Zhang ; Wei Wang ; Lu Qin

【Abstract】: Nearest Neighbor search has been well solved in low-dimensional space, but is challenging in high-dimensional space due to the curse of dimensionality. As a trade-off between efficiency and result accuracy, a variety of c-approximate nearest neighbor (c-ANN) algorithms have been proposed to return a c-approximate NN with confident at least δ. We observe that existing c-ANN search algorithms have some limitations on I/O efficiency when their indexes are resided on the external memory, which is critical for handling large scale high-dimensional data. In this paper, we introduce an incremental search based c-ANN search algorithm, named I-LSH. Unlike the previous LSH methods, which expand the bucket width in an exponential way, I-LSH adopts a more natural search strategy to incrementally access the hash values of the objects. We provide rigorous theoretical analysis to underpin our incremental search strategy. Our comprehensive experiment results show that, compared with state-of-the-art I/O efficient c-ANN techniques, our algorithm can achieve much better I/O efficiency under the same theoretical guarantee.

【Keywords】: NNS; LSH; CNN

163. T-Sample: A Dual Reservoir-Based Sampling Method for Characterizing Large Graph Streams.

Paper Link】 【Pages】:1674-1677

【Authors】: Lingling Zhang ; Hong Jiang ; Fang Wang ; Dan Feng ; Yanwen Xie

【Abstract】: Reservoir sampling is widely employed to characterize connectivity of large graph streams by producing edge samples. However, existing reservoir-based sampling methods mainly characterize large graph streams by a measure of counting triangles but perform poorly in accuracy when used to analyze the topological characteristics reflected by node degrees because they produce disconnected edge samples, making them ineffective in many applications that require both types of connectivity estimation simultaneously in real time. This paper proposes a new method, called triangle-induced reservoir sampling, or T-Sample, to produce connected edge samples. While every edge in a graph stream is still processed only once by T-Sample, a dual sampling mechanism performing both uniform sampling and non-uniform sampling is carefully designed with a base reservoir and an incremental reservoir. Specifically, the uniform sampling can be used to count triangles by employing the existing algorithms while the non-uniform sampling ensures that the edge samples are connected. Experimental results driven by real datasets show that T-Sample can obtain much more accurate estimations on the distributions of node degrees than the existing reservoir-based sampling methods.

【Keywords】: Reservoir based sampling; Dual sampling; Characterizing graph streams

164. Real Time Principal Component Analysis.

Paper Link】 【Pages】:1678-1681

【Authors】: Ranak Roy Chowdhury ; Muhammad Abdullah Adnan ; Rajesh K. Gupta

【Abstract】: By processing the data in motion, real-time data processing enables us to extract instantaneous results from online input data that ensures timely responsiveness to events as well as a much enhanced capacity to process large data sets. This is especially important when decision loops include querying and processing data on the web where size and latency considerations make it impossible to process raw data in real-time. This makes dimensionality reduction techniques, like principal component analysis (PCA), an important data preprocessing tool to gain insights into data. In this paper, we propose a variant of PCA, that is suited for real-time applications. In the real-time version of the PCA problem, we maintain a window over the most recent data and project every incoming row of data into lower dimensional subspace, which we generate as the output of the model. The goal is to minimize the reconstruction error of the output from the input. We use the reconstruction error as the termination criteria to update the eigenspace as new data arrives. To verify whether our proposed model can capture the essence of the changing distribution of large datasets in real-time, we have implemented the algorithm and evaluated performance against carefully designed simulations that change distributions of data sources over time in a controllable manner. Furthermore, we have demonstrated that our algorithm can capture the changing distributions of real-life datasets by running simulations on datasets from a variety of real-time applications e.g. localization, customer expenditure, etc. We propose algorithmic enhancements that rely upon spectral analysis to improve dimensionality reduction. Results show that our method can successfully capture the changing distribution of data in a real-time scenario, thus enabling real-time PCA.

【Keywords】: Big Data; Real Time; Dimensionality Reduction; PCA

165. A Fast Sketch Method for Mining User Similarities Over Fully Dynamic Graph Streams.

Paper Link】 【Pages】:1682-1685

【Authors】: Peng Jia ; Pinghui Wang ; Jing Tao ; Xiaohong Guan

【Abstract】: Many real-world networks such as Twitter and YouTube are given as fully dynamic graph streams represented as sequences of edge insertions and deletions. (e.g., users can subscribe and unsubscribe to channels on YouTube). Existing similarity estimation methods such as MinHash and OPH are customized to static graphs. We observe that they are indeed sampling methods and exhibit a sampling bias when applied to fully dynamic graph streams, which results in large estimation errors. To solve this challenge, we develop a fast and accurate sketch method VOS. VOS processes each edge in the graph stream of interest with small time complexity O(1) and uses small memory space to build a compact sketch of the dynamic graph stream over time. Based on the sketch built on-the-fly, we develop a method to estimate user similarities over time. We conduct extensive experiments and the experimental results demonstrate the efficiency and efficacy of our method.

【Keywords】: sketch; similarity; dynamic graph streams

166. A Collaborative Framework for Similarity Enforcement in Synthetic Scaling of Relational Datasets.

Paper Link】 【Pages】:1686-1689

【Authors】: Jiangwei W. Zhang ; Yong Chiang Tay

【Abstract】: Researchers and developers use benchmarks to compare their algorithms and products. A database benchmark must have a dataset. To be application-specific, this dataset should be empirical. However, the dataset may be too small, or too large, for the benchmarking experiments. The dataset must, therefore, be scaled to the desired size. To ensure the scaled dataset is similar to the original dataset, previous work typically specifies or extracts a fixed set of properties from the original dataset, then uses these properties to generate synthetic data for the scaled dataset. However, this approach becomes increasingly intractable as the property set gets larger, so a new solution is necessary. This paper proposes ASPECT, which adopts a different approach, for relational datasets. The user first selects a sizescaler to synthetically scale the empirical dataset to a desired size, then uses a tool to tweak the scaled dataset to enforce each target property. ASPECT provides the interface for these tools to collaborate in this tweaking process. Extensive experiments on real datasets show that ASPECT can efficiently and effectively enforce similarity in such a generated dataset.

【Keywords】: synthetic data; benchmarking; similarity

167. Meta Diagram Based Active Social Networks Alignment.

Paper Link】 【Pages】:1690-1693

【Authors】: Yuxiang Ren ; Charu C. Aggarwal ; Jiawei Zhang

【Abstract】: Network alignment aims at inferring a set of anchor links matching the shared entities between different information networks, which has become a prerequisite step for effective fusion of multiple information networks. In this paper, we will study the network alignment problem to fuse online social networks specifically. Social network alignment is extremely challenging to address due to several reasons, i.e., lack of training data, network heterogeneity and one-to-one constraint. Existing network alignment works usually require a large number of training data, but such a demand can hardly be met in applications, as manual anchor link labeling is extremely expensive. Significantly different from other homogeneous network alignment works, information in online social networks is usually of heterogeneous categories, the incorporation of which in model building is not an easy task. Furthermore, the one-to-one cardinality constraint on anchor links renders their inference process intertwistingly correlated. To resolve these three challenges, a novel network alignment model, namely ActiveIter, is introduced in this paper. ActiveIter defines a set of inter-network meta diagrams for anchor link feature extraction, adopts active learning for effective label query and uses greedy link selection for anchor link cardinality filtering. Extensive experiments are conducted on real-world aligned networks datasets, and the experimental results have demonstrated the effectiveness of ActiveIter compared with other state-of-the-art baseline methods.

【Keywords】: Heterogeneous Network; Network Alignment; Active Learning; Data Mining

168. Entity Integrity, Referential Integrity, and Query Optimization with Embedded Uniqueness Constraints.

Paper Link】 【Pages】:1694-1697

【Authors】: Ziheng Wei ; Uwe Leck ; Sebastian Link

【Abstract】: Embedded uniqueness constraints represent unique column combinations embedded in complete fragments of incomplete data. In contrast to SQL UNIQUE constraints, they offer a principled separation of completeness and uniqueness requirements and are capable of exploiting more resource-conscious index structures. The latter help relational database systems to be more efficient in enforcing entity and referential integrity, and in evaluating common types of queries.

【Keywords】: Data integrity; Index; Missing data; SQL; Query; Unique constraint

169. Efficient Pattern Mining Based Cryptanalysis for Privacy-Preserving Record Linkage.

Paper Link】 【Pages】:1698-1701

【Authors】: Anushka Vidanage ; Thilina Ranbaduge ; Peter Christen ; Rainer Schnell

【Abstract】: Privacy-preserving record linkage (PPRL) is the process of identifying records that correspond to the same entities across several databases without revealing any sensitive information about these entities. One popular PPRL technique is Bloom filter (BF) encoding, with first applications of BF based PPRL now being employed in real-world linkage applications. Here we present a cryptanalysis attack that can re-identify attribute values encoded in BFs. Our method applies maximal frequent itemset mining on a BF database to first identify sets of frequently co-occurring bit positions that correspond to encoded frequent q-grams (character substrings extracted from plain-text values). Using a language model, we then identify additional q-grams by applying pattern mining on subsets of BFs that encode a previously identified frequent q-gram. Experiments on a real database show that our attack can successfully re-identify sensitive values even when each BF in a database is unique.

【Keywords】: Bloom filter; re-identification; pattern mining; Max-Miner; data linkage; probabilistic language model

170. ECOQUG: An Effective Ensemble Community Scoring Function.

Paper Link】 【Pages】:1702-1705

【Authors】: Chunnan Wang ; Hongzhi Wang ; Chang Zhou ; Jianzhong Li ; Hong Gao

【Abstract】: A reasonable and effective community scoring function is of great significance since it can measure the community quality of groups we found more properly and help us discover more valuable communities. In this paper, we propose a new community scoring function, ECOQUG. Different from the existing community scoring functions, ECOQUG is designed based on the experimental study and theoretical analysis of groups with different community qualities. ECOQUG is more convincing. In addition, we design a series of experiments to examine the effectiveness and accuracy of ECOQUG and 13 other classic community scoring functions comprehensively. The extensive experimental results show that ECOQUG is effective and better than other community scoring functions.

【Keywords】: Community scoring function; Network communities; Ground-truth communities; Community quality

171. CN-Probase: A Data-Driven Approach for Large-Scale Chinese Taxonomy Construction.

Paper Link】 【Pages】:1706-1709

【Authors】: Jindong Chen ; Ao Wang ; Jiangjie Chen ; Yanghua Xiao ; Zhendong Chu ; Jingping Liu ; Jiaqing Liang ; Wei Wang

【Abstract】: Taxonomies play an important role in machine intelligence. However, most well-known taxonomies are in English, and non-English taxonomies, especially Chinese ones, are still very rare. In this paper, we focus on automatic Chinese taxonomy construction and propose an effective generation and verification framework to build a large-scale and high-quality Chinese taxonomy. In the generation module, we extract isA relations from multiple sources of Chinese encyclopedia, which ensures the coverage. To further improve the precision of taxonomy, we apply three heuristic approaches in verification module. As a result, we construct the largest Chinese taxonomy with high precision about 95% called CN-Probase. Our taxonomy has been deployed on Aliyun, with over 82 million API calls in six months.

【Keywords】: Knowledge Base; Taxonomy construction

172. Understanding Data Science Lifecycle Provenance via Graph Segmentation and Summarization.

Paper Link】 【Pages】:1710-1713

【Authors】: Hui Miao ; Amol Deshpande

【Abstract】: Increasingly modern data science platforms today have non-intrusive and extensible provenance ingestion mechanisms to collect rich provenance and context information, handle modifications to the same file using distinguishable versions, and use graph data models (e.g., property graphs) and query languages (e.g., Cypher) to represent and manipulate the stored provenance/context information. Due to the schema-later nature of the metadata, multiple versions of the same files, and unfamiliar artifacts introduced by team members, the resulting "provenance graphs" are quite verbose and evolving; further, it is very difficult for the users to compose queries and utilize this valuable information just using standard graph query model. In this paper, we propose two high-level graph query operators to address the verboseness and evolving nature of such provenance graphs. First, we introduce a graph segmentation operator, which queries the retrospective provenance between a set of source vertices and a set of destination vertices via flexible boundary criteria to help users get insight about the derivation relationships among those vertices. We show the semantics of such a query in terms of a context-free grammar, and develop efficient algorithms that run orders of magnitude faster than state-of-the-art. Second, we propose a graph summarization operator that combines similar segments together to query prospective provenance of the underlying project. The operator allows tuning the summary by ignoring vertex details and characterizing local structures, and ensures the provenance meaning using path constraints. We show the optimal summary problem is PSPACE-complete and develop effective approximation algorithms. We implement the operators on top of Neo4j, evaluate our query techniques extensively, and show the effectiveness and efficiency of the proposed methods.

【Keywords】: provenance management; graph query; model management; context free language

173. Efficient Partitioning and Query Processing of Spatio-Temporal Graphs with Trillion Edges.

Paper Link】 【Pages】:1714-1717

【Authors】: Mengsu Ding ; Shimin Chen

【Abstract】: Real-world graphs often contain spatio-temporal information and evolve over time. Compared with static graphs, spatio-temporal graphs present more significant challenges in data volume, data velocity, and query processing. In this paper, we define a formal spatio-temporal graph model based on real-world applications, and propose PAST, a framework for efficient PArtitioning and query processing of Spatio-Temporal graphs. Our experimental results show that PAST improves query performance by orders of magnitude compared to state-of-the-art solutions.

【Keywords】: Spatio-temporal graphs; partitioning for spatio-temporal graphs; graph query processing; PAST

174. Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing.

Paper Link】 【Pages】:1718-1721

【Authors】: Vasilis Verroios ; Hector Garcia-Molina

【Abstract】: Given a set of records, entity resolution algorithms find all the records referring to each entity. In top-k entity resolution, the goal is to find all the records referring to the k largest (in terms of number of records) entities. Top-k entity resolution is driven by many modern applications that operate over just the few most popular entities in a dataset. In this paper we introduce the problem of top-k entity resolution and we summarize a novel approach for this problem; full details are presented in a technical report. Our approach is based on locality-sensitive hashing, and can very rapidly and accurately process massive datasets. Our key insight is to adaptively decide how much processing each record requires to ascertain if it refers to a top-k entity or not: the less likely a record is to refer to a top-k entity, the less it is processed. The heavily reduced amount of processing for the vast majority of records that do not refer to top-k entities, leads to significant speedups. Our experiments with images, web articles, and scientific publications show a 2× to 25× speedup compared to traditional approaches for high-dimensional data.

【Keywords】: top K entity resolution; adaptive locality sensitive hashing; LSH; clustering

175. Finding Average Regret Ratio Minimizing Set in Database.

Paper Link】 【Pages】:1722-1725

【Authors】: Sepanta Zeighami ; Raymond Chi-Wing Wong

【Abstract】: Selecting a certain number of data points (or records) from a database which "best" satisfy users' expectations is a very prevalent problem with many applications. One application is a hotel booking website showing a certain number of hotels on a single page. However, this problem is very challenging since the selected points should "collectively" satisfy the expectation of all users. Showing a certain number of data points to a single user could decrease the satisfaction of a user because the user may not be able to see his/her favorite point which could be found in the original database. In this paper, we would like to find a set of k points such that on average, the satisfaction (ratio) of a user is maximized. This problem takes into account the probability distribution of the users and considers the satisfaction (ratio) of all users, which is more reasonable in practice, compared with the existing studies that only consider the worst-case satisfaction (ratio) of the users, which may not reflect the whole population and is not useful in some applications. Motivated by this, in this paper, we propose algorithms for this problem. Finally, we conducted experiments to show the effectiveness and the efficiency of the algorithms.

【Keywords】: query processing; average regret ratio

176. HyMJ: A Hybrid Structure-Aware Approach to Distributed Multi-way Join Query.

Paper Link】 【Pages】:1726-1729

【Authors】: Guanghui Zhu ; Xiaoqi Wu ; Liangliang Yin ; Haogang Wang ; Rong Gu ; Chunfeng Yuan ; Yihua Huang

【Abstract】: The multi-way join query plays a fundamental role in many big data analytic scenarios. Recently, the hybrid join query is becoming increasingly important. However, the existing one-round and multi-round algorithms have limitations in the process of the hybrid query. In this paper, we present a novel hybrid structure-aware multi-way join algorithm called HyMJ, which combines the one-round and multi-round algorithms to compute the hybrid query efficiently. First, we propose the query structure graph (QSG) to represent the internal query structure of a given join query and the query structure decomposition tree (QSDT) to represent the structure-aware query plan. Each internal node of the QSDT denotes a subquery with a cyclic or acyclic query structure. Then, we design a graph contraction based algorithm to construct QSDT from QSG. Furthermore, to select the optimal join strategy for each subquery in the QSDT, we introduce a heuristic strategy selection model. Experimental results on Apache Spark reveal that HyMJ outperforms both the one-round and multi-round algorithms for hybrid multi-way join queries on real-world datasets.

【Keywords】: multi-way join; distributed computing; parallel query; Apache Spark

Paper Link】 【Pages】:1730-1733

【Authors】: Xuefeng Chen ; Xin Cao ; Zhiqiang Xu ; Ying Zhang ; Shuo Shang ; Wenjie Zhang

【Abstract】: Given a set of server points (e.g., locations) P and a set of client points (e.g., users) O, the problem of maximizing bichromatic reverse k-nearest neighbor (MaxBRkNN) aims to find a region for setting up a new service site such that it can influence the most clients, i.e., it is in the kNN results of most client points. All existing studies first compute the kNN of client points and then perform the MaxBRkNN search. However, computing kNN for all clients is extremely time consuming especially on large datasets. Observing this, we develop an approach which computes kNN for only promising clients by utilising a two-level grid index (ADPGI) to reduce the cost substantially. Empirical studies on both real and synthetic datasets show that our proposed exact algorithm is 3 to 5 times faster than two state-of-the-art MaxBRkNN algorithms.

【Keywords】: k-nearest neighbor; MaxBRkNN; Grid index

178. Efficient Bottom-Up Discovery of Multi-scale Time Series Correlations Using Mutual Information.

Paper Link】 【Pages】:1734-1737

【Authors】: Nguyen Ho ; Torben Bach Pedersen ; Mai Vu ; Van Long Ho ; Christophe A. N. Biscio

【Abstract】: Recent developments in computing and IoT technology have enabled the daily generation of enormous amounts of time series data. These time series have to be analyzed to create value. A fundamental type of analysis is to find temporal correlations between given sets of time series. To provide a robust method for solving this problem, several properties are desirable. First, the method should have a strong theoretical foundation. Second, since temporal correlations can occur at different temporal scales, e.g., sub-second versus weekly, it is important that the method is capable of discovering multitemporal scale correlations. Finally, the method should be efficient and scalable. This paper presents an approach to search for synchronous correlations in big time series that displays all three properties: the proposed method (i) utilizes the metric of mutual information from information theory, providing a strong theoretical foundation, (ii) is able to discover correlations at multiple temporal scales, and (iii) works in an efficient, bottom-up fashion, making it scalable to large datasets. Our experiments verify that the proposed approach can identify various types of correlation relations across multiple temporal scales, while achieving a performance of an order of magnitude faster than the state-of-the-art techniques.

【Keywords】: temporal correlation, hill climbing, sliding window, mutual information

179. Fingerprinting Big Data: The Case of KNN Graph Construction.

Paper Link】 【Pages】:1738-1741

【Authors】: Rachid Guerraoui ; Anne-Marie Kermarrec ; Olivier Ruas ; François Taïani

【Abstract】: We propose fingerprinting, a new technique that consists in constructing compact, fast-to-compute and privacy-preserving binary representations of datasets. We illustrate the effectiveness of our approach on the emblematic big data problem of K-Nearest-Neighbor (KNN) graph construction and show that fingerprinting can drastically accelerate a large range of existing KNN algorithms, while efficiently obfuscating the original data, with little to no overhead. Our extensive evaluation of the resulting approach (dubbed GoldFinger) on several realistic datasets shows that our approach delivers speedups of up to 78.9% compared to the use of raw data while only incurring a negligible to moderate loss in terms of KNN quality.

【Keywords】: KNN graphs; fingerprint; similarity

180. Outer and Anti Joins in Temporal-Probabilistic Databases.

Paper Link】 【Pages】:1742-1745

【Authors】: Katerina Papaioannou ; Martin Theobald ; Michael H. Böhlen

【Abstract】: The result of a temporal-probabilistic (TP) join with negation includes, at each time point, the probability with which a tuple of a positive relation p matches none of the tuples in a negative relation n, for a given join condition θ. For the computation of TP joins with negation, we introduce generalized lineage-aware temporal windows, a mechanism that binds an interval to the lineages of all the matching valid tuples of each input relation. We compute these windows in an incremental manner, and we show that pipelined computations allow for the direct integration of our approach into PostgreSQL. We thereby alleviate the prevalent redundancies in the interval computations of existing approaches, which is proven by an extensive experimental evaluation with real-world datasets.

【Keywords】: outer join; anti join; negation; time; lineage

181. Workload-Driven Fragment Allocation for Partially Replicated Databases Using Linear Programming.

Paper Link】 【Pages】:1746-1749

【Authors】: Stefan Halfpap ; Rainer Schlosser

【Abstract】: In replication schemes, replica nodes can process read-only queries on snapshots of the master node without violating transactional consistency. By analyzing the workload, we can identify query access patterns and replicate data depending to its access frequency. In this paper, we define a linear programming (LP) model to calculate the set of partial replicas with the lowest overall memory capacity while evenly balancing the query load. Furthermore, we propose a scalable decomposition heuristic to calculate solutions for larger problem sizes. While guaranteeing the same performance as state-of-the-art heuristics, our decomposition approach calculates allocations with up to 23% lower memory footprint for the TPC-H benchmark.

【Keywords】: database replication; allocation problem; linear programming

182. OSMAC: Optimizing Subgraph Matching Algorithms with Community Structure.

Paper Link】 【Pages】:1750-1753

【Authors】: Yunkai Lou ; Chaokun Wang

【Abstract】: Subgraph has gained increasing attention as it is an important query type on graphs. The efficiency of existing subgraph matching algorithms becomes unsatisfactory since graphs gradually get larger and more complex. This paper proposes an optimization method named OSMAC to accelerate subgraph matching algorithms with the community structure of data graphs. In essence, OSMAC changes the task of subgraph matching into dealing with all VC-mappings. An optimization method named community-structure-based boundary pruning is proposed to further improve the performance of OSMAC. It implements an efficient pruning method with the information of community structure and can reduce the search space. As a case study, we optimize TurboISO, one of the state-of-the-art subgraph matching algorithms, with OSMAC. The results of the experiments conducted on real-world data sets confirm that OSMAC is efficient and can improve the performance of subgraph matching algorithms significantly.

【Keywords】: subgraph matching, community structure, optimization, vertex-community mapping

183. Workload-Aware Subgraph Query Caching and Processing in Large Graphs.

Paper Link】 【Pages】:1754-1757

【Authors】: Yongjiang Liang ; Peixiang Zhao

【Abstract】: A subgraph query q that finds as output all its subgraph-isomorphic embeddings from a data graph g has been core to modern declarative querying in large graphs. In this paper, we address subgraph queries with the availability of query workload information, W = {w 1 ,..., w n }, where w i ∈ W is a previously issued query with all its subgraph-isomorphic embeddings cached beforehand. We introduce a workload-aware subgraph querying framework, WaSQ, that leverages query workload for subgraph query rewriting, search plan refinement, partial results reusing, and false positive filtering towards facilitating the whole subgraph querying process. Experimental studies in real-world graphs demonstrate that WaSQ achieves significant and consistent performance gains in comparison with state-of-the-art, workload-oblivious solutions for large-scale subgraph querying.

【Keywords】: graphs; subgraph queries; graph matching

184. How I Learned to Stop Worrying and Love Re-optimization.

Paper Link】 【Pages】:1758-1761

【Authors】: Matthew Perron ; Zeyuan Shang ; Tim Kraska ; Michael Stonebraker

【Abstract】: Cost-based query optimizers remain one of the most important components of database management systems for analytic workloads. Though modern optimizers select plans close to optimal performance in the common case, a small number of queries are an order of magnitude slower than they could be. In this paper we investigate why this is still the case, despite decades of improvements to cost models, plan enumeration, and cardinality estimation. We demonstrate why we believe that a re-optimization mechanism is likely the most cost-effective way to improve end-to-end query performance. We find that even a simple re-optimization scheme can improve the latency of many poorly performing queries. We demonstrate that re-optimization improves the end-to-end latency of the top 20 longest running queries in the Join Order Benchmark by 27%, realizing most of the benefit of perfect cardinality estimation.

【Keywords】: Cardinality Estimation; Query Optimization; Dynamic Query Re-optimization

185. CRA: Enabling Data-Intensive Applications in Containerized Environments.

Paper Link】 【Pages】:1762-1765

【Authors】: Ibrahim Sabek ; Badrish Chandramouli ; Umar Farooq Minhas

【Abstract】: Today, a modern data center hosts a wide variety of applications comprising batch, interactive, machine learning, and streaming applications. In this paper, we factor out the commonalities in a large majority of these applications, into a generic dataflow layer called Common Runtime for Applications (CRA). In parallel, another trend, with containerization technologies (e.g., Docker), has taken a serious hold on cloud-scale data centers, with direct implications on building next generation of data center applications. Container orchestrators (e.g., Kubernetes) have made deployment a lot easy, and they solve many infrastructure level problems, e.g., service discovery, auto-restart, and replication. For best in class performance, there is a need to marry the next generation applications with containerization technologies. To that end, CRA leverages and builds upon the containerization and resource orchestration capabilities of Kubernetes/Docker, and makes it easy to build a wide range of cloud-edge applications on top. To the best of our knowledge, we are the first to present a cloud native runtime for building data center applications. We show the efficiency of CRA through various micro-benchmarking experiments.

【Keywords】: dataflow processing; distributed; containerization; docker; kubernetes

186. Scalable Similarity Joins of Tokenized Strings.

Paper Link】 【Pages】:1766-1777

【Authors】: Ahmed Metwally ; Chun-Heng Huang

【Abstract】: This work tackles the problem of fuzzy joining of strings that naturally tokenize into meaningful substrings, e.g., full names. Tokenized-string joins have several established applications in the context of data integration and cleaning. This work is primarily motivated by fraud detection, where attackers slightly modify tokenized strings, e.g., names on accounts, to create numerous identities that she can use to defraud service providers, e.g., Google, and LinkedIn. To detect such attacks, all the accounts are pair-wise compared, and the resulting similar accounts are considered suspicious and are further investigated. Comparing the tokenized-string features of a large number of accounts requires an intuitive tokenized-string distance that can detect subtle edits introduced by an adversary, and a very scalable algorithm. This is not achievable by existing distance measure that are unintuitive, hard to tune, and whose join algorithms are serial and hence unscalable. We define a novel intuitive distance measure between tokenized strings, Normalized Setwise Levenshtein Distance (NSLD). To the best of our knowledge, NSLD is the first metric proposed for comparing tokenized strings. We propose a scalable distributed framework, Tokenized-String Joiner (TSJ), that adopts existing scalable string-join algorithms as building blocks to perform NSLD-joins. We carefully engineer optimizations and approximations that dramatically improve the efficiency of TSJ. The effectiveness of the TSJ framework is evident from the evaluation conducted on tens of millions of tokenized-string names from Google accounts. The superiority of the tokenized-string-specific TSJ framework over the general-purpose metric-spaces joining algorithms has been established.

【Keywords】: Distributed Algorithms; Fraud Detection; Data Cleaning; Experiments; Performance; Fuzzy Joins

187. MLlib*: Fast Training of GLMs Using Spark MLlib.

Paper Link】 【Pages】:1778-1789

【Authors】: Zhipeng Zhang ; Jiawei Jiang ; Wentao Wu ; Ce Zhang ; Lele Yu ; Bin Cui

【Abstract】: In Tencent Inc., more than 80% of the data are extracted and transformed using Spark. However, the commonly used machine learning systems are TensorFlow, XGBoost, and Angel, whereas Spark MLlib, an official Spark package for machine learning, is seldom used. One reason for this ignorance is that it is generally believed that Spark is slow when it comes to distributed machine learning. Users therefore have to undergo the painful procedure of moving data in and out of Spark. The question why Spark is slow, however, remains elusive. In this paper, we study the performance of MLlib with a focus on training generalized linear models using gradient descent. Based on a detailed examination, we identify two bottlenecks in MLlib, i.e., pattern of model update and pattern of communication. To address these two bottlenecks, we tweak the implementation of MLlib with two state-of-the-art and well-known techniques, model averaging and AllReduce. We show that, the new system that we call MLlib*, can significantly improve over MLlib and achieve similar or even better performance than other specialized distributed machine learning systems (such as Petuum and Angel), on both public and Tencent's workloads.

【Keywords】: Distributed Machine Learning, Spark, Generalized Linear Models, Gradient Descent

188. DirectLoad: A Fast Web-Scale Index System Across Large Regional Centers.

Paper Link】 【Pages】:1790-1801

【Authors】: An Qin ; Mengbai Xiao ; Jin Ma ; Dai Tan ; Rubao Lee ; Xiaodong Zhang

【Abstract】: The freshness of web page indices is the key to improving searching quality of search engines. In Baidu, the major search engine in China, we have developed DirectLoad, an index updating system for efficiently delivering the webscale indices to nationwide data centers. However, the web-scale index updating suffers from increasingly high data volumes during network transmission and inefficient I/O transactions due to slow disk operations. DirectLoad accelerates the index updating streams from two aspects: 1) DirectLoad effectively cuts down the overwhelmingly high volume of indices in transmission by removing the redundant data across versions, and mutates regular operations in a key-value storage system for successful accesses to the deduplicated datasets. 2) DirectLoad significantly improves the I/O efficiency by replacing the LSMTree with a memory-resident table (memtable) and appendingonly-files (AOFs) on disk. Specifically, the write amplification stemming from sorting operations on disk is eliminated, and a lazy garbage collection policy further improves the I/O performance at the software level. In addition, DirectLoad directly manipulates the SSD native interfaces to remove the write amplification at the hardware level. In practice, 63% updating bandwidth has been saved due to the deduplication, and the write throughput to SSDs is increased by 3x. The index updating cycle of our production workloads has been compressed from 15 days to 3 days after deploying DirectLoad. In this paper, we show the effectiveness and efficiency of an in-memory index updating system, which is disruptive to the framework in a conventional memory hierarchy. We hope that this work contributes a strong case study in the system research literature.

【Keywords】: index update; LSM tree storage engine; search engine; distributed storage

189. Presto: SQL on Everything.

Paper Link】 【Pages】:1802-1813

【Authors】: Raghav Sethi ; Martin Traverso ; Dain Sundstrom ; David Phillips ; Wenlei Xie ; Yutian Sun ; Nezih Yegitbasi ; Haozhun Jin ; Eric Hwang ; Nileema Shingte ; Christopher Berner

【Abstract】: Presto is an open source distributed query engine that supports much of the SQL analytics workload at Facebook. Presto is designed to be adaptive, flexible, and extensible. It supports a wide variety of use cases with diverse characteristics. These range from user-facing reporting applications with sub-second latency requirements to multi-hour ETL jobs that aggregate or join terabytes of data. Presto's Connector API allows plugins to provide a high performance I/O interface to dozens of data sources, including Hadoop data warehouses, RDBMSs, NoSQL systems, and stream processing systems. In this paper, we outline a selection of use cases that Presto supports at Facebook. We then describe its architecture and implementation, and call out features and performance optimizations that enable it to support these use cases. Finally, we present performance results that demonstrate the impact of our main design decisions.

【Keywords】: SQL; query engine; big data; data warehouse

190. Improving RDF Query Performance Using In-memory Virtual Columns in Oracle Database.

Paper Link】 【Pages】:1814-1819

【Authors】: Eugene Inseok Chong ; Matthew Perry ; Souripriya Das

【Abstract】: Many RDF Knowledge Graph Stores use IDs to represent triples to save storage and for ease of maintenance. Oracle is no exception. While this design is good for a small footprint on disk, it incurs overhead in query processing, as it requires joins with the value table to return results or process aggregates/filter/order-by queries. It becomes especially problematic as the result size increases or the number of projected variables increases. Depending on queries, the value table join could take up most of the query processing time. In this paper, we propose to use in-memory virtual columns to avoid value table joins. It has advantages in that it does not increase the footprint on disk, and at the same time it avoids the value table joins by utilizing in-memory virtual columns. The idea is to materialize the values only in memory and utilize the compression and vector processing that come with Oracle Database In-Memory technology. Typically, the value table in RDF is small compared to the triples table. Therefore, its footprint is manageable in memory, especially with compression in columnar format. The same mechanism can be applied to any application where there exists a one-to-one mapping between an ID and its value, such as data warehousing or data marts. The mechanism has been implemented in Oracle 18c. Experimental results using the LUBM1000 benchmark show up to two orders of magnitude query performance improvement.

【Keywords】: Knowledge Graphs, Semantic Technologies, RDF, Query Processing, Graph Databases, In-Memory Processing, Virtual Columns

191. SEBDB: Semantics Empowered BlockChain DataBase.

Paper Link】 【Pages】:1820-1831

【Authors】: Yanchao Zhu ; Zhao Zhang ; Cheqing Jin ; Aoying Zhou ; Ying Yan

【Abstract】: Blockchain has been adopted in many applications to construct trust among multiple participants, such as supply chain management, digital assets transfer, philanthropy, etc. Blockchain platforms are often used as decentralized databases. However, existing blockchain platforms are far less convenient to use than traditional databases. They are lack of the capability of modelling complex tasks conveniently and efficiently, especially when both on-chain and off-chain data are involved at the same time. In this paper, we propose and implement a novel blockchain database, called SEBDB, which leverages the existing databases' functionality which are optimized for decades. Comparing to existing works, SEBDB is the first platform which considers both useability and scalability. Specifically, first, weaddrelationaldata semantics into blockchain platform, where each transaction is a tuple with multiple attributes in a pre-defined table. Second, we use SQL-like language as the general interface, instead of code-level APIs, to support convenient application development, in which intrinsic operations are re-defined and re-implemented to suit for blockchain platform. Third, as RDBMS has achieved great success in the past decades, our system, though not relying on RDBMS, treats it as an important component. Finally, we define a mini-benchmark to evaluate the performance of the blockchain database. Extensive experiments demonstrate the effectiveness and efficiency of our proposed system.

【Keywords】: Blockchaindatabase, relationalsemantics, query processing, data storage

192. Large Scale Traffic Signal Network Optimization - A Paradigm Shift Driven by Big Data.

Paper Link】 【Pages】:1832-1840

【Authors】: Liang Yu ; Jinqiang Yu ; Maolei Zhang ; Xin Zhang ; Yuehu Liu ; Hui Zhang ; Wanli Min

【Abstract】: Traffic signal is the key method for city traffic control. Existing signal control systems use the loop detector data as the main input which is nearsighted in terms of both space and time. Lacking of effective data collection methods has hindered the development of more sophisticated models. It is therefore very hard to develop an optimization model considering all signals in a region or even a city. In this paper, we will introduce our method for large scale traffic signal optimization, which is the major module of Alibaba's city brain solution. By integrating multiple data sources to sense the whole city's traffic conditions, a layered model is developed based on the divide-and-conquer paradigm to gradually apply different types of data-driven optimization algorithms. It is a paradigm shift to use big data to improve a traditionally closed signal control system, and its effectiveness has been proven in a field test in Shanghai city.

【Keywords】: Signal control, green-wave, graph partition, optimization

193. Domain-Independent Automated Processing of Free-Form Text Data in Telecom.

Paper Link】 【Pages】:1841-1849

【Authors】: Rajarshi Bhowmik ; Ahmet Akyamaç

【Abstract】: Free-form, unstructured and semi-structured textual data has become increasingly more prevalent in the telecommunications industry, with service and equipment providers alike. Some typical examples include textual data from customer care tickets, machine logs, alarm and alerting systems, and diagnostics. There is a growing business need to rapidly and automatically understand the underlying key topics and categories of this bulk collection of text. With the present mode of operation of relying on domain experts to analyze textual data, there is a clear need to apply text analytics to automate the process. Difficulties arise due to the jargon-filled and fragmented, incomplete nature of textual data in this field. In this paper, we propose a domain-agnostic, unsupervised approach that deploys a multi-stage text processing pipeline for automatically discovering the key topics and categories from free-form text documents. Using anonymized datasets retrieved from actual customer care tickets and system logs, we show that our approach outperforms traditional text mining approaches, and performs comparably to manual categorization tasks that were undertaken by domain experts with full system knowledge.

【Keywords】: Free-form textual data; key-phrase extraction; text processing pipeline; telecommunications industry

194. DRIVEN: a Framework for Efficient Data Retrieval and Clustering in Vehicular Networks.

Paper Link】 【Pages】:1850-1861

【Authors】: Bastian Havers ; Romaric Duvignau ; Hannaneh Najdataei ; Vincenzo Gulisano ; Ashok Chaitanya Koppisetty ; Marina Papatriantafilou

【Abstract】: Applications for adaptive (sometimes also called smart) Cyber-Physical Systems are blossoming thanks to the large volumes of data, sensed in a continuous fashion, in large distributed systems. The benefits of these applications come nonetheless with a price: the need for jointly addressing challenges in efficient data communication and analysis (among others). The goal of the DRIVEN framework, presented here, is to address these challenges for a data gathering and distance-based clustering tool in the context of vehicular networks. Because of the limited communication bandwidth (compared to the volume of sensed data) of vehicular networks and the monetary costs of data transmission, the intuition behind DRIVEN is to avoid gathering the data to be clustered in a raw format from each vehicle, but rather to allow for a streaming-based error-bounded approximation, through Piecewise Linear Approximation, to compress the volumes of data to be gathered. At the same time, rather than relying on a batch-based clustering algorithm that requires all the data to be first gathered (and then clustered), DRIVEN relies on and extends a streaming-based clustering algorithm that leverages the inherent ordering of the spatial and temporal data being collected, to perform the clustering in an online fashion, while data is being retrieved. As we show, based on our prototype implementation using Apache Flink and our evaluation with real-world data such as GPS and LiDAR, the accuracy loss for the clustering performed on the reconstructed data can be small, even when the raw data is compressed to 10-35% of its original size, and the transferring of data itself can be completed in up to one-tenth of the duration observed when gathering raw data.

【Keywords】: compression; streaming data; clustering; edge computing; fog computing

195. Accurate Product Attribute Extraction on the Field.

Paper Link】 【Pages】:1862-1873

【Authors】: Martin Rezk ; Laura Alonso Alemany ; Lasguido Nio ; Ted Zhang

【Abstract】: In this paper we present a bootstrapping approach for attribute value extraction that minimizes the need for human intervention. Our approach automatically extracts attribute names and values from semi-structured text, generates a small labelled dataset, and bootstraps it by extracting new values from unstructured text. It is domain/language-independent, relying only on existing semi-structured text to create the initial labeled dataset. We assess the impact of different machine learning approaches to increase precision of the core approach without compromising coverage. We perform an extensive evaluation using e-commerce product data across different categories in two languages and hundreds of thousands of product pages. We show that our approach provides high precision and good coverage. In addition, we study the impact of different methods that address specific sources of error. With error analysis we highlight how these methods complement each other, obtaining insights about the individual methods and the ensamble as a whole.

【Keywords】: information extraction; attribute extraction; semantic information extraction; crf; bootstrapping; lstm

196. CATS: Cross-Platform E-Commerce Fraud Detection.

Paper Link】 【Pages】:1874-1885

【Authors】: Haiqin Weng ; Shouling Ji ; Fuzheng Duan ; Zhao Li ; Jianhai Chen ; Qinming He ; Ting Wang

【Abstract】: Nowadays, the popularity of e-commerce has brought huge economic benefits to factories, third-party merchants, and e-commerce service providers. Driven by such huge economic benefits, malicious merchants attempt to promote items through inserting fraudulent purchases, fake review scores, and/or feedback, into them. Mitigating this threat is challenging due to the difficulty of obtaining internal e-commerce data, the variance of e-commerce services used by malicious merchants, and the reluctance of service providers in cooperation. In this paper, we present an efficient, platform-independent, and robust e-commerce fraud detection system, CATS, to detect frauds for different large-scale e-commerce platforms. We implement the design of CATS into a prototype system and evaluate this prototype on the world's popular e-commerce platform Taobao. The evaluation result on Taobao shows that CATS can achieve a high accuracy of 91% in detecting frauds. Based on this success, we then apply CATS on another large-scale e-commerce platforms, and again CATS achieves an accuracy of 96%, suggesting that CATS is very effective on real e-commerce platforms. Based on the cross-platform evaluation results, we conduct a comprehensive analysis on the reported frauds and reveal several abnormal yet interesting behaviors of those reported frauds. Our study in this paper is expected to shed light on defending against frauds for various e-commerce platforms.

【Keywords】: fraud detection; e commerce fraud

197. Caladrius: A Performance Modelling Service for Distributed Stream Processing Systems.

Paper Link】 【Pages】:1886-1897

【Authors】: Faria Kalim ; Thomas Cooper ; Huijun Wu ; Yao Li ; Ning Wang ; Neng Lu ; Maosong Fu ; Xiaoyao Qian ; Hao Luo ; Da Cheng ; Yaliang Wang ; Fred Dai ; Mainak Ghosh ; Beinan Wang

【Abstract】: Real-time stream processing has become increasingly important in recent years and has led to the development of a multitude of stream processing systems. Given the varying job workloads that characterize stream processing, these systems need to be tuned and adjusted to maintain performance targets in the face of variation in incoming traffic. Current auto-scaling systems adopt a series of trials to approach a job's expected performance due to a lack of performance modelling tools. We find that general traffic trends in most jobs lend themselves well to prediction. Based on this premise, we built a system called Caladrius that forecasts the future traffic load of a stream processing job and predicts its processing performance after a proposed change to the parallelism of its operators. Experimental results show that Caladrius is able to estimate a job's throughput performance and CPU load under a given scaling configuration.

【Keywords】: Stream Processing; Performance Prediction

Paper Link】 【Pages】:1898-1903

【Authors】: Zhao Li ; Junshuai Song ; Shichang Hu ; Shasha Ruan ; Long Zhang ; Zehong Hu ; Jun Gao

【Abstract】: Fraud sellers in e-commerce usually promote their products via fake transactions. Such behaviors damage the reputation of the e-commerce platform and jeopardize the business environment in the platform. The search engine of existing e-commerce platforms mainly focuses on generating transactions by matching users' queries and sellers' products. The most common method to defense fraud sellers is to set up a blacklist based on fraud detection and manual investigation, and then punish those sellers in the list, which is inefficient and can only cover a small fraction of potential fraud sellers. In this paper, we propose the first fraud aware impression regulation system (FAIR) which is data-driven and can work in large-scale e-commerce platforms. Its main function is to actively regulate the impressions received by all potential fraud sellers in a real-time fashion. It utilizes the reinforcement learning architecture to dynamically adjust the impression regulation strategy under different reward settings, which can not only promote the impression regulation effects, but also improve the revenue of the platform simultaneously. We deploy FAIR on the Taobao platform of Alibaba, one of the world's largest e-commerce search platform, and perform an A/B test for two weeks. The results show that FAIR can effectively reduce the fraud impressions and improve the overall platform revenue at the same time.

【Keywords】: E-Commerce; Search Engine; Impression Regulation; Fraud Sellers

Paper Link】 【Pages】:1904-1909

【Authors】: Muhammad Asiful Islam ; Ramakrishnan Srikant ; Sugato Basu

【Abstract】: Click-through rate (CTR) is a key signal of relevance for search engine results, both organic and sponsored. CTR of a result has two core components: (a) the probability of examination of a result by a user, and (b) the perceived relevance of the result given that it has been examined by the user. There has been considerable work on user browsing models, to model and analyze both the examination and the relevance components of CTR. In this paper, we propose a novel formulation: a micro-browsing model for how users read result snippets. The snippet text of a result often plays a critical role in the perceived relevance of the result. We study how particular words within a line of snippet can influence user behavior. We validate this new micro-browsing user model by considering the problem of predicting which snippet will yield higher CTR, and show that classification accuracy is dramatically higher with our micro-browsing user model. The key insight in this paper is that varying relatively few words within a snippet, and even their location within a snippet, can have a significant influence on the clickthrough of a snippet.

【Keywords】: Sponsored Search, Machine Learning

200. Interpretable Multi-task Learning for Product Quality Prediction with Attention Mechanism.

Paper Link】 【Pages】:1910-1921

【Authors】: Cheng-Han Yeh ; Yao-Chung Fan ; Wen-Chih Peng

【Abstract】: In this paper, we investigate the problem of mining multivariate time series data generated from sensors mounted on manufacturing stations for early product quality prediction. In addition to accurate quality prediction, another crucial requirement for industrial production scenarios is model interpretability, i.e., to understand the significance of an individual time series with respect to the final quality. Aiming at the goals, this paper proposes a multi-task learning model with an encoder-decoder architecture augmented by the matrix factorization technique and the attention mechanism. Our model design brings two major advantages. First, by jointly considering the input multivariate time series reconstruction task and the quality prediction in a multi-task learning model, the performance of the quality prediction task is boosted. Second, by incorporating the matrix factorization technique, we enable the proposed model to pay/learn attentions on the component of the multivariate time series rather than on the time axis. With the attention on components, the correlation between a sensor reading and a final quality measure can be quantized to improve the model interpretability. Comprehensive performance evaluation on real data sets is conducted. The experimental results validate that strengths of the proposed model on quality prediction and model interpretability.

【Keywords】: Early stage prediction, quality prediction, neural network, deep learning, multi task learning, encoder decoder, attention mechanism

201. Learning Effective Embeddings From Crowdsourced Labels: An Educational Case Study.

Paper Link】 【Pages】:1922-1927

【Authors】: Guowei Xu ; Wenbiao Ding ; Jiliang Tang ; Songfan Yang ; Gale Yan Huang ; Zitao Liu

【Abstract】: Learning representation has been proven to be helpful in numerous machine learning tasks. The success of the majority of existing representation learning approaches often requires a large amount of consistent and noise-free labels. However, labels are not accessible in many real-world scenarios and they are usually annotated by the crowds. In practice, the crowdsourced labels are usually inconsistent among crowd workers given their diverse expertise and the number of crowdsourced labels is very limited. Thus, directly adopting crowdsourced labels for existing representation learning algorithms is inappropriate and suboptimal. In this paper, we investigate the above problem and propose a novel framework of Representation Learning with crowdsourced Labels, i.e., "RLL", which learns representation of data with crowdsourced labels by jointly and coherently solving the challenges introduced by limited and inconsistent labels. The proposed representation learning framework is evaluated in two real-world education applications. The experimental results demonstrate the benefits of our approach on learning representation from limited labeled data from the crowds, and show RLL is able to outperform state-of-the-art baselines. Moreover, detailed experiments are conducted on RLL to fully understand its key components and the corresponding performance.

【Keywords】: representation learning; education data mining; crowdsourcing

202. A Prescription Trend Analysis using Medical Insurance Claim Big Data.

Paper Link】 【Pages】:1928-1939

【Authors】: Kazutoshi Umemoto ; Kazuo Goda ; Naohiro Mitsutake ; Masaru Kitsuregawa

【Abstract】: Understanding the spread of diseases and the use of medicines is of practical importance for various organizations, such as medical providers, medical payers, and national governments. This study aims to detect the change in the prescription trends and to identify its cause through an analysis of Medical Insurance Claims (MICs), which comprise the specifications of medical fees charged to health insurers. Our approach is two-fold. (1) We propose a latent variable model that simulates the medication behavior of physicians to accurately reproduce monthly prescription time series from the MIC data, where prescription links between the diseases and medicines are missing. (2) We apply a state space model with intervention variables to decompose the monthly prescription time series into different components including seasonality and structural changes. Using a large dataset consisting of 3.5-year MIC records, we conduct experiments to evaluate our approach in terms of accuracy, usefulness, and efficiency. We also demonstrate three applications for our medical analysis.

【Keywords】: medical insurance claim; time series; change detection; link prediction; medical data mining

203. Differential Data Quality Verification on Partitioned Data.

Paper Link】 【Pages】:1940-1945

【Authors】: Sebastian Schelter ; Stefan Grafberger ; Philipp Schmidt ; Tammo Rukat ; Mario Kießling ; Andrey Taptunov ; Felix Bießmann ; Dustin Lange

【Abstract】: Modern companies and institutions rely on data to guide every single decision. Missing or incorrect information seriously compromises any decision process. In previous work, we presented Deequ, a Spark-based library for automating the verification of data quality at scale. Deequ provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables "unit tests for data". However, we found that the previous computational model of Deequ is not flexible enough for many scenarios in modern data pipelines, which handle large, partitioned datasets. Such scenarios require the evaluation of dataset-level quality constraints after individual partition updates, without having to re-read already processed partitions. Additionally, such scenarios often require the verification of data quality on select combinations of partitions. We therefore present a differential generalization of the computational model of Deequ, based on algebraic states with monoid properties. We detail how to efficiently implement the corresponding operators and aggregation functions in Apache Spark. Furthermore, we show how to optimize the resulting workloads to minimize the required number of passes over the data, and empirically validate that our approach decreases the runtimes for updating data metrics under data changes and for different combinations of partitions.

【Keywords】: data quality; data validation; incremental data processing

204. Logan: A Distributed Online Log Parser.

Paper Link】 【Pages】:1946-1951

【Authors】: Amey Agrawal ; Rohit Karlupia ; Rajat Gupta

【Abstract】: Logs serve as a critical tool for debugging and monitoring applications. However, gaining insights from unstructured logs is difficult. Hence, many log management and analysis applications first parse logs into structured templates. In this paper, we train a data-driven log parser on our new Apache Spark dataset, the largest application log dataset yet. We implement a distributed online algorithm to accommodate for the large volume of data. We also devise a new metric for evaluation of parsers when labeled data is unavailable. We show that our method generalizes over diverse datasets without any parameter tuning or domain-specific inputs from the user. When evaluated on publicly available HDFS dataset our method performs 13x faster than the previous state-of-the-art.

【Keywords】: Log parsing; Online algorithm; Distributed processing

205. WebPut: A Web-Aided Data Imputation System for the General Type of Missing String Attribute Values.

Paper Link】 【Pages】:1952-1955

【Authors】: Shuangli Shan ; Zhixu Li ; Yang Li ; Qiang Yang ; Jia Zhu ; Mohamed A. Sharaf ; Xiaofang Zhou

【Abstract】: In this demonstration, we present an end-to-end web-aided data imputation prototype system named WebPut. WebPut consults the Web for imputing the missing values in a local database when the traditional inferring-based imputation method has difficulties in getting the right answers. Specifically, WebPut investigates the interaction between the local inferring-based imputation methods and the web-based retrieving methods and shows that retrieving a small number of selected missing values can greatly improve the imputation recall of the inferring-based methods. Besides, WebPut also incorporates a crowd intervention component that can get advice from humans in case that the web-based imputation methods may have difficulties in making the right decisions. We demonstrate, step by step, how WebPut fills an incomplete table with each of its components.

【Keywords】: Data Imputation, WebPut, Incomplete Data

206. FGreat: Focused Graph Query Autocompletion.

Paper Link】 【Pages】:1956-1959

【Authors】: Nathan Ng ; Peipei Yi ; Zhiwei Zhang ; Byron Choi ; Sourav S. Bhowmick ; Jianliang Xu

【Abstract】: Composing queries is evidently a tedious task. This is particularly true of graph queries as they are typically complex and prone to errors. This is compounded by the fact that graph schemas can be missing or too loose to be helpful for query formulation. Graph Query AutoCompletion (gQAC) alleviates users from the potentially painstaking task of graph query formulation. This demonstration presents an interactive visual Focused GRaph quEry AutocompleTion framework, called FGreat. Its novelty relies on the user focus for gQAC, which is a subgraph of the current query that a user is focusing on. FGreat automatically computes a focus and completes the query at the focus, as opposed to an arbitrary query subgraph. This demonstration presents two complementary approaches to compute the user focus for different circumstances. It computes the focus from either (i) the sequence of edges that a user recently added to his/her query, or (ii) the position of the mouse cursor, if it is available. We demonstrate that the user focus enhances both the effectiveness and efficiency of graph query autocompletion.

【Keywords】: subgraph query; graph autocompletion; graphs; database usability

207. Aucher: Multi-modal Queries on Live Audio Streams in Real-Time.

Paper Link】 【Pages】:1960-1963

【Authors】: Zeyi Wen ; Mingyu Liang ; Bingsheng He ; Zexin Xia ; Bo Li

【Abstract】: This paper demonstrates a real-time search system called Aucher for live audio streams. Audio streaming services (e.g., Mixlr, Ximalaya, Lizhi and Facebook Live Audio) have become increasingly popular with the wide use of smart phones. Because of the popularity of audio broadcasting, the data volume of live audio streams is also ever increasing. Searching and indexing these audio streams is an important and challenging problem. Aucher is a system prototype which can support both voice search and keyword search on audio streams. We achieve the real-time response for queries by our novel index which exploits log structured merge-trees and supports multi-modal search. Moreover, our system can handle insertion about four times faster and more memory efficient than the state-of-the-art solution. We plan to demonstrate searching live audio streams by keywords and voice, illustrate the trade-off of freshness, popularity and relevance on query results, perform searching hot terms, and show the ability of searching live audio streams in real-time.

【Keywords】: Real Time Search; Audio Streams; Multi Modal Query; Indexing

208. SAC: A System for Big Data Lineage Tracking.

Paper Link】 【Pages】:1964-1967

【Authors】: MingJie Tang ; Saisai Shao ; Weiqing Yang ; Yanbo Liang ; Yongyang Yu ; Bikas Saha ; Dongjoon Hyun

【Abstract】: In the era of big data, a data processing flow contains various types of tasks. It is nontrivial to discover the data flow/movement from its source to destination, such that monitoring different transformations and hops on its way in an enterprise environment. Therefore, data lineage or provenance is useful to learn how the data gets transformed along the way, how the representation and parameters change, and how the data splits or converges after each hop. However, existing systems offer limited support for such use cases in a distributed computing setup. To address this issue, we build Spark-Atlas-Connector (short as SAC), a new system to track data lineage in a distributed computation platform, e.g., Spark. SAC tracks different processes involved in the data flow and their dependencies, supporting different data storage (e.g., HBase, HDFS, and Hive) and data processing paradigms (e.g., SQL, ETL, machine learning, and streaming). SAC provides a visual representation of data lineage to track data from its origin to downstreams, and is deployed in a distributed production environment for demonstrating its efficiency and scalability.

【Keywords】: big data; data lineage; data provenance

209. A Gossip-Based System for Fast Approximate Score Computation in Multinomial Bayesian Networks.

Paper Link】 【Pages】:1968-1971

【Authors】: Arun Zachariah ; Praveen Rao ; Anas Katib ; Monica Senapati ; Kobus Barnard

【Abstract】: In this paper, we present a system for fast approximate score computation, a fundamental task for score-based structure learning of multinomial Bayesian networks. Our work is motivated by the fact that exact score computation on large datasets is very time consuming. Our system enables approximate score computation on large datasets in an efficient and scalable manner with probabilistic error bounds on the statistics required for score computation. Our system has several novel features including gossip-based decentralized computation of statistics, lower resource consumption via a probabilistic approach of maintaining statistics, and effective distribution of tasks for score computation using hashing techniques. The demo will provide a real-time and interactive experience to a user on how our system employs the principle of gossiping and hashing techniques in a novel way for fast approximate score computation. The user will be able to control different aspects of our system's execution on a cluster with up to 32 nodes. The approximate scores output by our system can be then used by existing score-based structure learning algorithms.

【Keywords】: Bayesian networks; gossip algorithms; approximate score computation; large scale data

210. Faster, Higher, Stronger: Redesigning Spreadsheets for Scale.

Paper Link】 【Pages】:1972-1975

【Authors】: Mangesh Bendre ; Tana Wattanawaroon ; Sajjadur Rahman ; Kelly Mack ; Yuyang Liu ; Shichu Zhu ; Yu Lu ; Ping-Jing Yang ; Xinyan Zhou ; Kevin Chen-Chuan Chang ; Karrie Karahalios ; Aditya G. Parameswaran

【Abstract】: Spreadsheet tools are ubiquitous for interactive adhoc data management and analysis. With increasing dataset sizes, spreadsheet tools fall short-they freeze during heavy computation within the sheet (interactivity); they are hard to navigate when datasets go beyond a certain size (navigability); they only support cell-at-a-time computation, severely limiting analysis capabilities (expressiveness). We have been developing DATASPREAD to holistically unify databases and spreadsheets to leverage the benefits of both, with a spreadsheet-like front-end and a database-like backend. We demonstrate three key features of DATASPREAD to address the aforementioned spreadsheet scalability challenges in interactivity, navigability, and expressiveness1. Our demonstration will let attendees perform typical analysis tasks on Microsoft Excel and DATASPREAD side-by-side, providing a clear understanding of the improvements offered by DATASPREAD over traditional spreadsheet tools.

【Keywords】: Spreadsheets; Scalability; Expressivity; Navigation

211. RecovDB: Accurate and Efficient Missing Blocks Recovery for Large Time Series.

Paper Link】 【Pages】:1976-1979

【Authors】: Ines Arous ; Mourad Khayati ; Philippe Cudré-Mauroux ; Ying Zhang ; Martin L. Kersten ; Svetlin Stalinlov

【Abstract】: With the emergence of the Internet of Things (IoT), time series data has become ubiquitous in our daily life. Making sense of time series is a topic of great interest in many domains. Existing time series analysis applications generally assume or even require perfect time series (i.e. regular time intervals without unknown values), but real-world time series are rarely so neat. They often contain "holes" of different sizes (i.e., single missing values, or blocks of consecutive missing values) due to some failures or irregular time intervals. Hence, missing value recovery is a prerequisite for many time series analysis applications. In this demo, we present RecovDB, a relational database system enhanced with advanced matrix decomposition technology for missing blocks recovery. This demo will show the main features of RecovDB that are important for today's time series analysis but are lacking in state-of-the-art technologies: i) recovering large missing blocks in multiple time series at once; ii) achieving high recovery accuracy by benefiting from different correlations across time series; iii) maintaining recovery accuracy under increasing size of missing blocks; iv) maintaining recovery efficiency with increasing time series' lengths and the number of time series; and iv) supporting all these features while being parameter-free. In this paper, we also compare the efficiency and accuracy of RecovDB against state-of-the-art recovery systems.

【Keywords】: Time Series; missing values recovery; Centroid Decomposition

212. AI Pro: Data Processing Framework for AI Models.

Paper Link】 【Pages】:1980-1983

【Authors】: Richie Frost ; Debjyoti Paul ; Feifei Li

【Abstract】: We present AI Pro, an open-source framework for data processing with Artificial Intelligence (AI) models. Our framework empowers its users with immense capability to transform raw data into meaningful information with a simple configuration file. AI Pro's configuration file generates a data pipeline from start to finish with as many data transformations as desired. AI Pro supports major deep learning frameworks and Open Neural Network Exchange (ONNX), which allows users to choose models from any AI frameworks supported by ONNX. Its wide range of features and user friendly web interface grants everyone the opportunity to broaden their AI application horizons, irrespective of the user's technical expertise. AI Pro has all the quintessential features to perform end-to-end data processing, which we demonstrate using two real world scenarios.

【Keywords】: data processing; pipeline; framework; realtime; AI Processing

213. IVLG: Interactive Visualization of Large Graphs.

Paper Link】 【Pages】:1984-1987

【Authors】: Maria Krommyda ; Verena Kantere ; Yannis Vassiliou

【Abstract】: There has been significant effort in recent years to explore and navigate very large linked datasets, due to the increase of their availability. Many techniques have been developed that extract the information from such datasets and present it to the user as diagrams, while others take advantage of the hierarchies of the datasets to filter and aggregate them, allowing the users to access specific information. In order to overcome the limitations regarding the volume of the presented information, we have developed a novel technique that enables the interactive visualization as one continuous graph of datasets with millions of elements. IVLG is a fully fledged prototype system that implements this technique based on a client-server architecture, enabling many users to have concurrently access to the information through a user-friendly interface. It allows the user to navigate the dataset through different levels of abstraction and locate information using innovative exploration techniques. A carefully designed storage schema along with an API that takes advantage of the appropriate indexing handles datasets with millions of elements without raising any performance issues, even when accessed from devices with limited computational resources. The proposed demonstration showcases the advantages and technical features of IVLG. A series of demonstration scenarios will show how IVLG can adapt accordingly and handle diverse real and synthetic datasets that vary on size and average node degree and how the system functionalities support the user experience and the exploration of the information.

【Keywords】: data visualization; spatial storage; spatial indexing; data exploration

214. Just in Time: Personal Temporal Insights for Altering Model Decisions.

Paper Link】 【Pages】:1988-1991

【Authors】: Naama Boer ; Daniel Deutch ; Nave Frost ; Tova Milo

【Abstract】: The interpretability of complex Machine Learning models is coming to be a critical social concern, as they are increasingly used in human-related decision-making processes such as resume filtering or loan applications. Individuals receiving an undesired classification are likely to call for an explanation - preferably one that specifies what they should do in order to alter that decision when they reapply in the future. Existing work focuses on a single ML model and a single point in time, whereas in practice, both models and data evolve over time: an explanation for an application rejection in 2018 may be irrelevant in 2019 since in the meantime both the model and the applicant's data can change. To this end, we propose a novel framework that provides users with insights and plans for changing their classification in particular future time points. The solution is based on combining state-of-the-art algorithms for (single) model explanations, ones for predicting future models, and database-style querying of the obtained explanations. We propose to demonstrate the usefulness of our solution in the context of loan applications, and interactively engage the audience in computing and viewing suggestions tailored for applicants based on their unique characteristic.

【Keywords】: Interpretability; Accountability; Temporal

215. GeoSparkViz in Action: A Data System with Built-in Support for Geospatial Visualization.

Paper Link】 【Pages】:1992-1995

【Authors】: Jia Yu ; Anique Tahir ; Mohamed Sarwat

【Abstract】: Visualizing data on maps is deemed a powerful tool for data scientists to make sense of geospatial data. The geospatial map visualization (abbr. MapViz) process first loads the designated geospatial data, processes the data and then applies the map visualization effect. Guaranteeing detailed and accurate geospatial MapViz (e.g., at multiple zoom levels) requires extremely high-resolution maps. Classic solutions suffer from limited computation resources while scalable MapViz system architectures are not able to co-optimize the data management and visualization phases in the same system. This paper demonstrates GeoSparkViz, a full-fledged system that allows the user to load, prepare, integrate and execute MapViz tasks in the same system. For demonstration purpose, we implemented a web interface using a node.js web server, Baidu echarts library, and MapBox on top of GeoSparkViz to visually explore patterns in the New York City Taxi Trips dataset. The demonstration scenarios show how the data preparation and map visualization phases are combined in GeoSparkViz.

【Keywords】: geospatial; visualization; distributed computing

216. Hybrid.Poly: A Consolidated Interactive Analytical Polystore System.

Paper Link】 【Pages】:1996-1999

【Authors】: Maksim Podkorytov ; Michael N. Gubanov

【Abstract】: Anecdotal evidence suggests the Variety of Big data is one of the most challenging problems in Computer Science research today [1]. First, Big data arrives from a myriad of data sources, hence its shape and flavor differ. Second, hundreds of different Big data management systems support different APIs, storage/indexing schemes, and expose data to the users through their data model lens, each specific to their own system. All of these offer a significant impediment for Big data users who just want an easy to use interface to all relevant data regardless of its shape, format, size, and a back-end system used to store it. Naturally, these differences also complicate development of any analytical algorithms on top of large-scale, heterogeneous datasets. Here we describe HYBRID.POLY- a consolidated in-memory polystore engine [2], designed to support heterogeneous large-scale data and interactively process complex analytical work-loads. We execute and evaluate several popular analytical work-loads including Data Fusion, Machine Learning, and Music search at scale.

【Keywords】: Polystore; Database; System; Big Data; Analytics; Music Retrieval; Machine Learning; Data Fusion; Data Integration; Data Model; Information Retrieval; Web Search; Natural Language Processing; Deep Learning; Similarity Search; Interactive Analytics; Large scale Data Management

217. EXPLAINER: Entity Resolution Explanations.

Paper Link】 【Pages】:2000-2003

【Authors】: Amr Ebaid ; Saravanan Thirumuruganathan ; Walid G. Aref ; Ahmed K. Elmagarmid ; Mourad Ouzzani

【Abstract】: Entity Resolution is a fundamental data cleaning and integration problem that has received considerable attention in the past few decades. While rule-based methods have been used in many practical scenarios and are often easy to understand, machine-learning-based methods provide the best accuracy. However, the state-of-the-art classifiers are very opaque. There has been some work towards understanding and debugging the early stages of the entity resolution pipeline, e.g. blocking and generating features (similarity scores). However, there are no such efforts for explaining the model or its predictions. In this demo, we propose ExplainER, a tool to understand and explain entity resolution classifiers with different granularity levels of explanations. Using several benchmark datasets, we will demonstrate how ExplainER can handle different scenarios for a variety of classifiers.

【Keywords】: Entity Resolution, Data Cleaning, Data Integration, Data Cleaning Explanations, Explainable AI

218. CEP-Wizard: Automatic Deployment of Distributed Complex Event Processing.

Paper Link】 【Pages】:2004-2007

【Authors】: Yooju Shin ; Susik Yoon ; Patara Trirat ; Jae-Gil Lee

【Abstract】: Complex event processing (CEP) is defined as event processing for multiple stream sources to infer events that suggest complicated circumstances. As the size of stream data becomes larger, CEP engines have been parallelized to take advantage of distributed computing. Typically, deployment of such a distributed CEP engine involves manual configuration, which has been regarded as an obstacle to its widespread adoption. In this demonstration, we present CEP-Wizard, a framework of automatically configuring and deploying a distributed CEP engine with minimum effort. The demonstration shows that even inexperienced users can easily configure and deploy it on Apache Storm with achieving high performance and low resource usage.

【Keywords】: Complex event processing; Distributed system; Apache Storm

219. A Comparison of Allocation Algorithms for Partially Replicated Databases.

Paper Link】 【Pages】:2008-2011

【Authors】: Stefan Halfpap ; Rainer Schlosser

【Abstract】: Increasing demand for analytical processing capabilities can be managed by replication approaches. However, to evenly balance the replicas' workload shares while at the same time minimizing the data replication factor is a highly challenging allocation problem. As optimal solutions are only applicable for small problem instances, effective heuristics are indispensable. In this paper, we test and compare state-of-the-art allocation algorithms for partial replication. By visualizing and exploring their (heuristic) solutions for different benchmark workloads, we are able to derive structural insights and to detect an algorithm's strengths as well as its potential for improvement. Further, our application enables end-to-end evaluations of different allocations to verify their theoretical performance.

【Keywords】: database replication; allocation problem

220. PePPer: Fine-Grained Personal Access Control via Peer Probing.

Paper Link】 【Pages】:2012-2015

【Authors】: Yael Amsterdamer ; Osnat Drein

【Abstract】: Users share data on multiple platforms where access control is managed according to the platform-specific policy. However, some data is conceptually owned by peers in a manner that is platform-independent. To enable peers to manage access control rights on such data we introduce PePPer, a tool for fine-grained, personal access control. This system, which runs on the client side, enables loading data items from different sources and annotating them with fine-grained access control requirements via provenance-style Boolean expressions. These expressions are evaluated to decide whether the client is allowed to share the data with a given peer, using a taxonomy that compactly captures data ownership and access control policies. If credentials depend on the unknown access policy of other peers, PePPer probes them to obtain relevant access permissions. For that, we employ efficient algorithms that minimizes the expected number of probes. Our algorithm adapt techniques from stochastic Boolean evaluation to our setting by accounting for multiple peers, policies, expressions and the access control taxonomy. Throughout this process, PePPer further presents to the client a continually updated partial view of the data currently known to be safely shareable. We demonstrate PePPer for the sharing of calendar entries among peers, using as example real-life calendars of politicians and public figures.

【Keywords】: access control; fine grained provenance; Boolean evaluation; personal data

221. COBRA: Compression Via Abstraction of Provenance for Hypothetical Reasoning.

Paper Link】 【Pages】:2016-2019

【Authors】: Daniel Deutch ; Yuval Moskovitch ; Noam Rinetzky

【Abstract】: Data analytics often involves hypothetical reasoning: repeatedly modifying the data and observing the induced effect on the computation result of a data-centric application. Recent work has proposed to leverage ideas from data provenance tracking towards supporting efficient hypothetical reasoning: instead of a costly re-execution of the underlying application, one may assign values to a pre-computed provenance expression. A prime challenge in leveraging this approach for large-scale data and complex applications lies in the size of the provenance. To this end, we present a framework that allows to reduce provenance size. Our approach is based on reducing the provenance granularity using abstraction.We propose a demonstration of COBRA, a system that allows examine the effect of the provenance compression on the anticipated analysis results. We will demonstrate the usefulness of COBRA in the context of business data analysis.

【Keywords】: Cognition; Cost accounting; Standards; Companies; Databases; Data analysis

222. CogLearn: A Cognitive Graph-Oriented Online Learning System.

Paper Link】 【Pages】:2020-2023

【Authors】: Yang Pian ; Yu Lu ; Penghe Chen ; Qinglong Duan

【Abstract】: We propose and implement a novel online learning system, called CogLearn, to support learner's self-awareness and reflective thinking, which urges a proper form of knowledge representation together with individual learner's cognitive status. We thus design and employ the machine learning techniques to estimate learner's cognitive status and identify educational relations to construct the desired knowledge representation, namely cognitive graph in our system. We further demonstrate the system by presenting two practical services, i.e., learning obstacle diagnosis and learning path planning, to demonstrate how the constructed cognitive graph effectively and adaptively supports individual system user's learning process.

【Keywords】: Cognitive Graph; Educational Data Mining; Knowledge Tracing; Adaptive Learning System

223. GRIT: Consistent Distributed Transactions Across Polyglot Microservices with Multiple Databases.

Paper Link】 【Pages】:2024-2027

【Authors】: Guogen Zhang ; Kun Ren ; Jung-Sang Ahn ; Sami Ben-Romdhane

【Abstract】: The popular microservice architecture for applications brings new challenges for consistent distributed transactions across multiple microservices. These microservices may be implemented in different languages, and access multiple underlying databases. Consistent distributed transactions are a real requirement but are very hard to achieve with existing technologies in these environments. In this demo we present GRIT: a system that resolves this challenge by cleverly leveraging deterministic database technologies and optimistic concurrency control protocol(OCC). A transaction is optimistically executed with its read-set and write-set captured during the execution phase. Then at the commit time, conflict checking is performed and a global commit decision is made. A logically committed transaction is persisted into logs first, and then asynchronously applied to the physical databases deterministically. GRIT is able to achieve consistent, high throughput and serializable distributed transactions for any applications invoking microservices. The demonstration offers a walk-through of how GRIT can easily support distributed transactions across multiple microservices and databases.

【Keywords】: microservice, distributed transactions, OCC, deterministic

Paper Link】 【Pages】:2028-2031

【Authors】: Yang Ji ; Cheng Xu ; Jianliang Xu ; Haibo Hu

【Abstract】: With the proliferation of cloud computing and data-as-a-service (DaaS), more and more organizations and individuals outsource their data to a third-party service provider. While enjoying the benefits of cloud-based data outsourcing, the data owners are at the risk of losing control of data integrity and access management. In this demonstration, we present a system called vABS, which enables verifiable Attribute-Based Search over shared cloud data. The vABS system adopts the common DaaS architecture, in which the server provides search services to users on behalf of data owners. By employing a novel zero-knowledge approach proposed in our prior work [1], vABS not only provides users with good search experiences, but also supports authenticated query processing with fine-grained access control, which is crucial to many high-security applications.

【Keywords】: Data Integrity; Access Control; Zero Knowledge; Verifiable Search

225. An Environment-Aware Market Strategy for Data Allocation and Dynamic Migration in Cloud Database.

Paper Link】 【Pages】:2032-2035

【Authors】: Tengjiao Wang ; Binyang Li ; Wei Chen ; Yuxiao Zhang ; Ying Han ; Jinzhong Niu ; Kam-Fai Wong

【Abstract】: Currently, a cloud database is employed to serve on-line query-intensive applications. It inevitably happens that some cloud data nodes storing hot records are facing high frequent query requests while others are rarely visited or even idle. Therefore, how data are dynamically allocated and migrated at runtime has significant impact on query load distribution and system performance. Existing system adopt centralized approaches, and they face two main challenges: (1) Query load on individual node cannot be always balancing even if the data are fairly distributed; (2) For each node, the dynamic changes of configuration resources cannot be captured during the runtime. To this end, this paper presents an environment-aware market strategy based system, named e-MARS, for reasonable data migration to achieve query load balance in cloud database. In e-MARS, cloud database is modeled as a cloudDB market, while data nodes are regarded as intelligent traders and the query load as commodity. Each trader is aware of its local environmental re-sources, such as computing capacity, disk volume, based on which the trader itself decides how to trade the query load and migrates the corresponding data. In this way the cloudDB market will achieve equilibrium. Experiments are conducted on the real communication data, and e-MARS significantly enhances the efficiency. Compared with HBase Balancer, more than 65% improvement is achieved in terms of query response time.

【Keywords】: cloud database, environment-aware market strategy, data allocation, dynamic migration

226. Vaite: A Visualization-Assisted Interactive Big Urban Trajectory Data Exploration System.

Paper Link】 【Pages】:2036-2039

【Authors】: Chuang Yang ; Yilan Zhang ; Bo Tang ; Min Zhu

【Abstract】: Big urban trajectory exploration extracts insights from trajectories. It has many smart-city applications, e.g., traffic jam detection, taxi movement pattern analysis. The challenges of big urban trajectory data exploration are: (i) the data analysts probably do not have experience or knowledge on issuing their analysis tasks by SQL-like queries or analysis operations accurately; and (ii) big urban trajectory data is naturally complex, e.g., unpredictability, interrelation, etc. In this work, we architect and implement a visualization-assisted big urban trajectory data exploration system (Vaiet) to address these chanllenges. Vaiet includes three layers, from data collection to results visualization. We devise novel visualization views in Vaiet to support interactive big urban trajectory exploratory analysis. We demonstrate the effectiveness of Vaiet by the real world applications.

【Keywords】: trajectory data exploration; data visualization

227. SciDetector: Scientific Event Discovery by Tracking Variable Source Data Streaming.

Paper Link】 【Pages】:2040-2043

【Authors】: Zhiqiang Duan ; Chen Yang ; Xiaofeng Meng ; Yongjie Du ; Jiaming Qiu ; Xiaobin Ma ; Zhihui Du ; Xukang Zhang ; Baoning Niu ; Chao Wu

【Abstract】: We present the design of and a demonstration for SciDetector, a system of scientific research for online analysis. SciDetector can efficiently online analyze scientific events on large-scale newly arriving data. In modern scientific research, especially astronomy survey, with the development of the observation infrastructure, Short-Timescale and Large Field-of-view (STLF) sky survey has been the major topic. However, existing scientific system focus on offline analysis of long-term historical data, not real-time analysis of large-scale scientific data. Using real data, we demonstrate how SciDetector processes the data and generate the scientific product.

【Keywords】: scientific event analysis; pipeline; astronomy

228. Demonstrating Spindra: A Geographic Knowledge Graph Management System.

Paper Link】 【Pages】:2044-2047

【Authors】: Yuhan Sun ; Jia Yu ; Mohamed Sarwat

【Abstract】: Knowledge Graphs are widely used to store facts about real-world entities and events. With the ubiquity of spatial data, vertexes or edges in knowledge graphs can possess spatial location attributes side by side with other non-spatial attributes. For instance, as of June 2018 the Wikidata knowledge graph contains 48; 547; 142 data items (i.e., vertexes) to date and ≈13% of them have spatial location attributes. The co-existence of the graph and spatial data in the same geographic knowledge graph allows users to search the graph with local intent. Many location-based services such as UberEats, GrubHub, and Yelp already employ similar knowledge graphs to enhance the location search experience for their end-users. In this paper, we demonstrate a system, namely Spindra, that provides efficient management of geographic knowledge graphs. We demonstrate the system using an interactive map-based web interface that allows users to issue location-aware search queries over the WikiData knowledge graph. The Front end will then visualize the returned geographic knowledge to the user using OpenStreetMaps.

【Keywords】: Knowledge Graph; Spatial Data; Graph Search; Index

229. Native Storage Techniques for Data Management.

Paper Link】 【Pages】:2048-2051

【Authors】: Ilia Petrov ; Andreas Koch ; Sergey Hardock ; Tobias Vinçon ; Christian Riegger

【Abstract】: In the present tutorial we perform a cross-cut analysis of database storage management from the perspective of modern storage technologies. We argue that neither the design of modern DBMS, nor the architecture of modern storage technologies are aligned with each other. Moreover, the majority of the systems rely on a complex multi-layer and compatibility-oriented storage stack. The result is needlessly suboptimal DBMS performance, inefficient utilization, or significant write amplification due to outdated abstractions and interfaces. In the present tutorial we focus on the concept of native storage, which is storage operated without intermediate abstraction layers over an open native storage interface and is directly controlled by the DBMS. We cover the following aspects of native storage: (i) architectural approaches and techniques; (ii) interfaces; (iii) storage abstractions; (iv) DBMS/system integration; (v) in-storage processing.

【Keywords】: DBMS; Native Storage; Near Data Processing

230. Crowdsourcing Database Systems: Overview and Challenges.

Paper Link】 【Pages】:2052-2055

【Authors】: Chengliang Chai ; Ju Fan ; Guoliang Li ; Jiannan Wang ; Yudian Zheng

【Abstract】: Many data management and analytics tasks, such as entity resolution, cannot be solely addressed by automated processes. Crowdsourcing is an effective way to harness the human cognitive ability to process these computer-hard tasks. Thanks to public crowdsourcing platforms, e.g., Amazon Mechanical Turk and CrowdFlower, we can easily involve hundreds of thousands of ordinary workers (i.e., the crowd) to address these computer-hard tasks. However it is rather inconvenient to interact with the crowdsourcing platforms, because the platforms require one to set parameters and even write codes. Inspired by traditional DBMS, crowdsourcing database systems have been proposed and widely studied to encapsulate the complexities of interacting with the crowd. In this tutorial, we will survey and synthesize a wide spectrum of existing studies on crowdsourcing database systems. We first give an overview of crowdsourcing, and then summarize the fundamental techniques in designing crowdsourcing databases, including task design, truth inference, task assignment, answer reasoning and latency reduction. Next we review the techniques on designing crowdsourced operators, including selection, join, sort, top-k, max/min, count, collect, and fill. Finally, we discuss the emerging challenges.

【Keywords】: Crowdsourcing, Database

231. Telco Big Data Research and Open Problems.

Paper Link】 【Pages】:2056-2059

【Authors】: Constantinos Costa ; Demetrios Zeinalipour-Yazti

【Abstract】: A telecommunication company (telco) is traditionally only perceived as the entity that provides telecommunication services, such as telephony and data communication access to users. However, the radio and backbone infrastructure of such entities spanning densely most urban spaces and widely most rural areas, provides nowadays a unique opportunity to collect immense amounts of data that capture a variety of natural phenomena on an ongoing basis, e.g., traffic, commerce, mobility patterns and user service experience. The ability to perform analytics on the generated big data within tolerable elapsed time and share it with key smart city enablers (e.g., municipalities, public services, startups, authorities, and companies), elevates the role of telcos in the realm of future smart cities from pure network access providers to information providers. In this tutorial, we overview the state-of-the-art in telco big data analytics by focusing on a set of basic principles, namely: (i) real-time analytics and detection; (ii) experience, behavior and retention analytics; (iii) privacy; and (iv) storage. We also present experiences from developing an innovative such architecture and conclude with open problems and future directions.

【Keywords】: Telco, Big Data, Queries, Analytics, Storage, Privacy

232. Geospatial Data Management in Apache Spark: A Tutorial.

Paper Link】 【Pages】:2060-2063

【Authors】: Jia Yu ; Mohamed Sarwat

【Abstract】: The volume of spatial data increases at a staggering rate. This tutorial comprehensively studies how existing works extend Apache Spark to uphold massive-scale spatial data. During this 1.5 hour tutorial, we first provide a background introduction of the characteristics of spatial data and the history of distributed data management systems. A follow-up section presents the common approaches used by the practitioners to extend Spark and introduces the vital components in a generic spatial data management system. The third, fourth and fifth sections then discuss the ongoing efforts and experience in spatial-temporal data, spatial data analytics and streaming spatial data, respectively. The sixth part finally concludes this tutorial to help the audience better grasp the overall content and points out future research directions.

【Keywords】: geospatial data; apache spark; distributed computing

233. Hierarchical Decomposition of Big Graphs.

Paper Link】 【Pages】:2064-2067

【Authors】: Ying Zhang ; Lu Qin ; Fan Zhang ; Wenjie Zhang

【Abstract】: Graph decomposition has been widely used to analyze real-life networks from different perspectives. Recent studies focus on the hierarchical graph decomposition methods to handle big graphs in many real-life applications such as community detection, network analysis, network visualization, internet topology analysis and protein function prediction. In this tutorial, we first highlight the importance of hierarchical graph decomposition in a variety of applications and the unique challenges that need to be addressed. Subsequently, we provide an overview of the existing models and the computation algorithms under different computing environments. Then we discuss the integration of existing models with other approaches to better capture the cohesiveness of subgraphs in real-life scenarios. Finally, we discuss the future research directions in this important and growing research area.

【Keywords】: graph; cohesive subgraph; hierarchical decomposition

234. Cohesive Subgraph Computation Over Large Sparse Graphs.

Paper Link】 【Pages】:2068-2071

【Authors】: Lijun Chang ; Lu Qin

【Abstract】: With the rapid development of information technology, huge volumes of graph data are accumulated. Real graphs are usually sparsely connected from a global point of view, but typically contain subgraphs that are locally densely connected. It is of great importance to identify dense (i.e., cohesive) subgraphs in a large sparse graph. Cohesive subgraph computation can either be the main goal of a graph analysis task, or act as a preprocessing step aiming to reduce/trim the graph by removing sparse/unimportant parts such that more complex and time-consuming analysis can be conducted. In the literature, the cohesiveness of a subgraph is usually measured by the minimum degree, the average degree, or their higher-order variants. Cohesive subgraph computation based on different cohesiveness measures extracts subgraphs with different properties, and also requires different levels of computational efforts. In this tutorial, we survey the models and state-of-the-art algorithms for efficient cohesive subgraph computation based on different cohesiveness measures. We discuss details of the algorithms, including time complexity and implementation matters. Finally, we present open problems for future research.

【Keywords】: cohesive subgraph, core decomposition, truss de composition, higher-order structure, densest subgraph, triangle-densest subgraph

235. Robust Query Processing: Mission Possible.

Paper Link】 【Pages】:2072-2075

【Authors】: Jayant R. Haritsa

【Abstract】: Robust query processing with strong performance guarantees is an extremely desirable objective in the design of industrial-strength database engines. However, it has proved to be a largely intractable and elusive challenge in spite of sustained efforts spanning several decades. The good news is that in recent times, there have been a host of exciting technical advances, at different levels in the database architecture, that collectively promise to materially address this problem. In this tutorial, we will present these novel research approaches, characterize their strengths and limitations, and enumerate open technical problems that remain to be solved to make robust query processing a contemporary reality.

【Keywords】: Declarative Queries; Optimization; Robust Execution; Performance Guarantees

236. Automated Documentation of End-to-End Experiments in Data Science.

Paper Link】 【Pages】:2076-2080

【Authors】: Sergey Redyuk

【Abstract】: Reproducibility plays a crucial role in experimentation. However, the modern research ecosystem and the underlying frameworks are constantly evolving and thereby making it extremely difficult to reliably reproduce scientific artifacts such as data, algorithms, trained models and visualizations. We, therefore, aim to design a novel system for assisting data scientists with rigorous end-to-end documentation of data-oriented experiments. Capturing data lineage, metadata, and other artifacts helps reproducing and sharing experimental results. We summarize this challenge as automated documentation of data science experiments. We aim at reducing manual overhead for experimenting researchers, and intend to create a novel approach in dataflow and metadata tracking based on the analysis of the experiment source code. The envisioned system will accelerate the research process in general, and enable capturing fine-grained meta information by deriving a declarative representation of data science experiments.

【Keywords】: experimentation in data science; reproducibility; automatic tracking of metadata; responsible data management

237. Explaining Results of Data-Driven Applications.

Paper Link】 【Pages】:2081-2085

【Authors】: Nave Frost

【Abstract】: Albert Einstein once said "If you can't explain it simply, you don't understand it well enough". This is not only true for physical theories but also highly relevant to the results of data-driven applications. Data science is extensively used in various domains, achieving high performance at different tasks, e.g. question answering, and image recognition. Unfortunately, its results generally remain unexplainable, leaving its users to wonder why a specific result was returned instead of the other. My Ph.D. thesis focuses on providing explanations for data-driven applications. This paper demonstrates approaches for interpretability in two applications: Natural Language Queries, and Machine Learning Classifiers, followed by a discussion of open problems and future work.

【Keywords】: Data Science; Interpretability

238. Towards Explaining the Effects of Data Preprocessing on Machine Learning.

Paper Link】 【Pages】:2086-2090

【Authors】: Carlos Vladimiro Gonzalez Zelaya

【Abstract】: Ensuring the explainability of machine learning models is an active research topic, naturally associated with notions of algorithmic transparency and fairness. While most approaches focus on the problem of making the model itself explainable, we note that many of the decisions that affect the model's predictive behaviour are made during data preprocessing, and are encoded as specific data transformation steps as part of pre-learning pipelines. Our research explores metrics to quantify the effect of some of these steps. In this initial work we define a simple metric, which we call volatility, to measure the effect of including/excluding a specific step on predictions made by the resulting model. Using training set rebalancing as a concrete example, we report on early experiments on measuring volatility in two public benchmark datasets, Students' Academic Performance and German Credit, with the ultimate goal of identifying predictors for volatility that are independent of the dataset and of the specific preprocessing step.

【Keywords】: machine learning; fairness; transparency; preprocessing

239. Don't Fear the REAPER: A Framework for Materializing and Reusing Deep-Learning Models.

Paper Link】 【Pages】:2091-2095

【Authors】: Melanie B. Sigl

【Abstract】: Training of Deep-Learning models represents a computationally intensive task, especially with high-dimensional and high-volume data, where training may take days or weeks. Current research mainly focuses on model materialization, workflow optimization, and lifecylce management. However, searching for a similar dataset in a dataset repository to reuse a suitable pre-built model has received little attention by the data-management community. A remarkable feature of reusing an already built model is that the iterative process of finding a suitable model is shortened, and additionally, only a small set of training data is required. Thus, transferring such knowledge has become an important issue in the Deep-Learning community. However, the relationship between the source and target task is not well understood yet. The aim of this research is to reduce training time of machine learning from a data-management perspective through model reuse, and shed some light on the above relationship in the case when reusing a model is appropriate. We propose a framework to aid data scientists in searching for a similar dataset given a new dataset and selecting a suitable model. We discuss a novel approach on how to determine similarity among datasets and how to chose an appropriate model. We are confident that with recent advances of model materialization and dataset repositories our research improves the knowledge on this topic.

【Keywords】: Data Management, Data Profiling, Deep Learning, Transfer Learning

240. Knowledge Representation for Emotion Intelligence.

Paper Link】 【Pages】:2096-2100

【Authors】: Shuo Wang

【Abstract】: Emotion intelligence (EI) is a traditional topic for psychology, sociology, biology and medical science. Because emotion is related with the personality, interpersonal effect, social function, disease treatment, etc. Analyzing the emotion from the Web data by computer technology becomes more and more popular, and the scientists from the non-computer domains need more helpful computing models to deal with professional problems that are not traditional for computer science. Knowledge representation is a basic and possible solution as a bridge between emotion intelligence and artificial intelligence. For the sentiment words, word embedding can map the words to vectors that represent the semantic context of the words. Sentiment embedding based on the word embedding can capture both semantics and the emotion information. We have introduced two kinds of improving embedding methods (MEC and Emo2Vec) for the sentiment words embedding. For emotion structure based on the psychology of emotion, knowledge graph can represent the cognitive relations between different emotion types. The same emotional expressions can affect the reaction and behaviors of the recipient in different ways due to factors such as social relations, information processing, time pressure, etc. Knowledge graph can represent these complicated situations as the relations between the entities and attributes. Based on this graph, we make the inference or prediction of the emotion influence on decision making.

【Keywords】: emotion intelligence, sentiment embedding, cognitive structure of emotion, knowledge graph

Paper Link】 【Pages】:2101-2105

【Authors】: Niousha Hormozi

【Abstract】: Recently, there has been a significant growth in keyword search due to its wide range of use-cases in people's everyday life. While keyword search has been applied to different kinds of data, ambiguity always exists no matter what data the query is being asked from. Generally, when users submit a query they need an exact answer that perfectly meets their needs, rather than wondering about different possible answers that are retrieved by the system. To achieve this, search systems need a Disambiguation functionality that can efficiently filter out and rank all possible answers to a query, before showing them to user. In this paper, we are going to describe how we are improving state of the art in various stages of a keyword-search pipeline in order to retrieve the answers that best match the user's intent.

【Keywords】: Disambiguation, Relational Databases, Expansion, keyword search, Information Retrieval, Data Extraction

242. Event Recommendation using Social Media.

Paper Link】 【Pages】:2106-2110

【Authors】: Sreekanth Madisetty

【Abstract】: Every day, lots of events keep happening around the globe. Examples of events are festivals, concerts, shows, conferences, sports events, movie launches, etc. There are several event recommendation systems that suggest different events to the users. The popularity of an event plays an important role in event recommendation. If the system knows which event is popular in a region, then that can be recommended to more people in that region. However, since events take place at predefined location and time it is important to identify the popularity much before the occurrence of the event. Early prediction of future popularity is very crucial for the success of any event recommendation system. We plan to use event related discussions in social media as a signal for estimating the popularity of the events. Towards this task, given meta information (time, date, venue, title etc.) about an event, we first try to identify social media posts related to the event. These posts are then passed through several preprocessing stages in a pipeline to detect spam, identify the presence of emotions and their intensities, etc. The preprocessed posts can then be used along with several other contextual parameters to predict the future popularity of the events. Next, events can be recommended to the users by using a recommendation algorithm.

【Keywords】: Twitter, event recommendation, social media, event popularity, hashtags

243. Location Inference for Non-Geotagged Tweets in User Timelines [Extended Abstract].

Paper Link】 【Pages】:2111-2112

【Authors】: Pengfei Li ; Hua Lu ; Nattiya Kanhabua ; Sha Zhao ; Gang Pan

【Abstract】: This study explores the problem of inferring locations for individual tweets. We scrutinize Twitter user timelines in a novel fashion. First of all, we split each user's tweet timeline temporally into a number of clusters, each tending to imply a distinct location. Subsequently, we adapt machine learning models to our setting and design classifiers that classify each tweet cluster into one of the pre-defined location classes at the city level. Extensive experiments on a large set of real Twitter data suggest that our models are effective at inferring locations for non-geotagged tweets and outperform the state-of-the-art approaches significantly in terms of inference accuracy.

【Keywords】: Twitter, location inference, bayes, LSTM

244. Efficient Parallel Skyline Query Processing for High-Dimensional Data.

Paper Link】 【Pages】:2113-2114

【Authors】: Mingjie Tang ; Yongyang Yu ; Walid G. Aref ; Qutaibah M. Malluhi ; Mourad Ouzzani

【Abstract】: Given a set of multidimensional data points, skyline queries retrieve those points that are not dominated by any other points in the set. Due to the ubiquitous use of skyline queries, such as in preference-based query answering and decision making, and the large amount of data that these queries have to deal with, enabling their scalable processing is of critical importance. However, there are several outstanding challenges that have not been well addressed. More specifically, in this paper, we are tackling the data straggler and data skew challenges introduced by distributed skyline query processing, as well as the ensuing high computation cost of merging skyline candidates. We thus introduce a new efficient three-phase approach for large scale processing of skyline queries. In the first preprocessing phase, the data is partitioned along the Z-order curve. We utilize a novel data partitioning approach that formulates data partitioning as an optimization problem to minimize the size of intermediate data. In the second phase, each compute node partitions the input data points into disjoint subsets, and then performs the skyline computation on each subset to produce skyline candidates in parallel. In the final phase, we build an index and employ an efficient algorithm to merge the generated skyline candidates. Extensive experiments demonstrate that the proposed skyline algorithm achieves more than one order of magnitude enhancement in performance compared to existing state-of-the-art approaches.

【Keywords】: parallel computation; skyline query; big data

245. On Generalizing Collective Spatial Keyword Queries (Extended Abstract).

Paper Link】 【Pages】:2115-2116

【Authors】: Harry Kai-Ho Chan ; Cheng Long ; Raymond Chi-Wing Wong

【Abstract】: With the proliferation of spatial-textual data such as location-based services and geo-tagged websites, spatial keyword queries are ubiquitous in real life. One example of spatial-keyword query is the so-called collective spatial keyword query (CoSKQ) which is to find, for a given query consisting a query location and several query keywords, a set of objects which covers the query keywords collectively and has the smallest cost wrt the query location. Quite a few cost functions have been proposed for CoSKQ and correspondingly, different approaches have been developed. However, given these cost functions in different forms and approaches in different structures, one could hardly compare existing cost functions systematically and needs to implement all approaches in order to tackle the CoSKQ problem with different cost functions, which is effort-consuming. In this paper, we design a unified cost function which generalizes the majority of existing cost functions for CoSKQ and develop a unified approach which works as well as (and sometimes better than) best-known approaches based on different cost functions. Experiments were conducted on both real and synthetic datasets which verified our proposed approach.

【Keywords】: spatial keyword queries; unified framework

246. A Novel Representation and Compression for Queries on Trajectories in Road Networks (Extended Abstract).

Paper Link】 【Pages】:2117-2118

【Authors】: Xiaochun Yang ; Bin Wang ; Kai Yang ; Chengfei Liu ; Baihua Zheng

【Abstract】: Recording and querying time-stamped trajectories incurs high cost of data storage and computing. In this paper, we explore characteristics of the trajectories in road networks, which have motivated the idea of coding trajectories by associating timestamps with relative spatial path and locations. Such a representation contains large number of duplicate information to achieve a lower entropy compared with the existing representations, thereby drastically cutting the storage cost. We propose techniques to compress spatial path and locations separately, which can support fast positioning and achieve better compression ratio. For locations, we propose two novel encoding schemes such that the binary code can preserve distance information, which is very helpful for LBS applications. In addition, an unresolved question in this area is whether it is possible to perform search directly on the compressed trajectories, and if the answer is yes, then how. Here we show that directly querying compressed trajectories based on our encoding scheme is possible and can be done efficiently.We design a set of primitive operations for this purpose, and propose index structures to reduce query response time. We demonstrate the advantage of our method and compare it against existing ones through a thorough experimental study on real trajectories in road network.

【Keywords】: trajectory; compression; LBS; road network

247. Efficient Multi-Class Probabilistic SVMs on GPUs.

Paper Link】 【Pages】:2119-2120

【Authors】: Zeyi Wen ; Jiashuai Shi ; Bingsheng He ; Jian Chen ; Yawen Chen

【Abstract】: Multi-class SVMs with the probabilistic output (MP-SVMs) are important techniques in pattern recognition. Two key challenges for efficient GPU accelerations for MP-SVM are: (i) many kernel values are repeatedly computed as a binary SVM classifier is trained iteratively, resulting in repeated accesses to the high latency GPU memory; (ii) performing training or estimating probability in parallel requires a much larger memory footprint than the GPU memory. To overcome the challenges, we propose GMP-SVM to reduce high latency memory accesses and memory consumption through batch processing, computation/data reusing and sharing. Experimental results show that our solution (available in https://github.com/Xtra-Computing/thundersvm) outperforms LibSVM by 100 times while retaining the same accuracy.

【Keywords】: Graphics Processing Units; Classification; Machine Learning

248. C2Net: A Network-Efficient Approach to Collision Counting LSH Similarity Join(Extended Abstract).

Paper Link】 【Pages】:2121-2122

【Authors】: Hangyu Li ; Sarana Nutanong ; Hong Xu ; Chenyun Yu ; Foryu Ha

【Abstract】: Similarity join of two datasets P and Q is a primitive operation that is useful in many application domains. The operation involves identifying pairs (p, q), in the Cartesian product of P and Q such that (p, q) satisfies a stipulated similarity condition. In a high-dimensional space, an approximate similarity join based on locality-sensitive hashing (LSH) provides a good solution while reducing the processing cost with a predictable loss of accuracy. A distributed processing framework such as MapReduce allows the handling of large and high-dimensional datasets. However, network cost frequently turns into a bottleneck in a distributed processing environment, thus resulting in a challenge of achieving faster and more efficient similarity join [2]. This paper focuses on collision counting LSH-based similarity join in MapReduce and proposes a network-efficient solution called C2Net to improve the utilization of MapReduce combiners. The solution uses two graph partitioning schemes: (i) minimum spanning tree for organizing LSH buckets replication; and (ii) spectral clustering for runtime collision counting task scheduling. Experiments have shown that, in comparison to the state of the art, the proposed solution is able to achieve 20% data reduction and 50% reduction in shuffle time.

【Keywords】: Locality sensitive hashing; join processing; similarity search; distributed processing

Paper Link】 【Pages】:2123-2124

【Authors】: Jungeun Kim ; Sungsu Lim ; Jae-Gil Lee ; Byung Suk Lee

【Abstract】: This paper proposes LinkBlackHole, a novel algorithm for finding communities that are (i) overlapping in nodes and (ii) mixing (not separating clearly) in links. There has been a small body of work in each category, but this paper is the first one that addresses both. For this purpose, LinkBlackHole incorporates the advantages of both the link-space transformation and the black hole transformation. Thorough experiments show superior quality of the communities detected by LinkBlackHole* to those detected by other state-of-the-art algorithms.

【Keywords】: Community Detection, Graph Clustering, Overlapping Communities, Link Clustering, Graph Drawing, Link Embedding

250. Fusion OLAP: Fusing the Pros of MOLAP and ROLAP Together for In-memory OLAP (Extended Abstract).

Paper Link】 【Pages】:2125-2126

【Authors】: Yansong Zhang ; Yu Zhang ; Shan Wang ; Jiaheng Lu

【Abstract】: OLAP models can be categorized with two types: MOLAP (multidimensional OLAP) and ROLAP (relational OLAP). In particular, MOLAP is efficient in multidimensional computing at the cost of cube maintenance, while ROLAP reduces the data storage size at the cost of expensive multidimensional join operations. In this paper, we propose a novel Fusion OLAP model to fuse the multi-dimensional computing model and relational storage model together to make the best aspects of both MOLAP and ROLAP worlds. The Fusion OLAP model can be integrated into the state-of-the-art in-memory databases with additional surrogate key indexes and vector indexes. We compared the Fusion OLAP implementations with three leading analytical in-memory databases. Our comprehensive experimental results show that Fusion OLAP implementation can achieve up to 35%, 365% and 169% performance improvements based on the Hyper, Vectorwise and MonetDB databases respectively, for the Star Schema Benchmark (SSB) with scale factor 100.

【Keywords】: MOLAP, ROLAP, vector index, surrogate key index, vector referencing

Paper Link】 【Pages】:2127-2128

【Authors】: Huan Li ; Hua Lu ; Lidan Shou ; Gang Chen ; Ke Chen

【Abstract】: As people spend significant parts of daily lives indoors, it is useful and important to measure indoor densities and find the dense regions in many indoor scenarios like space management and security control. In this paper, we propose a data-driven approach that finds top-k indoor dense regions by using indoor positioning data. Such data is obtained by indoor positioning systems working at a relatively low frequency, and the reported locations in the data are discrete, from a preselected location set that does not continuously cover the entire indoor space. When a search is triggered, the object positioning information is already out-of-date and thus object locations are uncertain. To this end, we first integrate object location uncertainty into the definitions for counting objects in an indoor region and computing its density. Subsequently, we conduct a thorough analysis of the location uncertainty in the context of complex indoor topology, deriving upper and lower bounds of indoor region densities and introducing distance decaying effect into computing concrete indoor densities. Enabled by the uncertainty analysis outcomes, we design efficient search algorithms for solving the problem. Finally, we conduct extensive experimental studies on our proposals using synthetic and real data. The experimental results verify that the proposed search approach is efficient, scalable, and effective. The top-k indoor dense regions returned by our search are considerably consistent with ground truth, despite that the search uses neither historical data nor extra knowledge about objects.

【Keywords】: Indoor space; Indoor positioning data; Density queries

Paper Link】 【Pages】:2129-2130

【Authors】: Xiaoye Miao ; Yunjun Gao ; Linlin Zhou ; Wei Wang ; Qing Li

【Abstract】: Probabilistic queries usually suffer from the noisy query result sets, due to data uncertainty. In this paper, we propose an efficient optimization framework, termed as QueryClean, for both probabilistic skyline computation and probabilistic similarity search. Its goal is to optimize query quality by selecting a group of uncertain objects to clean under limited resource available, where an entropy based quality function is leveraged. We develop an efficient index to organize the possible result sets of probabilistic queries, which is able to help avoid multiple probabilistic query evaluations over a large number of possible worlds for quality computation. Moreover, using two newly presented heuristics, we present exact and approximate algorithms for the optimization problem. Extensive experiments on both real and synthetic data sets demonstrate the efficiency and scalability of QueryClean.

【Keywords】: Probabilistic skyline query; probabilistic similarity query; query quality; optimization algorithms

253. On Efficiently Answering Why-Not Range-Based Skyline Queries in Road Networks (Extended Abstract).

Paper Link】 【Pages】:2131-2132

【Authors】: Xiaoye Miao ; Yunjun Gao ; Su Guo ; Gang Chen

【Abstract】: The range-based skyline (r-skyline) query on road networks retrieves the skyline objects of taking each point within a road region as a query point, in terms of objects' spatial and non-spatial attributes. In this paper, we systematically carry out the study of why-not questions on the r-skyline query in the road network (abbreviated as the why-not RSQ problem). We present three modification strategies, including modifying the query range, modifying the why-not point, and modifying both of them, for the why-not RSQ problem. In particular, a suite of newly presented effective concepts/techniques are leveraged, such as the concepts of skyline scope and skyline dominance region, non-spatial attribute modification pruning, and G-tree index. Extensive experiments using both real and synthetic data sets demonstrate the performance of our proposed algorithms.

【Keywords】: Why not question; skyline query; road network; dominance relationship; algorithm

254. SLADE: A Smart Large-Scale Task Decomposer in Crowdsourcing.

Paper Link】 【Pages】:2133-2134

【Authors】: Yongxin Tong ; Lei Chen ; Zimu Zhou ; Hosagrahar Visvesvaraya Jagadish ; Lidan Shou ; Weifeng Lv

【Abstract】: A crowdsourcing task in real-world applications often consists of thousands of atomic tasks. A common practice to distribute a large-scale crowdsourcing task is to pack atomic tasks into task bins and send to crowd workers in batches. It is challenging to decompose a large-scale crowdsourcing task into task bins to ensure reliability at a minimal total cost. In this paper, we propose the Smart Large-scAle task DEcomposer (SLADE) problem, which aims to decompose a large-scale crowdsourcing task to achieve the desired reliability at a minimal cost. We prove its NP-hardness and study two variants of the problem. For the homogeneous SLADE problem, we propose a greedy algorithm and an approximation framework using an optimal priority queue (OPQ) structure with provable approximation ratio. For the heterogeneous SLADE problem, we extend this framework and prove its approximation guarantee. Extensive experiments validate the effectiveness and efficiency of the solutions.

【Keywords】: Crowdsourcing; task decomposition; task assignment

255. XINA: Explainable Instance Alignment using Dominance Relationship (Extended Abstract).

Paper Link】 【Pages】:2135-2136

【Authors】: Jinyoung Yeo ; Haeju Park ; Sanghoon Lee ; Eric Wonhee Lee ; Seung-won Hwang

【Abstract】: In this extended abstract, we present an instance alignment framework, namely XINA, for KB integration. We then show its effectiveness and efficiency on real-world KBs.

【Keywords】: Instance alignment; Explainable AI

256. A Hardware-Accelerated Solution for Hierarchical Index-Based Merge-Join(Extended Abstract).

Paper Link】 【Pages】:2137-2138

【Authors】: Zimeng Zhou ; Chenyun Yu ; Sarana Nutanong ; Yufei Cui ; Chenchen Fu ; Chun Jason Xue

【Abstract】: Hardware acceleration through field programmable gate arrays (FPGAs) has recently become a technique of growing interest for many data-intensive applications. Join query is one of the most fundamental database query types useful in relational database management systems. However, the available solutions so far have been beset by higher costs in comparison with other query types. In this paper, we develop a novel solution to accelerate the processing of sort-merge join queries with low match rates. Specifically, our solution makes use of hierarchical indexes to identify result-yielding regions in the solution space in order to take advantage of result sparseness. Further, in addition to one-dimensional equi-join query processing, our solution supports processing of multidimensional similarity join queries. Experimental results show that our solution is superior to the best existing method in a low match rate setting; the method achieves a speedup factor of 4.8 for join queries with a match rate of 5%.

【Keywords】: Sort merge join; hardware acceleration; FPGA; B+tree; low selectivity join

Paper Link】 【Pages】:2139-2140

【Authors】: Huan Li ; Hua Lu ; Lidan Shou ; Gang Chen ; Ke Chen

【Abstract】: Knowing popular indoor locations can benefit many applications like exhibition planning and location-based advertising, among others. In this work, we use uncertain historical indoor mobility data to find the top-k popular indoor semantic locations with the highest flow values. In the data we use, an object positioning report contains a set of samples, each consisting of an indoor location and a corresponding probability. The problem is challenging due to the difficulty in obtaining reliable flow values and the heavy computational workload on probabilistic samples for large numbers of objects. To address the first challenge, we propose an indoor flow definition that takes into account both data uncertainty and indoor topology. To efficiently compute flows for individual indoor semantic locations, we design data structures for facilitating accessing the relevant data, a data reduction method that reduces the intermediate data to process, and an overall flow computing algorithm. Furthermore, we design search algorithms for finding the top-k popular indoor semantic locations. All proposals are evaluated extensively on real and synthetic data. The evaluation results show that our data reduction method significantly reduces the data volume in computing, our search algorithms are efficient and scalable, and the top-k popular semantic locations returned are in good accord with ground truth.

【Keywords】: Indoor space; Indoor mobility data; Indoor flows

258. Uncertain Graph Sparsification (Extended Abstract).

Paper Link】 【Pages】:2141-2142

【Authors】: Panos Parchas ; Nikolaos Papailiou ; Dimitris Papadias ; Francesco Bonchi

【Abstract】: Uncertain graphs are prevalent in several applications including communications systems, biological databases and social networks. The ever increasing size of the underlying data renders both graph storage and query processing extremely expensive. Sparsification has often been used to reduce the size of deterministic graphs by maintaining only the important edges. However, adaptation of deterministic sparsification methods fails in the uncertain setting. To overcome this problem, we introduce the first sparsification techniques aimed explicitly at uncertain graphs. The proposed methods reduce the number of edges and redistribute their probabilities in order to decrease the graph size, while preserving its underlying structure. The resulting graph can be used to efficiently and accurately approximate any query and mining tasks on the original graph, including clustering coefficient, page rank, reliability and shortest path distance.

【Keywords】: Uncertain Graphs; Sparsification; Entropy

259. Rule-Based Entity Resolution on Database with Hidden Temporal Information (Extended Abstract).

Paper Link】 【Pages】:2143-2144

【Authors】: Hongzhi Wang ; Xiaoou Ding ; Jianzhong Li ; Hong Gao

【Abstract】: In this paper, we deal with the problem of rule-based entity resolution on imprecise temporal data. We use record matching dependencies and data currency constraints to derive temporal records' information and trend of their attributes' evolvement with elapsing of time. We firstly block records into smaller blocks, and then by exploring data currency constraints. We propose a temporal clustering approach with two steps, i.e., the skeleton clustering and the banding clustering. Experiments show that our method achieves both high accuracy and efficiency with hidden temporal information on datasets without imprecise timestamps.

【Keywords】: Entity resolution; imprecise temporal data; attribute evolvement; dynamic weight schema

260. BRIGHT - Drift-Aware Demand Predictions for Taxi Networks (Extended Abstract).

Paper Link】 【Pages】:2145-2146

【Authors】: Amal Saadallah ; Luís Moreira-Matias ; Ricardo Sousa ; Jihed Khiari ; Erik Jenelius ; João Gama

【Abstract】: The dynamic behavior of urban mobility patterns makes matching taxi supply with demand as one of the biggest challenges in this industry. Recently, the increasing availability of massive broadcast GPS data has encouraged the exploration of this issue under different perspectives. One possible solution is to build a data-driven real-time taxi-dispatching recommender system. However, existing systems are based on strong assumptions such as stationary demand distributions and finite training sets, which make them inadequate for modeling the dynamic nature of the network. In this paper, we propose BRIGHT: a drift-aware supervised learning framework which aims to provide accurate predictions for short-term horizon taxi demand quantities through a creative ensemble of time series analysis methods that handle distinct types of concept drift. A large experimental set-up which includes three real-world transportation networks and a synthetic test-bed with artificially inserted concept drifts, was employed to illustrate the advantages of BRIGHT when compared to S.o.A methods for this problem.

【Keywords】: Time-series forecasting, Concept drift, Ensemble learning, Taxi passenger demand

261. Order-Sensitive Imputation for Clustered Missing Values (Extended Abstract).

Paper Link】 【Pages】:2147-2148

【Authors】: Qian Ma ; Yu Gu ; Wang-Chien Lee ; Ge Yu

【Abstract】: To study the issue of missing values (MVs), we propose the Order-Sensitive Imputation for Clustered Missing values (OSICM) framework, in which missing values are imputed sequentially such that the values filled earlier in the process are also used for later imputation of other MVs. Obviously, the order of imputations is critical to the effectiveness and efficiency of OSICM framework. We formulate the searching of the optimal imputation order as an optimization problem, and show its NP-hardness. Furthermore, we devise an algorithm to find the exact optimal solution and propose two approximate/heuristic algorithms to trade off effectiveness for efficiency. Finally, we conduct extensive experiments on real and synthetic datasets to demonstrate the superiority of our OSICM framework.

【Keywords】: missing value; order sensitive imputation; clustered MVs phenomenon

262. DeepDirect: Learning Directions of Social Ties with Edge-Based Network Embedding (Extended Abstract).

Paper Link】 【Pages】:2149-2150

【Authors】: Chaokun Wang ; Changping Wang ; Zheng Wang ; Xiaojun Ye ; Jeffrey Xu Yu ; Bin Wang

【Abstract】: This paper presents the problem of tie direction learning which learns the directionality function of directed social networks. One way is based on hand-crafted features; the other called DeepDirect learns the social tie representation through the network topology. DeepDirect directly maps social ties to low-dimensional embedding vectors by preserving network topology, utilizing labeled data, and generating pseudo-labels based on observed directionality patterns. Experimental results on two tasks, i.e., direction discovery on undirected ties and direction quantification on bidirectional ties, demonstrate the proposed methods are effective and promising.

【Keywords】: tie direction learning, mixed social networks, edge based network embedding, social tie representation

263. A Utility-Optimized Framework for Personalized Private Histogram Estimation (Extended Abstract).

Paper Link】 【Pages】:2151-2152

【Authors】: Yiwen Nie ; Wei Yang ; Liusheng Huang ; Xike Xie ; Zhenhua Zhao ; Shaowei Wang

【Abstract】: Local differential privacy (LDP), as a strong and practical notion, has been applied to deal with privacy issues in data collection. However, existing LDP-based strategies mainly focus on utility optimization at a single privacy level while ignoring various privacy preferences of data providers and multilevel privacy demands for statistics. In this poster, we for the first time propose a framework to optimize the utility of histogram estimation with these two privacy requirements. To clarify the goal of privacy protection, we personalize the traditional definition of LDP. We design two independent approaches to minimize the utility loss: Advanced Combination, which composes multilevel results for utility optimization, and Data Recycle with Personalized Privacy, which enlarges sample size for an estimation. We demonstrate their effectiveness on privacy and utility. Moreover, we embed these approaches within a Recycle and Combination Framework and prove that the framework stably achieves the optimal utility by quantifying its error bounds. On real-world datasets, our approaches are experimentally validated and remarkably outperform baseline methods.

【Keywords】: Differential Privacy; Crowdsourcing; Histograms

264. Near-Accurate Multiset Reconciliation (Extended Abstract).

Paper Link】 【Pages】:2153-2154

【Authors】: Lailong Luo ; Deke Guo ; Xiang Zhao ; Jie Wu ; Ori Rottenstreich ; Xueshan Luo

【Abstract】: The mission of set reconciliation (also called set synchronization) is to identify those elements which appear only in exactly one of two given sets. In this paper, we extend the set reconciliation problem into three design rationales: (i) multiset support; (ii) near 100% reconciliation accuracy; (iii) communication-friendly and time-saving. Prior reconciliation methods fail to realize the three rationales simultaneously. To this end, we redesign Trie and Fenwick Tree (FT), to near-accurately represent and reconcile two types of multisets that we refer to as unsorted and sorted multisets, respectively. Comprehensive evaluations are conducted to quantify the performance of our proposals. The trace-driven evaluations demonstrate that Trie and FT achieve near-accurate multiset reconciliation, with 4.31 and 2.96 times faster than the CBF-based method, respectively.

【Keywords】: multiset reconciliation; trie; fenwick tree

265. Answering Why-Not Group Spatial Keyword Queries (Extended Abstract).

Paper Link】 【Pages】:2155-2156

【Authors】: Bolong Zheng ; Kai Zheng ; Christian S. Jensen ; Nguyen Quoc Viet Hung ; Han Su ; Guohui Li ; Xiaofang Zhou

【Abstract】: With the proliferation of geo-textual objects on the web, extensive efforts have been devoted to improving the efficiency of top-k spatial keyword queries in different settings. However, comparatively much less work has been reported on enhancing the quality and usability of such queries. In this context, we propose means of enhancing the usability of a top-k group spatial keyword query, where a group of users aim to find k objects that contain given query keywords and are nearest to the users. Specifically, when users receive the result of such a query, they may find that one or more objects that they expect to be in the result are in fact missing, and they may wonder why. To address this situation, we develop a so-called why-not query that is able to minimally modify the original query into a query that returns the expected, but missing, objects, in addition to other objects. Specifically, we formalize the why-not query in relation to the top-k group spatial keyword query, called the Why-not Group Spatial Keyword Query (WGSK) that is able to provide a group of users with a more satisfactory query result. We propose a three-phase framework for efficiently computing he WGSK. Extensive experiments with real and synthetic data offer evidence that the proposed solution excels over baselines with respect to both effectiveness and efficiency.

【Keywords】: spatial keyword queries; why not; top k query; query processing

Paper Link】 【Pages】:2157-2158

【Authors】: Yixiang Fang ; Zhongran Wang ; Reynold Cheng ; Hongzhi Wang ; Jiafeng Hu

【Abstract】: Communities are prevalent in social networks, knowledge graphs, and biological networks. Recently, the topic of community search (CS), extracting a dense subgraph containing a query vertex q from a graph, has received great attention. However, existing CS solutions are designed for undirected graphs, and overlook directions of edges which potentially lose useful information carried on directions. In many applications (e.g., Twitter), users' relationships are often modeled as directed graphs (e.g., if a user a follows another user b, then there is an edge from a to b). In this paper, we study the problem of CS on directed graph. Given a vertex q of a graph G, we aim to find a densely connected subgraph containing q from G, in which vertices have strong interactions and high similarities, by using the minimum in/out-degrees metric. We first develop a baseline algorithm based on the concept of D-core. We further propose three index structures and corresponding query algorithms. Our experimental results on seven real graphs show that our solutions are very effective and efficient.

【Keywords】: community search; directed graphs; D-core; online queries

267. Exploring Communities in Large Profiled Graphs (Extended Abstract).

Paper Link】 【Pages】:2159-2160

【Authors】: Yankai Chen ; Yixiang Fang ; Reynold Cheng ; Yun Li ; Xiaojun Chen ; Jie Zhang

【Abstract】: Given a graph G and a vertex q ϵ G, the community search (CS) problem aims to efficiently find a subgraph of G whose vertices are closely related to q. Communities are prevalent in social and biological networks, and can be used in product advertisement and social event recommendation. In this paper, we study profiled community search (PCS), where CS is performed on a profiled graph. This is a graph in which each vertex has labels arranged in a hierarchical manner. Compared with existing CS approaches, PCS can sufficiently identify vertices with semantic commonalities and thus find more high-quality diverse communities. As a naive solution for PCS is highly expensive, we have developed a tree index, which facilitates efficient and online solutions for PCS.

【Keywords】: community search; social networks; graph queries; profiled graph

Paper Link】 【Pages】:2161-2162

【Authors】: Long Yuan ; Lu Qin ; Wenjie Zhang ; Lijun Chang ; Jianye Yang

【Abstract】: Community search is important in graph analysis and can be used in many real applications. In the literature, various community models have been proposed. However, most of them cannot well identify the overlaps between communities which is an essential feature of real graphs. To address this issue, k-clique percolation community model was proposed and has been proven effective in many applications. Motivated by this, in this paper, we adopt the k-clique percolation community model and study the densest clique percolation community search problem which aims to find the k-clique percolation community with the maximum k value that contains a given set of query nodes. We adopt an index based approach to solve this problem. Based on the observation that a k-clique percolation community is a union of maximal cliques, we devise a novel compact index, DCPC-Index, to preserve the maximal cliques and their connectivity information of the input graph. With DCPC-Index, we can answer the densest clique percolation community query efficiently. Besides, we also propose an index construction algorithm based on the definition of DCPC-Index and further improve the algorithm in terms of efficiency and memory consumption. We conduct extensive performance studies on real graphs and the experimental results demonstrate the efficiency of our index-based query processing algorithm and index construction algorithm.

【Keywords】: k-clique percolation community, community search, social network