32. ICDE 2016:Helsinki, Finland

32nd IEEE International Conference on Data Engineering, ICDE 2016, Helsinki, Finland, May 16-20, 2016. IEEE Computer Society 【DBLP Link】

Paper Num: 212 || Session Num: 36

Demo Session 1D 6
Demo Session 1E 6
Demo Session 2D 6
Demo Session 2E 5
Industrial and Applications 1: Distributed/Parallel Systems 3
Industrial and Applications 2: Potpourri 1 3
Industrial and Applications 3: Potpourri 2 3
Industrial and Applications 4: Real Time Analytics 3
Panels 2
Research Session 1A: Graph Processing 4
Research Session 1B: Crowdsourcing 4
Research Session 2A: Graph Algorithmics 4
Research Session 2B: Beyond Relational Query Processing 4
Research Session 2C: Privacy 4
Research Session 3A: Graph Proximity 4
Research Session 3B: Scalable Query Processing 4
Research Session 3C: Preference and Trust 4
Research Session 4A: Graph Mining 4
Research Session 4B: Conscious Big Data Processing 4
Research Session 4C: Data Streams 4
Research Session 5A: Graph Patterns 4
Research Session 5B: Parallel and Distributed Big Data Processing 4
Research Session 5C: Clustering 4
Research Session 6A: Spatial Analytics 4
Research Session 6B: Analytics on Big Data 4
Research Session 6C: Uncertain and Probabilistic Data 4
Research Session 7A: Scalable Matrix-Based Analytics 4
Research Session 7B: Trajectories and Roads 3
Research Session 8A: Data Explorations and Event Analytics 4
Research Session 8B: Spatial Analytics 4
Research Session 8C: Web Data Processing 4
Research Session 9A: Visual Analytics in Social Networks 4
Research Session 9B: Optimization of Temporal, Spatial Data 4
Research Session 9C: Data Integration and Strings 4
TKDE Posters 67
Tutorials 8

Research Session 1A: Graph Processing 4

【Paper Link】【Pages】:1-12

【Authors】: Xiaoyang Wang ; Ying Zhang ; Wenjie Zhang ; Xuemin Lin

【Abstract】: Influence maximization is a key problem in viral marketing. Given a social network G and a positive integer k, it aims to identify a seed set of k nodes in G that can maximize the expected influence spread in a certain propagation model. With the proliferation of geo-social networks, location-aware product promotion is becoming more necessary in real applications. However, the importance of the distance between users and the promoted location is underestimated in existing models. For instance, when opening a restaurant in downtown, through online promotion, the owner may expect to influence more customers who are close to the restaurant, instead of people that are far away from it. In this paper, we formally define the distance-aware influence maximization problem, to find a seed set that maximizes the expected influence over users who are more likely to be the potential customers of the promoted location. To efficiently calculate the influence spread, we adopt the maximum influence arborescence (MIA) model for influence approximation. To speed up the search, we propose three pruning strategies to prune unpromising nodes from expensive evaluation and achieve potential early termination in each iteration without sacrificing the final result's approximation ratio. In addition, novel index structures are developed to compute the bounds used in the three pruning strategies. By integrating these pruning strategies, we propose a priority based algorithm which searches users based on their order of influence. The algorithm achieves an approximation ratio of 1 ¿¿¿ 1 over e under the MIA model. In the final, comprehensive experiments over real datasets demonstrate the efficiency and effectiveness of the proposed algorithms and pruning strategies.

【Keywords】: Approximation algorithms; Computational modeling; Indexes; Integrated circuit modeling; Silicon; Social network services

【Paper Link】【Pages】:13-24

【Authors】: Daehoon Kim ; Jae-Gil Lee ; Byung Suk Lee

【Abstract】: Social curation services are emerging social media platforms that enable users to curate their contents according to the topic and express their interests at the topic level by following curated collections of other users' contents rather than the users themselves. The topic-level information revealed through this new feature far exceeds what existing methods solicit from the traditional social networking services, to greatly enhance the quality of topic-sensitive influence modeling. In this paper, we propose a novel model called the topical influence with social curation (TISC) to find influential users from social curation services. This model, formulated by the continuous conditional random field, fully takes advantage of the explicitly available topic-level information reflected in both contents and interactions. In order to validate its merits, we comprehensively compare TISC with state-of-the-art models using two real-world data sets collected from Pinterest and Scoop.it. The results show that TISC achieves higher accuracy by up to around 80% and finds more convincing results in case studies than the other models. Moreover, we develop a distributed learning algorithm on Spark and demonstrate its excellent scalability on a cluster of 48 cores.

【Keywords】: Computational modeling; Data models; Graphical models; Heart; Media; Predictive models; Social network services

3. BlackHole: Robust community detection inspired by graph drawing.

【Paper Link】【Pages】:25-36

【Authors】: Sungsu Lim ; Junghoon Kim ; Jae-Gil Lee

【Abstract】: With regard to social network analysis, we concentrate on two widely-accepted building blocks: community detection and graph drawing. Although community detection and graph drawing have been studied separately, they have a great commonality, which means that it is possible to advance one field using the techniques of the other. In this paper, we propose a novel community detection algorithm for undirected graphs, called BlackHole, by importing a geometric embedding technique from graph drawing. Our proposed algorithm transforms the vertices of a graph to a set of points on a low-dimensional space whose coordinates are determined by a variant of graph drawing algorithms, following the overall procedure of spectral clustering. The set of points are then clustered using a conventional clustering algorithm to form communities. Our primary contribution is to prove that a common idea in graph drawing, which is characterized by consideration of repulsive forces in addition to attractive forces, improves the clusterability of an embedding. As a result, our algorithm has the advantages of being robust especially when the community structure is not easily detectable. Through extensive experiments, we have shown that BlackHole achieves the accuracy higher than or comparable to the state-of-the-art algorithms.

【Keywords】: Clustering algorithms; Detection algorithms; Force; Image edge detection; Layout; Robustness; Social network services

【Paper Link】【Pages】:37-48

【Authors】: Arijit Khan ; Benjamin Zehnder ; Donald Kossmann

【Abstract】: We study the novel problem of revenue maximization of a social network host that sells viral marketing campaigns to multiple competing campaigners. Each client campaigner informs the social network host about her target users in the network, as well as how much money she is willing to pay to the host if one of her target users buys her product. The social network host, in turn, assigns a set of seed users to each of her client campaigners. The seed set for a campaigner is a limited number of users to whom the campaigner provides free samples, discounted price etc. with the expectation that these seed users will buy her product, and would also be able to influence many of her target users in the network towards buying her product. Because of various product-adoption costs, it is very unlikely that an average user will purchase more than one of the competing products. Therefore, from the host's perspective, it is important to assign seed users to client campaigners in such a way that the seed assignment guarantees the maximum aggregated revenue for the host considering all her client campaigners. We formulate our problem by following two well-established influence cascading models: the independent cascade model and the linear threshold model. While our problem using both these models is NP-hard, and neither monotonic, nor sub-modular; we develop approximated algorithms with theoretical performance guarantees. However, as our approximated algorithms often incur higher running times, we also design efficient heuristic methods that empirically perform as good as our approximated algorithms. Our detailed experimental evaluation attests that the proposed techniques are effective and scalable over real-world datasets.

【Keywords】: Algorithm design and analysis; Approximation algorithms; Companies; Heuristic algorithms; Integrated circuit modeling; Social network services

Research Session 1B: Crowdsourcing 4

5. Online mobile Micro-Task Allocation in spatial crowdsourcing.

【Paper Link】【Pages】:49-60

【Authors】: Yongxin Tong ; Jieying She ; Bolin Ding ; Libin Wang ; Lei Chen

【Abstract】: With the rapid development of smartphones, spatial crowdsourcing platforms are getting popular. A foundational research of spatial crowdsourcing is to allocate micro-tasks to suitable crowd workers. Most existing studies focus on offline scenarios, where all the spatiotemporal information of micro-tasks and crowd workers is given. However, they are impractical since micro-tasks and crowd workers in real applications appear dynamically and their spatiotemporal information cannot be known in advance. In this paper, to address the shortcomings of existing offline approaches, we first identify a more practical micro-task allocation problem, called the Global Online Micro-task Allocation in spatial crowdsourcing (GOMA) problem. We first extend the state-of-art algorithm for the online maximum weighted bipartite matching problem to the GOMA problem as the baseline algorithm. Although the baseline algorithm provides theoretical guarantee for the worst case, its average performance in practice is not good enough since the worst case happens with a very low probability in real world. Thus, we consider the average performance of online algorithms, a.k.a online random order model.We propose a two-phase-based framework, based on which we present the TGOA algorithm with 1 over 4 -competitive ratio under the online random order model. To improve its efficiency, we further design the TGOA-Greedy algorithm following the framework, which runs faster than the TGOA algorithm but has lower competitive ratio of 1 over 8. Finally, we verify the effectiveness and efficiency of the proposed methods through extensive experiments on real and synthetic datasets.

【Keywords】: Algorithm design and analysis; Crowdsourcing; Heuristic algorithms; Mobile communication; Real-time systems; Resource management; Spatiotemporal phenomena

6. Crowdsourced POI labelling: Location-aware result inference and Task Assignment.

【Paper Link】【Pages】:61-72

【Authors】: Huiqi Hu ; Yudian Zheng ; Zhifeng Bao ; Guoliang Li ; Jianhua Feng ; Reynold Cheng

【Abstract】: Identifying the labels of points of interest (POIs), aka POI labelling, provides significant benefits in location-based services. However, the quality of raw labels manually added by users or generated by artificial algorithms cannot be guaranteed. Such low-quality labels decrease the usability and result in bad user experiences. In this paper, by observing that crowdsourcing is a best-fit for computer-hard tasks, we leverage crowdsourcing to improve the quality of POI labelling. To our best knowledge, this is the first work on crowdsourced POI labelling tasks. In particular, there are two sub-problems: (1) how to infer the correct labels for each POI based on workers' answers, and (2) how to effectively assign proper tasks to workers in order to make more accurate inference for next available workers. To address these two problems, we propose a framework consisting of an inference model and an online task assigner. The inference model measures the quality of a worker on a POI by elaborately exploiting (i) worker's inherent quality, (ii) the spatial distance between the worker and the POI, and (iii) the POI influence, which can provide reliable inference results once a worker submits an answer. As workers are dynamically coming, the online task assigner judiciously assigns proper tasks to them so as to benefit the inference. The inference model and task assigner work alternately to continuously improve the overall quality. We conduct extensive experiments on a real crowdsourcing platform, and the results on two real datasets show that our method significantly outperforms state-of-the-art approaches.

【Keywords】: Computational modeling; Computer science; Crowdsourcing; Labeling; Mobile radio mobility management; Random variables; Reliability

7. Mutual benefit aware task assignment in a bipartite labor market.

【Paper Link】【Pages】:73-84

【Authors】: Liu Zheng ; Lei Chen

【Abstract】: As one of the three major steps (question design, task assignment, answer aggregation) in crowdsourcing, task assignment directly affects the quality of the crowdsourcing result. A good assignment will not only improve the answers' quality, but also boost the workers' willingness to participate. Although a lot of works have been made to produce better assignment, most of them neglected one of its most important properties: the bipartition, which exists widely in real world scenarios. Such ignorance greatly limits their application under general settings.

【Keywords】: Algorithm design and analysis; Computer science; Crowdsourcing; Electronic mail; Graphical models; Labeling; Probabilistic logic

8. Computing Connected Components with linear communication cost in pregel-like systems.

【Paper Link】【Pages】:85-96

【Authors】: Xing Feng ; Lijun Chang ; Xuemin Lin ; Lu Qin ; Wenjie Zhang

【Abstract】: The paper studies two fundamental problems in graph analytics: computing Connected Components (CCs) and computing BiConnected Components (BCCs) of a graph. With the recent advent of Big Data, developing effcient distributed algorithms for computing CCs and BCCs of a big graph has received increasing interests. As with the existing research efforts, in this paper we focus on the Pregel programming model, while the techniques may be extended to other programming models including MapReduce and Spark. The state-of-the-art techniques for computing CCs and BCCs in Pregel incur O(m ¿¿ #supersteps) total costs for both data communication and computation, where m is the number of edges in a graph and #supersteps is the number of supersteps. Since the network communication speed is usually much slower than the computation speed, communication costs are the dominant costs of the total running time in the existing techniques. In this paper, we propose a new paradigm based on graph decomposition to reduce the total communication costs from O(m¿¿#supersteps) to O(m), for both computing CCs and computing BCCs. Moreover, the total computation costs of our techniques are smaller than that of the existing techniques in practice, though theoretically they are almost the same. Comprehensive empirical studies demonstrate that our approaches can outperform the existing techniques by one order of magnitude regarding the total running time.

【Keywords】: Algorithm design and analysis; Australia; Computational modeling; Data communication; Image color analysis; Labeling; Phase change random access memory

Research Session 2A: Graph Algorithmics 4

9. VColor: A practical vertex-cut based approach for coloring large graphs.

【Paper Link】【Pages】:97-108

【Authors】: Yun Peng ; Byron Choi ; Bingsheng He ; Shuigeng Zhou ; Ruzhi Xu ; Xiaohui Yu

【Abstract】: Graph coloring is a fundamental NP-hard problem in graph theory. It has a wide range of real applications, such as Operations Research, Communication Network, Computational Biology and Compiler Optimization. Notable efforts have been spent on designing its approximation algorithms. Halldrsson proposed the algorithm (denoted as SampleIS) with the current best known approximation ratio. However, its time complexity is O(|G|3), where |G| is the number of vertices of a graph G. It is clear that SampleIS is not practical for large graphs. In this paper, we propose a practical vertex-cut based coloring technique (VColor) for coloring large graphs. First, we partition G into k connected components (CCs) of a small size s by removing a vertex-cut component (VCC). For each CC, we apply our novel coloring algorithm, based on maximal independent set enumeration. The approximation ratio and the time complexity for coloring the k CCs are log s + 1 and O(ks23s/3), respectively, whereas those of SampleIS are ks(log log ks)2/ log3 ks and O(k3s3). For the VCC, we simply apply SampleIS. To combine the colorings of the CCs and the VCC, we propose a maximum matching based algorithm. Second, in the context of a database of graphs, users may color many graphs. We propose an optimization technique, inspired by multi-query optimization, for coloring a set of graphs. We design a VP hierarchy (VPH) to represent the common subgraphs as the common CCs. Third, we propose techniques for determining the optimal values of the parameters of VColor. Our extensive experimental evaluation on real-world graphs confirms the efficiency and/or effectiveness of our proposed techniques. In particular, VColor is more than 500 times faster than SampleIS, and the number of colors used are comparable on real graphs Yeast and LS.

【Keywords】: Algorithm design and analysis; Approximation algorithms; Color; Electronic mail; Interference; Optimization; Partitioning algorithms

10. Compressing graphs by grammars.

【Paper Link】【Pages】:109-120

【Authors】: Sebastian Maneth ; Fabian Peternek

【Abstract】: We present a new graph compressor that detects repeating substructures and represents them by grammar rules. We show that for a large number of graphs the compressor obtains smaller representations than other approaches. For RDF graphs and version graphs it outperforms the best known previous methods. Specific queries such as reachability between two nodes, can be evaluated in linear time over the grammar, thus allowing speed-ups proportional to the compression ratio.

【Keywords】: Approximation algorithms; Databases; Electronic mail; Grammar; Informatics; Maintenance engineering; Resource description framework

11. Planar: Parallel lightweight architecture-aware adaptive graph repartitioning.

【Paper Link】【Pages】:121-132

【Authors】: Angen Zheng ; Alexandros Labrinidis ; Panos K. Chrysanthis

【Abstract】: Graph partitioning is an essential preprocessing step in distributed graph computation and scientific simulations. Existing well-studied graph partitioners are designed for static graphs, but real-world graphs, such as social networks and Web networks, keep changing dynamically. In fact, the communication and computation patterns of some graph algorithms may vary significantly, even across their different computation phases. This means that the optimal partitioning changes over time, requiring the graph to be repartitioned periodically to maintain good performance. However, the state-of-the-art graph (re)partitioners are known for their poor scalability against massive graphs. Furthermore, they usually assume a homogeneous and contention-free computing environment, which is no longer true in modern high performance computing infrastructures. In this paper, we introduce Planar, a parallel lightweight graph repartitioner, which does not require full knowledge of the graph and incrementally adapts the partitioning to changes while considering the heterogeneity and contentiousness of the underlying computing infrastructure. Using a diverse collection of datasets, we showed that, in comparison with the de-facto standard and two state-of-the-art streaming graph partitioning heuristics, Planar improved the quality of graph partitionings by up to 68%, 46%, and 69%, respectively. Furthermore, our experiments with an MPI implementation of Breadth First Search and Single Source Shortest Path showed that Planar achieved up to 10x speedups against the state-of-the-art streaming and multi-level graph (re)partitioners. Finally, we scaled Planar up to a graph with 3.6 billion edges.

【Keywords】: Big data; Computational modeling; Data communication; Heuristic algorithms; Partitioning algorithms; Scalability; Social network services

12. I/O efficient Core Graph Decomposition at web scale.

【Paper Link】【Pages】:133-144

【Authors】: Dong Wen ; Lu Qin ; Ying Zhang ; Xuemin Lin ; Jeffrey Xu Yu

【Abstract】: Core decomposition is a fundamental graph problem with a large number of applications. Most existing approaches for core decomposition assume that the graph is kept in memory of a machine. Nevertheless, many real-world graphs are big and may not reside in memory. In the literature, there is only one work for I/O efficient core decomposition that avoids loading the whole graph in memory. However, this approach is not scalable to handle big graphs because it cannot bound the memory size and may load most parts of the graph in memory. In addition, this approach can hardly handle graph updates. In this paper, we study I/O efficient core decomposition following a semi-external model, which only allows node information to be loaded in memory. This model works well in many web-scale graphs. We propose a semi-external algorithm and two optimized algorithms for I/O efficient core decomposition using very simple structures and data access model. To handle dynamic graph updates, we show that our algorithm can be naturally extended to handle edge deletion. We also propose an I/O efficient core maintenance algorithm to handle edge insertion, and an improved algorithm to further reduce I/O and CPU cost by investigating some new graph properties. We conduct extensive experiments on 12 real large graphs. Our optimal algorithm significantly outperform the existing I/O efficient algorithm in terms of both processing time and memory consumption. In many memory-resident graphs, our algorithms for both core decomposition and maintenance can even outperform the in-memory algorithm due to the simple structures and data access model used. Our algorithms are very scalable to handle web-scale graphs. As an example, we are the first to handle a web graph with 978.5 million nodes and 42.6 billion edges using less than 4.2 GB memory.

【Keywords】: Algorithm design and analysis; Data models; Heuristic algorithms; Load modeling; Maintenance engineering; Memory management; Optimization

Research Session 2B: Beyond Relational Query Processing 4

13. Reachability and time-based path queries in temporal graphs.

【Paper Link】【Pages】:145-156

【Authors】: Huanhuan Wu ; Yuzhen Huang ; James Cheng ; Jinfeng Li ; Yiping Ke

【Abstract】: A temporal graph is a graph in which vertices communicate with each other at specific time, e.g., A calls B at 11 a.m. and talks for 7 minutes, which is modeled by an edge from A to B with starting time ¿¿¿11 a.m.¿¿¿ and duration ¿¿¿7 mins¿¿¿. Temporal graphs can be used to model many networks with time-related activities, but efficient algorithms for analyzing temporal graphs are severely inadequate. We study fundamental problems such as answering reachability and time-based path queries in a temporal graph, and propose an efficient indexing technique specifically designed for processing these queries in a temporal graph. Our results show that our method is efficient and scalable in both index construction and query processing.

【Keywords】: Algorithm design and analysis; Electronic mail; Indexing; Labeling; Query processing; Social network services

14. Scalable supergraph search in large graph databases.

【Paper Link】【Pages】:157-168

【Authors】: Bingqing Lyu ; Lu Qin ; Xuemin Lin ; Lijun Chang ; Jeffrey Xu Yu

【Abstract】: Supergraph search is a fundamental problem in graph databases that is widely applied in many application scenarios. Given a graph database and a query-graph, supergraph search retrieves all data-graphs contained in the query-graph from the graph database. Most existing solutions for supergraph search follow the pruning-and-verification framework, which prunes false answers based on features in the pruning phase and performs subgraph isomorphism testings on the remaining graphs in the verification phase. However, they are not scalable to handle large-sized data-graphs and query-graphs due to three drawbacks. First, they rely on a frequent subgraph mining algorithm to select features which is expensive and cannot generate large features. Second, they require a costly verification phase. Third, they process features in a fixed order without considering their relationship to the query-graph. In this paper, we address the three drawbacks and propose new indexing and query processing algorithms. In indexing, we select features directly from the data-graphs without expensive frequent subgraph mining. The features form a feature-tree that contains all-sized features and both the cost sharing and pruning power of the features are considered. In query processing, we propose a verification-free algorithm, where the order to process features is query-dependent by considering both the cost sharing and the pruning power. We explore two optimization strategies to further improve the algorithm efficiency. The first strategy applies a lightweight graph compression technique and the second strategy optimizes the inclusion of answers. Finally, we conduct extensive performance studies on two real large datasets to demonstrate the high scalability of our algorithms.

【Keywords】: Computational efficiency; Optimization; Partitioning algorithms; Query processing; Search problems; Testing

15. Hobbes3: Dynamic generation of variable-length signatures for efficient approximate subsequence mappings.

【Paper Link】【Pages】:169-180

【Authors】: Jongik Kim ; Chen Li ; Xiaohui Xie

【Abstract】: Recent advances in DNA sequencing have enabled a flood of sequencing-based applications for studying biology and medicine. A key requirement of these applications is to rapidly and accurately map DNA subsequences to a reference genome. This DNA subsequence mapping problem shares core technical challenges with the similarity query processing problem studied in the database research literature. To solve this problem, existing techniques first extract signatures from a query, then retrieve candidate mapping positions from an index using the extracted signatures, and finally verify the candidate positions. The efficiency of these techniques depends critically on signatures selected from queries, while signature selection relies on an indexing scheme of a reference genome. The q-gram inverted indexing, one of the most widely used indexing schemes, can discover candidate positions quickly, but has the limitation that signatures of queries are restricted to fixed-length q-grams. To address the problem, we propose a flexible way to generate variable-length signatures using a fixed-length q-gram index. The proposed technique groups a few q-grams into a variable-length signature, and generates candidate positions for the variable-length signature using the inverted lists of the q-grams. We also propose a novel dynamic programming algorithm to balance between the filtering power of signatures and the overhead of generating candidate positions for the signatures. Through extensive experiments on both simulated and real genomic data, we show that our technique substantially improves the performance of read mapping in terms of both mapping speed and accuracy.

【Keywords】: Bioinformatics; DNA; Genomics; Hamming distance; Indexing; Query processing

16. NoSE: Schema design for NoSQL applications.

【Paper Link】【Pages】:181-192

【Authors】: Michael J. Mior ; Kenneth Salem ; Ashraf Aboulnaga ; Rui Liu

【Abstract】: Database design is critical for high performance in relational databases and many tools exist to aid application designers in selecting an appropriate schema. While the problem of schema optimization is also highly relevant for NoSQL databases, existing tools for relational databases are inadequate for this setting. Application designers wishing to use a NoSQL database instead rely on rules of thumb to select an appropriate schema. We present a system for recommending database schemas for NoSQL applications. Our cost-based approach uses a novel binary integer programming formulation to guide the mapping from the application's conceptual data model to a database schema. We implemented a prototype of this approach for the Cassandra extensible record store. Our prototype, the NoSQL Schema Evaluator (NoSE) is able to capture rules of thumb used by expert designers without explicitly encoding the rules. Automating the design process allows NoSE to produce efficient schemas and to examine more alternatives than would be possible with a manual rule-based approach.

【Keywords】: Data models; Decision making; Nose; Optimization; Physical design; Relational databases

Research Session 2C: Privacy 4

17. Towards Virtual Private NoSQL datastores.

【Paper Link】【Pages】:193-204

【Authors】: Pietro Colombo ; Elena Ferrari

【Abstract】: Many modern applications use context related information to provide highly personalized services, and use NoSQL databases for data management, as these systems show outstanding performance and support high volumes of data. However, NoSQL databases integrate poor data protection features with basic coarse grained access control and no support for context aware policies. Therefore, we believe that a general approach is required to enhance NoSQL datastores with fine grained context aware access control. In this paper, we start to fill this void by targeting MongoDB, a very popular datastore. The contribution is twofold. We enhance MongoDB's access control model with advanced features and we define an enforcement monitor for the proposed enhanced model, which can be straightforwardly used in any MongoDB deployment. Technological limitations of MongoDB do not allow implementing the same efficient enforcement mechanism for all query types. As a consequence, experimental results show an enforcement overhead that is significant for aggregate queries, which contrasts with a low overhead measured for find and map-reduce queries.

【Keywords】: Access control; Aggregates; Context; Context-aware services; Databases; Electronic mail; Servers

18. Differentially private multi-party high-dimensional data publishing.

【Paper Link】【Pages】:205-216

【Authors】: Sen Su ; Peng Tang ; Xiang Cheng ; Rui Chen ; Zequn Wu

【Abstract】: In this paper, we study the novel problem of publishing high-dimensional data in a distributed multi-party environment under differential privacy. In particular, with the assistance of a semi-trusted curator, the involved parties (i.e., local data owners) collectively generate a synthetic integrated dataset while satisfying ¿¿-differential privacy for any local dataset. To solve this problem, we present a differentially private sequential update of Bayesian network (DP-SUBN) solution. In DP-SUBN, the parties and the curator collaboratively identify the Bayesian network ¿¿¿ that best fits the integrated dataset D in a sequential manner, from which a synthetic dataset can then be generated. The fundamental advantage of adopting the sequential update manner is that the parties can treat the statistical results provided by previous parties as their prior knowledge to direct how to learn ¿¿¿. The core of DP-SUBN is the construction of the search frontier, which can be seen as a priori knowledge to guide the parties to update ¿¿¿. To improve the fitness of ¿¿¿ and reduce the communication cost, we introduce a correlation-aware search frontier construction (CSFC) approach, where attribute pairs with strong correlations are used to construct the search frontier. In particular, to privately quantify the correlations of attribute pairs without introducing too much noise, we first propose a non-overlapping covering design (NOCD) method, and then introduce a dynamic programming method to find the optimal parameters used in NOCD to ensure that the injected noise is minimum. Through formal privacy analysis, we show that DP-SUBN satisfies ¿¿-differential privacy for any local dataset. Extensive experiments on a real dataset demonstrate that DP-SUBN offers desirable data utility with low communication cost.

【Keywords】: Privacy

19. Optimizing secure classification performance with privacy-aware feature selection.

【Paper Link】【Pages】:217-228

【Authors】: Erman Pattuk ; Murat Kantarcioglu ; Huseyin Ulusoy ; Bradley Malin

【Abstract】: Recent advances in personalized medicine point towards a future where clinical decision making will be dependent upon the individual characteristics of the patient, e.g., their age, race, genomic variation, and lifestyle. Already, there are numerous commercial entities working towards the provision of software to support such decisions as cloud-based services. However, deployment of such services in such settings raises important challenges for privacy. A recent attack shows that disclosing personalized drug dosage recommendations, combined with several pieces of demographic knowledge, can be leveraged to infer single nucleotide polymorphism variants of a patient. One manner to prevent such inference is to apply secure multi-party computation (SMC) techniques that hide all patient data, so that no information, including the clinical recommendation, is disclosed during the decision making process. Yet, SMC is a computationally cumbersome process and disclosing some information may be necessary for various compliance purposes. Additionally, certain information (e.g., demographic information) may already be publicly available. In this work, we provide a novel approach to selectively disclose certain information before the SMC process to significantly improve personalized decision making performance while preserving desired levels of privacy. To achieve this goal, we introduce mechanisms to quickly compute the loss in privacy due to information disclosure while considering its performance impact on SMC execution phase. Our empirical analysis show that we can achieve up to three orders of magnitude improvement compared to pure SMC solutions with only a slight increase in privacy risks.

【Keywords】: Bioinformatics; Decision trees; Genomics; Medical services; Privacy; Protocols; Servers

20. Differentially private frequent subgraph mining.

【Paper Link】【Pages】:229-240

【Authors】: Shengzhi Xu ; Sen Su ; Li Xiong ; Xiang Cheng ; Ke Xiao

【Abstract】: Mining frequent subgraphs from a collection of input graphs is an important topic in data mining research. However, if the input graphs contain sensitive information, releasing frequent subgraphs may pose considerable threats to individual's privacy. In this paper, we study the problem of frequent subgraph mining (FGM) under the rigorous differential privacy model. We introduce a novel differentially private FGM algorithm, which is referred to as DFG. In this algorithm, we first privately identify frequent subgraphs from input graphs, and then compute the noisy support of each identified frequent subgraph. In particular, to privately identify frequent subgraphs, we present a frequent subgraph identification approach which can improve the utility of frequent subgraph identifications through candidates pruning. Moreover, to compute the noisy support of each identified frequent subgraph, we devise a lattice-based noisy support derivation approach, where a series of methods has been proposed to improve the accuracy of the noisy supports. Through formal privacy analysis, we prove that our DFG algorithm satisfies ¿¿-differential privacy. Extensive experimental results on real datasets show that the DFG algorithm can privately find frequent subgraphs with high data utility.

【Keywords】: Algorithm design and analysis; Data privacy; Estimation; Itemsets; Noise measurement; Privacy

Research Session 3A: Graph Proximity 4

21. Being prepared in a sparse world: The case of KNN graph construction.

【Paper Link】【Pages】:241-252

【Authors】: Antoine Boutet ; Anne-Marie Kermarrec ; Nupur Mittal ; François Taïani

【Abstract】: K-Nearest-Neighbor (KNN) graphs have emerged as a fundamental building block of many on-line services providing recommendation, similarity search and classification. Constructing a KNN graph rapidly and accurately is, however, a computationally intensive task. As data volumes keep growing, speed and the ability to scale out are becoming critical factors when deploying a KNN algorithm. In this work, we present KIFF, a generic, fast and scalable KNN graph construction algorithm. KIFF directly exploits the bipartite nature of most datasets to which KNN algorithms are applied. This simple but powerful strategy drastically limits the computational cost required to rapidly converge to an accurate KNN solution, especially for sparse datasets. Our evaluation on a representative range of datasets show that KIFF provides, on average, a speed-up factor of 14 against recent state-of-the art solutions while improving the quality of the KNN approximation by 18%.

【Keywords】: Art; Artificial neural networks; Bipartite graph; IP networks; Measurement; Motion pictures; Search problems

22. pSCAN: Fast and exact structural graph clustering.

【Paper Link】【Pages】:253-264

【Authors】: Lijun Chang ; Wei Li ; Xuemin Lin ; Lu Qin ; Wenjie Zhang

【Abstract】: In this paper, we study the problem of structural graph clustering, a fundamental problem in managing and analyzing graph data. Given a large graph G = (V, E), structural graph clustering is to assign vertices in V to clusters and to identify the sets of hub vertices and outlier vertices as well, such that vertices in the same cluster are densely connected to each other while vertices in different clusters are loosely connected to each other. Firstly, we prove that the existing SCAN approach is worst-case optimal. Nevertheless, it is still not scalable to large graphs due to exhaustively computing structural similarity for every pair of adjacent vertices. Secondly, we make three observations about structural graph clustering, which present opportunities for further optimization. Based on these observations, in this paper we develop a new two-step paradigm for scalable structural graph clustering. Thirdly, following this paradigm, we present a new approach aiming to reduce the number of structural similarity computations. Moreover, we propose optimization techniques to speed up checking whether two vertices are structure-similar to each other. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our new approach outperforms the state-of-the-art approaches by over one order of magnitude. Noticeably, for the twitter graph with 1 billion edges, our approach takes 25 minutes while the state-of-the-art approach cannot finish even after 24 hours.

【Keywords】: Australia; Biology; Collaboration; Data mining; Data models; Optimization; Social network services

23. CSI_GED: An efficient approach for graph edit similarity computation.

【Paper Link】【Pages】:265-276

【Authors】: Karam Gouda ; Mosab Hassaan

【Abstract】: Graph similarity is a basic and essential operation in many applications. In this paper, we are interested in computing graph similarity based on edit distance. Existing graph edit distance computation methods adopt the best first search paradigm A*. These methods are time and space bound. In practice, they can compute the edit distance of graphs containing 12 vertices at most. To enable graph edit similarity computation on larger and distant graphs, we present CSI_GED, a novel edge-based mapping method for computing graph edit distance through common sub-structure isomorphisms enumeration. CSI_GED utilizes backtracking search combined with a number of heuristics to reduce memory requirements and quickly prune away a large portion of the mapping search space. Experiments show that CSI_GED is highly efficient for computing the edit distance on small as well as large and distant graphs. Furthermore, we evaluated CSI_GED as a stand-alone graph edit similarity search query method. The experiments show that CSI_GED is effective and scalable, and outperforms the state-of-the-art indexing-based methods by over two orders of magnitude.

【Keywords】: Compounds; Drugs; Electronic mail; Informatics; Memory management; Search problems; Upper bound

24. Semantic proximity search on graphs with metagraph-based learning.

【Paper Link】【Pages】:277-288

【Authors】: Yuan Fang ; Wenqing Lin ; Vincent Wenchen Zheng ; Min Wu ; Kevin Chen-Chuan Chang ; Xiao-Li Li

【Abstract】: Given ubiquitous graph data such as the Web and social networks, proximity search on graphs has been an active research topic. The task boils down to measuring the proximity between two nodes on a graph. Although most earlier studies deal with homogeneous or bipartite graphs only, many real-world graphs are heterogeneous with objects of various types, giving rise to different semantic classes of proximity. For instance, on a social network two users can be close for different reasons, such as being classmates or family members, which represent two distinct classes of proximity. Thus, it becomes inadequate to only measure a ¿¿¿generic¿¿¿ form of proximity as previous works have focused on. In this paper, we identify metagraphs as a novel and effective means to characterize the common structures for a desired class of proximity. Subsequently, we propose a family of metagraph-based proximity, and employ a supervised technique to automatically learn the right form of proximity within its family to suit the desired class. As it is expensive to match (i.e., find the instances of) a metagraph, we propose the novel approaches of dual-stage training and symmetry-based matching to speed up. Finally, our experiments reveal that our approach is significantly more accurate and efficient. For accuracy, we outperform the baselines by 11% and 16% in NDCG and MAP, respectively. For efficiency, dual-stage training reduces the overall matching cost by 83%, and symmetry-based matching further decreases the cost of individual metagraphs by 52%.

【Keywords】: Companies; Context; Real-time systems; Search problems; Semantics; Social network services; Training

Research Session 3B: Scalable Query Processing 4

25. Private spatial data aggregation in the local setting.

【Paper Link】【Pages】:289-300

【Authors】: Rui Chen ; Haoran Li ; A. K. Qin ; Shiva Prasad Kasiviswanathan ; Hongxia Jin

【Abstract】: With the deep penetration of the Internet and mobile devices, privacy preservation in the local setting has become increasingly relevant. The local setting refers to the scenario where a user is willing to share his/her information only if it has been properly sanitized before leaving his/her own device. Moreover, a user may hold only a single data element to share, instead of a database. Despite its ubiquitousness, the above constraints make the local setting substantially more challenging than the traditional centralized or distributed settings. In this paper, we initiate the study of private spatial data aggregation in the local setting, which finds its way in many real-world applications, such as Waze and Google Maps. In response to users' varied privacy requirements that are natural in the local setting, we propose a new privacy model called personalized local differential privacy (PLDP) that allows to achieve desirable utility while still providing rigorous privacy guarantees. We design an efficient personalized count estimation protocol as a building block for achieving PLDP and give theoretical analysis of its utility, privacy and complexity. We then present a novel framework that allows an untrusted server to accurately learn the user distribution over a spatial domain while satisfying PLDP for each user. This is mainly achieved by designing a novel user group clustering algorithm tailored to our problem. We confirm the effectiveness and efficiency of our framework through extensive experiments on multiple real benchmark datasets.

【Keywords】: Data privacy; Distributed databases; Privacy; Protocols; Servers; Spatial databases

26. A novel, low-latency algorithm for multiple Group-By query optimization.

【Paper Link】【Pages】:301-312

【Authors】: Duy-Hung Phan ; Pietro Michiardi

【Abstract】: Data summarization is essential for users to interact with data. Current state of the art algorithms to optimize its most general form, the multiple Group By queries, have limitations in scalability. In this paper, we propose a novel algorithm, Top-Down Splitting, that scales to hundreds or even thousands of attributes and queries, and that quickly and efficiently produces optimized query execution plans. We analyze the complexity of our algorithm, and evaluate, empirically, its scalability and effectiveness through an experimental campaign. Results show that our algorithm is remarkably faster than alternatives in prior works, while generally producing better solutions. Ultimately, our algorithm reduces up to 34% the query execution time, when compared to un-optimized plans.

【Keywords】: Aggregates; Data analysis; Diseases; Optimization; Query processing; Scalability

27. Load balancing and skew resilience for parallel joins.

【Paper Link】【Pages】:313-324

【Authors】: Aleksandar Vitorovic ; Mohammed Elseidy ; Christoph Koch

【Abstract】: We address the problem of load balancing for parallel joins.We show that the distribution of input data received and the output data produced by worker machines are both important for performance. As a result, previous work, which optimizes either for input or output, stands ineffective for load balancing. To that end, we propose a multi-stage load-balancing algorithm which considers the properties of both input and output data through sampling of the original join matrix. To do this efficiently, we propose a novel category of equi-weight histograms. To build them, we exploit state-of-the-art computational geometry algorithms for rectangle tiling. To our knowledge, we are the first to employ tiling algorithms for join load-balancing. In addition, we propose a novel, join-specialized tiling algorithm that has drastically lower time and space complexity than existing algorithms. Experiments show that our scheme outperforms state-of-the-art techniques by up to a factor of 15.

【Keywords】: Complexity theory; Computational geometry; Computer architecture; Histograms; Load management; Partitioning algorithms; Sparse matrices

28. Platform-independent robust query processing.

【Paper Link】【Pages】:325-336

【Authors】: Srinivas Karthik ; Jayant R. Haritsa ; Sreyash Kenkre ; Vinayaka Pandit

【Abstract】: To address the classical selectivity estimation problem in databases, a radically different approach called PlanBouquet was recently proposed in [3], wherein the estimation process is completely abandoned and replaced with a calibrated discovery mechanism. The beneficial outcome of this new construction is that, for the first time, provable guarantees are obtained on worst-case performance, thereby facilitating robust query processing.

【Keywords】: Benchmark testing; Electronic mail; Engines; Query processing; Robustness

Research Session 3C: Preference and Trust 4

29. Authentication of function queries.

【Paper Link】【Pages】:337-348

【Authors】: Guolei Yang ; Ying Cai ; Zhenbi Hu

【Abstract】: Consider a database where each record represents a math function. A third party is in charge of processing queries over this database and we want to provide a mechanism for users to verify the correctness of their query results. Here each query, referred to as a Function Query (FQ), retrieves the functions whose computation results with user-supplied arguments satisfy certain conditions (e.g., within a certain range). We present authentication solutions that work on a variety of functions, including univariate linear function, multivariate linear function, and multivariate high degree function. Our solutions are based on the fact that the functions can be sorted in the subdomains defined by their intersections and thus can be chained to produce a signature mesh for query result verification. We study the performance of the proposed techniques through theoretical analysis, simulation and empirical study, and include the results in this paper.

【Keywords】: Analytical models; Authentication; Computer science; Databases; Electronic mail; Outsourcing; Servers

30. "Told you i didn't like it": Exploiting uninteresting items for effective collaborative filtering.

【Paper Link】【Pages】:349-360

【Authors】: Won-Seok Hwang ; Juan Parc ; Sang-Wook Kim ; Jongwuk Lee ; Dongwon Lee

【Abstract】: We study how to improve the accuracy and running time of top-N recommendation with collaborative filtering (CF). Unlike existing works that use mostly rated items (which is only a small fraction in a rating matrix), we propose the notion of pre-use preferences of users toward a vast amount of unrated items. Using this novel notion, we effectively identify uninteresting items that were not rated yet but are likely to receive very low ratings from users, and impute them as zero. This simple-yet-novel zero-injection method applied to a set of carefully-chosen uninteresting items not only addresses the sparsity problem by enriching a rating matrix but also completely prevents uninteresting items from being recommended as top-N items, thereby improving accuracy greatly. As our proposed idea is method-agnostic, it can be easily applied to a wide variety of popular CF methods. Through comprehensive experiments using the Movielens dataset and MyMediaLite implementation, we successfully demonstrate that our solution consistently and universally improves the accuracies of popular CF methods (e.g., item-based CF, SVD-based CF, and SVD++) by two to five orders of magnitude on average. Furthermore, our approach reduces the running time of those CF methods by 1.2 to 2.3 times when its setting produces the best accuracy. The datasets and codes that we used in experiments are available at: https://goo.gl/KUrmip

【Keywords】: Collaboration; Computer science; Motion pictures; Proposals; Recommender systems

31. Practical private shortest path computation based on Oblivious Storage.

【Paper Link】【Pages】:361-372

【Authors】: Dong Xie ; Guanru Li ; Bin Yao ; Xuan Wei ; Xiaokui Xiao ; Yunjun Gao ; Minyi Guo

【Abstract】: As location-based services (LBSs) become popular, location-dependent queries have raised serious privacy concerns since they may disclose sensitive information in query processing. Among typical queries supported by LBSs, shortest path queries may reveal information about not only current locations of the clients, but also their potential destinations and travel plans. Unfortunately, existing methods for private shortest path computation suffer from issues of weak privacy property, low performance or poor scalability. In this paper, we aim at a strong privacy guarantee, where the adversary cannot infer almost any information about the queries, with better performance and scalability. To achieve this goal, we introduce a general system model based on the concept of Oblivious Storage (OS), which can deal with queries requiring strong privacy properties. Furthermore, we propose a new oblivious shuffle algorithm to optimize an existing OS scheme. By making trade-offs between query performance, scalability and privacy properties, we design different schemes for private shortest path computation. Eventually, we comprehensively evaluate our schemes upon real road networks in a practical environment and show their efficiency.

【Keywords】: Computational modeling; Outsourcing; Privacy; Protocols; Roads; Scalability; Servers; Data Privacy; Oblivious Storage; Road Network; Shortest Path

【Paper Link】【Pages】:373-384

【Authors】: Xun Yi ; Elisa Bertino ; Fang-Yu Rao ; Athman Bouguettaya

【Abstract】: In this paper, we consider a scenario where a user queries a user profile database, maintained by a social networking service provider, to find out some users whose profiles are similar to the profile specified by the querying user. A typical example of this application is online dating. Most recently, an online data site, Ashley Madison, was hacked, which results in disclosure of a large number of dating user profiles. This serious data breach has urged researchers to explore practical privacy protection for user profiles in online dating. In this paper, we give a privacy-preserving solution for user profile matching in social networks by using multiple servers. Our solution is built on homomorphic encryption and allows a user to find out some matching users with the help of the multiple servers without revealing to anyone privacy of the query and the queried user profiles. Our solution achieves user profile privacy and user query privacy as long as at least one of the multiple servers is honest. Our implementation and experiments demonstrate that our solution is practical.

【Keywords】: Databases; Encryption; Privacy; Protocols; Servers; Social network services

Research Session 4A: Graph Mining 4

33. An embedding approach to anomaly detection.

【Paper Link】【Pages】:385-396

【Authors】: Renjun Hu ; Charu C. Aggarwal ; Shuai Ma ; Jinpeng Huai

【Abstract】: Network anomaly detection has become very popular in recent years because of the importance of discovering key regions of structural inconsistency in the network. In addition to application-specific information carried by anomalies, the presence of such structural inconsistency is often an impediment to the effective application of data mining algorithms such as community detection and classification. In this paper, we study the problem of detecting structurally inconsistent nodes that connect to a number of diverse influential communities in large social networks. We show that the use of a network embedding approach, together with a novel dimension reduction technique, is an effective tool to discover such structural inconsistencies. We also experimentally show that the detection of such anomalous nodes has significant applications: one is the specific use of detected anomalies, and the other is the improvement of the effectiveness of community detection.

【Keywords】: Algorithm design and analysis; Couplings; Data mining; Image edge detection; Optimization; Prediction algorithms; Social network services

34. Cross-layer betweenness centrality in multiplex networks with applications.

【Paper Link】【Pages】:397-408

【Authors】: Tanmoy Chakraborty ; Ramasuri Narayanam

【Abstract】: Several real-life social systems witness the presence of multiple interaction types (or layers) among the entities, thus establishing a collection of co-evolving networks, known as multiplex networks. More recently, there has been a significant interest in developing certain centrality measures in multiplex networks to understand the influential power of the entities (to be referred as vertices or nodes hereafter). In this paper, we consider the problem of studying how frequently the nodes occur on the shortest paths between other nodes in the multiplex networks. As opposed to simplex networks, the shortest paths between nodes can possibly traverse through multiple layers in multiplex networks. Motivated by this phenomenon, we propose a new metric to address the above problem and we call this new metric cross-layer betweenness centrality (CBC). Our definition of CBC measure takes into account the interplay among multiple layers in determining the shortest paths in multiplex networks. We propose an efficient algorithm to compute CBC and show that it runs much faster than the na¿¿ve computation of this measure. We show the efficacy of the proposed algorithm using thorough experimentation on two real-world multiplex networks. We further demonstrate the practical utility of CBC by applying it in the following three application contexts: discovering non-overlapping community structure in multiplex networks, identifying interdisciplinary researchers from a multiplex co-authorship network, and the initiator selection for message spreading. In all these application scenarios, the respective solution methods based on the proposed CBC are found to be significantly better performing than that of the corresponding benchmark approaches.

【Keywords】: Context; Detection algorithms; Electronic mail; Mathematical model; Multiplexing; Power measurement

35. NXgraph: An efficient graph processing system on a single machine.

【Paper Link】【Pages】:409-420

【Authors】: Yuze Chi ; Guohao Dai ; Yu Wang ; Guangyu Sun ; Guoliang Li ; Huazhong Yang

【Abstract】: Recent studies show that graph processing systems on a single machine can achieve competitive performance compared with cluster-based graph processing systems. In this paper, we present NXgraph, an efficient graph processing system on a single machine. We propose the Destination-Sorted Sub-Shard (DSSS) structure to store a graph. To ensure graph data access locality and enable fine-grained scheduling, NXgraph divides vertices and edges into intervals and sub-shards. To reduce write conflicts among different threads and achieve a high degree of parallelism, NXgraph sorts edges within each sub-shard according to their destination vertices. Then, three updating strategies, i.e., Single-Phase Update (SPU), Double-Phase Update (DPU), and Mixed-Phase Update (MPU), are proposed in this paper. NXgraph can adaptively choose the fastest strategy for different graph problems according to the graph size and the available memory resources to fully utilize the memory space and reduce the amount of data transfer. All these three strategies exploit streamlined disk access patterns. Extensive experiments on three real-world graphs and five synthetic graphs show that NXgraph outperforms GraphChi, TurboGraph, VENUS, and GridGraph in various situations. Moreover, NXgraph, running on a single commodity PC, can finish an iteration of PageRank on the Twitter [1] graph with 1.5 billion edges in 2.05 seconds; while PowerGraph, a distributed graph processing system, needs 3.6s to finish the same task on a 64-node cluster.

【Keywords】: Computational modeling; Data transfer; Memory management; Parallel processing; Spread spectrum communication; Twitter; Venus

【Paper Link】【Pages】:421-432

【Authors】: Hongwei Liang ; Ke Wang ; Feida Zhu

【Abstract】: Summarizing patterns of connections or social ties in a social network, in terms of attributes information on nodes and edges, holds a key to the understanding of how the actors interact and form relationships. We formalize this problem as mining top-k group relationships (GRs), which captures strong social ties between groups of actors. While existing works focus on patterns that follow from the well known homophily principle, we are interested in social ties that do not follow from homophily, thus, provide new insights. Finding top-k GRs faces new challenges: it requires a novel ranking metric because traditional metrics favor patterns that are expected from the homophily principle; it requires an innovative search strategy since there is no obvious anti-monotonicity for such GRs; it requires a novel data structure to avoid data explosion caused by multidimensional nodes and edges and many-to-many relationships in a social network. We address these issues through presenting an efficient algorithm, GRMiner, for mining top-k GRs and we evaluate its effectiveness and efficiency using real data.

【Keywords】: Data mining; Education; Facebook; Image edge detection; Measurement; Search problems

Research Session 4B: Conscious Big Data Processing 4

37. Self-Adaptive Linear Hashing for solid state drives.

【Paper Link】【Pages】:433-444

【Authors】: Chengcheng Yang ; Peiquan Jin ; Lihua Yue ; Dezhi Zhang

【Abstract】: Flash memory based solid state drives (SSDs) have emerged as a new alternative to replace magnetic disks due to their high performance and low power consumption. However, random writes on SSDs are much slower than SSD reads. Therefore, traditional index structures, which are designed based on the symmetrical I/O property of magnetic disks, cannot completely exert the high performance of SSDs. In this paper, we propose an SSD-optimized linear hashing index called Self-Adaptive Linear Hashing (SAL-Hashing) to reduce small random writes to SSDs that are caused by index operations. The contributions of our work are manifold. First, we propose to organize buckets into groups and sets to facilitate coarse-grained writes and lazy-split so as to avoid intermediate writes on the hash structure. A group consists of a fixed number of buckets and a set consists of a number of groups. Second, we attach a log region to each set, and amortize the cost of reads and writes by committing updates to the log region in batch. Third, in order to reduce search cost, each log region is equipped with Bloom filters to index update logs. We devise a cost-based online algorithm to adaptively merge the log region with the corresponding set when the set becomes search-intensive. Finally, in order to exploit the internal package-level parallelisms of SSDs, we apply coarse-grained writes for merging or split operations to achieve a high bandwidth. Our experimental results suggest that our proposal is self-adaptive according to the change of access patterns, and outperforms several competitors under various workloads on two commodity SSDs.

【Keywords】: Bandwidth; Flash memories; Indexes; Metadata; Proposals; Resource management; Linear hashing; Self-Adaptive; Solid State Drives

38. On main-memory flushing in microblogs data management systems.

【Paper Link】【Pages】:445-456

【Authors】: Amr Magdy ; Rami Alghamdi ; Mohamed F. Mokbel

【Abstract】: Searching microblogs, e.g., tweets and comments, is practically supported through main-memory indexing for scalable data digestion and efficient query evaluation. With continuity and excessive numbers of microblogs, it is infeasible to keep data in main-memory for long periods. Thus, once allocated memory budget is filled, a portion of data is flushed from memory to disk to continuously accommodate newly incoming data. Existing techniques come with either low memory hit ratio due to flushing items regardless of their relevance to incoming queries or significant overhead of tracking individual data items, which limit scalability of microblogs systems in either cases. In this paper, we propose kFlushing policy that exploits popularity of top-k queries in microblogs to smartly select a subset of microblogs to flush. kFlushing is mainly designed to increase memory hit ratio. To this end, it identifies and flushes in-memory data that does not contribute to incoming queries. The freed memory space is utilized to accumulate more useful data that is used to answer more queries from memory contents. When all memory is utilized for useful data, kFlushing flushes data that is less likely to degrade memory hit ratio. In addition, kFlushing comes with a little overhead that keeps high system scalability in terms of high digestion rates of incoming fast data. Extensive experimental evaluation shows the effectiveness and scalability of kFlushing to improve main-memory hit by 26¿¿¿330% while coping up with fast microblog streams of up to 100K microblog/second.

【Keywords】: Indexes

39. ICE: Managing cold state for big data applications.

【Paper Link】【Pages】:457-468

【Authors】: Badrish Chandramouli ; Justin J. Levandoski ; Eli Cortez

【Abstract】: The use of big data in a business revolves around a monitor-mine-manage (M3) loop: data is monitored in real-time, while mined insights are used to manage the business and derive value. While mining has traditionally been performed offline, recent years have seen an increasing need to perform all phases of M3 in real-time. A stream processing engine (SPE) enables such a seamless M3 loop for applications such as targeted advertising, recommender systems, risk analysis, and call-center analytics. However, these M3 applications require the SPE to maintain massive amounts of state in memory, leading to resource usage skew: memory is scarce and over-utilized, whereas CPU and I/O are under-utilized. In this paper, we propose a novel solution to scaling SPEs for memory-bound M3 applications that leverages natural access skew in data-parallel subqueries, where a small fraction of the state is hot (frequently accessed) and most state is cold (infrequently accessed). We present ICE (incremental coldstate engine), a framework that allows an SPE to seamlessly migrate cold state to secondary storage (disk or flash). ICE uses a novel architecture that exploits the semantics of individual stream operators to efficiently manage cold state in an SPE using an incremental log-structured store. We implemented ICE inside an SPE. Experiments using real data show that ICE can reduce memory usage significantly without sacrificing performance, and can sometimes even improve performance.

【Keywords】: Advertising; Engines; Ice; Memory management; Monitoring; Motion pictures; Real-time systems

40. HAWK: Hardware support for unstructured log processing.

【Paper Link】【Pages】:469-480

【Authors】: Prateek Tandon ; Faissal M. Sleiman ; Michael J. Cafarella ; Thomas F. Wenisch

【Abstract】: Rapidly processing high-velocity text data is critical for many technical and business applications. Widely used software solutions for processing these large text corpora target disk-resident data and rely on pre-computed indexes and large clusters to achieve high performance. However, greater capacity and falling costs are enabling a shift to RAM-resident data sets. The enormous bandwidth of RAM can facilitate scan operations that are competitive with pre-computed indexes for interactive, ad-hoc queries. However, software approaches for processing these large text corpora fall far short of saturating available bandwidth and meeting peak scan rates possible on modern memory systems. In this paper, we present HAWK, a hardware accelerator for ad hoc queries against large in-memory logs. HAWK comprises a stall-free hardware pipeline that scans input data at a constant rate, examining multiple input characters in parallel during a single accelerator clock cycle. We describe a 1GHz 32-characterwide HAWK design targeting ASIC implementation, designed to process data at 32GB/s (up to two orders of magnitude faster than software solutions), and demonstrate a scaled-down FPGA prototype that operates at 100MHz with 4-wide parallelism, which processes at 400MB/s (13¿¿ faster than software grep for large multi-pattern scans).

【Keywords】: Field programmable gate arrays; Indexes; Linux; Pattern matching; Servers; Standards; Testing

Research Session 4C: Data Streams 4

41. Efficient handling of concept drift and concept evolution over Stream Data.

【Paper Link】【Pages】:481-492

【Authors】: Ahsanul Haque ; Latifur Khan ; Michael Baron ; Bhavani M. Thuraisingham ; Charu C. Aggarwal

【Abstract】: To decide if an update to a data stream classifier is necessary, existing sliding window based techniques monitor classifier performance on recent instances. If there is a significant change in classifier performance, these approaches determine a chunk boundary, and update the classifier. However, monitoring classifier performance is costly due to scarcity of labeled data. In our previous work, we presented a semi-supervised framework SAND, which uses change detection on classifier confidence to detect a concept drift. Unlike most approaches, it requires only a limited amount of labeled data to detect chunk boundaries and to update the classifier. However, SAND is expensive in terms of execution time due to exhaustive invocation of the change detection module. In this paper, we present an efficient framework, which is based on the same principle as SAND, but exploits dynamic programming and executes the change detection module selectively. Moreover, we provide theoretical justification of the confidence calculation, and show effect of a concept drift on subsequent confidence scores. Experiment results show efficiency of the proposed framework in terms of both accuracy and execution time.

【Keywords】: Data mining; Data models; Dynamic programming; Electronic mail; Error analysis; Labeling; Training data; Classifier Confidence; Concept Drift; Dynamic Chunk

42. Quality-driven disorder handling for m-way sliding window stream joins.

【Paper Link】【Pages】:493-504

【Authors】: Yuanzhen Ji ; Jun Sun ; Anisoara Nica ; Zbigniew Jerzak ; Gregor Hackenbroich ; Christof Fetzer

【Abstract】: Sliding window join is one of the most important operators for stream applications. To produce high quality join results, a stream processing system must deal with the ubiquitous disorder within input streams which is caused by network delay, parallel processing, etc. Disorder handling involves an inevitable tradeoff between the latency and the quality of produced join results. To meet different requirements of stream applications, it is desirable to provide a user-configurable result-latency vs. result-quality tradeoff. Existing disorder handling approaches either do not provide such configurability, or support only user-specified latency constraints. In this work, we advocate the idea of quality-driven disorder handling, and propose a buffer-based disorder handling approach for sliding window joins, which minimizes sizes of input-sorting buffers, thus the result latency, while respecting user-specified result-quality requirements. The core of our approach is an analytical model which directly captures the relationship between sizes of input buffers and the produced result quality. Our approach is generic. It supports m-way sliding window joins with arbitrary join conditions. Experiments on real-world and synthetic datasets show that, compared to the state of the art, our approach can reduce the result latency incurred by disorder handling by up to 95% while providing the same level of result quality.

【Keywords】: Analytical models; Delays; Out of order; Productivity; Silicon; Sorting; Synchronization

【Paper Link】【Pages】:505-516

【Authors】: Yuchen Li ; Dongxiang Zhang ; Ziquan Lan ; Kian-Lee Tan

【Abstract】: Social media advertising is a multi-billion dollar market and has become the major revenue source for Facebook and Twitter. To deliver ads to potentially interested users, these social network platforms learn a prediction model for each user based on their personal interests. However, as user interests often evolve slowly, the user may end up receiving repetitive ads. In this paper, we propose a context-aware advertising framework that takes into account the relatively static personal interests as well as the dynamic news feed from friends to drive growth in the ad click-through rate. To meet the real-time requirement, we first propose an online retrieval strategy that finds k most relevant ads matching the dynamic context when a read operation is triggered. To avoid frequent retrieval when the context varies little, we propose a safe region method to quickly determine whether the top-k ads of a user are changed. Finally, we propose a hybrid model to combine the merits of both methods by analyzing the dynamism of news feed to determine an appropriate retrieval strategy. Extensive experiments conducted on multiple real social networks and ad datasets verified the efficiency and robustness of our hybrid model.

【Keywords】: Advertising; Context; Databases; Facebook; Feeds; Twitter

44. Tolerating correlated failures in Massively Parallel Stream Processing Engines.

【Paper Link】【Pages】:517-528

【Authors】: Li Su ; Yongluan Zhou

【Abstract】: Fault-tolerance techniques for stream processing engines can be categorized into passive and active approaches. A typical passive approach periodically checkpoints a processing task's runtime states and can recover a failed task by restoring its runtime state using its latest checkpoint. On the other hand, an active approach usually employs backup nodes to run replicated tasks. Upon failure, the active replica can take over the processing of the failed task with minimal latency. However, both approaches have their own inadequacies in Massively Parallel Stream Processing Engines (MPSPE). The passive approach incurs a long recovery latency especially when a number of correlated nodes fail simultaneously, while the active approach requires extra replication resources. In this paper, we propose a new fault-tolerance framework, which is Passive and Partially Active (PPA). In a PPA scheme, the passive approach is applied to all tasks while only a selected set of tasks will be actively replicated. The number of actively replicated tasks depends on the available resources. If tasks without active replicas fail, tentative outputs will be generated before the completion of the recovery process. We also propose effective and efficient algorithms to optimize a partially active replication plan to maximize the quality of tentative outputs. We implemented PPA on top of Storm, an open-source MPSPE and conducted extensive experiments using both real and synthetic datasets to verify the effectiveness of our approach.

【Keywords】: Engines; Fault tolerance; Fault tolerant systems; Semantics; Silicon compounds; Storms; Topology

Research Session 5A: Graph Patterns 4

45. Spatial influence - measuring followship in the real world.

【Paper Link】【Pages】:529-540

【Authors】: Huy Pham ; Cyrus Shahabi

【Abstract】: Finding influential people in a society has been the focus of social studies for decades due to its numerous applications, such as viral marketing or spreading ideas and practices. A critical first step is to quantify the amount of influence an individual exerts on another, termed pairwise influence. Early social studies had to confine themselves to surveys and manual data collections for this purpose; more recent studies have exploited web data (e.g., blogs). In this paper, for the first time, we utilize people's movement in the real world (aka spatiotemporal data) to derive pairwise influence. We first define followship to capture the phenomenon of an individual visiting a real-world location (e.g., restaurant) due the influence of another individual who has visited that same location in the past. Subsequently, we coin the term spatial influence as the concept of inferring pairwise influence from spatiotemporal data by quantifying the amount of followship influence that an individual has on others. We then propose the Temporal and Locational Followship Model (TLFM) to estimate spatial influence, in which we study three factors that impact followship: the time delay between the visits, the popularity of the location, and the inherent coincidences in individuals' visiting behaviors. We conducted extensive experiments using various real-world datasets, which demonstrate the effectiveness of our TLFM model in quantifying spatial influence.

【Keywords】: Blogs; Irrigation; Manuals

46. Durable graph pattern queries on historical graphs.

【Paper Link】【Pages】:541-552

【Authors】: Konstantinos Semertzidis ; Evaggelia Pitoura

【Abstract】: In this paper, we focus on labeled graphs that evolve over time. Given a sequence of graph snapshots representing the state of the graph at different time instants, we seek to find the most durable matches of an input graph pattern query, that is, the matches that exist for the longest period of time. The straightforward way to address this problem is by running a state-of-the-art graph pattern algorithm at each snapshot and aggregating the results. However, for large networks this approach is computationally expensive, since all matches have to be generated at each snapshot, including those appearing only once. We propose a new approach that uses a compact representation of the sequence of graph snapshots, appropriate time indexes to prune the search space and a threshold on the duration of the pattern to determine the search order. We also present experimental results using real datasets that illustrate the efficiency and effectiveness of our approach.

【Keywords】: Computer science; Electronic mail; Evolution (biology); IP networks; Indexes; Pattern matching; Proteins

47. Link prediction in graph streams.

【Paper Link】【Pages】:553-564

【Authors】: Peixiang Zhao ; Charu C. Aggarwal ; Gewen He

【Abstract】: Link prediction is a fundamental problem that aims to estimate the likelihood of the existence of edges (links) based on the current observed structure of a graph, and has found numerous applications in social networks, bioinformatics, E-commerce, and the Web. In many real-world scenarios, however, graphs are massive in size and dynamically evolving in a fast rate, which, without loss of generality, are often modeled and interpreted as graph streams. Existing link prediction methods fail to generalize in the graph stream setting because graph snapshots where link prediction is performed are no longer readily available in memory, or even on disks, for effective graph computation and analysis. It is therefore highly desirable, albeit challenging, to support link prediction online and in a dynamic way, which, in this paper, is referred to as the streaming link prediction problem in graph streams. In this paper, we consider three fundamental, neighborhood-based link prediction target measures, Jaccard coefficient, common neighbor, and Adamic-Adar, and provide accurate estimation to them in order to address the streaming link prediction problem in graph streams. Our main idea is to design cost-effective graph sketches (constant space per vertex) based on MinHash and vertex-biased sampling techniques, and to propose efficient sketch based algorithms (constant time per edge) with both theoretical accuracy guarantee and robust estimation results. We carry out experimental studies in a series of real-world graph streams. The results demonstrate that our graph sketch based methods are accurate, efficient, cost-effective, and thus can be practically employed for link prediction in real-world graph streams.

【Keywords】: Business; Data mining; Estimation; Prediction algorithms; Prediction methods; Robustness; Social network services

48. SimRank computation on uncertain graphs.

【Paper Link】【Pages】:565-576

【Authors】: Rong Zhu ; Zhaonian Zou ; Jianzhong Li

【Abstract】: SimRank is a similarity measure between vertices in a graph, which has become a fundamental technique in graph analytics. Recently, many algorithms have been proposed for efficient evaluation of SimRank similarities. However, the existing SimRank computation algorithms either overlook uncertainty in graph structures or is based on an unreasonable assumption (Du et al). In this paper, we study SimRank similarities on uncertain graphs based on the possible world model of uncertain graphs. Following the random-walk-based formulation of SimRank on deterministic graphs and the possible worlds model of uncertain graphs, we define random walks on uncertain graphs for the first time and show that our definition of random walks satisfies Markov's property. We formulate the SimRank measure based on random walks on uncertain graphs. We discover a critical difference between random walks on uncertain graphs and random walks on deterministic graphs, which makes all existing SimRank computation algorithms on deterministic graphs inapplicable to uncertain graphs. To efficiently compute SimRank similarities, we propose three algorithms, namely the baseline algorithm with high accuracy, the sampling algorithm with high efficiency, and the two-phase algorithm with comparable efficiency as the sampling algorithm and about an order of magnitude smaller relative error than the sampling algorithm. The extensive experiments and case studies verify the effectiveness of our SimRank measure and the efficiency of our SimRank computation algorithms.

【Keywords】: Algorithm design and analysis

Research Session 5B: Parallel and Distributed Big Data Processing 4

49. Input selection for fast feature engineering.

【Paper Link】【Pages】:577-588

【Authors】: Michael R. Anderson ; Michael J. Cafarella

【Abstract】: The application of machine learning to large datasets has become a vital component of many important and sophisticated software systems built today. Such trained systems are often based on supervised learning tasks that require features, signals extracted from the data that distill complicated raw data objects into a small number of salient values. A trained system's success depends substantially on the quality of its features. Unfortunately, feature engineering¿¿¿the process of writing code that takes raw data objects as input and outputs feature vectors suitable for a machine learning algorithm¿¿¿is a tedious, time-consuming experience. Because ¿¿¿big data¿¿¿ inputs are so diverse, feature engineering is often a trial-and-error process requiring many small, iterative code changes. Because the inputs are so large, each code change can involve a time-consuming data processing task (over each page in a Web crawl, for example). We introduce Zombie, a data-centric system that accelerates feature engineering through intelligent input selection, optimizing the ¿¿¿inner loop¿¿¿ of the feature engineering process. Our system yields feature evaluation speedups of up to 8¿¿ in some cases and reduces engineer wait times from 8 to 5 hours in others.

【Keywords】: Data mining; Data processing; Feature extraction; Indexes; Learning systems; Monitoring; Training

50. When two choices are not enough: Balancing at scale in Distributed Stream Processing.

【Paper Link】【Pages】:589-600

【Authors】: Muhammad Anis Uddin Nasir ; Gianmarco De Francisci Morales ; Nicolas Kourtellis ; Marco Serafini

【Abstract】: Carefully balancing load in distributed stream processing systems has a fundamental impact on execution latency and throughput. Load balancing is challenging because real-world workloads are skewed: some tuples in the stream are associated to keys which are significantly more frequent than others. Skew is remarkably more problematic in large deployments: having more workers implies fewer keys per worker, so it becomes harder to ¿¿¿average out¿¿¿ the cost of hot keys with cold keys. We propose a novel load balancing technique that uses a heavy hitter algorithm to efficiently identify the hottest keys in the stream. These hot keys are assigned to d ¿¿¿ 2 choices to ensure a balanced load, where d is tuned automatically to minimize the memory and computation cost of operator replication. The technique works online and does not require the use of routing tables. Our extensive evaluation shows that our technique can balance real-world workloads on large deployments, and improve throughput and latency by 150% and 60% respectively over the previous state-of-the-art when deployed on Apache Storm.

【Keywords】: Clustering algorithms; Digital signal processing; Engines; Load management; Scalability; Storms; Throughput

51. HadoopViz: A MapReduce framework for extensible visualization of big spatial data.

【Paper Link】【Pages】:601-612

【Authors】: Ahmed Eldawy ; Mohamed F. Mokbel ; Christopher Jonathan

【Abstract】: This paper introduces HadoopViz; a MapReduce-based framework for visualizing big spatial data. HadoopViz has three unique features that distinguish it from other techniques. (1) It exposes an extensible interface which allows users to define a new visualization types, e.g., scatter plot, road network, or heat map, by defining five abstract functions, without delving into the implementation details of the MapReduce algorithms. As it is open source, HadoopViz allows algorithm designers to focus on how the data should be visualized rather than performance or scalability issues. (2) HadoopViz is capable of generating big images with giga-pixel resolution by employing a three-phase technique, partition-plot-merge. (3) HadoopViz provides a smoothing functionality which can fuse nearby records together as the image is plotted. This makes it capable of generating more types of images with high quality as compared to existing work. Experimental results on real datasets of up to 14 Billion points show the extensibility, scalability, and efficiency of HadoopViz to handle different visualization types of spatial big data.

【Keywords】: Algorithm design and analysis; Brain modeling; Heating; Neurons; Partitioning algorithms; Smoothing methods; Spatial databases

52. Efficient fault-tolerance for iterative graph processing on distributed dataflow systems.

【Paper Link】【Pages】:613-624

【Authors】: Chen Xu ; Markus Holzemer ; Manohar Kaul ; Volker Markl

【Abstract】: Real-world graph processing applications often require combining the graph data with tabular data. Moreover, graph processing usually is part of a larger analytics workflow consiting of data preparation, analysis and model building, and model application. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the processing. Most big graph processing algorithms are iterative and incur a long runtime, as they require multiple passes over the data until convergence. Thus, fault tolerance and quick recovery from any intermittent failure at any step of the workflow are crucial for effective and efficient analysis. In this work, we propose a novel fault-tolerance mechanism for iterative graph processing on distributed data-flow systems with the objective to reduce the checkpointing cost and failure recovery time. Rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner, without breaking pipelined tasks. In contrast to the typical unblocking checkpointing approaches (i.e., managing checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating the checkpoint creation during iterative graph processing. We achieve speedier recovery, i.e., confined recovery, by using the local log files on each node to avoid a complete re-computation from scratch. Our theoretical studies as well as our experimental analysis on Flink give further insight into our fault-tolerance strategies and show that they are more efficient than blocking checkpointing and complete recovery for iterative graph processing on dataflow systems.

【Keywords】: Analytical models; Checkpointing; Data models; Fault tolerance; Fault tolerant systems; Sparks; Writing

Research Session 5C: Clustering 4

53. A model-based approach for text clustering with outlier detection.

【Paper Link】【Pages】:625-636

【Authors】: Jianhua Yin ; Jianyong Wang

【Abstract】: Text clustering is a challenging problem due to the high-dimensional and large-volume characteristics of text datasets. In this paper, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Process Multinomial Mixture model for text clustering (abbr. to GSDPMM) which does not need to specify the number of clusters in advance and can cope with the high-dimensional problem of text clustering. Our extensive experimental study shows that GSDPMM can achieve significantly better performance than three other clustering methods and can achieve high consistency on both long and short text datasets. We found that GSDPMM has low time and space complexity and can scale well with huge text datasets. We also propose some novel and effective methods to detect the outliers in the dataset and obtain the representative words of each cluster.

【Keywords】: Bayes methods; Clustering algorithms; Clustering methods; Complexity theory; Data models; Mathematical model; Mixture models

54. Streaming spectral clustering.

【Paper Link】【Pages】:637-648

【Authors】: Shinjae Yoo ; Hao Huang ; Shiva Prasad Kasiviswanathan

【Abstract】: Clustering is a classical data mining task used for discovering interrelated pattern of similarities in the data. In many modern day domains, data is getting continuously generated as a stream. For scalability reasons, clustering the points in a data stream requires designing single pass, limited memory streaming clustering algorithms. However, the performance of the known streaming clustering algorithms, that typically use K-means (or its variants) on the original feature space, tend to suffer when the feature space is high-dimensional. To overcome this problem, we propose a streaming spectral clustering algorithm. Our algorithm maintains an approximation of the normalized Laplacian of the data stream over time and efficiently updates the changing eigenvectors of this Laplacian in a streaming fashion. It requires just one pass over the data, consumes limited memory, and is stable to the ordering of the data stream. We provide a theoretical analysis of our streaming spectral clustering algorithm and our experimental results show that while gaining in scalability, its performance compares well with other known batch/streaming clustering approaches.

【Keywords】: Algorithm design and analysis; Approximation algorithms; Clustering algorithms; Kernel; Laplace equations; Streaming media; Symmetric matrices

55. Accelerating large scale centroid-based clustering with locality sensitive hashing.

【Paper Link】【Pages】:649-660

【Authors】: Ryan McConville ; Xin Cao ; Weiru Liu ; Paul C. Miller

【Abstract】: Most traditional data mining algorithms struggle to cope with the sheer scale of data efficiently. In this paper, we propose a general framework to accelerate existing algorithms to cluster large-scale datasets which contain large numbers of attributes, items, and clusters. Our framework makes use of locality sensitive hashing to significantly reduce the cluster search space. We also theoretically prove that our framework has a guaranteed error bound in terms of the clustering quality. This framework can be applied to a set of centroid-based clustering algorithms that assign an object to the most similar cluster, and we adopt the popular K-Modes categorical clustering algorithm to present how the framework can be applied. We validated our framework with five synthetic datasets and a real world Yahoo! Answers dataset. The experimental results demonstrate that our framework is able to speed up the existing clustering algorithm between factors of 2 and 6, while maintaining comparable cluster purity.

【Keywords】: Algorithm design and analysis; Clustering algorithms

56. PurTreeClust: A purchase tree clustering algorithm for large-scale customer transaction data.

【Paper Link】【Pages】:661-672

【Authors】: Xiaojun Chen ; Joshua Zhexue Huang ; Jun Luo

【Abstract】: Clustering of customer transaction data is usually an important procedure to analyze customer behaviors in retail and e-commerce companies. Note that products from companies are often organized as a product tree, in which the leaf nodes are goods to sell, and the internal nodes (except root node) could be multiple product categories. Based on this tree, we present to use a ¿¿¿personalized product tree¿¿¿, called purchase tree, to represent a customer's transaction data. The customer transaction data set can be represented as a set of purchase trees. We propose a PurTreeClust algorithm for clustering of large-scale customers from purchase trees. We define a new distance metric to effectively compute the distance between two purchase trees from the entire levels in the tree. A cover tree is then built for indexing the purchase tree data and we propose a leveled density estimation method for selecting initial cluster centers from a cover tree. PurTreeClust, a fast clustering method for clustering of large-scale purchase trees, is then presented. Last, we propose a gap statistic based method for estimating the number of clusters from the purchase tree clustering results. A series of experiments were conducted on ten large-scale transaction data sets which contain up to four million transaction records, and experimental results have verified the effectiveness and efficiency of the proposed method. We also compared our method with three clustering algorithms, e.g., spectral clustering, hierarchical agglomerative clustering and DBSCAN. The experimental results have demonstrated the superior performance of the proposed method.

【Keywords】: Measurement

Research Session 6A: Spatial Analytics 4

57. TRANSFORMERS: Robust spatial joins on non-uniform data distributions.

【Paper Link】【Pages】:673-684

【Authors】: Mirjana Pavlovic ; Thomas Heinis ; Farhan Tauheed ; Panagiotis Karras ; Anastasia Ailamaki

【Abstract】: Spatial joins are becoming increasingly ubiquitous in many applications, particularly in the scientific domain. While several approaches have been proposed for joining spatial datasets, each of them has a strength for a particular type of density ratio among the joined datasets. More generally, no single proposed method can efficiently join two spatial datasets in a robust manner with respect to their data distributions. Some approaches do well for datasets with contrasting densities while others do better with similar densities. None of them does well when the datasets have locally divergent data distributions. In this paper we develop TRANSFORMERS, an efficient and robust spatial join approach that is indifferent to such variations of distribution among the joined data. TRANSFORMERS achieves this feat by departing from the state-of-the-art through adapting the join strategy and data layout to local density variations among the joined data. It employs a join method based on data-oriented partitioning when joining areas of substantially different local densities, whereas it uses big partitions (as in space-oriented partitioning) when the densities are similar, while seamlessly switching among these two strategies at runtime. We experimentally demonstrate that TRANSFORMERS outperforms state-of-the-art approaches by a factor of between 2 and 8.

【Keywords】: Axons; Layout; Robustness; Runtime; Spatial databases; Switches; Time measurement

58. Finding the minimum spatial keyword cover.

【Paper Link】【Pages】:685-696

【Authors】: Dong-Wan Choi ; Jian Pei ; Xuemin Lin

【Abstract】: The existing works on spatial keyword search focus on finding a group of spatial objects covering all the query keywords and minimizing the diameter of the group. However, we observe that such a formulation may not address what users need in some application scenarios. In this paper, we introduce a novel spatial keyword cover problem (SK-COVER for short), which aims to identify the group of spatio-textual objects covering all keywords in a query and minimizing a distance cost function that leads to fewer proximate objects in the answer set. We prove that SK-COVER is not only NP-hard but also does not allow an approximation better than O(logm) in polynomial time, where m is the number of query keywords. We establish an O(logm)-approximation algorithm, which is asymptotically optimal in terms of the approximability of SK-COVER. Furthermore, we devise effective accessing strategies and pruning rules to improve the overall efficiency and scalability. In addition to our algorithmic results, we empirically show that our approximation algorithm always achieves the best accuracy, and the efficiency of our algorithm is comparable to a state-of-the-art algorithm that is intended for mCK, a problem similar to yet theoretically easier than SK-COVER.

【Keywords】: Algorithm design and analysis; Approximation algorithms; Cost function; Keyword search; Planning; Scalability; Upper bound

59. Answering why-not spatial keyword top-k queries via keyword adaption.

【Paper Link】【Pages】:697-708

【Authors】: Lei Chen ; Jianliang Xu ; Xin Lin ; Christian S. Jensen ; Haibo Hu

【Abstract】: Web objects, often associated with descriptive text documents, are increasingly being geo-tagged. A spatial keyword top-k query retrieves the best k such objects according to a scoring function that considers both spatial distance and textual similarity. However, it is in some cases difficult for users to identify the exact keywords that describe their query intent. After a user issues an initial query and gets back the result, the user may find that some expected objects are missing and may wonder why. Answering the resulting why-not questions can aid users in retrieving better results. However, no existing techniques are able to answer why-not questions by adapting the query keywords. We propose techniques capable of adapting an initial set of query keywords so that expected, but missing, objects enter the result along with other relevant objects. We develop a basic algorithm with a set of optimizations that sequentially examines a sequence of candidate keyword sets. In addition, we present an index-based bound-and-prune algorithm that is able to determine the best sample out of a set of candidates in just one pass of index traversal, thus speeding up the query processing. We also extend the proposed algorithms to handle multiple missing objects. Extensive experimental results offer insight into the efficiency of the proposed techniques in terms of running time and I/O cost.

【Keywords】: Approximation algorithms; Computer science; Indexes; Optimization; Query processing; Spatial databases

60. Influence based cost optimization on user preference.

【Paper Link】【Pages】:709-720

【Authors】: Jianye Yang ; Ying Zhang ; Wenjie Zhang ; Xuemin Lin

【Abstract】: The popularity of e-business and preference learning techniques have contributed a huge amount of product and user preference data. Analyzing the influence of an existing or new product among the users is critical to unlock the great scientific and social-economic value of these data. In this paper, we advocate the problem of influence-based cost optimization for the user preference and product data, which is fundamental in many real applications such as marketing and advertising. Generally, we aim to find a cost optimal position for a new product such that it can attract at least k or a particular percentage of users for the given user preference functions and competitors' products. Although we show the solution space of our problem can be reduced to a finite number of possible positions (points) by utilizing the classical k-level computation techniques, the computation cost is still very expensive due to the nature of the high combinatorial complexity of the k-level problem. To alleviate this issue, we develop efficient pruning and query processing techniques to significantly improve the performance. In particular, our traverse-based 2-dimensional algorithm is very efficient with time complexity O(n) where n is the number of user preference functions. For general multi-dimensional spaces, we develop space partition based algorithm to significantly improve the performance by utilizing cost-based, influence-based and local dominance based pruning techniques. Then, we show that the performance of the partition based algorithm can be further enhanced by utilizing sampling approach, where the problem can be reduced to the classical half-space intersection problem. We demonstrate the efficiency of our techniques with extensive experiments over real and synthetic datasets.

【Keywords】: Cost function; Object recognition; Partitioning algorithms; Portable computers; Space heating; Time complexity

Research Session 6B: Analytics on Big Data 4

61. The CRO kernel: Using Concomitant Rank Order hashes for sparse high dimensional randomized feature maps.

【Paper Link】【Pages】:721-730

【Authors】: Kave Eshghi ; Mehran Kafai

【Abstract】: Kernel methods have been shown to be effective for many machine learning tasks such as classification, clustering and regression. In particular, support vector machines with the RBF kernel have proved to be powerful classification tools. The standard way to apply kernel methods is to use the ¿¿¿kernel trick¿¿¿, where the inner product of the vectors in the feature space is computed via the kernel function. Using the kernel trick for SVMs, however, leads to training that is quadratic in the number of input vectors and classification that is linear with the number of support vectors. We introduce a new kernel, the CRO (Concomitant Rank Order) kernel that approximates the RBF kernel for unit length input vectors. We also introduce a new randomized feature map, based on concomitant rank order hashing, that produces sparse, high dimensional feature vectors whose inner product asymptotically equals the CRO kernel. Using the Hadamard transform for computing the CRO hashes ensures that the cost of computing feature vectors is very low. Thus, for unit length input vectors, we get the accuracy of the RBF kernel with the efficiency of a sparse high dimensional linear kernel. We show the efficacy of our approach using a number of standard datasets.

【Keywords】: Approximation algorithms; Gaussian distribution; Kernel; Standards; Support vector machines; Training; Transforms

62. MuVE: Efficient Multi-Objective View Recommendation for Visual Data Exploration.

【Paper Link】【Pages】:731-742

【Authors】: Humaira Ehsan ; Mohamed A. Sharaf ; Panos K. Chrysanthis

【Abstract】: To support effective data exploration, there is a well-recognized need for solutions that can automatically recommend interesting visualizations, which reveal useful insights into the analyzed data. However, such visualizations come at the expense of high data processing costs, where a large number of views are generated to evaluate their usefulness. Those costs are further escalated in the presence of numerical dimensional attributes, due to the potentially large number of possible binning aggregations, which lead to a drastic increase in the number of possible visualizations. To address that challenge, in this paper we propose the MuVE scheme for Multi-Objective View Recommendation for Visual Data Exploration. MuVE introduces a hybrid multi-objective utility function, which captures the impact of binning on the utility of visualizations. Consequently, novel algorithms are proposed for the efficient recommendation of data visualizations that are based on numerical dimensions. The main idea underlying MuVE is to incrementally and progressively assess the different benefits provided by a visualization, which allows an early pruning of a large number of unnecessary operations. Our extensive experimental results show the significant gains provided by our proposed scheme.

【Keywords】: Aggregates; Atmospheric measurements; Data visualization; Databases; Particle measurements; Visualization

63. Collaborative analytics for data silos.

【Paper Link】【Pages】:743-754

【Authors】: Jinkyu Kim ; Heonseok Ha ; Byung-Gon Chun ; Sungroh Yoon ; Sang K. Cha

【Abstract】: As a great deal of data has been accumulated in various disciplines, the need for the integrative analysis of separate but relevant data sources is becoming more important. Combining data sources can provide global insight that is otherwise difficult to obtain from individual sources. Because of privacy, regulations, and other issues, many large-scale data repositories remain closed off from the outside, raising what has been termed the data silo issue. The huge volume of today's big data often leads to computational challenges, adding another layer of complexity to the solution. In this paper, we propose a novel method called collaborative analytics by ensemble learning (CABEL), which attempts to resolve the main hurdles regarding the silo issue: accuracy, privacy, and computational efficiency. CABEL represents the data stored in each silo as a compact aggregate of samples called the silo signature. The compact representation provides computational efficiency and privacy preservation but makes it challenging to produce accurate analytics. To resolve this challenge, we formulate the problem of attribute domain sampling and reconstruction, and propose a solution called the Chebyshev subset. To model collaborative efforts to analyze semantically linked but structurally disconnected databases, CABEL utilizes a new ensemble learning technique termed the weighted bagging of base classifiers. We demonstrate the effectiveness of CABEL by testing with a nationwide health-insurance data set containing approximately 4,182,000,000 records collected from the entire population of an Organisation for Economic Co-operation and Development (OECD) country in 2012. In our binary classification tests, CABEL achieved median recall, precision, and F-measure values of 89%, 64%, and 76%, respectively, although only 0.001¿¿¿0.00001% of the original data was used for model construction, while maintaining data privacy and computational efficiency.

【Keywords】: Bagging; Collaboration; Data models; Data privacy; Privacy; Protocols

64. Visualization-aware sampling for very large databases.

【Paper Link】【Pages】:755-766

【Authors】: Yongjoo Park ; Michael J. Cafarella ; Barzan Mozafari

【Abstract】: Interactive visualizations are crucial in ad hoc data exploration and analysis. However, with the growing number of massive datasets, generating visualizations in interactive timescales is increasingly challenging. One approach for improving the speed of the visualization tool is via data reduction in order to reduce the computational overhead, but at a potential cost in visualization accuracy. Common data reduction techniques, such as uniform and stratified sampling, do not exploit the fact that the sampled tuples will be transformed into a visualization for human consumption. We propose a visualization-aware sampling (VAS) that guarantees high quality visualizations with a small subset of the entire dataset. We validate our method when applied to scatter and map plots for three common visualization goals: regression, density estimation, and clustering. The key to our sampling method's success is in choosing a set of tuples that minimizes a visualization-inspired loss function. While existing sampling approaches minimize the error of aggregation queries, we focus on a loss function that maximizes the visual fidelity of scatter plots. Our user study confirms that our proposed loss function correlates strongly with user success in using the resulting visualizations. Our experiments show that (i) VAS improves user's success by up to 35% in various visualization tasks, and (ii) VAS can achieve a required visualization quality up to 400¿¿ faster.

【Keywords】: Data visualization; Estimation; Human computer interaction; Query processing; Software architecture; Visual databases

Research Session 6C: Uncertain and Probabilistic Data 4

65. Answering why-not questions on metric probabilistic range queries.

【Paper Link】【Pages】:767-778

【Authors】: Lu Chen ; Yunjun Gao ; Kai Wang ; Christian S. Jensen ; Gang Chen

【Abstract】: Metric probabilistic range queries (MPRQ) have received substantial attention due to their utility in multimedia and text retrieval, decision making, etc. Existing MPRQ studies generally aim to improve query efficiency and resource usage. In contrast, we define and offer solutions to why-not questions on MPRQ. Given an original metric probabilistic range query and a why-not set W of uncertain objects that are absent from the query result, a why-not question on MPRQ explains why the uncertain objects in W do not appear in the query result, and provides refinements of the original query and/or W with the minimal penalty, so that the uncertain objects in W appear in the result of the refined query. Specifically, we propose a framework that consists of three efficient solutions, one that modifies the original query, one that modifies the why-not set, and one that modifies both the original query and the why-not set. Extensive experiments using both real and synthetic data sets offer insights into the properties of the proposed algorithms, and show that they are effective and efficient.

【Keywords】: Computers; Databases; Extraterrestrial measurements; Multimedia communication; Probabilistic logic; Uncertainty

66. Analyzing data-centric applications: Why, what-if, and how-to.

【Paper Link】【Pages】:779-790

【Authors】: Pierre Bourhis ; Daniel Deutch ; Yuval Moskovitch

【Abstract】: We consider in this paper the analysis of complex applications that query and update an underlying database in their operation. We focus on three classes of analytical questions that are important for application owners and users alike: Why was a result generated? What would be the result if the application logic or database is modified in a particular way? How can one interact with the application to achieve a particular goal? Answering these questions efficiently is a fundamental step towards optimizing the application and its use. Noting that provenance was a key component in answering similar questions in the context of database queries, we develop a provenance-based model and efficient algorithms for these problems in the context of data-centric applications. Novel challenges here include the dynamic update of data, combined with the possibly complex workflows allowed by applications. We nevertheless achieve theoretical guarantees for the algorithms performance, and experimentally show their efficiency and usefulness, even in presence of complex applications and large-scale data.

【Keywords】: Complexity theory; Computational modeling; Context; Databases; Heuristic algorithms; Navigation; Standards

67. CLEAR: Clustering based on locality embedding and reconstruction.

【Paper Link】【Pages】:791-798

【Authors】: Zhonglong Zheng ; Minqi Mao ; Songxia Ma

【Abstract】: Spectral clustering (SC) is a fundamental technique in the field of data mining and information processing. The similarity matrix of the input data set plays a vital role in SC methods. Since the similarity matrix computation is independent of the clustering procedure, the final results may be suboptimal if the learned similarity is not optimal in SC methods. In this paper, we proposed a novel data Clustering based on Locality Embedding And Reconstruction, CLEAR for short. The proposed CLEAR model combines the graph similarity matrix computation and data clustering procedure together by assigning adaptive weights to the local neighbors of each data point based on locally linear embedding and reconstruction. It is worth noting that CLEAR imposes a rank constraint to the graph Laplacian matrix of the similarity matrix, which leads to a favorable number of graph components in the final clustering. We formulate the proposed CLEAR model in a efficient way which can be solved easily, and show the theoretical relationship to K-means and SC methods. Extensive results on both synthetic data and real-world data show the effectiveness of our method.

【Keywords】: Clustering algorithms; Image reconstruction; Laplace equations; Mathematical model; Optimization; Signal processing algorithms; Silicon

68. OLAP over probabilistic data cubes I: Aggregating, materializing, and querying.

【Paper Link】【Pages】:799-810

【Authors】: Xike Xie ; Xingjun Hao ; Torben Bach Pedersen ; Peiquan Jin ; Jinchuan Chen

【Abstract】: On-Line Analytical Processing (OLAP) enables powerful analytics by quickly computing aggregate values of numerical measures over multiple hierarchical dimensions for massive datasets. However, many types of source data, e.g., from GPS, sensors, and other measurement devices, are intrinsically inaccurate (imprecise and/or uncertain) and thus OLAP cannot be readily applied. In this paper, we address the resulting data veracity problem in OLAP by proposing the concept of probabilistic data cubes. Such a cube is comprised of a set of probabilistic cuboids which summarize the aggregated values in the form of probability mass functions (pmfs in short) and thus offer insights into the underlying data quality and enable confidence-aware query evaluation and analysis. However, the probabilistic nature of data poses computational challenges as even simple operations are #P-hard under the possible world semantics. Even worse, it is hard to share computations among different cuboids, as aggregation functions that are distributive for traditional data cubes, e.g., SUM and COUNT, become holistic in probabilistic settings. In this paper, we propose a complete set of techniques for probabilistic data cubes, from cuboid aggregation, over cube materialization, to query evaluation. For aggregation, we focus on how to maximize the sharing of computation among cells and cuboids. We present two aggregation methods: convolution and sketch-based. The two methods scale down the time complexities of building a probabilistic cuboid to polynomial and linear, respectively. Each of the two supports both full and partial data cube materialization. Then, we devise a cost model which guides the aggregation methods to be deployed and combined during the cube materialization. We further provide algorithms for probabilistic slicing and dicing queries on the data cube. Extensive experiments over real and synthetic datasets are conducted to show that the techniques are effective and scalable.

【Keywords】: Aggregates; Lattices; Probabilistic logic; Query processing; Semantics; Sensors

Research Session 7A: Scalable Matrix-Based Analytics 4

69. SCouT: Scalable coupled matrix-tensor factorization - algorithm and discoveries.

【Paper Link】【Pages】:811-822

【Authors】: ByungSoo Jeon ; Inah Jeon ; Lee Sael ; U. Kang

【Abstract】: How can we analyze very large real-world tensors where additional information is coupled with certain modes of tensors? Coupled matrix-tensor factorization is a useful tool to simultaneously analyze matrices and a tensor, and has been used for important applications including collaborative filtering, multi-way clustering, and link prediction. However, existing single machine or distributed algorithms for coupled matrix-tensor factorization do not scale for tensors with billions of elements in each mode. In this paper, we propose SCOUT, a large-scale coupled matrix-tensor factorization algorithm running on the distributed MAPREDUCE platform. By carefully reorganizing operations, and reusing intermediate data, SCOUT decomposes up to 100¿¿ larger tensors than existing methods, and shows linear scalability for order and machines while other methods are limited in scalability. We also apply SCOUT on real world tensors and discover interesting hidden patterns like seasonal spike, and steady attentions for healthy food on Yelp dataset containing user-business-yearmonth tensor and two coupled matrices.

【Keywords】: Algorithm design and analysis; Error correction; Error correction codes; Matrix converters; Matrix decomposition; Scalability; Tensile stress

70. Topology-aware optimization of big sparse matrices and matrix multiplications on main-memory systems.

【Paper Link】【Pages】:823-834

【Authors】: David Kernert ; Wolfgang Lehner ; Frank Köhler

【Abstract】: Since data sizes of analytical applications are continuously growing, many data scientists are switching from customized micro-solutions to scalable alternatives, such as statistical and scientific databases. However, many algorithms in data mining and science are expressed in terms of linear algebra, which is barely supported by major database vendors and big data solutions. On the other side, conventional linear algebra algorithms and legacy matrix representations are often not suitable for very large matrices. We propose a strategy for large matrix processing on modern multicore systems that is based on a novel, adaptive tile matrix representation (AT MATRIX). Our solution utilizes multiple techniques inspired from database technology, such as multidimensional data partitioning, cardinality estimation, indexing, dynamic rewrites, and many more in order to optimize the execution time. Based thereon we present a matrix multiplication operator ATMULT, which outperforms alternative approaches. The aim of our solution is to overcome the burden for data scientists of selecting appropriate algorithms and matrix storage representations. We evaluated AT MATRIX together with ATMULT on several real-world and synthetic random matrices.

【Keywords】: Data structures; Databases; Linear algebra; Matrix converters; Optimization; Partitioning algorithms; Sparse matrices

71. 2PCP: Two-phase CP decomposition for billion-scale dense tensors.

【Paper Link】【Pages】:835-846

【Authors】: Xinsheng Li ; Shengyu Huang ; K. Selçuk Candan ; Maria Luisa Sapino

【Abstract】: Tensors are multi-dimensional arrays - consequently, tensor decomposition operations (CP and Tucker) are the bases for many high-dimensional data analysis tasks, from clustering, trend detection, anomaly detection, to correlation analysis in various application domains, including science and engineering1. One key problem with tensor decomposition is its computational complexity and space requirements. Especially, as the relevant data sets get denser, in-memory schemes for tensor decomposition become increasingly ineffective; therefore out-of-core (secondary-memory supported, potentially parallel) computing is necessitated. However, existing techniques do not consider the I/O and network data exchange costs that out-of-core execution of the tensor decomposition operation will incur. In this paper, we note that when this operation is implemented with the help of secondary-memory and/or multiple servers to tackle the memory limitations, we would need intelligent buffer-management and task-scheduling techniques which take into account the cost of bringing the relevant blocks into the buffer to minimize I/O in the system. In this paper, we introduce 2PCP, a two-phase, block-based CP decomposition system with intelligent buffer sensitive task scheduling and buffer management mechanisms. 2PCP aims to reduce I/O costs in the analysis of relatively dense tensors common in scientific and engineering applications. Experiment results compare with current state of art tensor decomposition algorithms and show that our algorithms can significantly reduce the amount of I/O and execution time while maintaining decomposition accuracy.

【Keywords】: Algorithm design and analysis; Arrays; Data analysis; Matrix decomposition; Optimization; Servers; Tensile stress

72. Distributed low rank approximation of implicit functions of a matrix.

【Paper Link】【Pages】:847-858

【Authors】: David P. Woodruff ; Peilin Zhong

【Abstract】: We study distributed low rank approximation in which the matrix to be approximated is only implicitly represented across the different servers. For example, each of s servers may have an n¿¿d matrix At, and we may be interested in computing a low rank approximation to A = ¿¿(¿¿t=1sAt), where ¿¿ is a function which is applied entrywise to the matrix ¿¿t=1sAt. We show for a wide class of functions ¿¿ it is possible to efficiently compute a d ¿¿ d rank-k projection matrix P for which ¿¿¿A ¿¿¿ AP¿¿¿F2 ¿¿¿ ¿¿¿A ¿¿¿ [A]k¿¿¿F2 + ¿¿ ¿¿¿A¿¿¿F2, where AP denotes the projection of A onto the row span of P, and [A]k denotes the best rank-k approximation to A given by the singular value decomposition. The communication cost of our protocols is d¿¿(sk=¿¿)O(1), and they succeed with high probability. Our framework allows us to efficiently compute a low rank approximation to an entry-wise softmax, to a Gaussian kernel expansion, and to M-Estimators applied entrywise (i.e., forms of robust low rank approximation). We also show that our additive error approximation is best possible, in the sense that any protocol achieving relative error for these problems requires significantly more communication. Finally, we experimentally validate our algorithms on real datasets.

【Keywords】: Additives; Approximation algorithms; Computational modeling; Partitioning algorithms; Principal component analysis; Protocols; Servers

Research Session 7B: Trajectories and Roads 3

73. Fuzzy trajectory linking.

【Paper Link】【Pages】:859-870

【Authors】: Huayu Wu ; Mingqiang Xue ; Jianneng Cao ; Panagiotis Karras ; Wee Siong Ng ; Kee Kiat Koo

【Abstract】: Today, people can access various services with smart carry-on devices, e.g., surf the web with smart phones, make payments with credit cards, or ride a bus with commuting cards. In addition to the offered convenience, the access of such services can reveal their traveled trajectory to service providers. Very often, a user who has signed up for multiple services may expose her trajectory to more than one service providers. This state of affairs raises a privacy concern, but also an opportunity. On one hand, several colluding service providers, or a government agency that collects information from such service providers, may identify and reconstruct users' trajectories to an extent that can be threatening to personal privacy. On the other hand, the processing of such rich data may allow for the development of better services for the common good. In this paper, we take a neutral standpoint and investigate the potential for trajectories accumulated from different sources to be linked so as to reconstruct a larger trajectory of a single person. We develop a methodology, called fuzzy trajectory linking (FTL) that achieves this goal, and two instantiations thereof, one based on hypothesis testing and one on Na¿¿ve-Bayes. We provide a theoretical analysis for factors that affect FTL and use two real datasets to demonstrate that our algorithms effectively achieve their goals.

【Keywords】: Biological system modeling; Buildings; Couplings; Mobile communication

74. Keyword-aware continuous kNN query on road networks.

【Paper Link】【Pages】:871-882

【Authors】: Bolong Zheng ; Kai Zheng ; Xiaokui Xiao ; Han Su ; Hongzhi Yin ; Xiaofang Zhou ; Guohui Li

【Abstract】: It is nowadays quite common for road networks to have textual contents on the vertices, which describe auxiliary information (e.g., business, traffic, etc.) associated with the vertex. In such road networks, which are modelled as weighted undirected graphs, each vertex is associated with one or more keywords, and each edge is assigned with a weight, which can be its physical length or travelling time. In this paper, we study the problem of keyword-aware continuous k nearest neighbour (KCkNN) search on road networks, which computes the k nearest vertices that contain the query keywords issued by a moving object and maintains the results continuously as the object is moving on the road network. Reducing the query processing costs in terms of computation and communication has attracted considerable attention in the database community with interesting techniques proposed. This paper proposes a framework, called a Labelling AppRoach for Continuous kNN query (LARC), on road networks to cope with KCkNN query efficiently. First we build a pivot-based reverse label index and a keyword-based pivot tree index to improve the efficiency of keyword-aware k nearest neighbour (KkNN) search by avoiding massive network traversals and sequential probe of keywords. To reduce the frequency of unnecessary result updates, we develop the concepts of dominance interval and region on road network, which share the similar intuition with safe region for processing continuous queries in Euclidean space but are more complicated and thus require more dedicated design. For high frequency keywords, we resolve the dominance interval when the query results changed. In addition, a path-based dominance updating approach is proposed to compute the dominance region efficiently when the query keywords are of low frequency. We conduct extensive experiments by comparing our algorithms with the state-of-the-art methods on real data sets. The empirical observations have verified the superiority of our propos- d solution in all aspects of index size, communication cost and computation time.

【Keywords】: Algorithm design and analysis; Indexes; Keyword search; Probes; Query processing; Roads; Search problems

75. Crowdsourcing-based real-time urban traffic speed estimation: From trends to speeds.

【Paper Link】【Pages】:883-894

【Authors】: Huiqi Hu ; Guoliang Li ; Zhifeng Bao ; Yan Cui ; Jianhua Feng

【Abstract】: Real-time urban traffic speed estimation provides significant benefits in many real-world applications. However, existing traffic information acquisition systems only obtain coarse-grained traffic information on a small number of roads but cannot acquire fine-grained traffic information on every road. To address this problem, in this paper we study the traffic speed estimation problem, which, given a budget K, identifies K roads (called seeds) where the real traffic speeds on these seeds can be obtained using crowdsourcing, and infers the speeds of other roads (called non-seed roads) based on the speeds of these seeds. This problem includes two sub-problems: (1) Speed Inference - How to accurately infer the speeds of the non-seed roads; (2) Seed Selection - How to effectively select high-quality seeds. It is rather challenging to estimate the traffic speed accurately, because the traffic changes dynamically and the changes are hard to be predicted as many possible factors can affect the traffic. To address these challenges, we propose effective algorithms to judiciously select high-quality seeds and devise inference models to infer the speeds of the non-seed roads. On the one hand, we observe that roads have correlations and correlated roads have similar traffic trend: the speeds of correlated roads rise or fall compared with their historical average speed simultaneously. We utilize this property and propose a two-step model to estimate the traffic speed. The first step adopts a graphical model to infer the traffic trend and the second step devises a hierarchical linear model to estimate the traffic speed based on the traffic trend. On the other hand, we formulate the seed selection problem, prove that it is NP-hard, and propose several greedy algorithms with approximation guarantees. Experimental results on two large real datasets show that our method outperforms baselines by 2 orders of magnitude in efficiency and 40% in estimation accuracy.

【Keywords】: Correlation; Crowdsourcing; Estimation; Hidden Markov models; Market research; Real-time systems; Roads

Research Session 8A: Data Explorations and Event Analytics 4

76. Learning abstract snippet detectors with Temporal embedding in convolutional neural Networks.

【Paper Link】【Pages】:895-905

【Authors】: Jiajun Liu ; Kun Zhao ; Brano Kusy ; Ji-Rong Wen ; Kai Zheng ; Raja Jurdak

【Abstract】: The prediction of periodical time-series remains challenging due to various types of scaling, misalignments and distortion effects. Here, we propose a novel model called Temporal embedding-enhanced convolutional neural Network (TeNet) to learn repeatedly-occurring-yet-hidden structural elements in periodical time-series, called abstract snippet detectors, to predict future changes. Our model effectively learns a new feature space for a time-series dataset. In the new feature space, distorted time-series that have implicit similarity but substantial differences in value and sequence to regular patterns are re-aligned to the regular patterns in the dataset, and subsequently contribute to a robust prediction mode. The model is robust to various types of distortions and misalignments and demonstrates strong prediction power for periodical time-series. We conduct extensive experiments and discover that the proposed model shows significant and consistent advantages over existing methods on a variety of data modalities ranging from human mobility to household power consumption records, when evaluated under four metrics. The model is also robust to various factors such as number of samples, variance of data, numerical ranges of data etc. The experiments verify that the intuition behind the model can be generalized to multiple data types and applications and promises significant improvement in prediction performance across the datasets studied.

【Keywords】: Convolution; Detectors; Distortion; Neural networks; Predictive models; Robustness; Training

77. Interactive data exploration with smart drill-down.

【Paper Link】【Pages】:906-917

【Authors】: Manas Joglekar ; Hector Garcia-Molina ; Aditya G. Parameswaran

【Abstract】: We present smart drill-down, an operator for interactively exploring a relational table to discover and summarize ¿¿¿interesting¿¿¿ groups of tuples. Each group of tuples is described by a rule. For instance, the rule (a, b, ¿¿¿, 1000) tells us that there are a thousand tuples with value a in the first column and b in the second column (and any value in the third column). Smart drill-down presents an analyst with a list of rules that together describe interesting aspects of the table. The analyst can tailor the definition of interesting, and can interactively apply smart drill-down on an existing rule to explore that part of the table. We demonstrate that the underlying optimization problems are NP-HARD, and describe an algorithm for finding the approximately optimal list of rules to display when the user uses a smart drill-down, and a dynamic sampling scheme for efficiently interacting with large tables. Finally, we perform experiments on real datasets on our experimental prototype to demonstrate the usefulness of smart drill-down and study the performance of our algorithms.

【Keywords】: Aggregates; Algorithm design and analysis; Bicycles; Databases; Heuristic algorithms; Optimization; Prototypes

78. ClEveR: Clustering events with high density of true-to-false occurrence ratio.

【Paper Link】【Pages】:918-929

【Authors】: Georgios Theodoridis ; Thierry Benoist

【Abstract】: Leveraging the ICT evolution, the modern systems collect voluminous sets of monitoring data, which are analysed in order to increase the system's situational awareness. Apart from the regular activity this bulk of monitoring information may also include instances of anomalous operation, which need to be detected and examined thoroughly so as their root causes to be identified. Hence, for an alert mechanism it is crucial to investigate the cross-correlations among the suspicious monitoring traces not only with each other but also against the overall monitoring data, in order to discover any high spatio-temporal concentration of abnormal occurrences that could be considered as evidence of an underlying system malfunction. To this end, this paper presents a novel clustering algorithm that groups instances of problematic behaviour not only according to their concentration but also with respect to the presence of normal activity. On this basis, the proposed algorithm operates at two proximity scales, so as to allow for combining more distant anomalous observations that are not however interrupted by regular feedback. Regardless of the initial motivation, the clustering algorithm is applicable to any case of objects that share a common feature and for which areas of high density in comparison with the rest of the population are examined.

【Keywords】: Clustering algorithms; Europe; IP networks; Internet; Monitoring; Security

79. Event regularity and irregularity in a time unit.

【Paper Link】【Pages】:930-941

【Authors】: Lijian Wan ; Tingjian Ge

【Abstract】: In this paper, we study the problem of learning a regular model from a number of sequences, each of which contains events in a time unit. Assuming some regularity in such sequences, we determine what events should be deemed irregular in their contexts. We perform an in-depth analysis of the model we build, and propose two optimization techniques, one of which is also of independent interest in solving a new problem named the Group Counting problem. Our comprehensive experiments on real and hybrid datasets show that the model we build is very effective in characterizing regularities and identifying irregular events. One of our optimizations improves model building speed by more than an order of magnitude, and the other significantly saves space consumption.

【Keywords】: Adaptation models; Buildings; Computational modeling; Context; Data models; Optimization; Sensors

Research Session 8B: Spatial Analytics 4

【Paper Link】【Pages】:942-953

【Authors】: Hongzhi Yin ; Zhiting Hu ; Xiaofang Zhou ; Hao Wang ; Kai Zheng ; Nguyen Quoc Viet Hung ; Shazia Wasim Sadiq

【Abstract】: Social community detection is a growing field of interest in the area of social network applications, and many approaches have been developed, including graph partitioning, latent space model, block model and spectral clustering. Most existing work purely focuses on network structure information which is, however, often sparse, noisy and lack of interpretability. To improve the accuracy and interpretability of community discovery, we propose to infer users' social communities by incorporating their spatiotemporal data and semantic information. Technically, we propose a unified probabilistic generative model, User-Community-Geo-Topic (UCGT), to simulate the generative process of communities as a result of network proximities, spatiotemporal co-occurrences and semantic similarity. With a well-designed multi-component model structure and a parallel inference implementation to leverage the power of multicores and clusters, our UCGT model is expressive while remaining efficient and scalable to growing large-scale geo-social networking data. We deploy UCGT to two application scenarios of user behavior predictions: check-in prediction and social interaction prediction. Extensive experiments on two large-scale geo-social networking datasets show that UCGT achieves better performance than existing state-of-the-art comparison methods.

【Keywords】: Data models; Knowledge engineering; Network topology; Probabilistic logic; Semantics; Social network services; Spatiotemporal phenomena

81. SPORE: A sequential personalized spatial item recommender system.

【Paper Link】【Pages】:954-965

【Authors】: Weiqing Wang ; Hongzhi Yin ; Shazia Wasim Sadiq ; Ling Chen ; Min Xie ; Xiaofang Zhou

【Abstract】: With the rapid development of location-based social networks (LBSNs), spatial item recommendation has become an important way of helping users discover interesting locations to increase their engagement with location-based services. Although human movement exhibits sequential patterns in LBSNs, most current studies on spatial item recommendations do not consider the sequential influence of locations. Leveraging sequential patterns in spatial item recommendation is, however, very challenging, considering 1) users' check-in data in LBSNs has a low sampling rate in both space and time, which renders existing prediction techniques on GPS trajectories ineffective; 2) the prediction space is extremely large, with millions of distinct locations as the next prediction target, which impedes the application of classical Markov chain models; and 3) there is no existing framework that unifies users' personal interests and the sequential influence in a principled manner. In light of the above challenges, we propose a sequential personalized spatial item recommendation framework (SPORE) which introduces a novel latent variable topic-region to model and fuse sequential influence with personal interests in the latent and exponential space. The advantages of modeling the sequential effect at the topic-region level include a significantly reduced prediction space, an effective alleviation of data sparsity and a direct expression of the semantic meaning of users' spatial activities. Furthermore, we design an asymmetric Locality Sensitive Hashing (ALSH) technique to speed up the online top-k recommendation process by extending the traditional LSH. We evaluate the performance of SPORE on two real datasets and one large-scale synthetic dataset. The results demonstrate a significant improvement in SPORE's ability to recommend spatial items, in terms of both effectiveness and efficiency, compared with the state-of-the-art methods.

【Keywords】: Additives; Global Positioning System; Hidden Markov models; Markov processes; Predictive models; Semantics; Trajectory

82. Reverse nearest neighbor heat maps: A tool for influence exploration.

【Paper Link】【Pages】:966-977

【Authors】: Yu Sun ; Rui Zhang ; Andy Yuan Xue ; Jianzhong Qi ; Xiaoyong Du

【Abstract】: We study the problem of constructing a reverse nearest neighbor (RNN) heat map by finding the RNN set of every point in a two-dimensional space. Based on the RNN set of a point, we obtain a quantitative influence (i.e., heat) for the point. The heat map provides a global view on the influence distribution in the space, and hence supports exploratory analyses in many applications such as marketing and resource management. To construct such a heat map, we first reduce it to a problem called Region Coloring (RC), which divides the space into disjoint regions within which all the points have the same RNN set. We then propose a novel algorithm named CREST that efficiently solves the RC problem by labeling each region with the heat value of its containing points. In CREST, we propose innovative techniques to avoid processing expensive RNN queries and greatly reduce the number of region labeling operations. We perform detailed analyses on the complexity of CREST and lower bounds of the RC problem, and prove that CREST is asymptotically optimal in the worst case. Extensive experiments with both real and synthetic data sets demonstrate that CREST outperforms alternative algorithms by several orders of magnitude.

【Keywords】: Heating; Indexes; Radio frequency

83. Automatic user identification method across heterogeneous mobility data sources.

【Paper Link】【Pages】:978-989

【Authors】: Wei Cao ; Zhengwei Wu ; Dong Wang ; Jian Li ; Haishan Wu

【Abstract】: With the ubiquity of location based services and applications, large volume of mobility data has been generated routinely, usually from heterogeneous data sources, such as different GPS-embedded devices, mobile apps or location based service providers. In this paper, we investigate efficient ways of identifying users across such heterogeneous data sources. We present a MapReduce-based framework called Automatic User Identification (AUI) which is easy to deploy and can scale to very large data set. Our framework is based on a novel similarity measure called the signal based similarity (SIG) which measures the similarity of users' trajectories gathered from different data sources, typically with very different sampling rates and noise patterns. We conduct extensive experimental evaluations, which show that our framework outperforms the existing methods significantly. Our study on one hand provides an effective approach for the mobility data integration problem on large scale data sets, i.e., combining the mobility data sets from different sources in order to enhance the data quality. On the other hand, our study provides an in-depth investigation for the widely studied human mobility uniqueness problem under heterogeneous data sources.

【Keywords】: Buildings; Education; Mobile communication; Navigation; Noise measurement; Trajectory; Urban areas

Research Session 8C: Web Data Processing 4

84. Fast top-k search in knowledge graphs.

【Paper Link】【Pages】:990-1001

【Authors】: Shengqi Yang ; Fangqiu Han ; Yinghui Wu ; Xifeng Yan

【Abstract】: Given a graph query Q posed on a knowledge graph G, top-k graph querying is to find k matches in G with the highest ranking score according to a ranking function. Fast top-k search in knowledge graphs is challenging as both graph traversal and similarity search are expensive. Conventional top-k graph search is typically based on threshold algorithm (TA), which can no long fit the demand in the new setting. This work proposes STAR, a top-k knowledge graph search framework. It has two components: (a) a fast top-k algorithm for star queries, and (b) an assembling algorithm for general graph queries. The assembling algorithm uses star query as a building block and iteratively sweeps the star match lists with a dynamically adjusted bound. For top-k star graph query where an edge can be matched to a path with bounded length d, we develop a message passing algorithm, achieving time complexity O(d2|E| + md) and space complexity linear to d|V| (assuming the size of Q and k is bounded by a constant), where m is the maximum node degree in G. STAR can further be leveraged to answer general graph queries by decomposing a query to multiple star queries and joining their results later. Learning-based techniques to optimize query decomposition are also developed. We experimentally verify that STAR is 5¿¿¿10 times faster than the state-of-the-art TA-style graph search algorithm, and 10¿¿¿100 times faster than a belief propagation approach.

【Keywords】: Belief propagation; Heuristic algorithms; Motion pictures; Query processing; Time complexity; Upper bound

85. Learning to query: Focused web page harvesting for entity aspects.

【Paper Link】【Pages】:1002-1013

【Authors】: Yuan Fang ; Vincent W. Zheng ; Kevin Chen-Chuan Chang

【Abstract】: As the Web hosts rich information about real-world entities, our information quests become increasingly entity centric. In this paper, we study the problem of focused harvesting of Web pages for entity aspects, to support downstream applications such as business analytics and building a vertical portal. Given that search engines are the de facto gateways to assess information on the Web, we recognize the essence of our problem as Learning to Query (L2Q)¿¿¿to intelligently select queries so that we can harvest pages, via a search engine, focused on an entity aspect of interest. Thus, it is crucial to quantify the utilities of the candidate queries w.r.t. some entity aspect. In order to better estimate the utilities, we identify two opportunities and address their challenges. First, a target entity in a given domain has many peers. We leverage these peer entities to become domain aware. Second, a candidate query may ¿¿¿overlap¿¿¿ with the past queries that have already been fired. We account for these past queries to become context aware. Empirical results show that our approach significantly outperforms both algorithmic and manual baselines by 16% and 10% in F-scores, respectively.

【Keywords】: Automobiles; Business; Context; Context-aware services; Portals; Redundancy; Search engines

86. Discovering Neighborhood Pattern Queries by sample answers in knowledge base.

【Paper Link】【Pages】:1014-1025

【Authors】: Jialong Han ; Kai Zheng ; Aixin Sun ; Shuo Shang ; Ji-Rong Wen

【Abstract】: Knowledge bases have shown their effectiveness in facilitating services like Web search and question-answering. Nevertheless, it remains challenging for ordinary users to fully understand the structure of a knowledge base and to issue structural queries. In many cases, users may have a natural language question and also know some popular (but not all) entities as sample answers. In this paper, we study the Reverse top-k Neighborhood Pattern Query problem, with the aim of discovering structural queries of the question based on: (i) the structure of the knowledge base, and (ii) the sample answers of the question. The proposed solution contains two phases: filter and refine. In the filter phase, a search space of candidate queries is systematically explored. The invalid queries whose result sets do not fully cover the sample answers are filtered out. In the refine phase, all surviving queries are verified to ensure that they are sufficiently relevant to the sample answers, with the assumption that the sample answers are more well-known or popular than other entities in the results of relevant queries. Several optimization techniques are proposed to accelerate the refine phrase. For evaluation, we conduct extensive experiments using the DBpedia knowledge base and a set of real-life questions. Empirical results show that our algorithm is able to provide a small set of possible queries, which contains the query matching the user question in natural language.

【Keywords】: Acceleration; Europe; Knowledge based systems; Natural languages; Optimization; Pattern matching; Terminology

87. Incremental updates on compressed XML.

【Paper Link】【Pages】:1026-1037

【Authors】: Stefan Böttcher ; Rita Hartel ; Thomas Jacobs ; Sebastian Maneth

【Abstract】: XML tree structures can be effectively compressed using straight-line grammars. It has been an open problem how to update straight-line grammars, while keeping them compressed. Therefore, the best previous known methods resort to periodic decompression followed by compression from scratch. The decompression step is expensive, potentially with exponential running time. We present a method that avoids this expensive step. Our method recompresses the updated grammar directly, without prior decompression; it thus greatly outperforms the decompress-compress approach, in terms of both space and time. Our experiments show that the obtained grammars are similar or even smaller than those of the decompress-compress method.

【Keywords】: Ad hoc networks; Binary trees; Grammar; Jacobian matrices; Maintenance engineering; Mobile computing; XML

88. Edge classification in networks.

【Paper Link】【Pages】:1038-1049

【Authors】: Charu Aggarwal ; Gewen He ; Peixiang Zhao

【Abstract】: We consider in this paper the edge classification problem in networks, which is defined as follows. Given a graph-structured network G(N, A), where N is a set of vertices and A ¿¿¿ N ¿¿N is a set of edges, in which a subset Al ¿¿¿ A of edges are properly labeled a priori, determine for those edges in Au = AAl the edge labels which are unknown. The edge classification problem has numerous applications in graph mining and social network analysis, such as relationship discovery, categorization, and recommendation. Although the vertex classification problem has been well known and extensively explored in networks, edge classification is relatively unknown and in an urgent need for careful studies. In this paper, we present a series of efficient, neighborhood-based algorithms to perform edge classification in networks. To make the proposed algorithms scalable in large-scale networks, which can be either disk-resident or streamlike, we further devise efficient, cost-effective probabilistic edge classification methods without a significant compromise to the classification accuracy. We carry out experimental studies in a series of real-world networks, and the experimental results demonstrate both the effectiveness and efficiency of the proposed methods for edge classification in large networks.

【Keywords】: Algorithm design and analysis; Context; Data mining; Electronic mail; Gold; Probabilistic logic; Social network services

89. Minfer: A method of inferring motif statistics from sampled edges.

【Paper Link】【Pages】:1050-1061

【Authors】: Pinghui Wang ; John C. S. Lui ; Donald F. Towsley ; Junzhou Zhao

【Abstract】: Characterizing motif (i.e., locally connected sub-graph patterns) statistics is important for understanding complex networks such as online social networks and communication networks. Previous work made the strong assumption that the graph topology of interest is known in advance. In practice, sometimes researchers have to deal with the situation where the graph topology is unknown because it is expensive to collect and store all topological and meta information. Hence, typically what is available to researchers is only a snapshot of the graph, i.e., a subgraph of the graph. Crawling methods such as breadth first sampling can be used to generate the snapshot. However, these methods fail to sample a streaming graph represented as a high speed stream of edges. Therefore, graph mining applications such as network traffic monitoring use random edge sampling (i.e., sample each edge with a fixed probability) to collect edges and generate a sampled graph, which we called a ¿¿¿RESampled graph¿¿¿. Clearly, a RESampled graph's motif statistics may be quite different from those of the underlying original graph. To resolve this, we propose a framework and implement a system called Minfer, which takes the given RESampled graph and accurately infers the underlying graph's motif statistics. We also apply Fisher information to bound the errors of our estimates. Experiments using large scale datasets show the accuracy and efficiency of our method.

【Keywords】: Communication networks; Electronic mail; Intelligent networks; Network topology; Security; Social network services; Topology

【Paper Link】【Pages】:1062-1073

【Authors】: Lizi Liao ; Qirong Ho ; Jing Jiang ; Ee-Peng Lim

【Abstract】: Social networks are an important class of networks that span a wide variety of media, ranging from social websites such as Facebook and Google Plus, citation networks of academic papers and patents, caller networks in telecommunications, and hyperlinked document collections such as Wikipedia ¿¿¿ to name a few. Many of these social networks now exceed millions of users or actors, each of which may be associated with rich attribute data such as user profiles in social websites and caller networks, or subject classifications in document collections and citation networks. Such attribute data is often incomplete for a number of reasons ¿¿¿ for example, users may be unwilling to spend the effort to complete their profiles, while in the case of document collections, there may be insufficient human labor to accurately classify all documents. At the same time, the tie or link information in these networks may also be incomplete ¿¿¿ in social websites, users may simply be unaware of potential acquaintances, while in citation networks, authors may be unaware of appropriate literature that should be referenced. Completing and predicting these missing attributes and ties is important to a spectrum of applications, such as recommendation, personalized search, and targeted advertising, yet large social networks can pose a scalability challenge to existing algorithms designed for this task. Towards this end, we propose an integrative probabilistic model, SLR, that captures both attribute and tie information simultaneously, and can be used for attribute completion and tie prediction, in order to enable the above mentioned applications. A key innovation in our model is the use of triangle motifs to represent ties in the network, in order to scale to networks with millions of nodes and beyond. Experiments on real world datasets show that SLR significantly improves the accuracy of attribute prediction and tie prediction compared to well-known methods, and our distributed, multi-machine impl- mentation easily scales up to millions of users. In addition to fast and accurate attribute and tie prediction, we also demonstrate how SLR can identify the attributes most responsible for homophily within the network, thus revealing which attributes drive network tie formation.

【Keywords】: Algorithm design and analysis; Facebook; Patents; Predictive models; Probabilistic logic; Scalability

91. TOPIC: Toward perfect Influence Graph Summarization.

【Paper Link】【Pages】:1074-1085

【Authors】: Lei Shi ; Sibai Sun ; Yuan Xuan ; Yue Su ; Hanghang Tong ; Shuai Ma ; Yang Chen

【Abstract】: Summarizing large influence graphs is crucial for many graph visualization and mining tasks. Classical graph clustering and compression algorithms focus on summarizing the nodes by their structural-level or attribute-level similarities, but usually are not designed to characterize the flow-level pattern which is the centerpiece of influence graphs. On the other hand, the social influence analysis has been intensively studied, but little is done on the summarization problem without an explicit focus on social networks. Building on the recent study of the Influence Graph Summarization (IGS), this paper presents a new perspective of the underlying flow-based heuristic. It establishes a direct linkage between the optimal summarization and the classic eigenvector centrality of the graph nodes. Such a theoretic linkage has important implications on numerous aspects in the pursuit of a perfect influence graph summarization. In particular, it enables us to develop a suite of algorithms that can: 1) achieve a near-optimal IGS objective, 2) support dynamic summarizations balancing the IGS objective and the stability of transition in navigating the summarization, and 3) scale to million-node graphs with a near-linear computational complexity. Both quantitative experiments on real-world citation networks and the user studies on the task analysis experience demonstrate the effectiveness of the proposed summarization algorithms.

【Keywords】: Clustering algorithms; Computational modeling; Couplings; Heuristic algorithms; Navigation; Social network services; Visualization

Research Session 9B: Optimization of Temporal, Spatial Data 4

92. A GPU-based index to support interactive spatio-temporal queries over historical data.

【Paper Link】【Pages】:1086-1097

【Authors】: Harish Doraiswamy ; Huy T. Vo ; Cláudio T. Silva ; Juliana Freire

【Abstract】: There are increasing volumes of spatio-temporal data from various sources such as sensors, social networks and urban environments. Analysis of such data requires flexible exploration and visualizations, but queries that span multiple geographical regions over multiple time slices are expensive to compute, making it challenging to attain interactive speeds for large data sets. In this paper, we propose a new indexing scheme that makes use of modern GPUs to efficiently support spatio-temporal queries over point data. The index covers multiple dimensions, thus allowing simultaneous filtering of spatial and temporal attributes. It uses a block-based storage structure to speed up OLAP-type queries over historical data, and supports query processing over in-memory and disk-resident data. We present different query execution algorithms that we designed to allow the index to be used in different hardware configurations, including CPU-only, GPU-only, and a combination of CPU and GPU. To demonstrate the effectiveness of our techniques, we implemented them on top of MongoDB and performed an experimental evaluation using two real-world data sets: New York City's (NYC) taxi data - consisting of over 868 million taxi trips spanning a period of five years, and Twitter posts - over 1.1 billion tweets collected over a period of 14 months. Our results show that our GPU-based index obtains interactive, sub-second response times for queries over large data sets and leads to at least two orders of magnitude speedup over spatial indexes implemented in existing open-source and commercial database systems.

【Keywords】: Graphics processing units; Parallel processing; Synchronization; Syntactics; Urban areas

93. An interval join optimized for modern hardware.

【Paper Link】【Pages】:1098-1109

【Authors】: Danila Piatov ; Sven Helmer ; Anton Dignös

【Abstract】: We develop an algorithm for efficiently joining relations on interval-based attributes with overlap predicates, which, for example, are commonly found in temporal databases. Using a new data structure and a lazy evaluation technique, we are able to achieve impressive performance gains by optimizing memory accesses exploiting features of modern CPU architectures. In an experimental evaluation with real-world datasets our algorithm is able to outperform the state-of-the-art by an order of magnitude.

【Keywords】: Containers; Data structures; Hardware; Indexes; Partitioning algorithms; Spatial databases

94. Efficiently computing reverse k furthest neighbors.

【Paper Link】【Pages】:1110-1121

【Authors】: Shenlu Wang ; Muhammad Aamir Cheema ; Xuemin Lin ; Ying Zhang ; Dongxi Liu

【Abstract】: Given a set of facilities F, a set of users U and a query facility q, a reverse k furthest neighbors (RkFN) query retrieves every user u ¿¿¿ U for which q is one of its k-furthest facilities. RkFN query is the natural complement of reverse k-nearest neighbors (RkNN) query that returns every user u for which q is one of its k-nearest facilities. While RkNN query returns the users that are highly influenced by a query q, RkFN query aims at finding the users that are least influenced by a query q. RkFN query has many applications in location-based services, marketing, facility location, clustering, and recommendation systems etc. While there exist several algorithms that answer RkFN query for k = 1, we are the first to propose a solution for arbitrary value of k. Based on several interesting observations, we present an efficient algorithm to process the RkFN queries. We also present a rigorous theoretical analysis to study various important aspects of the problem and our algorithm. An extensive experimental study is conducted using both real and synthetic data sets, demonstrating that our algorithm outperforms the state-of-the-art algorithm even for k = 1. The accuracy of our theoretical analysis is also verified by the experiments.

【Keywords】: Algorithm design and analysis; Australia; Clustering algorithms; Computational geometry; Games; Mobile radio mobility management; Rendering (computer graphics)

95. Indexing multi-metric data.

【Paper Link】【Pages】:1122-1133

【Authors】: Maximilian Franzke ; Tobias Emrich ; Andreas Züfle ; Matthias Renz

【Abstract】: The proliferation of the Web 2.0 and the ubiquitousness of social media yield a huge flood of heterogenous data that is voluntarily published and shared by billions of individual users all over the world. As a result, the representation of an entity (such as a real person) in this data may consist of various data types, including location and other numeric attributes, textual descriptions, images, videos, social network information and other types of information. Searching similar entities in this multi-enriched data exploiting the information of multiple representations simultaneously promises to yield more interesting and relevant information than searching among each data type individually. While efficient similarity search on single representations is a well studied problem, existing studies lacks appropriate solutions for multi-enriched data taking into account the combination of all representations as a whole. In this paper, we address the problem of index-supported similarity search on multi-enriched (a.k.a. multi-represented) objects based on a set of metrics, one metric for each representation. We define multimetric similarity search queries by employing user-defined weight function specifying the impact of each metric at query time. Our main contribution is an index structure which combines all metrics into a single multi-dimensional access method that works for arbitrary weights preferences. The experimental evaluation shows that our proposed index structure is more efficient than existing multi-metric access methods considering different cost criteria and tremendously outperforms traditional approaches when querying very large sets of multi-enriched objects.

【Keywords】: Extraterrestrial measurements; Indexing; Search problems; Social network services

Research Session 9C: Data Integration and Strings 4

96. DataXFormer: A robust transformation discovery system.

【Paper Link】【Pages】:1134-1145

【Authors】: Ziawasch Abedjan ; John Morcos ; Ihab F. Ilyas ; Mourad Ouzzani ; Paolo Papotti ; Michael Stonebraker

【Abstract】: In data integration, data curation, and other data analysis tasks, users spend a considerable amount of time converting data from one representation to another. For example US dates to European dates or airport codes to city names. In a previous vision paper, we presented the initial design of DataXFormer, a system that uses web resources to assist in transformation discovery. Specifically, DataXFormer discovers possible transformations from web tables and web forms and involves human feedback where appropriate. In this paper, we present the full fledged system along with several extensions. In particular, we present algorithms to find (i) transformations that entail multiple columns of input data, (ii) indirect transformations that are compositions of other transformations, (iii) transformations that are not functions but rather relationships, and (iv) transformations from a knowledge base of public data. We report on experiments with a collection of 120 transformation tasks, and show our enhanced system automatically covers 101 of them by using openly available resources.

【Keywords】: Airports; Computer architecture; Indexing; Knowledge based systems; Search engines; Semantics; Urban areas

97. Joint repairs for web wrappers.

【Paper Link】【Pages】:1146-1157

【Authors】: Stefano Ortona ; Giorgio Orsi ; Tim Furche ; Marcello Buoncristiano

【Abstract】: Automated web scraping is a popular means for acquiring data from the web. Scrapers (or wrappers) are derived from either manually or automatically annotated examples, often resulting in under/over segmented data, together with missing or spurious content. Automatic repair and maintenance of the extracted data is thus a necessary complement to automatic wrapper generation. Moreover, the extracted data is often the result of a long-term data acquisition effort and thus jointly repairing wrappers together with the generated data reduces future needs for data cleaning. We study the problem of computing joint repairs for XPath-based wrappers and their extracted data. We show that the problem is NP-complete in general but becomes tractable under a few natural assumptions. Even tractable solutions to the problem are still impractical on very large datasets, but we propose an optimal approximation that proves effective across a wide variety of domains and sources. Our approach relies on encoded domain knowledge, but require no per-source supervision. An evaluation spanning more than 100k web pages from 100 different sites of a wide variety of application domains, shows that joint repairs are able to increase the quality of wrappers between 15% and 60% independently of the wrapper generation system, eliminating all errors in more than 50% of the cases.

【Keywords】: Cleaning; Computer science; Data acquisition; Data mining; Maintenance engineering; Runtime; Web pages

98. Fast motif discovery in short sequences.

【Paper Link】【Pages】:1158-1169

【Authors】: Honglei Liu ; Fangqiu Han ; Hongjun Zhou ; Xifeng Yan ; Kenneth S. Kosik

【Abstract】: Motif discovery in sequence data is fundamental to many biological problems such as antibody biomarker identification. Recent advances in instrumental techniques make it possible to generate thousands of protein sequences at once, which raises a big data issue for the existing motif finding algorithms: They either work only in a small scale of several hundred sequences or have to trade accuracy for efficiency. In this work, we demonstrate that by intelligently clustering sequences, it is possible to significantly improve the scalability of all the existing motif finding algorithms without losing accuracy at all. An anchor based sequence clustering algorithm (ASC) is thus proposed to divide a sequence dataset into multiple smaller clusters so that sequences sharing the same motif will be located into the same cluster. Then an existing motif finding algorithm can be applied to each individual cluster to generate motifs. In the end, the results from multiple clusters are merged together as final output. Experimental results show that our approach is generic and orders of magnitude faster than traditional motif finding algorithms. It can discover motifs from protein sequences in the scale that no existing algorithm can handle. In particular, ASC reduces the running time of a very popular motif finding algorithm, MEME, from weeks to a few minutes with even better accuracy.

【Keywords】: Big data; Clustering algorithms; Instruments; Proteins; Random sequences; Shape

99. A novel fast and memory efficient parallel MLCS algorithm for long and large-scale sequences alignments.

【Paper Link】【Pages】:1170-1181

【Authors】: Yanni Li ; Yuping Wang ; Zhensong Zhang ; Yaxin Wang ; Ding Ma ; Jianbin Huang

【Abstract】: Information usually can be abstracted as a character sequence over a finite alphabet. With the advent of the era of big data, the increasing length and size of the sequences from various application fields (e.g., biological sequences) result in the classical NP-hard problem, searching for the Multiple Longest Common Subsequences of multiple sequences (i.e., MLCS problem with many applications in the areas of bioinformatics, computational genomics, pattern recognition, etc.), becoming a research hotspot and facing severe challenges. In this paper, we firstly reveal that the leading dominant-point-based MLCS algorithms are very hard to apply to long and large-scale sequences alignments. To overcome their defects, based on the proposed problem-solving model and parallel topological sorting strategies, we present a novel efficient parallel MLCS algorithm. The comprehensive experiments on the benchmark datasets of both random and biological sequences demonstrate that both the time and space complexities of the proposed algorithm are only linearly related to the dominants from aligned sequences, and that the proposed algorithm greatly outperforms the existing state-of-the-art dominant-point-based MLCS algorithms, and hence it is very suitable for long and large-scale sequences alignments.

【Keywords】: Algorithm design and analysis; Bioinformatics; Biology; Complexity theory; Dynamic programming; Heuristic algorithms; Silicon; ICSG-PCC Model; Irredundant Common Subsequence Graph (ICSG); Multiple Longest Common Subsequences (MLCS); Parallel Algorithm; Parallel Collection Chain (PCC)

Industrial and Applications 1: Distributed/Parallel Systems 3

100. SQL-SA for big data discovery polymorphic and parallelizable SQL user-defined scalar and aggregate infrastructure in Teradata Aster 6.20.

【Paper Link】【Pages】:1182-1193

【Authors】: Xin Tang ; Robert M. Wehrmeister ; James Shau ; Abhirup Chakraborty ; Daley Alex ; Awny Al Omari ; Feven Atnafu ; Jeff Davis ; Litao Deng ; Deepak Jaiswal ; Chittaranjan Keswani ; Yafeng Lu ; Chao Ren ; Tom Reyes ; Kashif Siddiqui ; David E. Simmen ; Devendra Vidhani ; Ling Wang ; Shuai Yang ; Daniel Yu

【Abstract】: There is increasing demand to integrate big data analytic systems using SQL. Given the vast ecosystem of SQL applications, enabling SQL capabilities allows big data platforms to expose their analytic potential to a wide variety of end users, accelerating discovery processes and providing significant business value. Most existing big data frameworks are based on one particular programming model such as MapReduce or Graph. However, data scientists are often forced to manually create adhoc data pipelines to connect various big data tools and platforms to serve their analytic needs. When the analytic tasks change, these data pipelines may be costly to modify and maintain. In this paper we present SQL-SA, a polymorphic and parallelizable SQL scalar and aggregate infrastructure in Aster 6.20. This infrastructure extends Aster 6's MapReduce and Graph capabilities to support polymorphic user-defined scalar and aggregate functions using flexible SQL syntax. The implementation enhances main Aster components including query syntax, API, planning and execution extensively. Integrating these new user-defined scalar and aggregate functions with Aster MapReduce and Graph functions, Aster 6.20 enables data scientists to integrate diverse programming models in a single SQL statement. The statement is automatically converted to an optimal data pipeline and executed in parallel. Using a real world business problem and data, Aster 6.20 demonstrates a significant performance advantage (25%+) over Hadoop Pig and Hive.

【Keywords】: Aggregates; Big data; Business; Distributed databases; Java; Programming; Syntactics

101. Flow-Join: Adaptive skew handling for distributed joins over high-speed networks.

【Paper Link】【Pages】:1194-1205

【Authors】: Wolf Rödiger ; Sam Idicula ; Alfons Kemper ; Thomas Neumann

【Abstract】: Modern InfiniBand interconnects offer link speeds of several gigabytes per second and a remote direct memory access (RDMA) paradigm for zero-copy network communication. Both are crucial for parallel database systems to achieve scalable distributed query processing where adding a server to the cluster increases performance. However, the scalability of distributed joins is threatened by unexpected data characteristics: Skew can cause a severe load imbalance such that a single server has to process a much larger part of the input than its fair share and by this slows down the entire distributed query. We introduce Flow-Join, a novel distributed join algorithm that handles attribute value skew with minimal overhead. Flow-Join detects heavy hitters at runtime using small approximate histograms and adapts the redistribution scheme to resolve load imbalances before they impact the join performance. Previous approaches often involve expensive analysis phases, which slow down distributed join processing for non-skewed workloads. This is especially the case for modern high-speed interconnects, which are too fast to hide the extra computation. Other skew handling approaches require detailed statistics, which are often not available or overly inaccurate for intermediate results. In contrast, Flow-Join uses our novel lightweight skew handling scheme to execute at the full network speed of more than 6 GB/s for InfiniBand 4¿¿FDR, joining a skewed input at 11.5 billion tuples/s with 32 servers. This is 6.8¿¿ faster than a standard distributed hash join using the same hardware. At the same time, Flow-Join does not compromise the join performance for non-skewed workloads.

【Keywords】: Histograms; Probes; Query processing; Runtime; Scalability; Servers; Standards

102. Moolle: Fan-out control for scalable distributed data stores.

【Paper Link】【Pages】:1206-1217

【Authors】: SungJu Cho ; Andrew Carter ; Joshua Ehrlich ; Jane Alam Jan

【Abstract】: Many Online Social Networks horizontally partition data across data stores. This allows the addition of server nodes to increase capacity and throughput. For single key lookup queries such as computing a member's 1st degree connections, clients need to generate only one request to one data store. However, for multi key lookup queries such as computing a 2nd degree network, clients need to generate multiple requests to multiple data stores. The number of requests to fulfill the multi key lookup queries grows in relation to the number of partitions. Increasing the number of server nodes in order to increase capacity also increases the number of requests between the client and data stores. This may increase the latency of the query response time because of network congestion, tail-latency, and CPU bounding. Replication based partitioning strategies can reduce the number of requests in the multi key lookup queries. However, reducing the number of requests in a query can degrade the performance of certain queries where processing, computing, and filtering can be done by the data stores. A better system would provide the capability of controlling the number of requests in a query. This paper presents Moolle, a system of controlling the number of requests in queries to scalable distributed data stores. Moolle has been implemented in the LinkedIn distributed graph service that serves hundreds of thousands of social graph traversal queries per second. We believe that Moolle can be applied to other distributed systems that handle distributed data processing with a high volume of variable-sized requests.

【Keywords】: Data systems; Distributed databases; LinkedIn; Query processing; Servers; Time factors

Industrial and Applications 2: Potpourri 1 3

103. Personal recommendation using deep recurrent neural networks in NetEase.

【Paper Link】【Pages】:1218-1229

【Authors】: Sai Wu ; Weichao Ren ; Chengchao Yu ; Gang Chen ; Dongxiang Zhang ; Jingbo Zhu

【Abstract】: Each user session in an e-commerce system can be modeled as a sequence of web pages, indicating how the user interacts with the system and makes his/her purchase. A typical recommendation approach, e.g., Collaborative Filtering, generates its results at the beginning of each session, listing the most likely purchased items. However, such approach fails to exploit current viewing history of the user and hence, is unable to provide a real-time customized recommendation service. In this paper, we build a deep recurrent neural network to address the problem. The network tracks how users browse the website using multiple hidden layers. Each hidden layer models how the combinations of webpages are accessed and in what order. To reduce the processing cost, the network only records a finite number of states, while the old states collapse into a single history state. Our model refreshes the recommendation result each time when user opens a new web page. As user's session continues, the recommendation result is gradually refined. Furthermore, we integrate the recurrent neural network with a Feedfoward network which represents the user-item correlations to increase the prediction accuracy. Our approach has been applied to Kaola (http://www.kaola.com), an e-commerce website powered by the NetEase technologies. It shows a significant improvement over previous recommendation service.

【Keywords】: Computational modeling; History; Predictive models; Real-time systems; Recurrent neural networks; Web pages

104. Recommendations meet web browsing: enhancing collaborative filtering using internet browsing logs.

【Paper Link】【Pages】:1230-1238

【Authors】: Royi Ronen ; Elad Yom-Tov ; Gal Lavee

【Abstract】: Collaborative filtering (CF) recommendation systems are one of the most popular and successful methods for recommending products to people. CF systems work by finding similarities between different people according to their past purchases, and using these similarities to suggest possible items of interest. In this work we show that CF systems can be enhanced using Internet browsing data and search engine query logs, both of which represent a rich profile of individuals' interests.

【Keywords】: Collaboration; Filtering; History; Internet; Measurement; Motion pictures; Search engines

105. SPDO: High-throughput road distance computations on Spark using Distance Oracles.

【Paper Link】【Pages】:1239-1250

【Authors】: Shangfu Peng ; Jagan Sankaranarayanan ; Hanan Samet

【Abstract】: In the past decades, shortest distance methods for road networks have been developed that focus on how to speed up the latency of a single source-target pair distance query. Large analytical applications on road networks including simulations (e.g., evacuation planning), logistics, and transportation planning require methods that provide high throughput (i.e., distance computations per second) and the ability to ¿¿¿scale out¿¿¿ by using large distributed computing clusters. A framework called SPDO is presented which implements an extremely fast distributed algorithm for computing road network distance queries on Apache Spark. The approach extends our previous work of developing the ¿¿-distance oracle which has now been adapted to use Spark's resilient distributed dataset (RDD). Compared with state-of-the-art methods that focus on reducing latency, the proposed framework improves the throughput by at least an order of magnitude, which makes the approach suitable for applications that need to compute thousands to millions of network distances per second.

【Keywords】: Handheld computers; Road networks; Spark; analytical queries; high throughput; network distance computations

Industrial and Applications 3: Potpourri 2 3

【Paper Link】【Pages】:1251-1262

【Authors】: Christopher Jonathan ; Amr Magdy ; Mohamed F. Mokbel ; Albert Jonathan

【Abstract】: The recent wide popularity of microblogs (e.g., tweets, online comments) has empowered various important applications, including, news delivery, event detection, market analysis, and target advertising. A core module in all these applications is a frequent/trending query processor that aims to find out those topics that are highly frequent or trending in the social media through posted microblogs. Unfortunately current attempts for such core module suffer from several drawbacks. Most importantly, their narrow scope, as they focus only on solving trending queries for a very special case of localized and very recent microblogs. This paper presents GARNET; a holistic system equipped with one-stop efficient and scalable solution for supporting a generic form of context-aware frequent and trending queries on microblogs. GARNET supports both frequent and trending queries, any arbitrary time interval either current, recent, or past, of fixed granularity, and having a set of arbitrary filters over contextual attributes. From a system point of view, GARNET is very appealing and industry-friendly, as one needs to realize it once in the system. Then, a myriad of various forms of trending and frequent queries are immediately supported. Experimental evidence based on a real system prototype of GARNET and billions of real Twitter data show the scalability and efficiency of GARNET for various query types.

【Keywords】: Context; Data structures; Frequency measurement; Garnets; Indexes; Systems architecture; Twitter

107. Using SSDs to scale up Google Fusion Tables, a database-in-the-cloud.

【Paper Link】【Pages】:1263-1274

【Authors】: Yingyi Bu ; Felix Halim ; Changkyu Kim ; Hongrae Lee ; Jayant Madhavan

【Abstract】: Flash memory solid state drives (SSDs) have increasingly been advocated and adopted as a means of speeding up and scaling up data-driven applications. SSDs are becoming more widely available as an option in the cloud. However, when an application considers SSDs in the cloud, the best option for the application may not be immediate, among a number of choices for placing SSDs in the layers of the cloud. Although there have been many studies on SSDs, they often concern a specific setting, and how different SSD options in the cloud compare with each other is less well understood. In this paper, we describe how Google Fusion Tables (GFT) used SSDs and what optimizations were implemented to scale up its in-memory processing, clearly showing opportunities and limitations of SSDs in the cloud with quantitative analyses. We first discuss various SSD placement strategies and compare them with low-level measurements, and propose SSD-placement guidelines for a variety of cloud data services. We then present internals of our column engine and optimizations to better use the performance characteristics of SSDs. We empirically demonstrate that the optimizations enable us to scale our application to much larger datasets while retaining the low-latency and simple query processing architecture.

【Keywords】: Cloud computing; Data visualization; Google; Optimization; Query processing; Scalability; Servers

108. FastFunction: Replacing a herd of lemmings with a cheetah a ruby framework for interaction with PostgreSQL databases.

【Paper Link】【Pages】:1275-1286

【Authors】: Henrietta Dombrovskaya ; Srivathsava Rangarajan ; Jonathan Marks

【Abstract】: The ability of web applications to respond fast is critical to the success of any web-based business, and time spent on the interaction with the database is often the most time-consuming portion of the overall response time. Although recent research suggest lots of proven algorithms to improve this interaction, the technicalities of implementation are often considered too time consuming by application developers. In this paper we present a tool that eliminates a significant portion of these technical difficulties and allows almost seamless incorporation of complex database queries and functions into object-oriented applications.

【Keywords】: Analytical models; Databases; ORM; holistic application optimization; object-relational impedance mismatch

Industrial and Applications 4: Real Time Analytics 3

109. A column store engine for real-time streaming analytics.

【Paper Link】【Pages】:1287-1297

【Authors】: Alex Skidanov ; Anders J. Papito ; Adam Prout

【Abstract】: This paper describes novel aspects of the column store implemented in the MemSQL database engine and describes the design choices made to support real-time streaming workloads. Column stores have traditionally been restricted to data warehouse scenarios where low latency queries are a secondary goal, and where restricting data ingestion to be offline, batched, append-only, or some combination thereof is acceptable. In contrast, the MemSQL column store implementation treats low latency queries and ongoing writes as first class citizens, with a focus on avoiding interference between read, ingest, update, and storage optimization workloads through the use of fragmented snapshot transactions and optimistic storage reordering. This implementation broadens the range of serviceable column store workloads to include those with more stringent demands on query and data latency, such as those backing operational systems used by adtech, financial services, fraud detection and other real-time or data streaming applications.

【Keywords】: Distributed databases; Engines; Indexes; Metadata; Real-time systems; Sorting; analytics; columnar; columnstore; concurrency; memsql; realtime; streaming; tpch

110. Fault-tolerant real-time analytics with distributed Oracle Database In-memory.

【Paper Link】【Pages】:1298-1309

【Authors】: Niloy Mukherjee ; Shasank Chavan ; Maria Colgan ; Mike Gleeson ; Xiaoming He ; Allison Holloway ; Jesse Kamp ; Kartik Kulkarni ; Tirthankar Lahiri ; Juan Loaiza ; Neil MacNaughton ; Atrayee Mullick ; Sujatha Muthulingam ; Vivekanandhan Raja ; Raunak Rungta

【Abstract】: Modern data management systems are required to address new breeds of OLTAP applications. These applications demand real time analytical insights over massive data volumes not only on dedicated data warehouses but also on live mainstream production environments where data gets continuously ingested and modified. Oracle introduced the Database In-memory Option (DBIM) in 2014 as a unique dual row and column format architecture aimed to address the emerging space of mixed OLTAP applications along with traditional OLAP workloads. The architecture allows both the row format and the column format to be maintained simultaneously with strict transactional consistency. While the row format is persisted in underlying storage, the column format is maintained purely in-memory without incurring additional logging overheads in OLTP. Maintenance of columnar data purely in memory creates the need for distributed data management architectures. Performance of analytics incurs severe regressions in single server architectures during server failures as it takes non-trivial time to recover and rebuild terabytes of in-memory columnar format. A distributed and distribution aware architecture therefore becomes necessary to provide real time high availability of the columnar format for glitch-free in-memory analytic query execution across server failures and additions, besides providing scale out of capacity and compute to address real time throughput requirements over large volumes of in-memory data. In this paper, we will present the high availability aspects of the distributed architecture of Oracle DBIM that includes extremely scaled out application transparent column format duplication mechanism, distributed query execution on duplicated in-memory columnar format, and several scenarios of fault tolerant analytic query execution across the in-memory column format at various stages of redistribution of columnar data during cluster topology changes.

【Keywords】: Computer architecture; Context; Distributed databases; Fault tolerance; Fault tolerant systems; Real-time systems; Servers; OLTAP; Oracle Database In-memory; distributed architecture; distributed in-memory fault tolerant analytics; high availability; real-time analytics

111. Virtual lightweight snapshots for consistent analytics in NoSQL stores.

【Paper Link】【Pages】:1310-1321

【Authors】: Fernando Chirigati ; Jérôme Siméon ; Martin Hirzel ; Juliana Freire

【Abstract】: Increasingly, applications that deal with big data need to run analytics concurrently with updates. But bridging the gap between big and fast data is challenging: most of these applications require analytics' results that are fresh and consistent, but without impacting system latency and throughput. We propose virtual lightweight snapshots (VLS), a mechanism that enables consistent analytics without blocking incoming updates in NoSQL stores. VLS requires neither native support for database versioning nor a transaction manager. Besides, it is storage-efficient, keeping additional versions of records only when needed to guarantee consistency, and sharing versions across multiple concurrent snapshots. We describe an implementation of VLS in MongoDB and present a detailed experimental evaluation which shows that it supports consistency for analytics with small impact on query evaluation time, update throughput, and latency.

【Keywords】: Airports; Big data; Context; Databases; Middleware; Stability analysis; Throughput

Demo Session 1D 6

112. ALEX: Automatic Link Exploration in Linked Data.

【Paper Link】【Pages】:1322-1325

【Authors】: Ahmed El-Roby ; Ashraf Aboulnaga

【Abstract】: There has recently been an increase in the number of RDF knowledge bases published on the Internet. These rich RDF data sets can be useful in answering many queries, but much more interesting queries can be answered by integrating data from different data sets. This has given rise to research on automatically linking different RDF data sets representing different knowledge bases. This is challenging due to the scale and semantic heterogeneity of these data sets. Various approaches have been proposed, but there is room for improving the quality of the generated links. In this demonstration, we showcase ALEX, a system that aims at improving the quality of links between RDF data sets by using feedback provided by users on the answers to linked data queries. ALEX starts with multiple RDF data sets that are linked using any automatic linking algorithm. ALEX enables the user to issue queries that integrate data from different data sets, and to provide feedback on the answers to these queries. ALEX uses this feedback to eliminate incorrect links between the data sets and discover new links. In this demonstration, we show ALEX in action on multiple data sets from the Linked Open Data cloud.

【Keywords】: Joining processes; Knowledge based systems; Learning (artificial intelligence); Negative feedback; Query processing; Resource description framework; Semantics

【Paper Link】【Pages】:1326-1329

【Authors】: Dolan Antenucci ; Michael R. Anderson ; Penghua Zhao ; Michael J. Cafarella

【Abstract】: Social media nowcasting, the process of estimating real-world phenomena from social media data, has grown in popularity over the last several years as an alternative to traditional data collection methods like phone surveys. Unfortunately, current nowcasting methods depend on pre-existing, traditionally collected survey data as an aid to sift through the huge number of signals that can be derived from social media. This dependence severely limits the applicability of current nowcasting techniques. If we could remove this need for conventional data, social media signals could describe a much wider range of target phenomena. We have built a nowcasting querying system that estimates real-world phenomena without requiring any conventional data, relying instead upon an interactive exploration with users. Specifically, our system exploits a user-provided multi-part query consisting of semantic and signal components. The user can explore in real time the tradeoff between these two components to find the most relevant social media signals to estimate the target phenomenon. Our demonstration system lets users search for signals within a large Twitter corpus using a dynamic web-based interface. Also, users can share results with the general public, review and comment on others' shared results, and clone these results as starting points for further exploration and querying.

【Keywords】: Correlation; Interviews; Market research; Media; Prototypes; Semantics; Unemployment

【Paper Link】【Pages】:1330-1333

【Authors】: Leila Bahri ; Amira Soliman ; Jacopo Squillaci ; Barbara Carminati ; Elena Ferrari ; Sarunas Girdzijauskas

【Abstract】: Fake accounts in online social networks (OSNs) have known considerable sophistication and are now attempting to gain network trust by infiltrating within honest communities. Honest users have limited perspective on the truthfulness of new online identities requesting their friendship. This facilitates the task of fake accounts in deceiving honest users to befriend them. To address this, we have proposed a model that learns hidden correlations between profile attributes within OSN communities, and exploits them to assist users in estimating the trustworthiness of new profiles. To demonstrate our method, we suggest, in this demo, a game application through which players try to cheat the system and convince nodes in a simulated OSN to befriend them. The game deploys different strategies to challenge the players and to reach the objectives of the demo. These objectives are to make participants aware of how fake accounts can infiltrate within their OSN communities, to demonstrate how our suggested method could aid in mitigating this threat, and to eventually strengthen our model based on the data collected from the moves of the players.

【Keywords】: Correlation; Engines; Games; Joining processes; Media; Social network services; Urban areas

115. OptImatch: Semantic web system for query problem determination.

【Paper Link】【Pages】:1334-1337

【Authors】: Guilherme Damasio ; Piotr Mierzejewski ; Jaroslaw Szlichta ; Calisto Zuzarte

【Abstract】: Query performance problem determination is usually performed by analyzing query execution plans (QEPs). Analyzing complex QEPs is excessively time consuming and existing automatic problem determination tools do not provide ability to perform analysis with flexible user-defined problem patterns. We present the novel OptImatch system that allows a relatively naive user to search for patterns in QEPs and get recommendations from an expert and user customizable knowledge base. Our system transforms a QEP into an RDF graph. We provide a web graphical interface for the user to describe a pattern that is transformed with handlers into a SPARQL query. The SPARQL query is matched against the abstracted RDF graph and any matched parts of the graph are relayed back to the user. With the knowledge base the system automatically matches stored patterns to the QEPs by adapting dynamic context through developed tagging language and ranks recommendations using statistical correlation analysis.

【Keywords】: Context; Databases; Knowledge based systems; Pattern matching; Resource description framework; Tagging

116. INSQ: An influential neighbor set based moving kNN query processing system.

【Paper Link】【Pages】:1338-1341

【Authors】: Chuanwen Li ; Yu Gu ; Jianzhong Qi ; Ge Yu ; Rui Zhang ; Qingxu Deng

【Abstract】: We revisit the moving k nearest neighbor (MkNN) query, which computes one's k nearest neighbor set and maintains it while at move. Existing MkNN algorithms are mostly safe region based, which lack efficiency due to either computing small safe regions with a high recomputation frequency or computing larger safe regions but with a high cost for each computation. In this demonstration, we showcase a system named INSQ that adopts a novel algorithm called the Influential Neighbor Set (INS) algorithm to process the MkNN query in both two-dimensional Euclidean space and road networks. This algorithm uses a small set of safe guarding objects instead of safe regions. As long as the the current k nearest neighbors are closer to the query object than the safe guarding objects are, the current k nearest neighbors stay valid and no recomputation is required. Meanwhile, the region defined by the safe guarding objects is the largest possible safe region. This means that the recomputation frequency is also minimized and hence, the INS algorithm achieves high overall query processing efficiency.

【Keywords】: Artificial neural networks; Australia; Electronic mail; Information science; Query processing; Roads

117. graphVizdb: A scalable platform for interactive large graph visualization.

【Paper Link】【Pages】:1342-1345

【Authors】: Nikos Bikakis ; John Liagouris ; Maria Krommyda ; George Papastefanatos ; Timos K. Sellis

【Abstract】: We present a novel platform for the interactive visualization of very large graphs. The platform enables the user to interact with the visualized graph in a way that is very similar to the exploration of maps at multiple levels. Our approach involves an offline preprocessing phase that builds the layout of the graph by assigning coordinates to its nodes with respect to a Euclidean plane. The respective points are indexed with a spatial data structure, i.e., an R-tree, and stored in a database. Multiple abstraction layers of the graph based on various criteria are also created offline, and they are indexed similarly so that the user can explore the dataset at different levels of granularity, depending on her particular needs. Then, our system translates user operations into simple and very efficient spatial operations (i.e., window queries) in the backend. This technique allows for a fine-grained access to very large graphs with extremely low latency and memory requirements and without compromising the functionality of the tool. Our web-based prototype supports three main operations: (1) interactive navigation, (2) multi-level exploration, and (3) keyword search on the graph metadata.

【Keywords】: Data visualization; Geometry; Indexes; Layout; Navigation; Partitioning algorithms

Demo Session 1E 6

118. QB2OLAP: Enabling OLAP on Statistical Linked Open Data.

【Paper Link】【Pages】:1346-1349

【Authors】: Jovan Varga ; Lorena Etcheverry ; Alejandro A. Vaisman ; Oscar Romero ; Torben Bach Pedersen ; Christian Thomsen

【Abstract】: Publication and sharing of multidimensional (MD) data on the Semantic Web (SW) opens new opportunities for the use of On-Line Analytical Processing (OLAP). The RDF Data Cube (QB) vocabulary, the current standard for statistical data publishing, however, lacks key MD concepts such as dimension hierarchies and aggregate functions. QB4OLAP was proposed to remedy this. However, QB4OLAP requires extensive manual annotation and users must still write queries in SPARQL, the standard query language for RDF, which typical OLAP users are not familiar with. In this demo, we present QB2OLAP, a tool for enabling OLAP on existing QB data. Without requiring any RDF, QB(4OLAP), or SPARQL skills, it allows semi-automatic transformation of a QB data set into a QB4OLAP one via enrichment with QB4OLAP semantics, exploration of the enriched schema, and querying with the high-level OLAP language QL that exploits the QB4OLAP semantics and is automatically translated to SPARQL.

【Keywords】: Aggregates; Europe; Organizations; Resource description framework; Semantics; Standards; Vocabulary

119. SEED: A system for entity exploration and debugging in large-scale knowledge graphs.

【Paper Link】【Pages】:1350-1353

【Authors】: Jun Chen ; Yueguo Chen ; Xiaoyong Du ; Xiangling Zhang ; Xuan Zhou

【Abstract】: Large-scale knowledge graphs (KGs) contain massive entities and abundant relations among the entities. Data exploration over KGs allows users to browse the attributes of entities as well as the relations among entities. It therefore provides a good way of learning the structure and coverage of KGs. In this paper, we introduce a system called SEED that is designed to support entity-oriented exploration in large-scale KGs, based on retrieving similar entities of some seed entities as well as their semantic relations that show how entities are similar to each other. A by-product of entity exploration in SEED is to facilitate discovering the deficiency of KGs, so that the detected bugs can be easily fixed by users as they explore the KGs.

【Keywords】: Debugging; Generators; Itemsets; Knowledge engineering; Semantics; User interfaces; Visualization

120. Ranking support for matched patterns over complex event streams: The CEPR system.

【Paper Link】【Pages】:1354-1357

【Authors】: Jiaqi Gu ; Jin Wang ; Carlo Zaniolo

【Abstract】: There is a growing interest in pattern matching over complex event streams. While many bodies of techniques were proposed to search complex patterns and enhance the expressive power of query language, no previous work focused on supporting a well-defined ranking mechanism over answers using semantic ordering. To satisfy this need, we proposed CEPR, a CEP system capable of ranking matchings and emitting ordered results based on users' intentions via a novel query language. In this demo, we will (i) demonstrate language features, system architecture and functionalities, (ii) show examples of CEPR in various application domains and (iii) present a user-friendly interface to monitor query results and interact with the system in real time.

【Keywords】: Automata; DNA; Database languages; Engines; Measurement; Pattern matching; Semantics

121. QPlain: Query by explanation.

【Paper Link】【Pages】:1358-1361

【Authors】: Daniel Deutch ; Amir Gilad

【Abstract】: To assist non-specialists in formulating database queries, multiple frameworks that automatically infer queries from a set of input and output examples have been proposed. While highly useful, a shortcoming of the approach is that if users can only provide a small set of examples, many inherently different queries may qualify. We observe that additional information about the examples, in the form of their explanations, is useful in significantly focusing the set of qualifying queries. We propose to demonstrate QPlain, a system that learns conjunctive queries from examples and their explanations. We capture explanations of different levels of granularity and detail, by leveraging recently developed models for data provenance. Explanations are fed through an intuitive interface, are compiled to the appropriate provenance model, and are then used to derive proposed queries. We will demonstrate that it is feasible for non-specialists to provide examples with meaningful explanations, and that the presence of such explanations result in a much more focused set of queries which better match user intentions.

【Keywords】: Algorithm design and analysis; Computer science; Context; Databases; Graphical user interfaces; Planning; Standards

122. InVerDa - co-existing schema versions made foolproof.

【Paper Link】【Pages】:1362-1365

【Authors】: Kai Herrmann ; Hannes Voigt ; Thorsten Seyschab ; Wolfgang Lehner

【Abstract】: In modern software landscapes multiple applications usually share one database as their single point of truth. All these applications will evolve over time by their very nature. Often former versions need to stay available, so database developers find themselves maintaining co-existing schema version of multiple applications in multiple versions. This is highly error-prone and accounts for significant costs in software projects, as developers realize the translation of data accesses between schema versions with hand-written delta code. In this demo, we showcase INVERDA, a tool for integrated, robust, and easy to use database versioning. We rethink the way of specifying the evolution to new schema versions. Using the richer semantics of a descriptive database evolution language, we generate all required artifacts automatically and make database versioning foolproof.

【Keywords】: Backpropagation; Databases; Manuals; Semantics; Software; Syntactics; Writing

123. ANALOC: Efficient analytics over Location Based Services.

【Paper Link】【Pages】:1366-1369

【Authors】: Md Farhadur Rahman ; Saad Bin Suhaim ; Weimo Liu ; Saravanan Thirumuruganathan ; Nan Zhang ; Gautam Das

【Abstract】: Location Based Services (LBS), including standalone ones such as Google Maps and embedded ones such as ¿¿¿users near me¿¿¿ in the WeChat instant-messaging platform, provide great utility to millions of users. Not only that, they also form an important data source for geospatial and commercial information such as Point-Of-Interest (POI) locations, review ratings, user geo-distributions, etc. Unfortunately, it is not easy to tap into these LBS for tasks such as data analytics and mining, because the only access interface they offer is a limited k-Nearest-Neighbor (kNN) search interface - i.e., for a given input location, return the k nearest tuples in the database, where k is a small constant such as 50 or 100. This limited interface essentially precludes the crawling of an LBS' underlying database, as the small k mandates an extremely large number of queries that no real-world LBS would allow from an IP address or API account. We demonstrate ANALOC, a web based system that enables fast analytics over an LBS by issuing a small number of queries through its restricted kNN interface. ANALOC stands in sharp contrast with existing systems for analyzing geospatial data, as those systems mostly assume complete access to the underlying data. Specifically, ANALOC supports the approximate processing of a wide variety of SUM, COUNT and AVG aggregates over user-specified selection conditions. In the demonstration, we shall not only illustrate the design and accuracy of our underlying aggregate estimation techniques, but also showcase how these estimated aggregates can be used to enable exciting applications such as hotspot detection, infographics, etc. Our demonstration system is designed to query real-world LBS (systems or modules) such as Google Maps, WeChat and Sina Weibo at real time, in order to provide the audience with a practical understanding of the performance of ANALOC.

【Keywords】: Aggregates; Computer architecture; Databases; Estimation; Google; Microprocessors; Servers

Demo Session 2D 6

124. A new privacy-preserving solution for clustering massively distributed personal times-series.

【Paper Link】【Pages】:1370-1373

【Authors】: Tristan Allard ; Georges Hébrail ; Florent Masseglia ; Esther Pacitti

【Abstract】: New personal data fields are currently emerging due to the proliferation of on-body/at-home sensors connected to personal devices. However, strong privacy concerns prevent individuals to benefit from large-scale analytics that could be performed on this fine-grain highly sensitive wealth of data. We propose a demonstration of Chiaroscuro, a complete solution for clustering massively-distributed sensitive personal data while guaranteeing their privacy. The demonstration scenario highlights the affordability of the privacy vs. quality and privacy vs. performance tradeoffs by dissecting the inner working of Chiaroscuro - launched over energy consumption times-series -, by exposing the results obtained by the individuals participating in the clustering process, and by illustrating possible uses.

【Keywords】: Aggregates; Convergence; Data privacy; Encryption; Graphical user interfaces; Privacy; differential privacy; gossip; k-means; time-series

125. Mercury: Metro density prediction with recurrent neural network on streaming CDR data.

【Paper Link】【Pages】:1374-1377

【Authors】: Victor C. Liang ; Richard T. B. Ma ; Wee Siong Ng ; Li Wang ; Marianne Winslett ; Huayu Wu ; Shanshan Ying ; Zhenjie Zhang

【Abstract】: Telecommunication companies possess mobility information of their phone users, containing accurate locations and velocities of commuters travelling in public transportation system. Although the value of telecommunication data is well believed under the smart city vision, there is no existing solution to transform the data into actionable items for better transportation, mainly due to the lack of appropriate data utilization scheme and the limited processing capability on massive data. This paper presents the first ever system implementation of real-time public transportation crowd prediction based on telecommunication data, relying on the analytical power of advanced neural network models and the computation power of parallel streaming analytic engines. By analyzing the feeds of caller detail record (CDR) from mobile users in interested regions, our system is able to predict the number of metro passengers entering stations, the number of waiting passengers on the platforms and other important metrics on the crowd density. New techniques, including geographical-spatial data processing, weight-sharing recurrent neural network, and parallel streaming analytical programming, are employed in the system. These new techniques enable accurate and efficient prediction outputs, to meet the real-world business requirements from public transportation system.

【Keywords】: Companies; Data models; Mobile handsets; Predictive models; Public transportation; Recurrent neural networks

126. ORLF: A flexible framework for online record linkage and fusion.

【Paper Link】【Pages】:1378-1381

【Authors】: El Kindi Rezig ; Eduard C. Dragut ; Mourad Ouzzani ; Ahmed K. Elmagarmid ; Walid G. Aref

【Abstract】: With the exponential growth of data on the Web comes the opportunity to integrate multiple sources to give more accurate answers to user queries. Upon retrieving records from multiple Web databases, a key task is to merge records that refer to the same real-world entity. We demonstrate ORLF (Online Record Linkage and Fusion), a flexible query-time record linkage and fusion framework. ORLF deduplicates newly arriving query results jointly with previously processed query results. We use an iterative caching solution that leverages query locality to effectively deduplicate newly incoming records with cached records. ORLF aims to deliver timely query answers that are duplicate-free and reflect knowledge collected from previous queries.

【Keywords】: Cleaning; Couplings; Data integration; Data structures; Indexes; Monitoring

127. TemProRA: Top-k temporal-probabilistic results analysis.

【Paper Link】【Pages】:1382-1385

【Authors】: Katerina Papaioannou ; Michael H. Böhlen

【Abstract】: The study of time and probability, as two combined dimensions in database systems, has focused on the correct and efficient computation of the probabilities and time intervals. However, there is a lack of analytical information that allows users to understand and tune the probability of time-varying result tuples. In this demonstration, we present TemProRA, a system that focuses on the analysis of the top-k temporal probabilistic results of a query. We propose the Temporal Probabilistic Lineage Tree (TPLT), the Temporal Probabilistic Bubble Chart (TPBC) and the Temporal Probabilistic Column Chart (TPCC): for each output tuple these three tools are created to provide the user with the most important information to systematically modify the time-varying probability of result tuples. The effectiveness and usefulness of TemProRA are demonstrated through queries performed on a dataset created based on data from Migros, the leading Swiss supermarket branch.

【Keywords】: Algebra; Computer science; Database systems; Electronic mail; Probabilistic logic; Vegetation

128. Leveraging non-volatile memory for instant restarts of in-memory database systems.

【Paper Link】【Pages】:1386-1389

【Authors】: David Schwalb ; Martin Faust ; Markus Dreseler ; Pedro Flemming ; Hasso Plattner

【Abstract】: Emerging non-volatile memory technologies (NVM) offer fast and byte-addressable access, allowing to rethink the durability mechanisms of in-memory databases. Hyrise-NV is a database storage engine that maintains table and index structures on NVM. Our architecture updates the database state and index structures transactionally consistent on NVM using multi-version data structures, allowing to instantly recover data-bases independent of their size. In this paper, we demonstrate the instant restart capabilities of Hyrise-NV, storing all data on non-volatile memory. Recovering a dataset of size 92.2 GB takes about 53 seconds using our log-based approach, whereas Hyrise-NV recovers in under one second.

【Keywords】: Computer crashes; Data structures; Indexes; Nonvolatile memory; Throughput

129. Java2SDG: Stateful big data processing for the masses.

【Paper Link】【Pages】:1390-1393

【Authors】: Raul Castro Fernandez ; Panagiotis Garefalakis ; Peter R. Pietzuch

【Abstract】: Big data processing is no longer restricted to specially-trained engineers. Instead, domain experts, data scientists and data users all want to benefit from applying data mining and machine learning algorithms at scale. A considerable obstacle towards this ¿¿¿democratisation of big data¿¿¿ are programming models: current scalable big data processing platforms such as Spark, Naiad and Flink require users to learn custom functional or declarative programming models, which differ fundamentally from popular languages such as Java, Matlab, Python or C++. An open challenge is how to provide a big data programming model for users that are not familiar with functional programming, while maintaining performance, scalability and fault tolerance. We describe JAVA2SDG, a compiler that translates annotated Java programs to stateful dataflow graphs (SDGs) that can execute on a compute cluster in a data-parallel and fault-tolerant fashion. Compared to existing distributed dataflow models, a distinguishing feature of SDGs is that their computational tasks can access distributed mutable state, thus allowing SDGs to capture the semantics of stateful Java programs. As part of the demonstration, we provide examples of machine learning programs in Java, including collaborative filtering and logistic regression, and we explain how they are translated to SDGs and executed on a large set of machines.

【Keywords】: Big data; Collaboration; Computational modeling; Data models; Java; Mathematical model; Programming

Demo Session 2E 5

130. Flexible hybrid stores: Constraint-based rewriting to the rescue.

【Paper Link】【Pages】:1394-1397

【Authors】: Francesca Bugiotti ; Damian Bursztyn ; Alin Deutsch ; Ioana Manolescu ; Stamatis Zampetakis

【Abstract】: Data management goes through interesting times1, as the number of currently available data management systems (DMSs in short) is probably higher than ever before. This leads to unique opportunities for data-intensive applications, as some systems provide excellent performance on certain data processing operations. Yet, it also raises great challenges, as a system efficient on some tasks may perform poorly or not support other tasks, making it impossible to use a single DMS for a given application. It is thus desirable to use different DMSs side by side in order to take advantage of their best performance, as advocated under terms such as hybrid or poly-stores. We present ESTOCADA, a novel system capable of exploiting side-by-side a practically unbound variety of DMSs, all the while guaranteeing the soundness and completeness of the store, and striving to extract the best performance out of the various DMSs. Our system leverages recent advances in the area of query rewriting under constraints, which we use to capture the various data models and describe the fragments each DMS stores.

【Keywords】: Big data; Data integration; Data models; History; Optimization; Query processing; Sparks

131. WatCA: The Waterloo consistency analyzer.

【Paper Link】【Pages】:1398-1401

【Authors】: Hua Fan ; Shankha Chatterjee ; Wojciech M. Golab

【Abstract】: Today's online applications depend on fast storage and retrieval of up-to-date data at web scale. To meet this growing demand, the designers of distributed storage systems have devised a rich variety of data replication protocols, offering different trade-offs between consistency, latency, and availability. Understanding the sweet spot, and testing whether a system delivers a particular level of consistency, are challenging problems as consistency itself is difficult to reason about. This demo paper describes an interactive software tool for measuring and visualizing the consistency actually observed by client applications accessing a key-value storage system in real time. The tool can be used to evaluate performance trade-offs in a system with tunable consistency, or to verify the correctness of a storage system that guarantees certain forms of so-called ¿¿¿strong consistency¿¿¿.

【Keywords】: Atomic measurements; Graphical user interfaces; History; Registers; Synchronization; Throughput

132. DebEAQ - debugging empty-answer queries on large data graphs.

【Paper Link】【Pages】:1402-1405

【Authors】: Elena Vasilyeva ; Thomas Heinze ; Maik Thiele ; Wolfgang Lehner

【Abstract】: The large volume of freely available graph data sets impedes the users in analyzing them. For this purpose, they usually pose plenty of pattern matching queries and study their answers. Without deep knowledge about the data graph, users can create ¿¿¿failing¿¿¿ queries, which deliver empty answers. Analyzing the causes of these empty answers is a time-consuming and complicated task especially for graph queries. To help users in debugging these ¿¿¿failing¿¿¿ queries, there are two common approaches: one is focusing on discovering missing subgraphs of a data graph, the other one tries to rewrite the queries such that they deliver some results. In this demonstration, we will combine both approaches and give the users an opportunity to discover why empty results were delivered by the requested queries. Therefore, we propose DebEAQ, a debugging tool for pattern matching queries, which allows to compare both approaches and also provides functionality to debug queries manually.

【Keywords】: Data models; Databases; Debugging; Engines; Pattern matching; Topology; Urban areas

133. Cruncher: Distributed in-memory processing for location-based services.

【Paper Link】【Pages】:1406-1409

【Authors】: Ahmed S. Abdelhamid ; MingJie Tang ; Ahmed M. Aly ; Ahmed R. Mahmood ; Thamir Qadah ; Walid G. Aref ; Saleh M. Basalamah

【Abstract】: Advances in location-based services (LBS) demand high-throughput processing of both static and streaming data. Recently, many systems have been introduced to support distributed main-memory processing to maximize the query throughput. However, these systems are not optimized for spatial data processing. In this demonstration, we showcase Cruncher, a distributed main-memory spatial data warehouse and streaming system. Cruncher extends Spark with adaptive query processing techniques for spatial data. Cruncher uses dynamic batch processing to distribute the queries and the data streams over commodity hardware according to an adaptive partitioning scheme. The batching technique also groups and orders the overlapping spatial queries to enable inter-query optimization. Both the data streams and the offline data share the same partitioning strategy that allows for data co-locality optimization. Furthermore, Cruncher uses an adaptive caching strategy to maintain the frequently-used location data in main memory. Cruncher maintains operational statistics to optimize query processing, data partitioning, and caching at runtime. We demonstrate two LBS applications over Cruncher using real datasets from OpenStreetMap and two synthetic data streams. We demonstrate that Cruncher achieves order(s) of magnitude throughput improvement over Spark when processing spatial data.

【Keywords】: Distributed databases; Indexes; Optimization; Query processing; Sparks; Spatial databases; Throughput

134. A demonstration of GeoSpark: A cluster computing framework for processing big spatial data.

【Paper Link】【Pages】:1410-1413

【Authors】: Jia Yu ; Jinxuan Wu ; Mohamed Sarwat

【Abstract】: This paper demonstrates GEOSPARK a cluster computing framework for developing and processing large-scale spatial data analytics programs. GEOSPARK consists of three main layers: Apache Spark Layer, Spatial RDD Layer and Spatial Query Processing Layer. Apache Spark Layer provides basic Apache Spark functionalities as regular RDD operations. Spatial RDD Layer consists of three novel Spatial Resilient Distributed Datasets (SRDDs) which extend regular Apache Spark RDD to support geometrical and spatial objects with data partitioning and indexing. Spatial Query Processing Layer executes spatial queries (e.g., Spatial Join) on SRDDs. The dynamic status of SRDDs and spatial operations are visualized by GEOSPARK monitoring map interface. We demonstrate GEOSPARK using three spatial analytics applications (spatial aggregation, autocorrelation and co-location) to show how users can easily define their spatial analytics tasks and efficiently process such tasks on large-scale spatial data at interactive performance.

【Keywords】: Correlation; Data analysis; Heating; Monitoring; Sparks; Spatial databases; Vegetation

Tutorials 8

135. Indoor data management.

【Paper Link】【Pages】:1414-1417

【Authors】: Hua Lu ; Muhammad Aamir Cheema

【Abstract】: A large part of modern life is lived indoors such as in homes, offices, shopping malls, universities, libraries and airports. However, almost all of the existing location-based services (LBS) have been designed only for outdoor space. This is mainly because the global positioning system (GPS) and other positioning technologies cannot accurately identify the locations in indoor venues. Some recent initiatives have started to cross this technical barrier, promising huge future opportunities for research organizations, government agencies, technology giants, and enterprizing start-ups - to exploit the potential of indoor LBS. Consequently, indoor data management has gained significant research attention in the past few years and the research interest is expected to surge in the upcoming years. This will result in a broad range of indoor applications including emergency services, public services, in-store advertising, shopping, tracking, guided tours, and much more. In this tutorial, we first highlight the importance of indoor data management and the unique challenges that need to be addressed. Subsequently, we provide an overview of the existing research in indoor data management, covering modeling, cleansing, indexing, querying, and other relevant topics. Finally, we discuss the future research directions in this important and growing research area, discussing spatial-textual search, integrating outdoor and indoor spaces, uncertain indoor data, and indoor trajectory mining.

【Keywords】: Global Positioning System; Indexing; Indoor navigation; Radiofrequency identification; Trajectory; Tutorials

136. Scaling up truth discovery.

【Paper Link】【Pages】:1418-1419

【Authors】: Laure Berti-Equille

【Abstract】: The evolution of the Web from a technology platform to a social ecosystem has resulted in unprecedented data volumes being continuously generated, exchanged, and consumed. User-generated content on the Web is massive, highly dynamic, and characterized by a combination of factual data and opinion data. False information, rumors, and fake contents can be easily spread across multiple sources, making it hard to distinguish between what is true and what is not. Truth discovery (also known as fact-checking) has recently gained lot of interest from Data Science communities. This tutorial will attempt to cover recent work on truth-finding and how it can scale Big Data. We will provide a broad overview with new insights, highlighting the progress made on truth discovery from information extraction, data and knowledge fusion, as well as modeling of misinformation dynamics in social networks. We will review in details current models, algorithms, and techniques proposed by various research communities whose contributions converge towards the same goal of estimating the veracity of data in a dynamic world. Our aim is to bridge theory and practice and introduce recent work from diverse disciplines to database people to be better equipped for addressing the challenges of truth discovery in Big Data.

【Keywords】: Big data; Computational modeling; Data models; Heuristic algorithms; Information retrieval; Knowledge engineering; Tutorials

137. Scalable data management: NoSQL data stores in research and practice.

【Paper Link】【Pages】:1420-1423

【Authors】: Felix Gessert ; Norbert Ritter

【Abstract】: The unprecedented scale at which data is consumed and generated today has shown a large demand for scalable data management and given rise to non-relational, distributed ¿¿¿NoSQL¿¿¿ database systems. Two central problems triggered this process: 1) vast amounts of user-generated content in modern applications and the resulting requests loads and data volumes 2) the desire of the developer community to employ problem-specific data models for storage and querying. To address these needs, various data stores have been developed by both industry and research, arguing that the era of one-size-fits-all database systems is over. The heterogeneity and sheer amount of these systems - now commonly referred to as NoSQL data stores - make it increasingly difficult to select the most appropriate system for a given application. Therefore, these systems are frequently combined in polyglot persistence architectures to leverage each system in its respective sweet spot. This tutorial gives an in-depth survey of the most relevant NoSQL databases to provide comparative classification and highlight open challenges. To this end, we analyze the approach of each system to derive its scalability, availability, consistency, data modeling and querying characteristics. We present how each system's design is governed by a central set of trade-offs over irreconcilable system properties. We then cover recent research results in distributed data management to illustrate that some shortcomings of NoSQL systems could already be solved in practice, whereas other NoSQL data management problems pose interesting and unsolved research challenges.

【Keywords】: Data models; Database systems; Distributed databases; Generators; Scalability; Tutorials

138. The era of Big Spatial Data.

【Paper Link】【Pages】:1424-1427

【Authors】: Ahmed Eldawy ; Mohamed F. Mokbel

【Abstract】: In this tutorial, we present the recent work in the database community for handling Big Spatial Data. This topic became very hot due to the recent explosion in the amount of spatial data generated by smartphones, satellites and medical devices, among others. This tutorial goes beyond the use of existing systems as-is (e.g., Hadoop, Spark or Impala), and digs deep into the core components of big systems (e.g., indexing and query processing) to describe how they are designed to handle big spatial data. During this 90-minute tutorial, we review the state-of-the-art work in the area of Big Spatial Data while classifying the existing research efforts according to the implementation approach, underlying architecture, and system components. In addition, we provide case studies of full-fledged systems and applications that handle Big Spatial Data which allows the audience to better comprehend the whole tutorial.

【Keywords】: Big data; Buildings; Data visualization; Indexing; Spatial databases; Tutorials

139. Accelerating database workloads by software-hardware-system co-design.

【Paper Link】【Pages】:1428-1431

【Authors】: Rajesh R. Bordawekar ; Mohammad Sadoghi

【Abstract】: The key objective of this tutorial is to provide a broad, yet an in-depth survey of the emerging field of co-designing software, hardware, and systems components for accelerating enterprise data management workloads. The overall goal of this tutorial is two-fold. First, we provide a concise system-level characterization of different types of data management technologies, namely, the relational and NoSQL databases and data stream management systems from the perspective of analytical workloads. Using the characterization, we discuss opportunities for accelerating key data management workloads using software and hardware approaches. Second, we dive deeper into the hardware acceleration opportunities using Graphics Processing Units (GPUs) and Field-Programmable Gate Arrays (FPGAs) for the query execution pipeline. Furthermore, we explore other hardware acceleration mechanisms such as single-instruction multiple-data (SIMD) that enables short-vector data parallelism.

【Keywords】: Acceleration; Computer architecture; Databases; Field programmable gate arrays; Graphics processing units; Hardware; Programming

140. Data profiling.

【Paper Link】【Pages】:1432-1435

【Authors】: Ziawasch Abedjan ; Lukasz Golab ; Felix Naumann

【Abstract】: One of the crucial requirements before consuming datasets for any application is to understand the dataset at hand and its metadata. The process of metadata discovery is known as data profiling. Profiling activities range from ad-hoc approaches, such as eye-balling random subsets of the data or formulating aggregation queries, to systematic inference of structural information and statistics of a dataset using dedicated profiling tools. In this tutorial, we highlight the importance of data profiling as part of any data-related use-case, and discuss the area of data profiling by classifying data profiling tasks and reviewing the state-of-the-art data profiling systems and techniques. In particular, we discuss hard problems in data profiling, such as algorithms for dependency discovery and profiling algorithms for dynamic data and streams. We conclude with directions for future research in the area of data profiling. This tutorial is based on our survey on profiling relational data [1].

【Keywords】: Big data; Data integration; Data mining; Data visualization; Databases; Metadata; Tutorials

141. Blocking for large-scale Entity Resolution: Challenges, algorithms, and practical examples.

【Paper Link】【Pages】:1436-1439

【Authors】: George Papadakis ; Themis Palpanas

【Abstract】: Entity Resolution constitutes one of the cornerstone tasks for the integration of overlapping information sources. Due to its quadratic complexity, a large amount of research has focused on improving its efficiency so that it scales to Web Data collections, which are inherently voluminous and highly heterogeneous. The most common approach for this purpose is blocking, which clusters similar entities into blocks so that the pair-wise comparisons are restricted to the entities contained within each block. In this tutorial, we take a close look on blocking-based Entity Resolution, starting from the early blocking methods that were crafted for database integration. We highlight the challenges posed by contemporary heterogeneous, noisy, voluminous Web Data and explain why they render inapplicable these schema-based techniques. We continue with the presentation of blocking methods that have been developed for large-scale and heterogeneous information and are suitable for Web Data collections. We also explain how their efficiency can be further improved by meta-blocking and parallelization techniques. We conclude with a hands-on session that demonstrates the relative performance of several, state-of-the-art techniques. The participants of the tutorial will put in practice all the topics discussed in the theory part, and will get familiar with a reference toolbox, which includes the most prominent techniques in the area and can be readily used to tackle Entity Resolution problems.

【Keywords】: Couplings; Data collection; Data integration; Databases; Taxonomy; Tutorials; Web search

142. Microblogs data management and analysis.

【Paper Link】【Pages】:1440-1443

【Authors】: Amr Magdy ; Mohamed F. Mokbel

【Abstract】: Microblogs data, e.g., tweets, reviews, news comments, and social media comments, has gained considerable attention in recent years due to its popularity and rich contents. Nowadays, microblogs applications span a wide spectrum of interests, including detecting and analyzing events, user analysis for geo-targeted ads and political elections, and critical applications like discovering health issues and rescue services. Consequently, major research efforts are spent to analyze and manage microblogs data to support different applications. In this tutorial, we give a 1.5 hours overview about microblogs data analysis, management, and systems. The tutorial gives a comprehensive review for research efforts that are trying to analyze microblogs contents to build on them new functionality and use cases. In addition, the tutorial reviews existing research that propose core data management components to support microblogs queries at scale. Finally, the tutorial reviews system-level issues and on-going work on supporting microblogs data through the rising big data systems. Through its different parts, the tutorial highlights the challenges and opportunities in microblogs data research.

【Keywords】: Data analysis; Indexing; Media; Real-time systems; Tutorials; Twitter

Panels 2

143. Dark Data: Are we solving the right problems?

【Paper Link】【Pages】:1444-1445

【Authors】: Michael J. Cafarella ; Ihab F. Ilyas ; Marcel Kornacker ; Tim Kraska ; Christopher Re

【Abstract】: With the increasing urge of the enterprises to ingest as much data as they can in what's commonly referred to as ¿¿¿Data Lakes¿¿¿, the new environment presents serious challenges to traditional ETL models and to building analytic layers on top of well-understood global schema. With the recent development of multiple technologies to support this ¿¿¿load-first¿¿¿ paradigm, even traditional enterprises have fairly large HDFS-based data lakes now. They have even had them long enough that their first generation IT projects delivered on some, but not all, of the promise of integrating their enterprise's data assets. In short, we moved from no data to Dark data. Dark data is what enterprises might have in their possession, without the ability to access it or with limited awareness of what this data represents. In particular, business-critical information might still remain out of reach. This panel is about Dark Data and whether we have been focusing on the right data management challenges in dealing with it.

【Keywords】: Big data; Cleaning; Computer science; Data mining; Databases; Information retrieval; Lakes

144. Big data quality - whose problem is it?

【Paper Link】【Pages】:1446-1447

【Authors】: Shazia Sadiq ; Paolo Papotti

【Abstract】: The increased reliance on data driven enterprise has seen an unprecedented investment in big data initiatives. Organizations averaged US$8M in investments in big data-related initiatives and programs in 2014, with 70% of large enterprises and 56% of small and medium enterprises (SMEs) having already deployed, or planning to deploy, big-data projects [1]. As companies intensify their efforts to get value from big data, the growth in the amount of data being managed continues at an exponential rate, leaving organizations with a massive footprint of unexplored, unfamiliar datasets. On February 8th, 2015, a group of global thought leaders from the database research community outlined the grand challenges in getting value from big data [2]. The key message was the need to develop the capacity to ¿¿¿understand how the quality of data affects the quality of the insight we derive from it¿¿¿.

【Keywords】: Big data; Cleaning; Computer science; Databases; Investment

TKDE Posters 67

145. Similarity Group-By operators for multi-dimensional relational data.

【Paper Link】【Pages】:1448-1449

【Authors】: MingJie Tang ; Ruby Y. Tahboub ; Walid G. Aref ; Mikhail J. Atallah ; Qutaibah M. Malluhi ; Mourad Ouzzani ; Yasin N. Silva

【Abstract】: The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently realize these approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.

【Keywords】: Data analysis; Indexes; Measurement; Runtime; Semantics; Standards; Syntactics

146. Using Memetic Algorithm for instance coreference resolution.

【Paper Link】【Pages】:1450-1451

【Authors】: Xingsi Xue ; Yuping Wang

【Abstract】: The Linked Open Data (LOD) community effort is a cornerstone in the realization of the Semantic Web vision [1]. However, since an instance in the LOD is likely to be denoted with many identifiers (e.g., URIs) by different parties, instance coreference resolution, which identifies different identifiers for the same instance and eliminates the inconsistency between the datasets, has become critical to the development of LOD.

【Keywords】: Atmospheric measurements; Benchmark testing; Convergence; Ontologies; Particle measurements; Sociology; Statistics

147. Crowdsourcing for top-K query processing over uncertain data.

【Paper Link】【Pages】:1452-1453

【Authors】: Eleonora Ciceri ; Piero Fraternali ; Davide Martinenghi ; Marco Tagliasacchi

【Abstract】: Both social media and sensing infrastructures are producing an unprecedented mass of data characterized by their uncertain nature, due to either the noise inherent in sensors or the imprecision of human contributions. Therefore query processing over uncertain data has become an active research field. In the well-known class of applications commonly referred to as ¿¿¿top-K queries¿¿¿, the objective is to find the best K objects matching the user's information need, formulated as a scoring function over the objects' attribute values. If both the data and the scoring function are deterministic, the best K objects can be univocally determined and totally ordered so as to produce a single ranked result set (as long as ties are broken by some deterministic rule). However, in application scenarios involving uncertain data and fuzzy information needs, this does not hold: when either the attribute values or the scoring function are nondeterministic, there may be no consensus on a single ordering, but rather a space of possible orderings. To determine the correct ordering, one needs to acquire additional information so as to reduce the amount of uncertainty associated with the queried data and consequently the number of orderings in such a space. An emerging trend in data processing is crowdsourcing, defined as the systematic engagement of humans in the resolution of tasks through online distributed work. Our approach combines human and automatic computation in order to solve complex problems: when data ambiguity can be resolved by human judgment, crowdsourcing becomes a viable tool for converging towards a unique or at least less uncertain query result. The goal of this paper is to define and compare task selection policies for uncertainty reduction via crowdsourcing, with emphasis on the case of top-K queries.

【Keywords】: Crowdsourcing; Entropy; Measurement uncertainty; Noise measurement; Probability density function; Query processing; Uncertainty

148. Adaptive noise immune cluster ensemble using affinity propagation.

【Paper Link】【Pages】:1454-1455

【Authors】: Zhiwen Yu ; Guoqiang Han ; Le Li ; Jiming Liu ; Jun Zhang

【Abstract】: Cluster ensemble, as one of the important research directions in the ensemble learning area, is gaining more and more attention, due to its powerful capability to integrate multiple clustering solutions and provide a more accurate, stable and robust result [1]. Cluster ensemble has a lot of useful applications in a large number of areas. Although most of traditional cluster ensemble approaches obtain good results, few of them consider how to achieve good performance for noisy datasets. Some noisy datasets have a number of noisy attributes which may degrade the performance of conventional cluster ensemble approaches. Some noisy datasets which contain noisy samples will affect the final results. Other noisy datasets may be sensitive to distance functions.

【Keywords】: Generators; Lungs

149. Efficient similarity join based on Earth Mover's Distance using MapReduce.

【Paper Link】【Pages】:1456-1457

【Authors】: Jia Xu ; Bin Lei ; Yu Gu ; Marianne Winslett ; Ge Yu ; Zhenjie Zhang

【Abstract】: Earth Mover's Distance (EMD) evaluates the similarity between probability distributions, known as a robust measure more consistent with human similarity perception than traditional similarity functions. EMD similarity join retrieves pairs of probability distributions with EMD below a specified threshold, supporting many important applications, such as duplicate image retrieval and sensor pattern recognition. This paper studies the possibility of using MapReduce to improve the scalability of EMD similarity join. Utilizing the dual-program mapping technique, we present a new general data partition framework to facilitate effective workload decomposition using MapReduce, ensuring similar distributions in terms of EMD are mapped to the same reduce task for further verification. New optimization strategies are also proposed to balance the workloads among reduce tasks and eliminate large unnecessary EMD evaluations. Our experiments verify the superiority of our proposal on system efficiency, with a huge advantage of at least one order of magnitude than the state-of-the-art solution, and on system effectiveness, with a real case study towards the abused image phenomenon on C2C website in China. Further details are reported in [4].

【Keywords】: Acceleration; Earth; Footwear; Histograms; Image retrieval; Optimization; Probability distribution

【Paper Link】【Pages】:1458-1459

【Authors】: Shan Jiang ; Hsinchun Chen

【Abstract】: Social media networks are dynamic. As such, the order in which network ties develop is an important aspect of the network dynamics.This study proposes a novel dynamic network model, the Nodal Attribute-based Temporal Exponential Random Graph Model (NATERGM) for dynamic network analysis. The proposed model focuses on how the nodal attributes of a network affect the order in which the network ties develop. Empirical results showed that the NATERGM demonstrated an enhanced pattern testing capability compared to benchmark models. The proposed NATERGM model helps explain the roles of nodal attributes in the formation process of dynamic networks.

【Keywords】: Analytical models; Benchmark testing; Knowledge engineering; Markov processes; Media; Monte Carlo methods; dynamic networks; knowledge sharing; social media

151. Tensor canonical correlation analysis for multi-view dimension reduction.

【Paper Link】【Pages】:1460-1461

【Authors】: Yong Luo ; Dacheng Tao ; Kotagiri Ramamohanarao ; Chao Xu ; Yonggang Wen

【Abstract】: Canonical correlation analysis (CCA) has proven an effective tool for two-view dimension reduction due to its profound theoretical foundation and success in practical applications. In respect of multi-view learning, however, it is limited by its capability of only handling data represented by two-view features, while in many real-world applications, the number of views is frequently many more. Although the ad hoc way of simultaneously exploring all possible pairs of features can numerically deal with multi-view data, it ignores the high order statistics (correlation information) which can only be discovered by simultaneously exploring all features. Therefore, in this work, we develop tensor CCA (TCCA) which straightforwardly yet naturally generalizes CCA to handle the data of an arbitrary number of views by analyzing the covariance tensor of the different views. TCCA aims to directly maximize the canonical correlation of multiple (more than two) views. Crucially, we prove that the main problem of multiview canonical correlation maximization is equivalent to finding the best rank-1 approximation of the data covariance tensor, which can be solved efficiently using the well-known alternating least squares (ALS) algorithm. As a consequence, the high order correlation information contained in the different views is explored and thus a more reliable common subspace shared by all features can be obtained.

【Keywords】: Approximation algorithms; Australia; Computers; Correlation; Electronic mail; Feature extraction; Tensile stress

152. TRIP: An interactive retrieving-inferring data imputation approach.

【Paper Link】【Pages】:1462-1463

【Authors】: Zhixu Li ; Lu Qin ; Hong Cheng ; Xiangliang Zhang ; Xiaofang Zhou

【Abstract】: Data imputation aims at filling in missing attribute values in databases. Existing imputation approaches to nonquantitive string data can be roughly put into two categories: (1) inferring-based approaches [2], and (2) retrieving-based approaches [1]. Specifically, the inferring-based approaches find substitutes or estimations for the missing ones from the complete part of the data set. However, they typically fall short in filling in unique missing attribute values which do not exist in the complete part of the data set [1]. The retrieving-based approaches resort to external resources for help by formulating proper web search queries to retrieve web pages containing the missing values from the Web, and then extracting the missing values from the retrieved web pages [1]. This webbased retrieving approach reaches a high imputation precision and recall, but on the other hand, issues a large number of web search queries, which brings a large overhead [1].

【Keywords】: Australia; Computer science; Electronic mail; Optimal scheduling; System recovery; Web pages; Web search

153. t-closeness through microaggregation: Strict privacy with enhanced utility preservation.

【Paper Link】【Pages】:1464-1465

【Authors】: Jordi Soria-Comas ; Josep Domingo-Ferrer ; David Sánchez ; Sergio Martínez

【Abstract】: This paper proposes and shows how to use microaggregation to attain t-closeness on top of k-anonymity to protect data releases. The advantages in terms of data utility preservation of microaggregation over classic approaches based on generalizing values are analyzed. Then several microaggregation algorithms for k-anonymous t-closeness are presented and empirically evaluated.

【Keywords】: Clustering algorithms; Computational modeling; Data privacy; Measurement; Merging; Partitioning algorithms; Privacy

154. Domain-sensitive Recommendation with user-item subgroup analysis.

【Paper Link】【Pages】:1466-1467

【Authors】: Jing Liu ; Yu Jiang ; Zechao Li ; Xi Zhang ; Hanqing Lu

【Abstract】: In this paper, we propose a Domain-sensitive Recommendation (DsRec) algorithm, to make the rating prediction by exploring the user-item subgroup analysis simultaneously, in which a user-item subgroup is deemed as a domain consisting of a subset of items with similar attributes and a subset of users who have interests in these items. The proposed framework of DsRec includes three components: a matrix factorization model for the observed rating reconstruction, a bi-clustering model for the user-item subgroup analysis, and two regularization terms to connect the above two components into a unified formulation. Extensive experiments on three real-world datasets show that our method achieves the better performance over some state-of-the-art methods.

【Keywords】: Algorithm design and analysis; Analytical models; Collaboration; Linear programming; Prediction algorithms; Predictive models; Recommender systems

155. Semantic-aware blocking for entity resolution.

【Paper Link】【Pages】:1468-1469

【Authors】: Qing Wang ; Mingyuan Cui ; Huizhi Liang

【Abstract】: In this work we propose a semantic-aware blocking framework for entity resolution (ER). The proposed framework is built using locality-sensitive hashing (LSH) techniques to efficiently unify both textual and semantic features into an ER blocking process. In order to understand how similarity metrics may affect the effectiveness of ER blocking, we study the robustness of similarity metrics and their properties in terms of LSH families. We further discuss how the semantic similarity of records can be captured, measured, and integrated with LSH techniques over multiple similarity spaces. We have evaluated our proposed framework over two real-world data sets, and compared it with the state-of-the-art blocking techniques. The experimental study shows that using a combination of semantic features and textual features can considerably improve the quality of blocking. Due to the probabilistic nature of LSH, this semantic-aware blocking framework also enables us to build fast and reliable blocking for performing entity resolution tasks in a large-scale data environment.

【Keywords】: Erbium; FAA; Semantics

156. Privacy-preserving indoor localization on smartphones.

【Paper Link】【Pages】:1470-1471

【Authors】: Andreas Konstantinidis ; Georgios Chatzimilioudis ; Demetrios Zeinalipour-Yazti ; Paschalis Mpeis ; Nikos Pelekis ; Yannis Theodoridis

【Abstract】: Predominant smartphone OS localization subsystems currently rely on server-side localization processes, allowing the service provider to know the location of a user at all times. In this paper, we propose an innovative algorithm for protecting users from location tracking by the localization service, without hindering the provisioning of fine-grained location updates on a continuous basis. Our proposed Temporal Vector Map (TVM) algorithm, allows a user to accurately localize by exploiting a k-Anonymity Bloom (kAB) filter and a bestNeighbors generator of camouflaged localization requests, both of which are shown to be resilient to a variety of privacy attacks. We have evaluated our framework using a real prototype developed in Android and Hadoop HBase as well as realistic Wi-Fi traces scaling-up to several GBs. Our study reveals that TVM can offer fine-grained localization in approximately four orders of magnitude less energy and number of messages than competitive approaches.

【Keywords】: Buildings; Databases; Global Positioning System; IEEE 802.11 Standard; Privacy; Smart phones; Urban areas

157. CRoM and HuspExt: Improving efficiency of high utility sequential pattern extraction.

【Paper Link】【Pages】:1472-1473

【Authors】: Oznur Kirmemis Alkan ; Pinar Karagoz

【Abstract】: This paper presents efficient data structures and a pruning technique in order to improve the efficiency of high utility sequential pattern mining. CRoM (Cumulated Rest of Match) based upper bound, which is a tight upper bound on the utility of the candidates is proposed in order to perform more conservative pruning before candidate pattern generation in comparison to the existing techniques. In addition, an efficient algorithm, HuspExt (High Utility Sequential Pattern Extraction), is presented which calculates the utilities of the child patterns based on that of the parents'. Substantial experiments on both synthetic and real datasets from different domains show that, the solution efficiently discovers high utility sequential patterns under low thresholds.

【Keywords】: Computers; Data mining; Data structures; Itemsets; Memory management; Upper bound

158. Joint structure feature exploration and regularization for multi-task graph classification.

【Paper Link】【Pages】:1474-1475

【Authors】: Shirui Pan ; Jia Wu ; Xingquan Zhu ; Chengqi Zhang ; Philip S. Yu

【Abstract】: We formulate a new multi-task graph classification (MTG) problem, where multiple graph classification tasks are jointly regularized to find discriminative subgraphs shared by all tasks for learning. More details can be found in [1].

【Keywords】: Cancer; Chemical compounds; Linear programming; Loss measurement; Malignant tumors; Prediction algorithms; Training

159. Efficient answering of why-not questions in similar graph matching.

【Paper Link】【Pages】:1476-1477

【Authors】: Md. Saiful Islam ; Chengfei Liu ; Jianxin Li

【Abstract】: Graph data management and matching similar graphs are very important for many applications including bioinformatics, computer vision, VLSI design, bug localization, road networks, social and communication networking. Many graph indexing and similarity matching techniques have already been proposed for managing and querying graph data. In similar graph matching, a user is returned with the database graphs whose distances with the query graph are below a threshold. In such query settings, a user may not receive certain database graphs that are very similar to the query graph if the initial query graph is inappropriate/imperfect for the expected answer set. To exemplify this, consider a drug designer who is looking for chemical compounds that could be the target of her hypothetical drug before realizing it. In response to her query, the traditional search system may return the structures from the database that are most similar to the query graph. However, she may get surprised if some of the expected targets are missing in the answer set. She may then seek assistance from the system by asking ¿¿¿Is there other query graph that can match my expected answer set?¿¿¿. The system may then modify her initial query graph to include the missing answers in the new answer set. Here, we study this kind of problem of answering why-not questions in similar graph matching for graph databases.

【Keywords】: Bioinformatics; Biological cells; Databases; Drugs; Electronic mail; Nitrogen; Three-dimensional displays

160. CloudKeyBank: Privacy and owner authorization enforced key management framework.

【Paper Link】【Pages】:1478-1479

【Authors】: XiuXia Tian ; Ling Huang ; Tony Wu ; Xiaoling Wang ; Aoying Zhou

【Abstract】: Outsourcing keys (including passwords and data encryption keys) to professional password managers (honest-butcurious service providers) is attracting more and more attention from the researchers and users in the era of cloud computing. However, existing solutions in traditional data outsourcing scenario are unable to simultaneously meet the following three security requirements for keys outsourcing: 1) Confidentiality and privacy of keys; 2) Search privacy on identity attributes tied to keys; 3) Owner controllable authorization over his/her shared keys. In this paper, we propose CloudKeyBank, the first unified key management framework that addresses all the three goals above. To implement CloudKeyBank efficiently, we propose a new cryptographic primitive named Searchable Conditional Proxy Re-Encryption (SC-PRE) which combines the techniques of Hidden Vector Encryption (HVE) and Proxy Re-Encryption (PRE) seamlessly.

【Keywords】: Authorization; Data privacy; Databases; Encryption; Privacy

161. Association discovery in two-view data.

【Paper Link】【Pages】:1480-1481

【Authors】: Matthijs van Leeuwen ; Esther Galbrun

【Abstract】: Two-view datasets are datasets whose attributes are naturally split into two sets, each providing a different view on the same set of objects. We introduce1 the exploratory data mining task of finding small and non-redundant sets of associations that describe how the two views are related. To achieve this, we propose a novel approach in which sets of rules are used to translate one view to the other and vice versa. Our models, dubbed translation tables, contain both unidirectional and bidirectional rules that span both views and provide lossless translation from either of the views to the opposite view. To be able to evaluate different translation tables and perform model selection, we present a score based on the Minimum Description Length (MDL) principle. Next, we introduce three TRANSLATOR algorithms to find good models according to this score. The first algorithm is parameter-free and iteratively adds the rule that improves compression most. The other two algorithms use heuristics to achieve better trade-offs between runtime and compression. The empirical evaluation on real-world data demonstrates that only modest numbers of associations are needed to characterize the two-view structure present in the data, while the obtained translation rules are easily interpretable and provide relevant insight into the data.

【Keywords】: Algorithm design and analysis; Complexity theory; Data mining; Data models; Electronic mail; Explosions; Itemsets; Association discovery; Association rule mining; Minimum description length; Redescription mining; Two-view data

162. i2MapReduce: Incremental mapreduce for mining evolving big data.

【Paper Link】【Pages】:1482-1483

【Authors】: Yanfeng Zhang ; Shimin Chen ; Qiang Wang ; Ge Yu

【Abstract】: As new data and updates are constantly arriving, the results of data mining applications become stale and obsolete over time. Incremental processing is a promising approach to refresh mining results. It utilizes previously saved states to avoid the expense of re-computation from scratch. In this paper, we propose i2MapReduce, a novel incremental processing extension to MapReduce. Compared with the state-of-the-art work on Incoop, i2MapReduce (i) performs key-value pair level incremental processing rather than task level re-computation, (ii) supports not only one-step computation but also more sophisticated iterative computation, and (iii) incorporates a set of novel techniques to reduce I/O overhead for accessing preserved fine-grain computation states. Experimental results on Amazon EC2 show significant performance improvements of i2MapReduce compared to both plain and iterative MapReduce performing re-computation.

【Keywords】: Big data; Computational modeling; Context; Data mining; Runtime; Web pages; Web search

163. Incremental semi-supervised clustering ensemble for high dimensional data clustering.

【Paper Link】【Pages】:1484-1485

【Authors】: Zhiwen Yu ; Peinan Luo ; Si Wu ; Guoqiang Han ; Jane You ; Hareton Leung ; Hau-San Wong ; Jun Zhang

【Abstract】: Recently, cluster ensemble approaches have gained more and more attention [1]¿¿¿[2], due to useful applications in the areas of pattern recognition, data mining, bioinformatics, and so on. When compared with traditional single clustering algorithms, cluster ensemble approaches are able to integrate multiple clustering solutions obtained from different data sources into a unified solution, and provide a more robust, stable and accurate final result.

【Keywords】: Cancer; Clustering algorithms; Computer science; Gene expression; Linear programming; Robots; Search problems

164. Virtual denormalization via array index reference for main memory OLAP.

【Paper Link】【Pages】:1486-1487

【Authors】: Yansong Zhang ; Xuan Zhou ; Ying Zhang ; Yu Zhang ; Mingchuan Su ; Shan Wang

【Abstract】: Denormalization is a common tactic for enhancing performance of data warehouses. However, it is rarely used in main memory databases, which regards storage space as scarce resource. In this paper, we demonstrate that MMDB can actually benefit from the strategy of denormalization. We have created A-Store, a prototypical main-memory database system customized for star and snowflake schemas, which applies the strategy of denormalization to achieve highly efficient OLAP. Instead of resorting to fully materialized denormalization, A-Store applies a method called virtual denomalization, which allows query processing to be performed in a denormalized way, while without incurring additional space consumption.

【Keywords】: Asia; Data warehouses; Encoding; Indexes; Optimization; Query processing

165. Reverse keyword search for spatio-textual top-k queries in location-based services.

【Paper Link】【Pages】:1488-1489

【Authors】: Xin Lin ; Jianliang Xu ; Haibo Hu

【Abstract】: This paper proposes a novel query paradigm, namely reverse keyword search for spatio-textual top-k queries (RST Q). It returns the keywords under which a target object will be a spatio-textual top-k result. To efficiently process the new query, we devise a novel hybrid index KcR-tree to store and summarize the spatial and textual information of objects. To further improve the performance, we propose three query optimization techniques, i.e., KcR*-tree, lazy upper-bound updating, and keyword set filtering. We also extend RST Q to allow the input location to be a spatial region instead of a point. Experimental results demonstrate the efficiency of our proposed query techniques in terms of both the computational cost and I/O cost.

【Keywords】: Indexes

166. Distributed in-memory processing of All K Nearest Neighbor queries.

【Paper Link】【Pages】:1490-1491

【Authors】: Georgios Chatzimilioudis ; Constantinos Costa ; Demetrios Zeinalipour-Yazti ; Wang-Chien Lee ; Evaggelia Pitoura

【Abstract】: A wide spectrum of Internet-scale mobile applications, ranging from social networking, gaming and entertainment to emergency response and crisis management, all require efficient and scalable All k Nearest Neighbor (AkNN) computations over millions of moving objects every few seconds to be operational. In this paper we present Spitfire, a distributed algorithm that provides a scalable and high-performance AkNN processing framework to our award-winning geo-social network named Rayzit. The proposed algorithm deploys a fast load-balanced partitioning along with an efficient replication-set selection, to provide fast main-memory computations of the exact AkNN results in a batch-oriented manner. We evaluate, both analytically and experimentally, how the pruning efficiency of the Spitfire algorithm plays a pivotal role in reducing communication and response time up to an order of magnitude, compared to three state-of-the-art distributed AkNN algorithms executed in distributed main-memory.

【Keywords】: Random access memory

167. Accelerated Continuous Conditional Random Fields for load forecasting.

【Paper Link】【Pages】:1492-1493

【Authors】: Hongyu Guo

【Abstract】: Enabling commercial buildings to use energy in a flexible way, such as reducing usage during consumption peak hours, is of crucial importance, not only for preventing the disruptions of the utility grid, but also for containing buildings' rapidly growing energy cost. Such smart energy consumption, however, heavily relies on accurate short-term energy load forecasting, such as hourly predictions for the next n (n ¿¿¿ 2) hours. To attain sufficient accuracy, we treat such multi-steps ahead regression task as a sequence labeling (regression) problem, and adopt the Continuous Conditional Random Fields (CCRF) [1] to explicitly model these interconnected outputs.

【Keywords】: Acceleration; Correlation; Load forecasting; Load modeling; Mathematical model; Sparse matrices; Training

168. Querying knowledge Graphs By Example entity tuples.

【Paper Link】【Pages】:1494-1495

【Authors】: Nandish Jayaram ; Arijit Khan ; Chengkai Li ; Xifeng Yan ; Ramez Elmasri

【Abstract】: We witness an unprecedented proliferation of knowledge graphs that record millions of entities and their relationships. While knowledge graphs are structure-flexible and content-rich, they are difficult to use. The challenge lies in the gap between their overwhelming complexity and the limited database knowledge of non-professional users. As an initial step toward improving the usability of knowledge graphs, we propose to query such data by example entity tuples, without requiring users to form complex graph queries. Our system, GQBE (Graph Query By Example), automatically discovers a weighted hidden maximum query graph based on input query tuples, to capture a user's query intent. It then efficiently finds top-ranked approximate answer graphs and answer tuples.

【Keywords】: Business; Correlation; Lattices; Query processing; Usability

169. Efficient top-k retrieval on massive data.

【Paper Link】【Pages】:1496-1497

【Authors】: Xixian Han ; Jianzhong Li ; Hong Gao

【Abstract】: Top-k query is an important operation to return a set of interesting points from a potentially huge data space. In top-k query, a ranking function is provided to determine the score of each tuple and k tuples with the largest scores are returned.

【Keywords】: Buffer storage; Computer science; Concrete; Conferences; Data structures; Electronic mail; Indexes

【Paper Link】【Pages】:1498-1499

【Authors】: Martin Jergler ; Hans-Arno Jacobsen ; Mohammad Sadoghi ; Richard Hull ; Roman Vaculín

【Abstract】: We present a unique representation of data-centric workflows, designed to exploit the loosely coupled nature of publish/subscribe systems to enable their safe distribution and parallel execution. We argue for the practicality of our approach by mapping a standard and industry-strength data-centric workflow model, namely, IBM Business Artifacts with Guard-Stage-Milestone (GSM), into the publish/subscribe abstraction.

【Keywords】: Business; Data models; GSM; Jacobian matrices; Process control; Semantics; Throughput

171. Top-k dominating queries on incomplete data.

【Paper Link】【Pages】:1500-1501

【Authors】: Xiaoye Miao ; Yunjun Gao ; Baihua Zheng ; Gang Chen ; Huiyong Cui

【Abstract】: The top-k dominating (TKD) query returns the k objects that dominate the maximum number of the objects in a given dataset. Incomplete data exists in a wide spectrum of real datasets, due to device failure, privacy preservation, data loss, etc. In this paper, for the first time, we carry out a systematic study of TKD queries on incomplete data, which involves the data having some missing dimensional value(s). We formalize this problem, and propose a suite of efficient algorithms for supporting it. Our methods utilize some novel techniques, such as upper bound score pruning and bitmap binning strategy, to boost query efficiency. Extensive experiments with both real and synthetic data sets demonstrate the efficiency of our presented algorithms.

【Keywords】: Data privacy; Indexes; Motion pictures; Query processing; Recommender systems; Upper bound

172. Subspace based network community detection using sparse linear coding.

【Paper Link】【Pages】:1502-1503

【Authors】: Arif Mahmood ; Michael Small

【Abstract】: Information mining from networks by identifying communities is an important problem across a number of research fields including social science, biology, physics, and medicine. Most existing community detection algorithms are graph theoretic and lack the ability to detect accurate community boundaries if the ratio of intra-community to inter-community links is low. Also, algorithms based on modularity maximization may fail to resolve communities smaller than a specific size if the community size varies significantly. We propose a fundamentally different community detection algorithm based on the fact that each network community spans a different subspace in the geodesic space. Therefore, each node can only be efficiently represented as a linear combination of nodes spanning the same subspace (Fig. 1). To make the process of community detection more robust, we use sparse linear coding with ¿¿1 norm constraint. In order to find a community label for each node, sparse spectral clustering algorithm is used. The proposed community detection technique is compared with more than ten state of the art methods on two benchmark networks (with known clusters) using normalized mutual information criterion. Our proposed algorithm outperformed existing methods with a significant margin on both benchmark networks.

【Keywords】: Benchmark testing; Clustering algorithms; Detection algorithms; Encoding; Mutual information; Sparse matrices; Symmetric matrices

173. DSP-CC: I/O efficient parallel computation of connected components in billion-scale networks.

【Paper Link】【Pages】:1504-1505

【Authors】: Min-Soo Kim ; Sangyeon Lee ; Wook-Shin Han ; Himchan Park ; Jeong-Hoon Lee

【Abstract】: Computing connected components (CC) is a core operation on graph data. Since billion-scale graphs cannot be resident in memory of a single machine, there have been proposed a number of distributed graph processing methods. The representative ones for CC are Hash-To-Min and PowerGraph. Hash-To-Min focuses on minimizing the number of MapReduce rounds, but is still slower than in-memory methods, PowerGraph is a fast and general in-memory graph method, but requires a lot of machines for handling billion-scale graphs. We propose an ultra-fast parallel method DSP-CC, using only a single PC that exploits secondary storage like a PCI-E SSD for handling billion-scale graphs. It can compute connected components I/O efficiently using only a limited size of memory. Our experimental results show that DSP-CC significantly outperforms the representative methods including Hash-To-Min and PowerGraph.

【Keywords】:

174. Mining temporal patterns in interval-based data.

【Paper Link】【Pages】:1506-1507

【Authors】: Yi-Cheng Chen ; Wen-Chih Peng ; Suh-Yin Lee

【Abstract】: Sequential pattern mining is an important subfield in data mining. Recently, discovering patterns from interval events has attracted considerable efforts due to its widespread applications. However, due to the complex relation between two intervals, mining interval-based sequences efficiently is a challenging issue. In this paper, we develop a novel algorithm, P-TPMiner, to efficiently discover two types of interval-based sequential patterns. Some pruning techniques are proposed to further reduce the search space of the mining process. Experimental studies show that proposed algorithm is efficient and scalable. Furthermore, we apply proposed method to real datasets to demonstrate the practicability of discussed patterns.

【Keywords】: Algorithm design and analysis; Computer science; Data mining; Databases; Iron; Libraries; Probabilistic logic; data mining; interval-based event; representation; sequential pattern; temporal pattern

175. Pattern-aided regression modeling and prediction model analysis.

【Paper Link】【Pages】:1508-1509

【Authors】: Guozhu Dong ; Vahid Taslimitehrani

【Abstract】: This paper first introduces a new style of regression models, namely pattern aided regression (PXR) models, aimed at representing accurate and interpretable prediction models. It also introduces a contrast pattern aided regression (CPXR) method, to build accurate PXR models in an efficient manner. In experiments, the PXR models built by CPXR are very accurate in general, often outperforming state-of-the-art regression methods by wide margins. From extensive experiments we also found that (1) regression modeling applications often involve complex diverse predictor-response relationships, which occur when the optimal regression models (of given regression model type) fitting distinct natural subgroups of data are highly different, and (2) state-of-the-art regression methods are often unable to adequately model such relationships. CPXR is also useful for analyzing how a given regression model makes prediction errors. This is an extended abstract of [6].

【Keywords】: Analytical models; Computational modeling; Data models; Linear regression; Prediction algorithms; Predictive models; Regression tree analysis

【Paper Link】【Pages】:1510-1511

【Authors】: Yafei Li ; Rui Chen ; Jianliang Xu ; Qiao Huang ; Haibo Hu ; Byron Choi

【Abstract】: In this paper, we study a new type of Geo-Social K-Cover Group (GSKCG) queries that, given a set of query points and a social network, retrieves a minimum user group in which each user is socially related to at least k other users and the users' associated regions (e.g., familiar regions or service regions) can jointly cover all the query points. Albeit its practical usefulness, the GSKCG query problem is NP-hard. We consequently explore a set of effective pruning strategies to derive an efficient algorithm for finding the optimal solution. Moreover, we design a novel index structure tailored to our problem to further accelerate query processing. Extensive experiments demonstrate that our algorithm achieves desirable performance on real-life datasets.

【Keywords】: Algorithm design and analysis; Collaboration; Computer science; Electronic mail; Indexes; Query processing; Social network services

177. Diversified hidden Markov models for sequential labeling.

【Paper Link】【Pages】:1512-1513

【Authors】: Maoying Qiao ; Wei Bian ; Richard Yi Da Xu ; Dacheng Tao

【Abstract】: Labeling of sequential data is a prevalent metaproblem in a wide range of real world applications. A first-order hidden Markov model (HMM) provides a fundamental approach for sequential labeling. However, it does not show satisfactory performance for real world problems, such as optical character recognition (OCR). Aiming at addressing this problem, important extensions of HMM have been proposed in literature. One of the common key features in these extensions is the incorporation of proper prior information. In this paper, we propose a new extension of HMM, termed diversified hidden Markov models (dHMM), with incorporating a diversity-encouraging prior. The prior is added over the state-transition probabilities and thus facilitates more dynamic sequential labelling. Specifically, the diversity is modeled with a continuous determinantal point process. An EM framework for parameter learning and MAP inference is derived, and empirical evaluation on OCR dataset verifies its effectiveness.

【Keywords】: Character recognition; Hidden Markov models; Kernel; Labeling; Maximum likelihood estimation; Optical character recognition software; Optical imaging

178. Metric all-k-nearest-neighbor search.

【Paper Link】【Pages】:1514-1515

【Authors】: Lu Chen ; Yunjun Gao ; Gang Chen ; Haida Zhang

【Abstract】: An all-k-nearest-neighbor (AkNN) query finds from a given object set O, k nearest neighbors for each object in a specified query set Q. This operation is common in many applications such as GIS, data mining, and image analysis. Although it has received much attention in the Euclidean space, there is little prior work on the metric space. In this paper, we study the problem of AkNN retrieval in metric spaces, termed metric AkNN (MAkNN) search, and propose efficient algorithms for supporting MAkNN queries with arbitrary k value. Our methods utilize dynamic disk-based metric indexes, employ a series of pruning rules, take advantage of grouping, reuse, pre-processing and progressive pruning techniques, require no detailed representations of objects, and can be applied as long as the distance metric satisfies the triangle inequality. Extensive experiments using both real and synthetic data sets verify the efficiency of our proposed algorithms, compared with state-of-the-art Euclidean AkNN and MAkNN algorithms.

【Keywords】: Algorithm design and analysis; Data mining; Extraterrestrial measurements; Indexes; Partitioning algorithms; Search problems

179. Efficient enforcement of action-aware purpose-based access control within relational database management systems.

【Paper Link】【Pages】:1516-1517

【Authors】: Pietro Colombo ; Elena Ferrari

【Abstract】: Although database management systems (DBMSs) enforce access control according to a variety of models (see [2] for an overview), the majority of them do not integrate native privacy protection mechanisms. This void has been partially filled out with the advent of purpose based access control, as this access control model has brought to the integration of basic privacy preservation functionalities into DBMSs. Even though purposes represent a key feature of privacy policies, DBMSs' privacy awareness can be significantly increased considering additional privacy related aspects. With this work we do a step to achieve this goal by focusing on the actions performed by queries on data and the categories of the accessed data. We propose an access control model that supports highly customized privacyaware access control policies and significantly improves the basic privacy preservation capabilities of the purpose based model. The proposed model is complemented with an efficient enforcement monitor, which can be easily integrated into relational DBMSs. Early experimental evaluations show the efficiency of the proposed framework.

【Keywords】: Access control; Data privacy; Privacy; Relational databases; Remuneration

180. A practical and effective sampling selection strategy for large scale deduplication.

【Paper Link】【Pages】:1518-1519

【Authors】: Guilherme Dal Bianco ; Renata Galante ; Carlos A. Heuser ; Marcos André Gonçalves ; Sérgio D. Canuto

【Abstract】: Record deduplication aims at identifying entities that are potentially the same in a data repository. A set of pairs that is manually labeled is generally used to tune the deduplication process, as each dataset has a particular dirtiness pattern. However, producing an informative set of pairs is a very costly task, especially in very large datasets (even for expert users). We propose a new sampling strategy that is able to select a very small and informative set of pairs from large datasets. Our results show that our approach reduces user effort substantially while achieving a competitive or superior matching quality.

【Keywords】: Computational efficiency; Genetic programming; Labeling; Learning systems; Manuals; Redundancy; Training

181. Anonymizing collections of tree-structured data.

【Paper Link】【Pages】:1520-1521

【Authors】: Olga Gkountouna ; Manolis Terrovitis

【Abstract】: Collections of real-world data usually have implicit or explicit structural relations. For example, databases link records through foreign keys, and XML documents express associations between different values through syntax. Privacy preservation, until now, has focused either on data with a very simple structure, e.g. relational tables, or on data with very complex structure e.g. social network graphs, but has ignored intermediate cases, which are the most frequent in practice. In this work, we focus on tree structured data. The paper defines k(m;n)-anonymity, which provides protection against identity disclosure and proposes a greedy anonymization heuristic that is able to sanitize large datasets. The algorithm and the quality of the anonymization are evaluated experimentally.

【Keywords】: Data models; Databases; Electronic mail; Information systems; Privacy; Transforms; XML

【Paper Link】【Pages】:1522-1523

【Authors】: Xike Xie ; Peiquan Jin ; Man Lung Yiu ; Jiang Du ; Christian S. Jensen ; Mingxuan Yuan

【Abstract】: We study safe zones for service subscriptions in a volunteered geographic service setting, covering the concepts, properties, and algorithms needed for the use of weighted imprecise Voronoi cells as safe zones. Empirical performance studies on both synthetic and real datasets offer insights into the efficiency and scalability of our proposal. For a comprehensive coverage, see the full version of the paper [1].

【Keywords】: Buildings; Computer science; Indexes; Lungs; Mobile communication; Shape; Spatial databases

183. Beyond millisecond latency kNN search on commodity machine.

【Paper Link】【Pages】:1524-1525

【Authors】: Bailong Liao ; Leong Hou U ; Man Lung Yiu ; Zhiguo Gong

【Abstract】: To the best of our knowledge, this is the first work that offers very low latency (0.1ms) per kNN query on 10-million-node networks on a commodity machine. This translates to a query throughput of 10,000 queries per second per commodity machine. Our experimental studies on large scale road networks show that our solutions are 1¿¿¿3 orders of magnitudes faster than existing methods while our indexes are compact and can fit into main memory. In the future, we plan to further enhance the performance of kNN search with keywords.

【Keywords】: Bidirectional control; Boosting; Indexes; Roads; Search problems; Throughput

184. Unsupervised ranking of multi-attribute objects based on principal curves.

【Paper Link】【Pages】:1526-1527

【Authors】: Chun-Guo Li ; Xing Mei ; Bao-Gang Hu

【Abstract】: Unsupervised ranking faces one critical challenge in evaluation applications, that is, no ground truth is available. While PageRank and its variants show a good solution in related objects, they are applicable only for ranking from link-structure data. In this work, we focus on unsupervised ranking from multi-attribute data which is also common in evaluation tasks. To overcome the challenge, we propose five essential meta-rules for the design and assessment of unsupervised ranking approaches: scale and translation invariance, strict monotonicity, compatibility of linearity and nonlinearity, smoothness, and explicitness of parameter size. These meta-rules are regarded as high level knowledge for unsupervised ranking tasks. Inspired by the works in [2] and [6], we propose a ranking principal curve (RPC) model, which learns a one-dimensional manifold function to perform unsupervised ranking tasks on multi-attribute observations. Furthermore, the RPC is modeled to be a cubic B¿¿zier curve with control points restricted in the interior of a hypercube, complying with all the five meta-rules to infer a reasonable ranking list. With control points as model parameters, one is able to understand the learned manifold and to interpret and visualize the ranking results. Numerical experiments of the presented RPC model are conducted on two open datasets of different ranking applications. In comparison with the state-of-the-art approaches, the new model is able to show more reasonable ranking lists.

【Keywords】: Data models; Economic indicators; Electronic mail; Linearity; Manifolds; Mathematics; Numerical models

185. Scalable algorithms for nearest-neighbor joins on big trajectory data.

【Paper Link】【Pages】:1528-1529

【Authors】: Yixiang Fang ; Reynold Cheng ; Wenbin Tang ; Silviu Maniu ; Xuan S. Yang

【Abstract】: Trajectory data are prevalent in systems that monitor the locations of moving objects. In a location-based service, for instance, the positions of vehicles are continuously monitored through GPS; the trajectory of each vehicle describes its movement history. We study joins on two sets of trajectories, generated by two sets M and R of moving objects. For each entity in M, a join returns its k nearest neighbors from R. We examine how this query can be evaluated in cloud environments. This problem is not trivial, due to the complexity of the trajectory, and the fact that both the spatial and temporal dimensions of the data have to be handled. To facilitate this operation, we propose a parallel solution framework based on MapReduce. We also develop a novel bounding technique, which enables trajectories to be pruned in parallel. Our approach can be used to parallelize existing single-machine trajectory join algorithms. To evaluate the efficiency and the scalability of our solutions, we have performed extensive experiments on a real dataset.

【Keywords】: Global Positioning System; Joining processes; Load management; Monitoring; Partitioning algorithms; Trajectory; Vehicles

186. Maximizing a record's standing in a relation.

【Paper Link】【Pages】:1530-1531

【Authors】: Yu Tang ; Yilun Cai ; Nikos Mamoulis

【Abstract】: Given a database table with records that can be ranked, an interesting problem is to identify selection conditions, which are qualified by an input record and render its ranking as high as possible among the qualifying tuples. In this paper, we study this standing maximization problem, which finds application in object promotion and characterization. We propose greedy methods, which are experimentally shown to achieve high accuracy compared to exhaustive enumeration, while scaling very well to the problem size. Our contributions include a lineartime algorithm for determining the optimal selection range for an attribute and techniques for choosing and prioritizing the most promising selection predicates to apply. Experiments on real datasets confirm the effectiveness and efficiency of our techniques.

【Keywords】: Asia; Europe

187. Structure-preserving subgraph query services.

【Paper Link】【Pages】:1532-1533

【Authors】: Zhe Fan ; Byron Choi ; Qian Chen ; Jianliang Xu ; Haibo Hu ; Sourav S. Bhowmick

【Abstract】: Subgraph query (via subgraph isomorphism) is a fundamental and powerful query in various real graph applications. It has actively been investigated for performance enhancements recently. However, due to the high complexity of subgraph query, hosting efficient subgraph query services has been a technically challenging task, because the owners of graph data may not always possess the IT expertise to offer such services and hence may outsource to query service providers (SP). SPs are often equipped with high performance computing utilities (e.g., a cloud) that offer better scalability, elasticity and IT management. Unfortunately, as SPs may not always be trusted, security (such as the confidentiality of messages exchanged) has been recognized as one of the critical attributes of Quality of Services (QoS) [4]. This influences the willingness of both data owners and query clients to use SP's services. Recently, there is a bloom on the research on query processing with privacy preservation1, e.g., in the context of relational databases, spatial databases and graph databases. However, up to date, private subgraph query has not yet been studied.

【Keywords】: Encryption; Indexes; Privacy; Quality of service; Query processing

188. TrGraph: Cross-network transfer learning via common signature subgraphs.

【Paper Link】【Pages】:1534-1535

【Authors】: Meng Fang ; Jie Yin ; Xingquan Zhu ; Chengqi Zhang

【Abstract】: In this paper, we present a novel transfer learning framework for network node classification. Our objective is to accurately predict node labels in a target network by leveraging information from an auxiliary source network. Such a transfer learning framework is potentially useful for broader areas of network classification, where emerging new networks might not have sufficient labeled information because node labels are either costly to obtain or simply not available, whereas many established networks from related domains are available to benefit the learning. In reality, the source and the target networks may not share common nodes or connections, so the major challenge of cross-network transfer learning is to identify knowledge/patterns transferable between networks and potentially useful to support cross-network learning. In this work, we propose to learn common signature subgraphs between networks, and use them as structure features for the target network. By combining the original node content features and the new structure features, we develop an iterative classification algorithm, TrGraph, that utilizes label dependency to jointly classify nodes in the target network. Experiments on real-world networks demonstrate that TrGraph achieves the superior performance compared to the state-of-the-art baseline methods.

【Keywords】: Artificial neural networks

189. Crawling hidden objects with kNN queries.

【Paper Link】【Pages】:1536-1537

【Authors】: Hui Yan ; Zhiguo Gong ; Nan Zhang ; Tao Huang ; Hua Zhong ; Jun Wei

【Abstract】: With rapidly growing popularity, Location Based Services (LBS), e.g., Google Maps, Yahoo Local, WeChat, FourSquare, etc., started offering web-based search features that resemble a kNN query interface. Specifically, for a user-specified query location q, these websites extract from the objects in their backend database the top-k nearest neighbors to q and return these k objects to the user through the web interface. Here k is often a small value like 50 or 100. For example, McDonald [1] returns the top 25 nearest restaurants for a user-specified location through its locations search webpage.

【Keywords】: Algorithm design and analysis; Computer science; Google; Knowledge based systems; Prediction algorithms; Spatial databases

190. Indexing evolving events from tweet streams.

【Paper Link】【Pages】:1538-1539

【Authors】: HongYun Cai ; Zi Huang ; Divesh Srivastava ; Qing Zhang

【Abstract】: Tweet streams provide a variety of real-time information on dynamic social events. Although event detection has been actively studied, most of the existing approaches do not address the issue of efficient event monitoring in the presence of a large number of events detected from continuous tweet streams. In this paper, we capture the dynamics of events using four event operations: creation, absorption, split and merge.We also propose a novel event indexing structure, named Multi-layer Inverted List (MIL), for the acceleration of large-scale event search and update. We thoroughly study the problem of nearest neighbour search using MIL based on upper bound pruning. Extensive experiments have been conducted on a large-scale tweet dataset. The results demonstrate the promising performance of our method in terms of both efficiency and effectiveness.

【Keywords】: Australia; Indexing; Monitoring; Real-time systems; Search problems; Twitter; Upper bound

191. App relationship calculation: An iterative process.

【Paper Link】【Pages】:1540-1541

【Authors】: Ming Liu ; Chong Wu ; Xiang-Nan Zhao ; Chin-Yew Lin ; Xiao-Long Wang

【Abstract】: Today, plenty of apps are released to help users make the best use of their mobile phones. Facing the large amount of apps, app retrieval and app recommendation are extensively adopted to help users obtain their favorite apps. To acquire the high-quality retrieval or recommending results, it needs to obtain the accurate app relationship calculating results in advance. Unfortunately, recent methods are conducted mostly depending on user's log or app's contexts, which can only detect whether two apps are downloaded, installed meanwhile or provide similar functions or not. In fact, apps contain many deep relationships other than similarity, e.g., one app needs another app to cooperate to fulfill its work. Obviously, app's reviews contain user's viewpoint. They are useful to help dig deep relationship between apps. Therefore, to calculate relationship between apps via reviews, we propose an iterative process by combining review similarity calculation and app relationship calculation together.

【Keywords】: Computer science; Context; Correlation; Dictionaries; Iterative methods; Semantics; Thesauri

192. Efficient probabilistic supergraph search.

【Paper Link】【Pages】:1542-1543

【Authors】: Wenjie Zhang ; Xuemin Lin ; Ying Zhang ; Ke Zhu ; Gaoping Zhu

【Abstract】: In this paper, we investigate the problem of supergraph containment search over uncertain data graphs gu where each edge in a gu has an occurrence probability, namely probabilistic supergraph search, which has a wide spectrum of applications.

【Keywords】: Australia; Electronic mail; Probabilistic logic; Search problems; Synchronization; Time factors; Upper bound

193. Improving accuracy and robustness of self-tuning histograms by subspace clustering.

【Paper Link】【Pages】:1544-1545

【Authors】: Andranik Khachatryan ; Emmanuel Müller ; Klemens Böhm ; Christian Stier

【Abstract】: We show both formally and by means of experiments that self-tuning histograms suffer from three major problems - sensitivity to learning, stagnation, and dimensionality. We also describe our solution to the problem - which is histogram initialization with subspace clustering.

【Keywords】: Correlation; Electronic mail; Estimation error; Histograms; Image color analysis; Sensitivity; Training

194. CrowdOp: Query optimization for declarative crowdsourcing systems.

【Paper Link】【Pages】:1546-1547

【Authors】: Ju Fan ; Meihui Zhang ; Stanley Kok ; Meiyu Lu ; Beng Chin Ooi

【Abstract】: We propose CROWDOP, a cost-based query optimization approach for declarative crowdsourcing systems. CROWDOP considers both cost and latency in the query optimization objectives and generates query plans that provide a good balance between the cost and latency. We develop efficient algorithms in CROWDOP for optimizing three types of queries: selection, join and complex selection-join queries. We validate our approach via extensive experiments by simulation as well as with the real crowd on Amazon Mechanical Turk.

【Keywords】: Algorithm design and analysis; Crowdsourcing; Heuristic algorithms; Minimization; Optimization; Query processing; Vegetation

195. Time-series classification with COTE: The collective of transformation-based ensembles.

【Paper Link】【Pages】:1548-1549

【Authors】: Anthony Bagnall ; Jason Lines ; Jon Hills ; Aaron Bostrom

【Abstract】: We have proposed an ensemble scheme for TSC based on constructing classifiers on different data representations. The standard baseline algorithms used in TSC research are 1-NN with Euclidean distance and/or Dynamic Time Warping. We have conclusively shown that COTE significantly out-performs both of these approaches, and that COTE it is significantly better than all of the competing algorithms that have been proposed in the literature. We believe the results we present represents a new state-of-the-art in TSC that new algorithms should be compared to in terms of accuracy.

【Keywords】: Benchmark testing; Correlation; Heuristic algorithms; Standards; Time series analysis; Time-domain analysis; Transforms

196. A framework for enabling user preference profiling through Wi-Fi logs.

【Paper Link】【Pages】:1550-1551

【Authors】: Yao-Chung Fan ; Yu-Chi Chen ; Kuan-Chieh Tung ; Kuo-Chen Wu ; Arbee L. P. Chen

【Abstract】: Understanding users is a key for many business applications. In this paper, we propose to pursue user preference understanding by their Wi-Fi logs collected from their mobile devices. As shown, Wi-Fi data are essentially of various information types and with noises. The challenges lie in how to refine relevant information from noisy Wi-Fi data. Aiming at the challenges, this paper proposes a data cleaning and information enrichment framework for enabling user preference understanding through Wi-Fi logs, and introduces a series of filters for cleaning, correcting, and refining Wi-Fi logs. A comprehensive experiment with real data collected from users is made to verify the effectiveness of the proposed techniques for cleaning noisy Wi-Fi data for user preference profiling. To the best of our knowledge, this work is the first attempt to study user behavior understanding by mining Wi-Fi logs.

【Keywords】: Cleaning; Computer science; Gain measurement; IEEE 802.11 Standard; Mobile handsets; Semantics; Web search

197. Understanding short texts through semantic enrichment and hashing.

【Paper Link】【Pages】:1552-1553

【Authors】: Zheng Yu ; Haixun Wang ; Xuemin Lin ; Min Wang

【Abstract】: Clustering short texts by their meaning is a challenging task. The semantic hashing approach encodes the meaning of a text into a compact binary code. Thus, to tell if two texts have similar meanings, we only need to check if they have similar codes. The encoding is created by a deep neural network, which is trained on texts represented by word-count vectors. Unfortunately, for short texts such as search queries, such representations are insufficient to capture the underlying semantics. We propose a method to add more semantic signals to enrich short texts. Furthermore, we introduce a simplified deep learning network constructed by stacked auto-encoders to do semantic hashing. Experiments show that our method significantly improves the understanding of short texts, including text retrieval, classification and other general text-related tasks.

【Keywords】: Binary codes; Electronic publishing; Encyclopedias; Internet; Neural networks; Semantics

198. Application sensitive energy management framework for storage systems.

【Paper Link】【Pages】:1554-1555

【Authors】: Norifumi Nishikawa ; Miyuki Nakano ; Masaru Kitsuregawa

【Abstract】: Rapidly escalating energy and cooling costs of storage systems have become a concern for data centers. In response, a multitude of energy saving approaches that take into account storage-device-level input/output (I/O) behaviors has been proposed. The trouble is that critical applications are in constant operation at data centers, and the conventional approaches do not produce sufficient energy savings. It may be possible to dramatically reduce storage energy consumption without degrading application performance levels by utilizing application level I/O behaviors. However, such behaviors differ from one application to another, and it would be too expensive to tailor methods to individual applications. We propose a universal storage energy management framework for runtime storage energy savings that can be applied to any type of application. The results of evaluations show that the use of this framework results in substantive energy savings compared with the traditional approaches.

【Keywords】: Buffer storage; Cooling; Degradation; Energy storage; Monitoring

199. Capacity-Constrained Network-Voronoi Diagram.

【Paper Link】【Pages】:1556-1557

【Authors】: KwangSoo Yang ; Apurv Hirsh Shekhar ; Dev Oliver ; Shashi Shekhar

【Abstract】: Given a graph and a set of service center nodes, a Capacity Constrained Network-Voronoi Diagram (CCNVD) partitions the graph into a set of contiguous service areas that meet service center capacities and minimize the sum of the shortest distances from graph-nodes to allotted service centers. The CCNVD problem is important for critical societal applications such as assigning evacuees to shelters and assigning patients to hospitals. This problem is NPO-hard; it is computationally challenging because of the large size of the transportation network and the constraint that service areas must be contiguous in the graph to simplify communication of allotments. Previous work has focused on honoring either service area contiguity (e.g., Network Voronoi Diagrams) or service center capacity constraints (e.g., min-cost flow), but not both. We introduced a novel Pressure Equalizer (PE) approach for CCNVD to meet the capacity constraints of service centers while maintaining the contiguity of service areas. However, we find that the main bottleneck of the PE algorithm is testing whether service areas are contiguous. We propose novel algorithms that reduce the computational cost. Experiments using road maps from five different regions demonstrate that the proposed approaches significantly reduce computational cost for the PE approach.

【Keywords】: Computational efficiency; Computer science; Hospitals; Partitioning algorithms; Roads; Scalability

200. Similar subtree search using extended tree inclusion.

【Paper Link】【Pages】:1558-1559

【Authors】: Tomoya Mori ; Atsuhiro Takasu ; Jesper Jansson ; Jaewook Hwang ; Takeyuki Tamura ; Tatsuya Akutsu

【Abstract】: In this paper, we have extended the concept of unordered tree inclusion to take the costs of insertions and substitutions into account. The resulting algorithm, MinCostIncl, has the same time complexity as the original algorithm of [4] for unordered tree inclusion (O(22Dmn)). Computational experiments on a large synthetic dataset as well as real datasets showed that our proposed algorithm is fast and scalable. Source codes of the implemented algorithms are available upon request.

【Keywords】: Bioinformatics; Chemicals; Electronic mail; Informatics; RNA; Vegetation

201. Change-point detection in a sequence of bags-of-data.

【Paper Link】【Pages】:1560-1561

【Authors】: Kensuke Koshijima ; Hideitsu Hino ; Noboru Murata

【Abstract】: In this paper, the limitation that is prominent in most existing works of change-point detection methods is addressed by proposing a nonparametric, computationally efficient method. The limitation is that most works assume that each data point observed at each time step is a single multi-dimensional vector. However, there are many situations where this does not hold. Therefore, a setting where each observation is a collection of random variables, which we call a bag of data, is considered.

【Keywords】: Bipartite graph; Computational modeling; Earth; Electronic mail; Extraterrestrial measurements; Random variables

202. Reputation aggregation in peer-to-peer network using differential gossip algorithm.

【Paper Link】【Pages】:1562-1563

【Authors】: Ruchir Gupta ; Yatindra Nath Singh

【Abstract】: In a peer-to-peer system, a node should estimate reputation of other peers not only on the basis of its own interaction, but also on the basis of experience of other nodes. Reputation aggregation mechanism implements strategy for achieving this. Reputation aggregation in peer to peer networks is generally a very time and resource consuming process. This paper proposes a reputation aggregation algorithm that uses a variant of gossip algorithm called differential gossip. In this paper, estimate of reputation is considered to be having two parts, one common component which is same with every node, and the other one is the information received from immediate neighbours based on the neighbours' direct interaction with the node. Theoretical analysis and numerical results show that differential gossip is fast and requires lesser amount of resources. The reputation computed using the proposed algorithm also shows a good amount of immunity to the collusion.

【Keywords】: Algorithm design and analysis; Computer science; Conferences; Electrical engineering; Peer-to-peer computing; Servers; Social network services

203. Differentially private frequent itemset mining via transaction splitting.

【Paper Link】【Pages】:1564-1565

【Authors】: Sen Su ; Shengzhi Xu ; Xiang Cheng ; Zhengyi Li ; Fangchun Yang

【Abstract】: Frequent itemset mining (FIM) is one of the most fundamental problems in data mining. It has practical importance in a wide range of application areas such as decision support, Web usage mining, bioinformatics, etc. Given a database, where each transaction contains a set of items, FIM tries to find itemsets that occur in transactions more frequently than a given threshold. Despite valuable insights the discovery of frequent itemsets can potentially provide, if the data is sensitive (e.g., web browsing history and medical records), releasing the discovered frequent itemsets might pose considerable threats to individual privacy.

【Keywords】: Algorithm design and analysis; Data privacy; Heuristic algorithms; Itemsets; Privacy

204. Location aware keyword query suggestion based on document proximity.

【Paper Link】【Pages】:1566-1567

【Authors】: Shuyao Qi ; Dingming Wu ; Nikos Mamoulis

【Abstract】: Consider a user who has issued a keyword query to a search engine. We study the effective suggestion of alternative keyword queries to the user, which are semantically relevant to the original query and they have as results documents that correspond to objects near the user's location. For this purpose, we propose a weighted keyword-document graph which captures semantic and proximity relevance between queries and documents. Then, we use the graph to suggest queries that are near in terms of graph distance to the original queries. To make our framework scalable, we propose a partition-based approach that greatly outperforms the baseline algorithm.

【Keywords】: Electronic mail; Engines; Fish; Flickr; Ink; Partitioning algorithms; Semantics

205. The prediction of venture capital co-investment based on structural balance theory.

【Paper Link】【Pages】:1568-1569

【Authors】: Yun Zhou ; Zhiyuan Wang ; Jie Tang ; Jarder Luo

【Abstract】: In this paper, we study the prediction of co-investment of VCs. We present a series of observation analysis, design a large number of features, and then select prominent features for co-investment by group Lasso. Then we propose a factor graph model SBFG based on structural balance theory to formalize the observation into a unified model. Experiment results show that the proposed method can accurately (around 90% in terms of accuracy) predict the co-investment in the near future with only 10 features selected by group Lasso, and obtains a significant improvement over the baselines.

【Keywords】: Electronic mail; High performance computing; Industries; Investment; Predictive models; Static VAr compensators; Venture capital; co-investment; factor graph model; group Lasso; prediction; venture capital

206. XACML policy evaluation with dynamic context handling.

【Paper Link】【Pages】:1570-1571

【Authors】: Nariman Ammar ; Zaki Malik ; Abdelmounaam Rezgui ; Elisa Bertino

【Abstract】: We provided an XACML-based implementation of a semantic-based privacy management framework that incorporates context into dynamic rule evaluation and decision enforcement. Our evaluation results are promising, and we believe that future enhancements on the current implementation can provide a foundation for modern health records infrastructures and inspire collaborative data sharing.

【Keywords】: Computer science; Context; Engines; Privacy; Scalability; Semantics; Web services

207. kNNVWC: An efficient k-nearest neighbours approach based on Various-Widths Clustering.

【Paper Link】【Pages】:1572-1573

【Authors】: Abdulmohsen Almalawi ; Adil Fahad ; Zahir Tari ; Muhammad Aamir Cheema ; Ibrahim Khalil

【Abstract】: In this paper, a novel k-NN approach based on Various-Widths Clustering, named kNNVWC, is proposed to efficiently find k-NNs for a query object from a given data set. kNNVWC does clustering using various widths, where a data set is clustered with a global width first and each produced cluster that meets the predefined criteria is recursively clustered with its own local width that suits its distribution. Experimental results demonstrate that kNNVWC performs well compared to state-ofart of k-NN search algorithms.

【Keywords】: Approximation algorithms; Australia; Clustering algorithms; Computer science; Measurement; Merging; Search problems

208. Fast best-effort search on graphs with multiple attributes.

【Paper Link】【Pages】:1574-1575

【Authors】: Senjuti Basu Roy ; Tina Eliassi-Rad ; Spiros Papadimitriou

【Abstract】: We address the problem of top-k search on graphs with multiple nodal attributes, which we call WAGs (short for Weighted Attribute Graphs). For example, a co-authorship network is a WAG, where each author is a node; each attribute corresponds to a particular topic (e.g., databases, data mining, and machine learning); and the amount of expertise in a particular topic is represented by a non-negative weight on that attribute. A typical search in this setting may be: find three coauthors (i.e., a triangle) where each author's expertise is greater than 50% in at least one topic area (i.e., attribute). We show that the problem of retrieving the optimal answer for graph search on WAGs is NP-complete. Moreover, we propose a fast and effective top-k graph search algorithm for WAGs. In an extensive experimental study, our proposed algorithm exhibits significant speed-up over competing approaches. On average, our proposed method achieves 7¿¿ faster query processing than the best competitor.

【Keywords】: Business; Computer science; Data mining; Electronic mail; Query processing; Search problems

209. Top-k spatio-textual similarity join.

【Paper Link】【Pages】:1576-1577

【Authors】: Huiqi Hu ; Guoliang Li ; Zhifeng Bao ; Jianhua Feng ; Yongwei Wu ; Zhiguo Gong ; Yaoqiang Xu

【Abstract】: With the rapid development of mobile Internet technology, Internet users are shifting from desktop to mobile devices. Modern mobile devices (e.g., smartphones and tablets) are equipped with GPS, which can help users to easily obtain their locations, and location-based services (LBS) have been widely deployed. LBS users are generating more and more spatio-textual data which contains both textual descriptions and geographical locations. In user-generated data, a spatiotextual entity may have different representations, possibly due to GPS deviations or typographical errors [6], [2], and it calls for effective methods to integrate the spatio-textual data from different data sources. A spatio-textual similarity join is an important operation in spatio-textual data integration, which, given two sets of spatio-textual objects, finds all similar pairs from the two sets, where the similarity can be quantified by combining spatial proximity and textual relevancy. There are many applications in spatio-textual similarity joins, e.g., user recommendation in location-based social networks, image duplication detection using spatio-textual tags, spatio-textual advertising, and location-based market analysis [6], [2]. For example, a house rental agency (e.g., rent.com) wants to perform a similarity join on the spatio-textual data of house requirements from renters and the data of house properties from owners. For another example, a startup company, e.g., Factual (factual.com), crawls spatio-textual records to generate points of interest (POIs). As the records are from multiple sources and may contain many duplicates, It needs to run similarity joins to remove the duplicates.

【Keywords】: Australia; Computer science; Global Positioning System; Internet; Mobile handsets; Tuning; Upper bound

【Paper Link】【Pages】:1578-1579

【Authors】: Kyriakos Mouratidis ; Jing Li ; Yu Tang ; Nikos Mamoulis

【Abstract】: The diffusion of social networks introduces new challenges and opportunities for advanced services, especially so with their ongoing addition of location-based features. We show how applications like company and friend recommendation could significantly benefit from incorporating social and spatial proximity, and study a query type that captures these two-fold semantics. We develop highly scalable algorithms for its processing, and use real social network data to empirically verify their efficiency and efficacy.

【Keywords】:

211. Scalable online betweenness centrality in evolving graphs.

【Paper Link】【Pages】:1580-1581

【Authors】: Nicolas Kourtellis ; Gianmarco De Francisci Morales ; Francesco Bonchi

【Abstract】: Betweenness centrality measures the importance of an element of a graph, either a vertex or an edge, by the fraction of shortest paths that pass through it [1]. This measure is notoriously expensive to compute, and the best known algorithm, proposed by Brandes [2], runs in O(nm) time. The problems of efficiency and scalability are exacerbated in a dynamic setting, where the input is an evolving graph seen edge by edge, and the goal is to keep the betweenness centrality up to date. In this paper [8] we propose the first truly scalable and practical framework for computing vertex and edge betweenness centrality of large evolving graphs, incrementally and online.

【Keywords】: Complexity theory; Data structures; Engines; Facebook; Heuristic algorithms; Scalability; Time measurement

32. ICDE 2016:Helsinki, Finland

Paper Num: 212 || Session Num: 36

Research Session 1A: Graph Processing 4

1. Distance-aware influence maximization in geo-social network.

2. Topical influence modeling via topic-level interests and interactions on social curation services.

3. BlackHole: Robust community detection inspired by graph drawing.

4. Revenue maximization by viral marketing: A social network host's perspective.

Research Session 1B: Crowdsourcing 4

5. Online mobile Micro-Task Allocation in spatial crowdsourcing.

6. Crowdsourced POI labelling: Location-aware result inference and Task Assignment.

7. Mutual benefit aware task assignment in a bipartite labor market.

8. Computing Connected Components with linear communication cost in pregel-like systems.

Research Session 2A: Graph Algorithmics 4

9. VColor: A practical vertex-cut based approach for coloring large graphs.

10. Compressing graphs by grammars.

11. Planar: Parallel lightweight architecture-aware adaptive graph repartitioning.

12. I/O efficient Core Graph Decomposition at web scale.

Research Session 2B: Beyond Relational Query Processing 4

13. Reachability and time-based path queries in temporal graphs.

14. Scalable supergraph search in large graph databases.

15. Hobbes3: Dynamic generation of variable-length signatures for efficient approximate subsequence mappings.

16. NoSE: Schema design for NoSQL applications.

Research Session 2C: Privacy 4

17. Towards Virtual Private NoSQL datastores.

18. Differentially private multi-party high-dimensional data publishing.

19. Optimizing secure classification performance with privacy-aware feature selection.

20. Differentially private frequent subgraph mining.

Research Session 3A: Graph Proximity 4

21. Being prepared in a sparse world: The case of KNN graph construction.

22. pSCAN: Fast and exact structural graph clustering.

23. CSI_GED: An efficient approach for graph edit similarity computation.

24. Semantic proximity search on graphs with metagraph-based learning.

Research Session 3B: Scalable Query Processing 4

25. Private spatial data aggregation in the local setting.

26. A novel, low-latency algorithm for multiple Group-By query optimization.

27. Load balancing and skew resilience for parallel joins.

28. Platform-independent robust query processing.

Research Session 3C: Preference and Trust 4

29. Authentication of function queries.

30. "Told you i didn't like it": Exploiting uninteresting items for effective collaborative filtering.

31. Practical private shortest path computation based on Oblivious Storage.

32. Practical privacy-preserving user profile matching in social networks.

Research Session 4A: Graph Mining 4

33. An embedding approach to anomaly detection.

34. Cross-layer betweenness centrality in multiplex networks with applications.

35. NXgraph: An efficient graph processing system on a single machine.

36. Mining social ties beyond homophily.

Research Session 4B: Conscious Big Data Processing 4

37. Self-Adaptive Linear Hashing for solid state drives.

38. On main-memory flushing in microblogs data management systems.

39. ICE: Managing cold state for big data applications.

40. HAWK: Hardware support for unstructured log processing.

Research Session 4C: Data Streams 4

41. Efficient handling of concept drift and concept evolution over Stream Data.

42. Quality-driven disorder handling for m-way sliding window stream joins.

43. Context-aware advertisement recommendation for high-speed social news feeding.

44. Tolerating correlated failures in Massively Parallel Stream Processing Engines.

Research Session 5A: Graph Patterns 4

45. Spatial influence - measuring followship in the real world.

46. Durable graph pattern queries on historical graphs.

47. Link prediction in graph streams.

48. SimRank computation on uncertain graphs.

Research Session 5B: Parallel and Distributed Big Data Processing 4

49. Input selection for fast feature engineering.

50. When two choices are not enough: Balancing at scale in Distributed Stream Processing.

51. HadoopViz: A MapReduce framework for extensible visualization of big spatial data.

52. Efficient fault-tolerance for iterative graph processing on distributed dataflow systems.

Research Session 5C: Clustering 4

53. A model-based approach for text clustering with outlier detection.

54. Streaming spectral clustering.

55. Accelerating large scale centroid-based clustering with locality sensitive hashing.

56. PurTreeClust: A purchase tree clustering algorithm for large-scale customer transaction data.

Research Session 6A: Spatial Analytics 4

57. TRANSFORMERS: Robust spatial joins on non-uniform data distributions.

58. Finding the minimum spatial keyword cover.

59. Answering why-not spatial keyword top-k queries via keyword adaption.

60. Influence based cost optimization on user preference.

Research Session 6B: Analytics on Big Data 4

61. The CRO kernel: Using Concomitant Rank Order hashes for sparse high dimensional randomized feature maps.

62. MuVE: Efficient Multi-Objective View Recommendation for Visual Data Exploration.