31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea, April 13-17, 2015. IEEE Computer Society 【DBLP Link】
【Paper Link】 【Pages】:1-4
【Authors】: Surajit Chaudhuri
【Abstract】: We review how the state of information technology has evolved for consumers vs. enterprises. We discuss some of the key challenges in enterprise search over structured data and suggest a few promising directions for the research community.
【Keywords】: business data processing; data mining; enterprise search; information technology; research community; structured data; Business; Data models; Data visualization; Databases; Internet; Search engines; Search problems
【Paper Link】 【Pages】:5
【Authors】: Hector Garcia-Molina
【Abstract】: The document was not made available for publication as part of the conference proceedings.
【Keywords】:
【Paper Link】 【Pages】:6-17
【Authors】: Chen Jason Zhang ; Lei Chen ; Yongxin Tong ; Zheng Liu
【Abstract】: Uncertain data has been emerged as an important problem in database systems due to the imprecise nature of many applications. To handle the uncertainty, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with confidence. However, the uncertainty may propagate, hence the returned results from a query or mining process may not be useful. In this paper, we leverage the power of crowdsourcing for cleaning uncertain data. Specifically, we will design a set of Human Intelligence Tasks (HIT)s to ask a crowd to improve the quality of uncertain data. Each HIT is associated with a cost, thus, we need to design solutions to maximize the data quality with minimal number of HITs. There are two obstacles for this non-trivial optimization - first, the crowd has a probability to return incorrect answers; second, the HITs decomposed from uncertain data are often correlated. These two obstacles lead to very high computational cost for selecting the optimal set of HITs. Thus, in this paper, we have addressed these challenges by designing an effective approximation algorithm and an efficient heuristic solution. To further improve the efficiency, we derive tight lower and upper bounds, which are used for effective filtering and estimation. We have verified the solutions with extensive experiments on both a simulated crowd and a real crowdsourcing platform.
【Keywords】: approximation theory; data mining; database management systems; optimisation; query processing; HIT; approximation algorithm; crowdsourcing platform; database system; human intelligence tasks; mining process; noisy crowd; nontrivial optimization; probabilistic database; query process; uncertain data cleaning; Accuracy; Cities and towns; Cleaning; Crowdsourcing; Entropy; Semantics; Uncertainty
【Paper Link】 【Pages】:18-29
【Authors】: Matteo Interlandi ; Nan Tang
【Abstract】: One notoriously hard data cleaning problem is, given a database, how to precisely capture which value is correct (i.e., proof positive) or wrong (i.e., proof negative). Although integrity constraints have been widely studied to capture data errors as violations, the accuracy of data cleaning using integrity constraints has long been controversial. Overall they deem one fundamental problem: Given a set of data values that together forms a violation, there is no evidence of which value is proof positive or negative. Hence, it is known that integrity constraints themselves cannot guide dependable data cleaning. In this work, we introduce an automated method for proof positive and negative in data cleaning, based on Sherlock rules and reference tables. Given a tuple and reference tables, Sherlock rules tell us what attributes are proof positive, what attributes are proof negative and (possibly) how to update them. We study several fundamental problems associated with Sherlock rules. We also present efficient algorithms for cleaning data using Sherlock rules. We experimentally demonstrate that our techniques can not only annotate data with proof positive and negative, but also repair data when enough information is available.
【Keywords】: data integrity; Sherlock rule; data cleaning; data errors; data integrity; proof negative; proof positive; reference table; Accuracy; Cleaning; Databases; Maintenance engineering; Mobile communication; Semantics; Silicon
【Paper Link】 【Pages】:30-41
【Authors】: Jianmin Wang ; Shaoxu Song ; Xuemin Lin ; Xiaochen Zhu ; Jian Pei
【Abstract】: Event data are often dirty owing to various recording conventions or simply system errors. These errors may cause many serious damages to real applications, such as inaccurate provenance answers, poor profiling results or concealing interesting patterns from event data. Cleaning dirty event data is strongly demanded. While existing event data cleaning techniques view event logs as sequences, structural information do exist among events. We argue that such structural information enhances not only the accuracy of repairing inconsistent events but also the computation efficiency. It is notable that both the structure and the names (labeling) of events could be inconsistent. In real applications, while unsound structure is not repaired automatically (which needs manual effort from business actors to handle the structure error), it is highly desirable to repair the inconsistent event names introduced by recording mistakes. In this paper, we propose a graph repair approach for 1) detecting unsound structure, and 2) repairing inconsistent event name.
【Keywords】: business data processing; data handling; graph theory; business actors; computation efficiency; dirty event data cleaning techniques; graph repair approach; inconsistent event name repairing; sequences; structural information; structured event log cleaning; system errors; unsound structure detection; Approximation algorithms; Cleaning; Databases; Insulation; Labeling; Maintenance engineering; Petri nets
【Paper Link】 【Pages】:42-53
【Authors】: El Kindi Rezig ; Eduard C. Dragut ; Mourad Ouzzani ; Ahmed K. Elmagarmid
【Abstract】: Data-intensive Web applications usually require integrating data from Web sources at query time. The sources may refer to the same real-world entity in different ways and some may even provide outdated or erroneous data. An important task is to recognize and merge the records that refer to the same real world entity at query time. Most existing duplicate detection and fusion techniques work in the off-line setting and do not meet the online constraint. There are at least two aspects that differentiate online duplicate detection and fusion from its off-line counterpart. (i) The latter assumes that the entire data is available, while the former cannot make such an assumption. (ii) Several query submissions may be required to compute the “ideal” representation of an entity in the online setting. This paper presents a general framework for the online setting based on an iterative record-based caching technique. A set of frequently requested records is deduplicated off-line and cached for future reference. Newly arriving records in response to a query are deduplicated jointly with the records in the cache, presented to the user and appended to the cache. Experiments with real and synthetic data show the benefit of our solution over traditional record linkage techniques applied to an online setting.
【Keywords】: Internet; data integration; database management systems; iterative methods; query processing; sensor fusion; Web databases; Web sources; data integration; data-intensive Web applications; iterative record-based caching technique; online duplicate detection technique; online duplicate fusion technique; query submissions; query-time record fusion; query-time record linkage; real-world entity; record merging; record recognition; Accuracy; Business; Couplings; Data structures; Indexing
【Paper Link】 【Pages】:54-65
【Authors】: Yazhe Wang ; Baihua Zheng
【Abstract】: Existing works on identity privacy protection on social networks make the assumption that all the user identities in a social network are private and ignore the fact that in many real-world social networks, there exists a considerable amount of users such as celebrities, media users, and organization users whose identities are public. In this paper, we demonstrate that the presence of public users can cause serious damage to the identity privacy of other ordinary users. Motivated attackers can utilize the connection information of a user to some known public users to perform re-identification attacks, namely connection fingerprint (CFP) attacks. We propose two k-anonymization algorithms to protect a social network against the CFP attacks. One algorithm is based on adding dummy vertices. It can resist powerful attackers with the connection information of a user with the public users within n hops (n ≥ 1) and protect the centrality utility of public users. The other algorithm is based on edge modification. It is only able to resist attackers with the connection information of a user with the public users within 1 hop but preserves a rich spectrum of network utility. We perform comprehensive experiments on real-world networks and demonstrate that our algorithms are very efficient in terms of the running time and are able to generate k-anonymized networks with good utility.
【Keywords】: data privacy; social networking (online); connection fingerprint attacks; dummy vertices; edge modification; k-anonymization algorithms; network utility; privacy protection; social networks; Algorithm design and analysis; Data privacy; Fingerprint recognition; Privacy; Switches; YouTube
【Paper Link】 【Pages】:66-77
【Authors】: An Liu ; Kai Zheng ; Lu Li ; Guanfeng Liu ; Lei Zhao ; Xiaofang Zhou
【Abstract】: Outsourcing database to clouds is a scalable and cost-effective way for large scale data storage, management, and query processing. Trajectory data contain rich spatio-temporal relationships and reveal many forms of individual sensitive information (e.g., home address, health condition), which necessitate them to be encrypted before being outsourced for privacy concerns. However, efficient query processing over encrypted trajectory data is a very challenging task. Though some achievements have been reported very recently for simple queries (e.g., SQL queries, kNN queries) on encrypted data, there is rather limited progress on secure evaluation of trajectory queries because they are more complex and need special treatment. In this paper, we focus on secure trajectory similarity computation that is the cornerstone of secure trajectory query processing. More specifically, we propose an efficient solution to securely compute the similarity between two encrypted trajectories, which reveals nothing about the trajectories, but the final result. We theoretically prove that our solution is secure against the semi-honest adversaries model as all the intermediate information in our protocols can be simulated in polynomial time. Finally we empirically study the efficiency of the proposed method, which demonstrates the feasibility of our solution.
【Keywords】: cloud computing; computational complexity; cryptographic protocols; data privacy; outsourcing; query processing; storage management; data management; database outsourcing; encrypted trajectory data; large scale data storage; polynomial time; privacy concerns; query processing; secure similarity computation; secure trajectory query evaluation; secure trajectory query processing; secure trajectory similarity computation; spatiotemporal relationships; Encryption; Protocols; Query processing; Trajectory
【Paper Link】 【Pages】:78-88
【Authors】: Erman Pattuk ; Murat Kantarcioglu ; Huseyin Ulusoy ; Bradley Malin
【Abstract】: Big data will enable the development of novel services that enhance a company's market advantage, competition, or productivity. At the same time, the utilization of such a service could disclose sensitive data in the process, which raises significant privacy concerns. To protect individuals, various policies, such as the Code of Fair Information Practices, as well as recent laws require organizations to capture only the minimal amount of data necessary to support a service. While this is a notable goal, choosing the minimal data is a non-trivial process, especially while considering privacy and utility constraints. In this paper, we introduce a technique to minimize sensitive data disclosure by focusing on privacy-aware feature selection. During model deployment, the service provider requests only a subset of the available features from the client, such that it can produce results with maximal confidence, while minimizing its ability to violate a client's privacy. We propose an iterative approach, where the server requests information one feature at a time until the client-specified privacy budget is exhausted. The overall process is dynamic, such that the feature selected at each step depends on the previously selected features and their corresponding values. We demonstrate our technique with three popular classification algorithms and perform an empirical analysis over three real world datasets to illustrate that, in almost all cases, classifiers that select features using our strategy have the same error-rate as state-of-the art static feature selection methods that fail to preserve privacy.
【Keywords】: Big Data; data privacy; feature selection; pattern classification; Big Data; classifier; empirical analysis; error-rate; iterative approach; privacy constraints; privacy-aware dynamic feature selection; sensitive data disclosure minimization; utility constraints; Data privacy; Decision trees; Measurement; Niobium; Privacy; Probability; Servers
【Paper Link】 【Pages】:89-100
【Authors】: Xian Li ; Xin Luna Dong ; Kenneth B. Lyons ; Weiyi Meng ; Divesh Srivastava
【Abstract】: Recent research shows that copying is prevalent for Deep-Web data and considering copying can significantly improve truth finding from conflicting values. However, existing copy detection techniques do not scale for large sizes and numbers of data sources, so truth finding can be slowed down by one to two orders of magnitude compared with the corresponding techniques that do not consider copying. In this paper, we study how to improve scalability of copy detection on structured data. Our algorithm builds an inverted index for each shared value and processes the index entries in decreasing order of how much the shared value can contribute to the conclusion of copying. We show how we use the index to prune the data items we consider for each pair of sources, and to incrementally refine our results in iterative copy detection. We also apply a sampling strategy with which we are able to further reduce copy-detection time while still obtaining very similar results as on the whole data set. Experiments on various real data sets show that our algorithm can reduce the time for copy detection by two to three orders of magnitude; in other words, truth finding can benefit from copy detection with very little overhead.
【Keywords】: Internet; copy protection; conflicting values; copy detection techniques; copy-detection time reduction; data pruning; data sources; deep-Web data; index entry processing; inverted index; iterative copy detection; sampling strategy; scalability improvement; shared value; structured data; truth finding improvement; Accuracy; Buildings; Convergence; Distributed databases; Indexes; Knowledge based systems; Scalability
【Paper Link】 【Pages】:101-112
【Authors】: Ling Gu ; Minqi Zhou ; Zhenjie Zhang ; Ming-Chien Shan ; Aoying Zhou ; Marianne Winslett
【Abstract】: In the coming big data era, stress test to IT systems under extreme data volume is crucial to the adoption of computing technologies in every corner of the cyber world. Appropriately generated benchmark datasets provide the possibility for administrators to evaluate the capacity of the systems when real datasets hard obtained have not extreme cases. Traditional benchmark data generators, however, mainly target at producing relation tables of arbitrary size following fixed distributions. The output of such generators are insufficient when it is used to measure the stability of the architecture with extremely dynamic and heavy workloads, caused by complicated/hiden factors in the generation mechanism of real world, e.g. dependency between stocks in the trading market and collaborative human behaviors on the social network. In this paper, we present a new framework, called Chronos, to support new demands on streaming data benchmarking, by generating and simulating realistic and fast data streams in an elastic manner. Given a small group of samples with timestamps, Chronos reproduces new data streams with similar characteristics of the samples, preserving column-wise correlations, temporal dependency and order statistics of the snapshot distributions at the same time. To achieve such realistic requirements, we propose 1) a column decomposition optimization technique to partition the original relation table into small sub-tables with minimal correlation information loss, 2) a generative and extensible model based on Latent Dirichlet Allocation to capture temporal dependency while preserving order statistics of the snapshot distribution, and 3) a new generation and assembling method to efficiently build tuples following the expected distribution on the snapshots. To fulfill the vision of elasticity, we also present a new parallel stream data generation mechanism, facilitating distributed nodes to collaboratively generate tuples with minimal synchronization overhead and e- cellent load balancing. Our extensive experimental studies on real world data domains confirm the efficiency and effectiveness of Chronos on stream benchmark generation and simulation.
【Keywords】: Big Data; optimisation; parallel processing; program assemblers; resource allocation; Big Data; Chronos framework; IT systems; assembling method; benchmark data generators; collaborative human behaviors; column decomposition optimization technique; column-wise correlations; cyber world; elastic parallel framework; extreme data volume; fast data streams; latent Dirichlet allocation; load balancing; minimal correlation information loss; minimal synchronization overhead; order statistics; parallel stream data generation mechanism; snapshot distributions; social network; stream benchmark generation; temporal dependency; timestamps; trading market; Benchmark testing; Complexity theory; Computational modeling; Correlation; Distributed databases; Generators
【Paper Link】 【Pages】:113-124
【Authors】: Sai Wu ; Gang Chen ; Xianke Zhou ; Zhenjie Zhang ; Anthony K. H. Tung ; Marianne Winslett
【Abstract】: Various big data management systems have emerged to handle different types of applications, which cast very different demands on storage, indexing and retrieval of large amount of data on distributed file system. Such diversity on demands has raised huge challenges to the design of new generation of data access service for big data. In this paper, we present PABIRS, a unified data access middleware to support mixed workloads. PABIRS encapsulates the underlying distributed file system (DFS) and provides a unified access interface to systems such as MapReduce and key-value stores. PABIRS achieves dramatic improvement on efficiency by employing a novel hybrid indexing scheme. Based on the data distribution, the indexing scheme adaptively builds bitmap index and Log Structured Merge Tree (LSM) index. Moreover, PABIRS distributes the computation to multiple index nodes and utilizes a Pregel-based algorithm to facilitate parallel data search and retrieval. We empirically evaluate PABIRS against other existing distributed data processing systems and verify the huge advantages of PABIRS on shorter response time, higher throughput and better scalability, over big data with real-life phone logs and TPC-H benchmark.
【Keywords】: cloud computing; distributed databases; middleware; LSM index; PABIRS; Pregel-based algorithm; big data management system; bitmap index; cloud computing; data access middleware; distributed file systems; log structured merge tree index; Benchmark testing; Computational modeling; Generators; Indexes; Merging
【Paper Link】 【Pages】:125-136
【Authors】: Akon Dey ; Alan Fekete ; Uwe Röhm
【Abstract】: Typical cloud computing systems provide highly scalable and fault-tolerant data stores that may sacrifice other features like general multi-item transaction support. Recently techniques to implement multi-item transactions in these types of systems have focused on transactions across homogeneous data stores. Since applications access data in heterogeneous storage systems for legacy or interoperability reasons, we propose an approach that enables multi-item transactions with snapshot isolation across multiple heterogeneous data stores using only a minimal set of commonly implemented features such as single item consistency, conditional updates, and the ability to store additional meta-data. We define an client-coordinated transaction commitment protocol that does not rely on a central coordinating infrastructure. The application can take advantage of the scalability and fault-tolerance characteristics of modern key-value stores and access existing data in them, and also have multi-item transactional access guarantees with little performance impact. We have implemented our design in a Java library called Cherry Garcia (CG), that supports data store abstractions to Windows Azure Storage (WAS), Google Cloud Storage (GCS) and our own high-performance key-value store called Tora.
【Keywords】: Java; cloud computing; distributed databases; meta data; open systems; software fault tolerance; software libraries; software maintenance; transaction processing; Cherry Garcia; Google Cloud Storage; Java library; Tora; Windows Azure Storage; client-coordinated transaction commitment protocol; cloud computing systems; conditional updates; fault-tolerant data stores; general multi item transaction support; heterogeneous storage systems; high-performance key-value store; meta-data; multiple heterogeneous data stores; scalable data stores; scalable distributed transactions; single item consistency; snapshot isolation; Cloud computing; Fault tolerance; Fault tolerant systems; Google; Java; Libraries; Protocols
【Paper Link】 【Pages】:137-148
【Authors】: Muhammad Anis Uddin Nasir ; Gianmarco De Francisci Morales ; David García-Soriano ; Nicolas Kourtellis ; Marco Serafini
【Abstract】: We study the problem of load balancing in distributed stream processing engines, which is exacerbated in the presence of skew. We introduce Partial Key Grouping (PKG), a new stream partitioning scheme that adapts the classical “power of two choices” to a distributed streaming setting by leveraging two novel techniques: key splitting and local load estimation. In so doing, it achieves better load balancing than key grouping while being more scalable than shuffle grouping. We test PKG on several large datasets, both real-world and synthetic. Compared to standard hashing, PKG reduces the load imbalance by up to several orders of magnitude, and often achieves nearly-perfect load balance. This result translates into an improvement of up to 60% in throughput and up to 45% in latency when deployed on a real Storm cluster.
【Keywords】: distributed processing; resource allocation; PKG; distributed stream processing engine; distributed streaming setting; local load estimation; partial key grouping; practical load balancing; standard hashing; stream partitioning scheme; Color; Digital signal processing; Engines; Estimation; Load management; Radiation detectors; Routing
【Paper Link】 【Pages】:149-160
【Authors】: Julian Shun ; Kanat Tangwongsan
【Abstract】: Triangle counting and enumeration has emerged as a basic tool in large-scale network analysis, fueling the development of algorithms that scale to massive graphs. Most of the existing algorithms, however, are designed for the distributed-memory setting or the external-memory setting, and cannot take full advantage of a multicore machine, whose capacity has grown to accommodate even the largest of real-world graphs. This paper describes the design and implementation of simple and fast multicore parallel algorithms for exact, as well as approximate, triangle counting and other triangle computations that scale to billions of nodes and edges. Our algorithms are provably cache-friendly, easy to implement in a language that supports dynamic parallelism, such as Cilk Plus or OpenMP, and do not require parameter tuning. On a 40-core machine with two-way hyper-threading, our parallel exact global and local triangle counting algorithms obtain speedups of 17-50x on a set of real-world and synthetic graphs, and are faster than previous parallel exact triangle counting algorithms. We can compute the exact triangle count of the Yahoo Web graph (over 6 billion edges) in under 1.5 minutes. In addition, for approximate triangle counting, we are able to approximate the count for the Yahoo graph to within 99.6% accuracy in under 10 seconds, and for a given accuracy we are much faster than existing parallel approximate triangle counting implementations.
【Keywords】: graph theory; multi-threading; multiprocessing systems; parallel algorithms; search engines; Cilk Plus; OpenMP; Yahoo Web graph; cache-friendly algorithms; distributed-memory system; dynamic parallelism; external-memory system; large-scale network analysis; massive graph scaling; multicore parallel algorithms; multicore triangle computations; parallel approximate triangle counting implementations; parallel exact global triangle counting algorithms; parallel exact local triangle counting algorithms; real-world graphs; synthetic graphs; triangle enumeration; two-way hyper-threading; Algorithm design and analysis; Approximation algorithms; Arrays; Complexity theory; Heuristic algorithms; Instruction sets; Multicore processing
【Paper Link】 【Pages】:161-172
【Authors】: Yang Cao ; Wenfei Fan ; Jinpeng Huai ; Ruizhe Huang
【Abstract】: It is cost-prohibitive to find matches Q(G) of a pattern query Q in a big graph G. We approach this by fetching a small subgraph GQ of G such that Q(GQ) = Q(G). We show that many practical patterns are effectively bounded under access constraints A commonly found in real life, such that GQ can be identified in time determined by Q and A only, independent of the size |G| of G. This holds no matter whether pattern queries are localized (e.g., via subgraph isomorphism) or non-localized (graph simulation). We provide algorithms to decide whether a pattern Q is effectively bounded, and if so, to generate a query plan that computes Q(G) by accessing GQ, in time independent of |G|. When Q is not effectively bounded, we give an algorithm to extend access constraints and make Q bounded in G. Using real-life data, we experimentally verify the effectiveness of the approach, e.g., about 60% of queries are effectively bounded for subgraph isomorphism, and for such queries our approach outperforms the conventional methods by 4 orders of magnitude.
【Keywords】: graph theory; query processing; access constraints; big graph; pattern queries; query plan generation; subgraph isomorphism; Adaptation models; Approximation algorithms; Awards activities; Computational modeling; Indexes; Motion pictures; Pattern matching
【Paper Link】 【Pages】:173-184
【Authors】: Ge Luo ; Ke Yi ; Siu-Wing Cheng ; Zhenguo Li ; Wei Fan ; Cheng He ; Yadong Mu
【Abstract】: Given a time series S = ((x1, y1), (x2, y2), ...) and a prescribed error bound ε, the piecewise linear approximation (PLA) problem with max-error guarantees is to construct a piecewise linear function f such that |f(xi)-yi| ≤ ε for all i. In addition, we would like to have an online algorithm that takes the time series as the records arrive in a streaming fashion, and outputs the pieces of f on-the-fly. This problem has applications wherever time series data is being continuously collected, but the data collection device has limited local buffer space and communication bandwidth, so that the data has to be compressed and sent back during the collection process. Prior work addressed two versions of the problem, where either f consists of disjoint segments, or f is required to be a continuous piecewise linear function. In both cases, existing algorithms can produce a function f that has the minimum number of pieces while meeting the prescribed error bound ε. However, we observe that neither minimizes the true representation size of f, i.e., the number of parameters required to represent f. In this paper, we design an online algorithm that generates the optimal PLA in terms of representation size while meeting the prescribed max-error guarantee. Our experiments on many real-world data sets show that our algorithm can reduce the representation size of f by around 15% on average compared with the current best methods, while still requiring O(1) processing time per data record and small space.
【Keywords】: approximation theory; data compression; time series; PLA problem; communication bandwidth; data collection device; data compression; local buffer space; max-error guarantees; online algorithm; piecewise linear approximation; piecewise linear function; representation size; time series data streaming; Algorithm design and analysis; Approximation algorithms; Heuristic algorithms; Joints; Piecewise linear approximation; Programmable logic arrays; Time series analysis
【Paper Link】 【Pages】:185-194
【Authors】: Charu C. Aggarwal ; Philip S. Yu
【Abstract】: In this paper, we will examine the problem of historical storage and diagnosis of massive numbers of simultaneous streams. Such streams are common in very large sensor systems which collect many data streams simultaneously. For example, in a typical monitoring application, we may desire to determine specific abnormalities at sensor nodes or diagnose local regions of abnormal behavior. In other applications, a user may wish to query the streams for specific behavior of the data over arbitrary time horizons. This can be a very difficult task if it is not possible to store the voluminous sensor information at different nodes. In many cases, it is only possible to store aggregated data over different nodes. In this paper, we discuss the problem of storage-efficient monitoring and diagnosis of sensor networks with the use of summary representations. The goal of the summary representation is to providing worst-case guarantees on query functions computed over the sensor stream, while storing the streams compactly. We present experimental results on a number of real data sets showing the effectiveness of the approach.
【Keywords】: information networks; query processing; abnormal behavior; query processing; sensor networks diagnosis; sensor streams historical diagnosis; summary representation; Accuracy; Aggregates; Clocks; Context; Estimation; Random variables; Signal resolution
【Paper Link】 【Pages】:195-206
【Authors】: Manas Joglekar ; Hector Garcia-Molina ; Aditya G. Parameswaran
【Abstract】: Evaluating workers is a critical aspect of any crowdsourcing system. In this paper, we devise techniques for evaluating workers by finding confidence intervals on their error rates. Unlike prior work, we focus on “conciseness”-that is, giving as tight a confidence interval as possible. Conciseness is of utmost importance because it allows us to be sure that we have the best guarantee possible on worker error rate. Also unlike prior work, we provide techniques that work under very general scenarios, such as when not all workers have attempted every task (a fairly common scenario in practice), when tasks have non-boolean responses, and when workers have different biases for positive and negative tasks. We demonstrate conciseness as well as accuracy of our confidence intervals by testing them on a variety of conditions and multiple real-world datasets.
【Keywords】: behavioural sciences computing; personnel; comprehensive crowd assessment algorithms; conciseness; confidence intervals; multiple real-world datasets; nonBoolean responses; reliable crowd assessment algorithms; worker error rate; Accuracy; Crowdsourcing; Error analysis; Gold; Random variables; Reliability; Standards
【Paper Link】 【Pages】:207-218
【Authors】: Chuanfei Xu ; Bo Tang ; Man Lung Yiu
【Abstract】: Commercial web search engines adopt parallel and replicated architecture in order to support high query throughput. In this paper, we investigate the effect of caching on the throughput in such a setting. A simple scheme, called uniform caching, would replicate the cache content to all servers. Unfortunately, it does not exploit the variations among queries, thus wasting memory space on caching the same cache content redundantly on multiple servers. To tackle this limitation, we propose a diversified caching problem, which aims to diversify the types of queries served by different servers, and maximize the sharing of terms among queries assigned to the same server. We show that it is NP-hard to find the optimal diversified caching scheme, and identify intuitive properties to seek good solutions. Then we present a framework with a suite of techniques and heuristics for diversified caching. Finally, we evaluate the proposed solution with competitors by using a real dataset and a real query log.
【Keywords】: cache storage; query processing; search engines; NP-hard; optimal diversified caching scheme; parallel architecture; real query log; replicated Web search engines; replicated architecture; Computer architecture; Indexes; Search engines; Servers; Silicon; Throughput; Training
【Paper Link】 【Pages】:219-230
【Authors】: Vasilis Verroios ; Hector Garcia-Molina
【Abstract】: Given a set of records, an Entity Resolution (ER) algorithm finds records that refer to the same real-world entity. Humans can often determine if two records refer to the same entity, and hence we study the problem of selecting questions to ask error-prone humans. We give a Maximum Likelihood formulation for the problem of finding the “most beneficial” questions to ask next. Our theoretical results lead to a lightweight and practical algorithm, bDENSE, for selecting questions to ask humans. Our experimental results show that bDENSE can more quickly reach an accurate outcome, compared to two approaches proposed recently. Moreover, through our experimental evaluation, we identify the strengths and weaknesses of all three approaches.
【Keywords】: data handling; maximum likelihood estimation; question answering (information retrieval); records management; bDENSE; computer algorithms; crowd errors; entity resolution algorithm; error-prone humans; maximum likelihood formulation; question selection; record detection; record finding; strength identification; weakness identification; Erbium
【Paper Link】 【Pages】:231-242
【Authors】: Nguyen Thanh Tam ; Nguyen Quoc Viet Hung ; Matthias Weidlich ; Karl Aberer
【Abstract】: The amount of information available on the Web has been growing dramatically, raising the importance of techniques for searching the Web. Recently, Web Tables emerged as a model, which enables users to search for information in a structured way. However, effective presentation of results for Web Table search requires (1) selecting a ranking of tables that acknowledges the diversity within the search result; and (2) summarizing the information content of the selected tables concisely but meaningful. In this paper, we formalize these requirements as the diversified table selection problem and the structured table summarization problem. We show that both problems are computationally intractable and, thus, present heuristic algorithms to solve them. For these algorithms, we prove salient performance guarantees, such as near-optimality, stability, and fairness. Our experiments with real-world collections of thousands of Web Tables highlight the scalability of our techniques. We achieve improvements up to 50% in diversity and 10% in relevance over baselines for Web Table selection, and reduce the information loss induced by table summarization by up to 50%. In a user study, we observed that our techniques are preferred over alternative solutions.
【Keywords】: Internet; Web searching; Web table search summarization; Web table selection; heuristic algorithms; information content; information search; structured table summarization problem; table selection problem; Africa; Biological system modeling; Data models; Europe
【Paper Link】 【Pages】:243-254
【Authors】: Arko Provo Mukherjee ; Pan Xu ; Srikanta Tirthapura
【Abstract】: We consider mining dense substructures (maximal cliques) from an uncertain graph, which is a probability distribution on a set of deterministic graphs. For parameter 0 <; α <; 1, we consider the notion of an α-maximal clique in an uncertain graph. We present matching upper and lower bounds on the number of α-maximal cliques possible within a (uncertain) graph. We present an algorithm to enumerate α-maximal cliques whose worst-case runtime is near-optimal, and an experimental evaluation showing the practical utility of the algorithm.
【Keywords】: data mining; graph theory; probability; deterministic graphs; mining dense substructures; mining maximal cliques; probability distribution; uncertain graph; worst case runtime; Algorithm design and analysis; Communities; Data mining; Moon; Proteins; Runtime; Social network services
【Paper Link】 【Pages】:255-266
【Authors】: Lisi Chen ; Gao Cong ; Xin Cao ; Kian-Lee Tan
【Abstract】: Massive amount of data that are geo-tagged and associated with text information are being generated at an unprecedented scale. These geo-textual data cover a wide range of topics. Users are interested in receiving up-to-date tweets such that their locations are close to a user specified location and their texts are interesting to users. For example, a user may want to be updated with tweets near her home on the topic “food poisoning vomiting.” We consider the Temporal Spatial-Keyword Top-k Subscription (TaSK) query. Given a TaSK query, we continuously maintain up-to-date top-k most relevant results over a stream of geo-textual objects (e.g., geo-tagged Tweets) for the query. The TaSK query takes into account text relevance, spatial proximity, and recency of geo-textual objects in evaluating its relevance with a geo-textual object. We propose a novel solution to efficiently process a large number of TaSK queries over a stream of geotextual objects. We evaluate the efficiency of our approach on two real-world datasets and the experimental results show that our solution is able to achieve a reduction of the processing time by 70-80% compared with two baselines.
【Keywords】: geographic information systems; message passing; middleware; query processing; TaSK query; geo-textual data; temporal spatial-keyword Top-k publish-subscribe; up-to-date tweets; Continuous wavelet transforms; Social network services
【Paper Link】 【Pages】:267-278
【Authors】: Jinling Jiang ; Hua Lu ; Bin Yang ; Bin Cui
【Abstract】: Social network platforms and location-based services are increasingly popular in people's daily lives. The combination of them results in location-based social media where people are connected not only through the friendship in the social network but also by their geographical locations in reality. This duality makes it possible to query and make use of social media data in novel ways. In this work, we formulate a novel and useful problem called top-k local user search (TkLUS for short) from tweets with geo-tags. Given a location q, a distance r, and a set of keywords W, the TkLUS query finds the top-k users who have posted tweets relevant to the desired keywords in W at a place within the distance r from q. TkLUS queries are useful in many application scenarios such as friend recommendation, spatial decision, etc. We design a set of techniques to answer such queries efficiently. First, we propose two local user ranking methods that integrate text relevance and location proximity in a TkLUS query. Second, we construct a hybrid index under a scalable framework, which is aware of keywords as well as locations, to organize high volume geo-tagged tweets. Furthermore, we devise two algorithms for processing TkLUS queries. Finally, we conduct an experimental study using real tweet data sets to evaluate the proposed techniques. The experimental results demonstrate the efficiency, effectiveness and scalability of our proposals.
【Keywords】: geography; mobile computing; query processing; social networking (online); text analysis; TkLUS; geo-tagged social media data; geographical locations; hybrid index; local user ranking method; location proximity; location-based services; location-based social media; query answering; social network platform; text relevance; top-k local user search; Indexing; Instruction sets; Media; Twitter
【Paper Link】 【Pages】:279-290
【Authors】: Lei Chen ; Xin Lin ; Haibo Hu ; Christian S. Jensen ; Jianliang Xu
【Abstract】: Large volumes of geo-tagged text objects are available on the web. Spatial keyword top-k queries retrieve k such objects with the best score according to a ranking function that takes into account a query location and query keywords. In this setting, users may wonder why some known object is unexpectedly missing from a result; and understanding why may aid users in retrieving better results. While spatial keyword querying has been studied intensively, no proposals exist for how to offer users explanations of why such expected objects are missing from results. We provide techniques that allow the revision of spatial keyword queries such that their results include one or more desired, but missing objects. In doing so, we adopt a query refinement approach to provide a basic algorithm that reduces the problem to a two-dimensional geometrical problem. To improve performance, we propose an index-based ranking estimation algorithm that prunes candidate results early. Extensive experimental results offer insight into design properties of the proposed techniques and suggest that they are efficient in terms of both running time and I/O cost.
【Keywords】: Internet; query processing; I/O cost; Web; answering why not questions; geometrical problem; geotagged text objects; index based ranking estimation algorithm; query keywords; query location; query refinement approach; spatial keyword top-k queries; spatial keyword top-k queries retrieval; Algorithm design and analysis; Estimation; Indexes; Object recognition; Query processing; Spatial databases
【Paper Link】 【Pages】:291-302
【Authors】: Ziqiang Feng ; Eric Lo
【Abstract】: Modern CPUs have word width of 64 bits but real data values are usually represented using bits fewer than a CPU word. This underutilization of CPU at register level has motivated the recent development of bit-parallel algorithms that carry out data processing operations (e.g, filter scan) on CPU words packed with data values (e.g, 8 data values are packed into one 64-bit word). Bit-parallel algorithms fully unleash the intra-cycle parallelism of modern CPUs and they are especially attractive to main-memory column stores whose goal is to process data at the speed of the “bare metal”. Main-memory column stores generally focus on analytical queries, where aggregation is a common operation. Current bit-parallel algorithms, however, have not covered aggregation yet. In this paper, we present a suite of bit-parallel algorithms to accelerate all standard aggregation operations: SUM, MIN, MAX, AVG, MEDIAN, COUNT. The algorithms are designed to fully leverage the intra-cycle parallelism in CPU cores when aggregating words of packed values. Experimental evaluation shows that our bit-parallel aggregation algorithms exhibit significant performance benefits compared with non-bit-parallel methods.
【Keywords】: parallel algorithms; query processing; AVG operation; COUNT operation; CPU word; MAX operation; MEDIAN operation; MIN operation; SUM operation; analytical queries; bare metal; bit-parallel algorithms; data processing operations; data values; intracycle parallelism; main-memory column; register level CPU underutilization; Acceleration; Algorithm design and analysis; Central Processing Unit; Layout; Parallel processing; Registers; Standards
【Paper Link】 【Pages】:303-314
【Authors】: Yongming Luo ; George H. L. Fletcher ; Jan Hidders ; Paul De Bra
【Abstract】: Computing containment relations between massive collections of sets is a fundamental operation in data management, for example in graph analytics and data mining applications. Motivated by recent hardware trends, in this paper we present two novel solutions for computing set-containment joins over massive sets: the Patricia Trie-based Signature Join (PTSJ) and PRETTI+, a Patricia trie enhanced extension of the state-of-the-art PRETTI join. The compact trie structure not only enables efficient use of main-memory, but also significantly boosts the performance of both approaches. By carefully analyzing the algorithms and conducting extensive experiments with various synthetic and real-world datasets, we show that, in many practical cases, our algorithms are an order of magnitude faster than the state-of-the-art.
【Keywords】: data analysis; data structures; information retrieval; storage management; PRETTI join; PRETTI+; PTSJ; Patricia trie-based signature join; computing set containment relation; data management; data mining; graph analytics; scalable trie-based algorithm; Algorithm design and analysis; Data mining; Data structures; Estimation; Hardware; Indexes; Market research
【Paper Link】 【Pages】:315-326
【Authors】: Renata Borovica-Gajic ; Stratos Idreos ; Anastasia Ailamaki ; Marcin Zukowski ; Campbell Fraser
【Abstract】: Query optimizers depend heavily on statistics representing column distributions to create efficient query plans. In many cases, though, statistics are outdated or non-existent, and the process of refreshing statistics is very expensive, especially for ad-hoc workloads on ever bigger data. This results in suboptimal plans that severely hurt performance. The main problem is that any decision, once made by the optimizer, is fixed throughout the execution of a query. In particular, each logical operator translates into a fixed choice of a physical operator at run-time. In this paper, we advocate for continuous adaptation and morphing of physical operators throughout their lifetime, by adjusting their behavior in accordance with the statistical properties of the data. We demonstrate the benefits of the new paradigm by designing and implementing an adaptive access path operator called Smooth Scan, which morphs continuously within the space of traditional index access and full table scan. Smooth Scan behaves similarly to an index scan for low selectivity; if selectivity increases, however, Smooth Scan progressively morphs its behavior toward a sequential scan. As a result, a system with Smooth Scan requires no access path decisions up front nor does it need accurate statistics to provide good performance. We implement Smooth Scan in PostgreSQL and, using both synthetic benchmarks as well as TPC-H, we show that it achieves robust performance while at the same time being statistics-oblivious.
【Keywords】: query processing; relational databases; statistical analysis; PostgreSQL; Smooth Scan; TPC-H; ad-hoc workloads; adaptive access path operator; behavior morphs; column distributions; continuous adaptation; fixed throughout; full-table scan; index access; index scan; logical operator; physical operator morphing; query execution; query optimizers; query plans; sequential scan; statistical properties; statistics; statistics-oblivious access paths; suboptimal plans; synthetic benchmarks; Complexity theory; Estimation; Indexes; Probes; Query processing; Robustness; Switches
【Paper Link】 【Pages】:327-338
【Authors】: Hina A. Khan ; Mohamed A. Sharaf
【Abstract】: In Data Exploration platforms, diversification has become an essential method for extracting representative data, which provide users with a concise and meaningful view of the results to their queries. However, the benefits of diversification are achieved at the expense of an additional cost for the post-processing of query results. For high dimensional large result sets, the cost of diversification is further escalated due to massive distance computations required to evaluate the similarity between results. To address that challenge, in this paper we propose the Progressive Data Diversification (pDiverse) scheme. The main idea underlying pDiverse is to utilize partial distance computation to reduce the amount of processed data. Our extensive experimental results on both synthetic and real data sets show that our proposed scheme outperforms existing diversification methods in terms of both I/O and CPU costs.
【Keywords】: data handling; query processing; column-based data exploration platforms; pDiverse scheme; progressive data diversification scheme; progressive diversification; query processing; Data mining; Euclidean distance; Handheld computers; Heuristic algorithms; Memory; Query processing
【Paper Link】 【Pages】:339-350
【Authors】: Zhe Fan ; Byron Choi ; Jianliang Xu ; Sourav S. Bhowmick
【Abstract】: One fundamental type of query for graph databases is subgraph isomorphism queries (a.k.a subgraph queries). Due to the computational hardness of subgraph queries coupled with the cost of managing massive graph data, outsourcing the query computation to a third-party service provider has been an economical and scalable approach. However, confidentiality is known to be an important attribute of Quality of Service (QoS) in Query as a Service (QaaS). In this paper, we propose the first practical private approach for subgraph query services, asymmetric structure-preserving subgraph query processing, where the data graph is publicly known and the query structure/topology is kept secret. Unlike other previous methods for subgraph queries, this paper proposes a series of novel optimizations that only exploit graph structures, not the queries. Further, we propose a robust query encoding and adopt the novel cyclic group based encryption so that query processing is transformed into a series of private matrix operations. Our experiments confirm that our techniques are efficient and the optimizations are effective.
【Keywords】: graph theory; matrix algebra; optimisation; query processing; QaaS; asymmetric structure-preserving subgraph query processing; graph databases; novel cyclic group based encryption; novel optimizations; private matrix operations; quality of service; query as a service; robust query encoding; subgraph isomorphism queries; third-party service provider; Cascading style sheets; Computational modeling; Encoding; Encryption; Optimization; Privacy
【Paper Link】 【Pages】:351-362
【Authors】: Guanfeng Liu ; Kai Zheng ; Yan Wang ; Mehmet A. Orgun ; An Liu ; Lei Zhao ; Xiaofang Zhou
【Abstract】: Graph Pattern Matching (GPM) plays a significant role in social network analysis, which has been widely used in, for example, experts finding, social community mining and social position detection. Given a pattern graph GQ and a data graph GD, a GPM algorithm finds those subgraphs, GM, that match GQ in GD. However, the existing GPM methods do not consider the multiple constraints on edges in GQ, which are commonly exist in various applications such as, crowdsourcing travel, social network based e-commerce and study group selection, etc. In this paper, we first conceptually extend Bounded Simulation to Multi-Constrained Simulation (MCS), and propose a novel NP-Complete Multi-Constrained Graph Pattern Matching (MC-GPM) problem. Then, to address the efficiency issue in large-scale MC-GPM, we propose a new concept called Strong Social Component (SSC), consisting of participants with strong social connections. We also propose an approach to identify SSCs, and propose a novel index method and a graph compression method for SSC. Moreover, we devise a heuristic algorithm to identify MC-GPM results effectively and efficiently without decompressing graphs. An extensive empirical study on five real-world large-scale social graphs has demonstrated the effectiveness, efficiency and scalability of our approach.
【Keywords】: computational complexity; graph theory; pattern matching; social networking (online); GPM algorithm; NP-complete multi-constrained graph pattern matching; graph compression method; heuristic algorithm; index method; large-scale contextual social graphs; social network analysis; strong social component; subgraphs; Context; Data mining; Data models; Indexes; Pattern matching; Social network services; Time complexity
【Paper Link】 【Pages】:363-374
【Authors】: Peter Macko ; Virendra J. Marathe ; Daniel W. Margo ; Margo I. Seltzer
【Abstract】: We present LLAMA, a graph storage and analysis system that supports mutability and out-of-memory execution. LLAMA performs comparably to immutable main-memory analysis systems for graphs that fit in memory and significantly outperforms existing out-of-memory analysis systems for graphs that exceed main memory. LLAMA bases its implementation on the compressed sparse row (CSR) representation, which is a read-only representation commonly used for graph analytics. We augment this representation to support mutability and persistence using a novel implementation of multi-versioned array snapshots, making it ideal for applications that receive a steady stream of new data, but need to perform whole-graph analysis on consistent views of the data. We compare LLAMA to state-of-the-art systems on representative graph analysis workloads, showing that LLAMA scales well both out-of-memory and across parallel cores. Our evaluation shows that LLAMA's mutability introduces modest overheads of 3-18% relative to immutable CSR for in-memory execution and that it outperforms state-of-the-art out-of-memory systems in most cases, with a best case improvement of 5x on breadth-first-search.
【Keywords】: tree data structures; tree searching; CSR representation; LLAMA; analysis system; breadth-first-search; compressed sparse row; efficient graph analytics; graph storage; large multiversioned arrays; multiversioned array snapshots; mutability; out-of-memory analysis systems; out-of-memory execution; out-of-memory systems; representative graph analysis workloads; Arrays; Engines; Indexes; Memory management; Merging; Periodic structures; Writing
【Paper Link】 【Pages】:375-386
【Authors】: Xiaocheng Huang ; Zhuowei Bao ; Susan B. Davidson ; Tova Milo ; Xiaojie Yuan
【Abstract】: This paper proposes a novel approach for efficiently evaluating regular path queries over provenance graphs of workflows that may include recursion. The approach assumes that an execution g of a workflow G is labeled with query-agnostic reachability labels using an existing technique. At query time, given g, G and a regular path query R, the approach decomposes R into a set of subqueries R1, ..., Rk that are safe for G. For each safe subquery Ri, G is rewritten so that, using the reachability labels of nodes in g, whether or not there is a path which matches Ri between two nodes can be decided in constant time. The results of each safe subquery are then composed, possibly with some small unsafe remainder, to produce an answer to R. The approach results in an algorithm that significantly reduces the number of subqueries k over existing techniques by increasing their size and complexity, and that evaluates each subquery in time bounded by its input and output size. Experimental results demonstrate the benefit of this approach.
【Keywords】: query processing; reachability analysis; input size; output size; query complexity; query size; query time; query-agnostic reachability labels; reachability labels; recursion; regular path query; regular path query answering; safe subquery; time bounded subquery; workflow execution; workflow provenance graphs; Context modeling; Decoding; Grammar; Labeling; Polynomials; Production; XML
【Paper Link】 【Pages】:387-398
【Authors】: Long Yuan ; Lu Qin ; Xuemin Lin ; Lijun Chang ; Wenjie Zhang
【Abstract】: Maximal clique enumeration is a fundamental problem in graph theory and has been extensively studied. However, maximal clique enumeration is time-consuming in large graphs and always returns enormous cliques with large overlaps. Motivated by this, in this paper, we study the diversified top-k clique search problem which is to find top-k maximal cliques that can cover most number of nodes in the graph. Diversified top-k clique search can be widely used in a lot of applications including community search, motif discovery, and anomaly detection in large graphs. A naive solution for diversified top-k clique search is to keep all maximal cliques in memory and then find k of them that cover most nodes in the graph by using the approximate greedy max k-cover algorithm. However, such a solution is impractical when the graph is large. In this paper, instead of keeping all maximal cliques in memory, we devise an algorithm to maintain k candidates in the process of maximal clique enumeration. Our algorithm has limited memory footprint and can achieve a guaranteed approximation ratio. We also introduce a novel light-weight PNP-Index, based on which we design an optimal maximal clique maintenance algorithm. We further explore three optimization strategies to avoid enumerating all maximal cliques and thus largely reduce the computational cost. We conduct extensive performance studies on six real graphs one of which contains 0.3 billion edges, and the results demonstrate the high efficiency and effectiveness of our approach.
【Keywords】: approximation theory; graph theory; search problems; anomaly detection; approximate greedy max k-cover algorithm; community search; diversified top-k clique search problem; graph nodes; graph theory; maximal clique enumeration; motif discovery; optimal maximal clique maintenance algorithm; top-k maximal cliques; Algorithm design and analysis; Approximation algorithms; Approximation methods; Communities; Computational efficiency; Optimization; Search problems
【Paper Link】 【Pages】:399-410
【Authors】: Pericles de Oliveira ; Altigran Soares da Silva ; Edleno Silva de Moura
【Abstract】: Relational keyword search (R-KwS) systems based on schema graphs take the keywords from the input query, find the tuples and tables where these keywords occur and look for ways to “connect” these keywords using information on referential integrity constraints, i.e., key/foreign key pairs. The result is a number of expressions, called Candidate Networks (CNs), which join relations where keywords occur in a meaningful way. These CNs are then evaluated, resulting in a number of join networks of tuples (JNTs) that are presented to the user as ranked answers to the query. As the number of CNs is potentially very high, handling them is very demanding, both in terms of time and resources, so that, for certain queries, current systems may take too long to produce answers, and for others they may even fail to return results (e.g., by exhausting memory). Moreover, the quality of the CN evaluation may be compromised when a large number of CNs is processed. Based on observations made by other researchers and in our own findings on representative workloads, we argue that, although the number of possible Candidate Networks can be very high, only very few of them produce answers relevant to the user and are indeed worth processing. Thus, R-KwS systems can greatly benefit from methods for accessing the relevance of Candidate Networks, so that only those deemed relevant might be evaluated. We propose in this paper an approach for ranking CNs, based on their probability of producing relevant answers to the user. This relevance is estimated based on the current state of the underlying database using a probabilistic Bayesian model we have developed. Experiments that we performed indicate that this model is able to assign the relevant CNs among the top-4 in the ranking produced. In these experiments we also observed that processing only a few relevant CNs has a considerable positive impact, not only on the performance of processing keyword queries, but also on the quali- y of the results obtained.
【Keywords】: Bayes methods; graph theory; probability; query processing; relational databases; CN evaluation; JNTs; R-KwS systems; candidate network ranking; input query; join networks of tuples; keyword query processing; probabilistic Bayesian model; referential integrity constraints; relational databases; relational keyword search system; schema graphs; Algebra; Bayes methods; Indexes; Joints; Probabilistic logic; Relational databases
【Paper Link】 【Pages】:411-422
【Authors】: Mehdi Kargar ; Aijun An ; Nick Cercone ; Parke Godfrey ; Jaroslaw Szlichta ; Xiaohui Yu
【Abstract】: Keyword search over relational databases offers an alternative way to SQL to query and explore databases that is effective for lay users who may not be well versed in SQL or the database schema. This becomes more pertinent for databases with large and complex schemas. An answer in this context is a join tree spanning tuples containing the query's keywords. As there are potentially many answers to the query, and the user is often only interested in seeing the top-k answers, how to rank the answers based on their relevance is of paramount importance. We focus on the relevance of join as the fundamental means to rank answers. We devise means to measure relevance of relations and foreign keys in the schema over the information content of the database. This can be done offline with no need for external models. We compare the proposed measures against a gold standard we derive from a real workload over TPC-E and evaluate the effectiveness of our methods. Finally, we test the performance of our measures against existing techniques to demonstrate a marked improvement, and perform a user study to establish naturalness of the ranking of the answers.
【Keywords】: SQL; query processing; relational databases; trees (mathematics); SQL; TPC-E; answer ranking; complex schema; database querying; foreign keys; join tree spanning tuples; keyword search; large schema; query answering; relation relevance measurement; relational databases; Companies; Gold; Indexes; Keyword search; Relational databases; Security
【Paper Link】 【Pages】:423-434
【Authors】: Kai Zheng ; Han Su ; Bolong Zheng ; Shuo Shang ; Jiajie Xu ; Jiajun Liu ; Xiaofang Zhou
【Abstract】: Conventional top-k spatial keyword queries require users to explicitly specify their preferences between spatial proximity and keyword relevance. In this work we investigate how to eliminate this requirement by enhancing the conventional queries with interaction, resulting in Interactive Top-k Spatial Keyword (ITkSK) query. Having confirmed the feasibility by theoretical analysis, we propose a three-phase solution focusing on both effectiveness and efficiency. The first phase substantially narrows down the search space for subsequent phases by efficiently retrieving a set of geo-textual k-skyband objects as the initial candidates. In the second phase three practical strategies for selecting a subset of candidates are developed with the aim of maximizing the expected benefit for learning user preferences at each round of interaction. Finally we discuss how to determine the termination condition automatically and estimate the preference based on the user's feedback. Empirical study based on real PoI datasets verifies our theoretical observation that the quality of top-k results in spatial keyword queries can be greatly improved through only a few rounds of interactions.
【Keywords】: query processing; set theory; visual databases; ITkSK query; PoI datasets; geotextual k-skyband objects; interactive top-k spatial keyword queries; keyword relevance; spatial proximity; termination condition; three-phase solution; user feedback; user preferences; Australia; Business; Context; Databases; Music; Search problems; Weight measurement
【Paper Link】 【Pages】:435-446
【Authors】: Arvind Arasu ; Ken Eguro ; Manas Joglekar ; Raghav Kaushik ; Donald Kossmann ; Ravi Ramamurthy
【Abstract】: Cipherbase is a comprehensive database system that provides strong end-to-end data confidentiality through encryption. Cipherbase is based on a novel architecture that combines an industrial strength database engine (SQL Server) with lightweight processing over encrypted data that is performed in secure hardware. The overall architecture provides significant benefits over the state-of-the-art in terms of security, performance, and functionality. This paper presents a prototype of Cipherbase that uses FPGAs to provide secure processing and describes the system engineering details implemented to achieve competitive performance for transactional workloads. This includes hardware-software co-design issues (e.g. how to best offer parallelism), optimizations to hide the latency between the secure hardware and the main system, and techniques to cope with space inefficiencies. All these optimizations were carefully designed not to affect end-to-end data confidentiality. Our experiments with the TPC-C benchmark show that in the worst case when all data are strongly encrypted, Cipherbase achieves 40% of the throughput of plaintext SQL Server. In more realistic cases, if only critical data such as customer names are encrypted, the Cipherbase throughput is more than 90% of plaintext SQL Server.
【Keywords】: SQL; cryptography; database management systems; field programmable gate arrays; hardware-software codesign; optimisation; transaction processing; FPGA; TPC-C benchmark; cipherbase; comprehensive database system; encryption; hardware-software codesign issues; industrial strength database engine; latency hiding; optimizations; plaintext SQL Server; strong end-to-end data confidentiality; system engineering details; transaction processing; transactional workloads; Encryption; Hardware; Indexes; Servers
【Paper Link】 【Pages】:447-458
【Authors】: Lukas Kircher ; Michael Grossniklaus ; Christian Grün ; Marc H. Scholl
【Abstract】: In order to manage XML documents, native XML databases use specific encodings that map the hierarchical structure of a document to a flat representation. Several encodings have been proposed that differ in terms of their support for certain query workloads. While some encodings are optimized for query processing, others focus on data manipulation. For example, the Pre/Dist/Size XML encoding has been designed to support queries over all XPath axes efficiently, but processing atomic updates in XML documents can be costly. In this paper, we present a technique, so-called structural bulk updates, that works in concert with the XQuery Update Facility to support efficient updates on the Pre/Dist/Size encoding. We demonstrate the benefits of our technique in a detailed performance evaluation based on the XMark benchmark.
【Keywords】: XML; query processing; XML documents; XMark benchmark; XPath; XQuery update facility; data manipulation; efficient structural bulk updates; pre-dist-size XML encoding; query processing; Encoding
【Paper Link】 【Pages】:459-470
【Authors】: Michael Shekelyan ; Gregor Jossé ; Matthias Schubert
【Abstract】: In many graph applications, computing cost-optimal paths between two locations is an important task for routing and distance computation. Depending on the network multiple cost criteria might be of interest. Examples are travel time, energy consumption and toll fees in road networks. Path skyline queries compute the set of pareto optimal paths between two given locations. However, the number of skyline paths increases exponentially with the distance between the locations and the number of cost criteria. Thus, the result set might be too big to be of any use. In this paper, we introduce multicriteria linear path skyline queries. A linear path skyline is the subset of the conventional path skyline where the paths are optimal under a linear combination of their cost values. We argue that cost vectors being optimal with respect to a weighted sum are intuitive to understand and therefore, more interesting in many cases. We show that linear path skylines are convex hulls of an augmented solution space and propose an algorithm which utilizes this observation to efficiently compute the complete linear path skyline. To further control the size of the result set, we introduce an approximate version of our algorithm guaranteeing a certain level of optimality for each possible weighting. In our experimental evaluation, we show that our approach computes linear path skylines significantly faster than previous approaches, including those computing the complete path skyline.
【Keywords】: Pareto optimisation; graph theory; network theory (graphs); query processing; set theory; Pareto optimal paths; cost-optimal path computation; distance computation; energy consumption; graph applications; multicriteria linear path skyline queries; multicriteria networks; road networks; toll fees; Approximation algorithms; Cost function; Energy consumption; Pareto optimization; Roads; Routing; Three-dimensional displays
【Paper Link】 【Pages】:471-482
【Authors】: Martin Kaufmann ; Peter M. Fischer ; Norman May ; Chang Ge ; Anil K. Goel ; Donald Kossmann
【Abstract】: Following the adoption of basic temporal features in the SQL:2011 standard, there has been a tremendous interest within the database industry in supporting bi-temporal features, as a significant number of real-life workloads would greatly benefit from efficient temporal operations. However, current implementations of bi-temporal storage systems and operators are far from optimal. In this paper, we present the Bi-temporal Timeline Index, which supports a broad range of temporal operators and exploits the special properties of an in-memory column store database system. Comprehensive performance experiments with the TPC-BiH benchmark show that algorithms based on the Bi-temporal Timeline Index outperform significantly both existing commercial database systems and state-of-the-art data structures from research.
【Keywords】: SQL; data structures; database management systems; query processing; SQL; TPC-BiH benchmark; bitemporal data; bitemporal features; bitemporal storage systems; bitemporal timeline index; commercial database systems; data structure; database industry; inmemory column store database system; query processing; temporal operations; temporal operators; Cities and towns; Data structures; Indexing; Maintenance engineering
【Paper Link】 【Pages】:483-494
【Authors】: Rong Zhang ; Zhenjie Zhang ; Xiaofeng He ; Aoying Zhou
【Abstract】: With the prosperity of online services enabled by Web 2.0, huge amount of human generated commentary data are now available on the Internet, covering a wide range of domains on different products. Such comments contain valuable information for other customers, but are usually difficult to utilize due to the lack of common description structure, the complexity of opinion expression and fast growing data volume. Comment-based restaurant summarization is even more challenging than other types of products and services, as users' comments on restaurants are usually mixed with opinions on different dishes but attached with only one overall evaluation score on the whole experience with the restaurants. It is thus crucial to distinguish well-made dishes from other lousy dishes by mining the comment archive, in order to generate meaningful and useful summaries for other potential customers. This paper presents a novel approach to tackle the problem of restaurant comment summarization, with a core technique on the new bilateral topic analysis model on the commentary text data. In the bilateral topic model, the attributes discussed in the comments on the dishes and the user's evaluation on the attributes are considered as two independent dimensions in the latent space. Combined with new opinionated word extraction and clustering-based representation selection algorithms, our new analysis technique is effective to generate high-quality summary using representative snippets from the text comments. We evaluate our proposals on two real-world comment archives crawled from the most popular English and Chinese online restaurant review web sites, Yelp and Dianping. The experimental results verify the huge margin of advantage of our proposals on the summarization quality over baseline approaches in the literature.
【Keywords】: Internet; catering industry; computational complexity; text analysis; Chinese online restaurant review Web sites; English online restaurant review Web sites; Internet; Web 2.0; bilateral topic analysis; clustering-based representation selection algorithm; comment-based restaurant summarization; dish comment summarization; fast growing data volume; high-quality summary; online services; opinion expression complexity; opinionated word extraction; representative snippets; Algorithm design and analysis; Analytical models; Correlation; Data mining; Hidden Markov models; Proposals; Vocabulary
【Paper Link】 【Pages】:495-506
【Authors】: Wen Hua ; Zhongyuan Wang ; Haixun Wang ; Kai Zheng ; Xiaofang Zhou
【Abstract】: Understanding short texts is crucial to many applications, but challenges abound. First, short texts do not always observe the syntax of a written language. As a result, traditional natural language processing methods cannot be easily applied. Second, short texts usually do not contain sufficient statistical signals to support many state-of-the-art approaches for text processing such as topic modeling. Third, short texts are usually more ambiguous. We argue that knowledge is needed in order to better understand short texts. In this work, we use lexical-semantic knowledge provided by a well-known semantic network for short text understanding. Our knowledge-intensive approach disrupts traditional methods for tasks such as text segmentation, part-of-speech tagging, and concept labeling, in the sense that we focus on semantics in all these tasks. We conduct a comprehensive performance evaluation on real-life data. The results show that knowledge is indispensable for short text understanding, and our knowledge-intensive approaches are effective in harvesting semantics of short texts.
【Keywords】: natural language processing; statistical analysis; text analysis; concept labeling; knowledge-intensive approach; lexical-semantic analysis; natural language processing method; part-of-speech tagging; short text harvesting semantics; short text understanding; statistical signals; text processing; text segmentation; topic modeling; Approximation algorithms; Companies; Context; Labeling; Semantics; Tagging; Vocabulary
【Paper Link】 【Pages】:507-518
【Authors】: Georgia Koutrika ; Lei Liu ; Steven J. Simske
【Abstract】: Given a document collection, existing systems allow users to browse the collection or perform searches that return lists of documents ranked based on their relevance to the user query. While these approaches work fine when a user is trying to locate specific documents, they are insufficient when users need to access the pertinent documents in some logical order, for example for learning or editorial purposes. We present a system that automatically organizes a collection of documents in a tree from general to more specific documents, and allows a user to choose a reading sequence over the documents. This a novel way to content consumption that departs from the typical ranked lists of documents based on their relevance to a user query and from static navigational interfaces. We present a set of algorithms that solve the problem and we evaluate their performance as well as the reading trees generated.
【Keywords】: document handling; query processing; trees (mathematics); user interfaces; content consumption; document collection; reading order generation; reading trees; static navigational interface; user query; Calibration; Data mining; Entropy; Navigation; Search engines; Sequential analysis
【Paper Link】 【Pages】:519-530
【Authors】: Jin Wang ; Guoliang Li ; Dong Deng ; Yong Zhang ; Jianhua Feng
【Abstract】: String similarity search is a fundamental operation in data cleaning and integration. It has two variants, threshold-based string similarity search and top-k string similarity search. Existing algorithms are efficient either for the former or the latter; most of them can't support both two variants. To address this limitation, we propose a unified framework. We first recursively partition strings into disjoint segments and build a hierarchical segment tree index (HS-Tree) on top of the segments. Then we utilize the HS-Tree to support similarity search. For threshold-based search, we identify appropriate tree nodes based on the threshold to answer the query and devise an efficient algorithm (HS-Search). For top-k search, we identify promising strings with large possibility to be similar to the query, utilize these strings to estimate an upper bound which is used to prune dissimilar strings, and propose an algorithm (HS-Topk). We also develop effective pruning techniques to further improve the performance. Experimental results on real-world datasets show our method achieves high performance on the two problems and significantly outperforms state-of-the-art algorithms.
【Keywords】: Blogs; Heuristic algorithms; Indexes; Partitioning algorithms; Search problems; Silicon; Upper bound
【Paper Link】 【Pages】:531-542
【Authors】: Saad Aljubayrin ; Jianzhong Qi ; Christian S. Jensen ; Rui Zhang ; Zhen He ; Zeyi Wen
【Abstract】: We define and study Euclidean and spatial network variants of a new path finding problem: given a set of safe zones, find paths that minimize the distance traveled outside the safe zones. In this problem, the entire space with the exception of the safe zones is unsafe, but passable, and it differs from problems that involve unsafe regions to be strictly avoided. As a result, existing algorithms are not effective solutions to the new problem. To solve the Euclidean variant, we devise a transformation of the continuous data space with safe zones into a discrete graph upon which shortest path algorithms apply. A naive transformation yields a very large graph that is expensive to search. In contrast, our transformation exploits properties of hyperbolas in the Euclidean space to safely eliminate graph edges, thus improving performance without affecting the shortest path results. To solve the spatial network variant, we propose a different graph-to-graph transformation that identifies critical points that serve the same purpose as do the hyperbolas, thus avoiding the creation of extraneous edges. This transformation can be extended to support a weighted version of the problem, where travel in safe zones has non-zero cost. We conduct extensive experiments using both real and synthetic data. The results show that our approaches outperform baseline approaches by more than an order of magnitude in graph construction time, storage space and query response time.
【Keywords】: graph theory; safety; traffic engineering computing; Euclidean space; continuous data space; discrete graph; extraneous edges; graph construction time; graph edges elimination; graph-to-graph transformation; path finding problem; query response time; safe zones; shortest path algorithms; storage space; unsafe regions; Buildings; Indexes; Joining processes; Query processing; Routing; Safety; Space exploration
【Paper Link】 【Pages】:543-554
【Authors】: Jian Dai ; Bin Yang ; Chenjuan Guo ; Zhiming Ding
【Abstract】: When planning routes, drivers usually consider a multitude of different travel costs, e.g., distances, travel times, and fuel consumption. Different drivers may choose different routes between the same source and destination because they may have different driving preferences (e.g., time-efficient driving v.s. fuel-efficient driving). However, existing routing services support little in modeling multiple travel costs and personalization-they usually deliver the same routes that minimize a single travel cost (e.g., the shortest routes or the fastest routes) to all drivers. We study the problem of how to recommend personalized routes to individual drivers using big trajectory data. First, we provide techniques capable of modeling and updating different drivers' driving preferences from the drivers' trajectories while considering multiple travel costs. To recommend personalized routes, we provide techniques that enable efficient selection of a subset of trajectories from all trajectories according to a driver's preference and the source, destination, and departure time specified by the driver. Next, we provide techniques that enable the construction of a small graph with appropriate edge weights reflecting how the driver would like to use the edges based on the selected trajectories. Finally, we recommend the shortest route in the small graph as the personalized route to the driver. Empirical studies with a large, real trajectory data set from 52,211 taxis in Beijing offer insight into the design properties of the proposed techniques and suggest that they are efficient and effective.
【Keywords】: Big Data; driver information systems; graph theory; set theory; Beijing; Big trajectory data; design properties; driver driving preferences; driver trajectories; graph construction; personalized route recommendation; routing services; single travel cost minimization; subset selection; taxis; Fuels; Global Positioning System; Indexes; Roads; Trajectory; Vehicles
【Paper Link】 【Pages】:555-566
【Authors】: Tanmoy Chakraborty ; Natwar Modani ; Ramasuri Narayanam ; Seema Nagar
【Abstract】: Performing literature survey for scholarly activities has become a challenging and time consuming task due to the rapid growth in the number of scientific articles. Thus, automatic recommendation of high quality citations for a given scientific query topic is immensely valuable. The state-of-the-art on the problem of citation recommendation suffers with the following three limitations. First, most of the existing approaches for citation recommendation require input in the form of either the full article or a seed set of citations, or both. Nevertheless, obtaining the recommendation for citations given a set of keywords is extremely useful for many scientific purposes. Second, the existing techniques for citation recommendation aim at suggesting prestigious and well-cited articles. However, we often need recommendation of diversified citations of the given query topic for many scientific purposes; for instance, it helps authors to write survey papers on a topic and it helps scholars to get a broad view of key problems on a topic. Third, one of the problems in the keyword based citation recommendation is that the search results typically would not include the semantically correlated articles if these articles do not use exactly the same keywords. To the best of our knowledge, there is no known citation recommendation system in the literature that addresses the above three limitations simultaneously. In this paper, we propose a novel citation recommendation system called DiSCern to precisely address the above research gap. DiSCern finds relevant and diversified citations in response to a search query, in terms of keyword(s) to describe the query topic, while using only the citation graph and the keywords associated with the articles, and no latent information. We use a novel keyword expansion step, inspired by community finding in social network analysis, in DiSCern to ensure that the semantically correlated articles are also included in the results. Our proposed appr- ach primarily builds on the Vertex Reinforced Random Walk (VRRW) to balance prestige and diversity in the recommended citations. We demonstrate the efficacy of DiSCern empirically on two datasets: a large publication dataset of more than 1.7 million articles in computer science domain and a dataset of more than 29,000 articles in theoretical high-energy physics domain. The experimental results show that our proposed approach is quite efficient and it outperforms the state-of-the-art algorithms in terms of both relevance and diversity.
【Keywords】: citation analysis; graph theory; query processing; recommender systems; DiSCern; VRRW; citation graph; diversified citation recommendation system; high quality citation automatic recommendation; keyword based citation recommendation; keyword expansion step; scientific articles; scientific query topic; search query; theoretical high-energy physics domain; vertex reinforced random walk; Approximation methods; Clustering algorithms; Communities; Computer science; Context; Markov processes; Mathematical model
【Paper Link】 【Pages】:567-578
【Authors】: Tuan-Anh Nguyen Pham ; Xutao Li ; Gao Cong ; Zhenjie Zhang
【Abstract】: Event-based social networks (EBSNs), such as Meetup and Plancast, which offer platforms for users to plan, arrange, and publish events, have gained increasing popularity and rapid growth. EBSNs capture not only the online social relationship, but also the offline interactions from offline events. They contain rich heterogeneous information, including multiple types of entities, such as users, events, groups and tags, and their interaction relations. Three recommendation tasks, namely recommending groups to users, recommending tags to groups, and recommending events to users, have been explored in three separate studies. However, none of the proposed methods can handle all the three recommendation tasks. In this paper, we propose a general graph-based model, called HeteRS, to solve the three recommendation problems on EBSNs in one framework. Our method models the rich information with a heterogeneous graph and considers the recommendation problem as a query-dependent node proximity problem. To address the challenging issue of weighting the influences between different types of entities, we propose a learning scheme to set the influence weights between different types of entities. Experimental results on two real-world datasets demonstrate that our proposed method significantly outperforms the state-of-the-art methods for all the three recommendation tasks, and the learned influence weights help understanding user behaviors.
【Keywords】: graph theory; learning (artificial intelligence); recommender systems; social networking (online); EBSN; HeteRS; event-based social networks; events recommendation; general graph-based model; groups recommendation; heterogeneous graph; learning scheme; query-dependent node proximity problem; tags recommendation; Irrigation
【Paper Link】 【Pages】:579-590
【Authors】: Yuhong Li ; Leong Hou U ; Man Lung Yiu ; Zhiguo Gong
【Abstract】: Discovering motifs in sequence databases has been receiving abundant attentions from both database and data mining communities, where the motif is the most correlated pair of subsequences in a sequence object. Motif discovery is expensive for emerging applications which may have very long sequences (e.g., million observations per sequence) or the queries arrive rapidly (e.g., per 10 seconds). Prior works cannot offer fast correlation computations and prune subsequence pairs at the same time, as these two techniques require different orderings on examining subsequence pairs. In this work, we propose a novel framework named Quick-Motif which adopts a two-level approach to enable batch pruning at the outer level and enable fast correlation calculation at the inner level. We further propose two optimization techniques for the outer and the inner level. In our experimental study, our method is up to 3 orders of magnitude faster than the state-of-the-art methods.
【Keywords】: data mining; optimisation; query processing; Quick-Motif framework; batch pruning; correlation computations; data mining communities; database communities; exact motif discovery; inner level; optimization techniques; outer level; sequence databases; sequence object; subsequence pairs; two-level approach; Force; Silicon
【Paper Link】 【Pages】:591-602
【Authors】: Lu Chen ; Yunjun Gao ; Xinhan Li ; Christian S. Jensen ; Gang Chen
【Abstract】: The goal in similarity search is to find objects similar to a specified query object given a certain similarity criterion. Although useful in many areas, such as multimedia retrieval, pattern recognition, and computational biology, to name but a few, similarity search is not yet supported well by commercial DBMS. This may be due to the complex data types involved and the needs for flexible similarity criteria seen in real applications. We propose an efficient disk-based metric access method, the Space-filling curve and Pivot-based B+-tree (SPB-tree), to support a wide range of data types and similarity metrics. The SPB-tree uses a small set of so-called pivots to reduce significantly the number of distance computations, uses a space-filling curve to cluster the data into compact regions, thus improving storage efficiency, and utilizes a B+-tree with minimum bounding box information as the underlying index. The SPB-tree also employs a separate random access file to efficiently manage a large and complex data. By design, it is easy to integrate the SPB-tree into an existing DBMS. We present efficient similarity search algorithms and corresponding cost models based on the SPB-tree. Extensive experiments using real and synthetic data show that the SPB-tree has much lower construction cost, smaller storage size, and can support more efficient similarity queries with high accuracy cost models than is the case for competing techniques. Moreover, the SPB-tree scales sublinearly with growing dataset size.
【Keywords】: indexing; query processing; tree searching; DBMS; Pivot-based B+-tree; computational biology; disk-based metric access method; metric indexing; multimedia retrieval; pattern recognition; similarity search; space-filling curve; Algorithm design and analysis; Extraterrestrial measurements; Hafnium; Indexes; Object recognition; Search problems
【Paper Link】 【Pages】:603-614
【Authors】: Takanori Maehara ; Mitsuru Kusumoto ; Ken-ichi Kawarabayashi
【Abstract】: Similarity join finds all pairs of objects (i, j) with similarity score s(i, j) greater than some specified threshold θ. This is a fundamental query problem in the database research community, and is used in many practical applications, such as duplicate detection, merge/purge, record linkage, object matching, and reference conciliation. In this paper, we propose a scalable approximation algorithm with an arbitrary accuracy for the similarity join problem with the SimRank similarity measure. The algorithm consists of two phases: filter and verification. The filter phase enumerates similar pair candidates, and the similarity of each candidate is then assessed in the verification phase. The scalability of the proposed algorithm is experimentally verified for large real networks. The complexity depends only on the number of similar pairs, but does not depend on all pairs O(n2). The proposed algorithm scales up to the network of 5M vertices and 70M edges. By comparing the state-of-the-art algorithms, it is about 10 times faster and it requires about 10 times smaller memory.
【Keywords】: computational complexity; query processing; computational complexity; duplicate detection; fundamental query problem; merge-purge; object matching; record linkage; reference conciliation; scalable SimRank join algorithm; scalable approximation algorithm; Accuracy; Algorithm design and analysis; Approximation algorithms; Complexity theory; Couplings; Memory management; Monte Carlo methods
【Paper Link】 【Pages】:615-626
【Authors】: Jun Hou ; Richi Nayak
【Abstract】: High-Order Co-Clustering (HOCC) methods have attracted high attention in recent years because of their ability to cluster multiple types of objects simultaneously using all available information. During the clustering process, HOCC methods exploit object co-occurrence information, i.e., inter-type relationships amongst different types of objects as well as object affinity information, i.e., intra-type relationships amongst the same types of objects. However, it is difficult to learn accurate intra-type relationships in the presence of noise and outliers. Existing HOCC methods consider the p nearest neighbours based on Euclidean distance for the intra-type relationships, which leads to incomplete and inaccurate intra-type relationships. In this paper, we propose a novel HOCC method that incorporates multiple subspace learning with a heterogeneous manifold ensemble to learn complete and accurate intra-type relationships. Multiple subspace learning reconstructs the similarity between any pair of objects that belong to the same subspace. The heterogeneous manifold ensemble is created based on two-types of intra-type relationships learnt using p-nearest-neighbour graph and multiple subspaces learning. Moreover, in order to make sure the robustness of clustering process, we introduce a sparse error matrix into matrix decomposition and develop a novel iterative algorithm. Empirical experiments show that the proposed method achieves improved results over the state-of-art HOCC methods for FScore and NMI.
【Keywords】: learning (artificial intelligence); matrix decomposition; pattern clustering; Euclidean distance; FScore; HOCC method; NMI; heterogeneous manifold ensemble; high-order co-clustering method; iterative algorithm; multiple subspace learning; multitype relational data; object affinity information; object co-occurrence information; p-nearest-neighbour graph; robust clustering; sparse error matrix decomposition; Clustering algorithms; Manifolds; Matrix decomposition; Noise; Robustness; Sparse matrices; Web pages
【Paper Link】 【Pages】:627-638
【Authors】: Norases Vesdapunt ; Hector Garcia-Molina
【Abstract】: We study the problem of Entity Resolution (ER) with limited information. ER is the problem of identifying and merging records that represent the same real-world entity. In this paper, we focus on the resolution of a single node g from one social graph (Google+ in our case) against a second social graph (Twitter in our case). We want to find the best match for g in Twitter, by dynamically probing the Twitter graph (using a public API), limited by the number of API calls that social systems allow. We propose two strategies that are designed for limited information and can be adapted to different limits. We evaluate our strategies against a naive one on a real dataset and show that our strategies can provide improved accuracy with significantly fewer API calls.
【Keywords】: application program interfaces; graph theory; social networking (online); API calls; ER; Google+; Twitter graph; entity resolution; limited information; naive strategies; public API; social graph; social networks; user identification; Accuracy; Erbium; Google; Logistics; Probes; Twitter
【Paper Link】 【Pages】:639-650
【Authors】: Yuchen Li ; Zhifeng Bao ; Guoliang Li ; Kian-Lee Tan
【Abstract】: Internet users are shifting from searching on traditional media to social network platforms (SNPs) to retrieve up-to-date and valuable information. SNPs have two unique characteristics: frequent content update and small world phenomenon. However, existing works are not able to support these two features simultaneously. To address this problem, we develop a general framework to enable real time personalized top-k query. Our framework is based on a general ranking function that incorporates time freshness, social relevance and textual similarity. To ensure efficient update and query processing, there are two key challenges. The first is to design an index structure that is update-friendly while supporting instant query processing. The second is to efficiently compute the social relevance in a complex graph. To address these challenges, we first design a novel 3D cube inverted index to support efficient pruning on the three dimensions simultaneously. Then we devise a cube based threshold algorithm to retrieve the top-k results, and propose several pruning techniques to optimize the social distance computation, whose cost dominates the query processing. Furthermore, we optimize the 3D index via a hierarchical partition method to enhance our pruning on the social dimension. Extensive experimental results on two real world large datasets demonstrate the efficiency and the robustness of our proposed solution.
【Keywords】: Internet; graph theory; indexing; query processing; social networking (online); 3D cube inverted index design; Internet users; complex graph; cube based threshold algorithm; frequent content update; hierarchical partition method; index structure design; information retrieval; instant query processing; personalized top-k query; pruning techniques; ranking function; real time personalized search; small world phenomenon; social distance computation; social networks; social relevance; textual similarity; time freshness; Algorithm design and analysis; Indexes; Query processing; Real-time systems; Three-dimensional displays; Twitter
【Paper Link】 【Pages】:651-662
【Authors】: Ke Wu ; Song Yang ; Kenny Q. Zhu
【Abstract】: This paper studies the problem of automatic detection of false rumors on Sina Weibo, the popular Chinese microblogging social network. Traditional feature-based approaches extract features from the false rumor message, its author, as well as the statistics of its responses to form a flat feature vector. This ignores the propagation structure of the messages and has not achieved very good results. We propose a graph-kernel based hybrid SVM classifier which captures the high-order propagation patterns in addition to semantic features such as topics and sentiments. The new model achieves a classification accuracy of 91.3% on randomly selected Weibo dataset, significantly higher than state-of-the-art approaches. Moreover, our approach can be applied at the early stage of rumor propagation and is 88% confident in detecting an average false rumor just 24 hours after the initial broadcast.
【Keywords】: feature extraction; graph theory; pattern classification; social networking (online); support vector machines; Chinese microblogging social network; Sina Weibo; false rumors detection; feature extraction; feature-based approaches; flat feature vector; graph-kernel based hybrid SVM classifier; high-order propagation patterns; propagation structures; semantic features; Awards activities; Communities; Explosions; Feature extraction; Periodic structures; Silicon compounds; Skin
【Paper Link】 【Pages】:675-686
【Authors】: Kaiqi Zhao ; Gao Cong ; Quan Yuan ; Kenny Q. Zhu
【Abstract】: Many location based services, such as FourSquare, Yelp, TripAdvisor, Google Places, etc., allow users to compose reviews or tips on points of interest (POIs), each having a geographical coordinates. These services have accumulated a large amount of such geo-tagged review data, which allows deep analysis of user preferences in POIs. This paper studies two types of user preferences to POIs: topical-region preference and category aware topical-aspect preference. We propose a unified probabilistic model to capture these two preferences simultaneously. In addition, our model is capable of capturing the interaction of different factors, including topical aspect, sentiment, and spatial information. The model can be used in a number of applications, such as POI recommendation and user recommendation, among others. In addition, the model enables us to investigate whether people like an aspect of a POI or whether people like a topical aspect of some type of POIs (e.g., bars) in a region, which offer explanation for recommendations. Experiments on real world datasets show that the model achieves significant improvement in POI recommendation and user recommendation in comparison to the state-of-the-art methods. We also propose an efficient online recommendation algorithm based on our model, which saves up to 90% computation time.
【Keywords】: mobile computing; recommender systems; FourSquare; Google Places; POI recommendation; SAR; TripAdvisor; Yelp; category aware topical-aspect preference; geo-tagged reviews; geographical coordinates; location based services; online recommendation algorithm; points of interest; sentiment-aspect-region model; spatial information; topical aspect; topical-region preference; unified probabilistic model; user preference analysis; user recommendation; Analytical models; Computational modeling; Data mining; Gaussian distribution; Inference algorithms; Mathematical model; Proposals
【Paper Link】 【Pages】:687-698
【Authors】: Kenneth Fuglsang Christensen ; Lasse Linnerup Christiansen ; Torben Bach Pedersen ; Jeppe Pihl
【Abstract】: Increasingly, streaming positions from moving objects in blended indoor/outdoor spaces are used to deliver new types of real-time location-based services. To support such scenarios, this paper presents the Searchlight Graph (SLG) model and the associated Searchlight Continuous Query Processing Framework (CQPF) for (predictive) Continuous Query Processing (CQP) in symbolic indoor/outdoor spaces. The model captures both actual and predicted object movement, object-specific edge costs, and location/object context annotation with keywords, enabling context-aware (predictive) querying of both locations and objects. Furthermore, the paper proposes several types of continuous spatio-temporal queries, expressed in the declarative Searchlight Query Language (SLQL), along with novel query processing algorithms, and describes their implementation in the Searchlight CQPF. Finally, a novel location prediction algorithm is proposed. Extensive experimental studies show that Searchlight is scalable, efficient, and outperforms its main competitor.
【Keywords】: graph theory; query processing; ubiquitous computing; CQP; CQPF; SLG model; SLQL; context aware predictive continuous querying; indoor-outdoor spaces; location-object context annotation; moving objects; novel query processing algorithms; object-specific edge costs; predicted object movement; real-time location based services; searchlight continuous query processing framework; searchlight graph; searchlight query language; symbolic space; Aggregates; Biology; Receivers
【Paper Link】 【Pages】:699-710
【Authors】: Dong-Wan Choi ; Chin-Wan Chung
【Abstract】: This paper proposes a group version of the nearest neighbor (NN) query, called the nearest neighborhood (NNH) query, which aims to find the nearest group of points, instead of one nearest point. Given a set O of points, a query point q, and a ρ-radius circle C, the NNH query returns the nearest placement of C to q such that there are at least k points enclosed by C. We present a fast algorithm for processing the NNH query based on the incremental retrieval of nearest neighbors using the R-tree structure on O. Our solution includes several techniques, to efficiently maintain sets of retrieved nearest points and identify their validities in terms of the closeness constraint of their points. These techniques are devised from the unique characteristics of the NNH search problem. As a side product, we solve a new geometric problem, called the nearest enclosing circle (NEC) problem, which is of independent interest. We present a linear expected-time algorithm solving the NEC problem using the properties of the NEC similar to those of the smallest enclosing circle. We provide extensive experimental results, which show that our techniques can significantly improve the query performance.
【Keywords】: geometry; query processing; search problems; tree data structures; visual databases; ρ-radius circle; NEC problem; NNH query; NNH search problem; R-tree structure; geometric problem; incremental retrieval; linear expected-time algorithm; nearest enclosing circle; nearest neighborhood query; nearest neighborhood search; spatial databases; Artificial neural networks; Clustering algorithms; Data mining; Indexes; Mobile communication; Query processing; Spatial databases
【Paper Link】 【Pages】:711-722
【Authors】: Huiqi Hu ; Yiqun Liu ; Guoliang Li ; Jianhua Feng ; Kian-Lee Tan
【Abstract】: With the rapid progress of mobile Internet and the growing popularity of smartphones, location-aware publish/subscribe systems have recently attracted significant attention. Different from traditional content-based publish/subscribe, subscriptions registered by subscribers and messages published by publishers include both spatial information and textual descriptions, and messages should be delivered to relevant subscribers whose subscriptions have high relevancy to the messages. To evaluate the relevancy between spatio-textual messages and subscriptions, we should combine the spatial proximity and textual relevancy. Since subscribers have different preferences - some subscribers prefer messages with high spatial proximity and some subscribers pay more attention to messages with high textual relevancy, it calls for new location-aware publish/subscribe techniques to meet various needs from different subscribers. In this paper, we allow subscribers to parameterize their subscriptions and study the location-aware publish/subscribe problem on parameterized spatio-textual subscriptions. One big challenge is to achieve high performance. To meet this requirement, we propose a filter-verification framework to efficiently deliver messages to relevant subscribers. In the filter step, we devise effective filters to prune large numbers of irreverent results and obtain some candidates. In the verification step, we verify the candidates to generate the answers. We propose three effective filters by integrating prefix filtering and spatial pruning techniques. Experimental results show our method achieves higher performance and better quality than baseline approaches.
【Keywords】: message passing; middleware; mobile computing; smart phones; content-based publish/subscribe; filter-verification framework; location-aware publish/subscribe framework; message delivery; message subscription; parameterized spatio-textual subscriptions; prefix filtering; relevancy evaluate; spatial information; spatial proximity; spatial pruning technique; spatio-textual messages; textual descriptions; textual relevancy; Complexity theory; Filtering algorithms; Footwear; Mobile communication; Spatial indexes; Subscriptions
【Paper Link】 【Pages】:723-734
【Authors】: Yingjun Wu ; Kian-Lee Tan
【Abstract】: We introduce ChronoStream, a distributed system specifically designed for elastic stateful stream computation in the cloud. ChronoStream treats internal state as a first-class citizen and aims at providing flexible elastic support in both vertical and horizontal dimensions to cope with workload fluctuation and dynamic resource reclamation. With a clear separation between application-level computation parallelism and OS-level execution concurrency, ChronoStream enables transparent dynamic scaling and failure recovery by eliminating any network I/O and state-synchronization overhead. Our evaluation on dozens of computing nodes shows that ChronoStream can scale linearly and achieve transparent elasticity and high availability without sacrificing system performance or affecting collocated tenants.
【Keywords】: cloud computing; concurrency (computers); operating systems (computers); parallel processing; resource allocation; system recovery; ChronoStream; OS-level execution concurrency; application-level computation parallelism; distributed system; dynamic resource reclamation; elastic stateful stream computation; failure recovery; first-class citizen; flexible elastic support; network I/O; state-synchronization overhead; transparent dynamic scaling; Checkpointing; Computational modeling; Containers; Elasticity; Instruction sets; Peer-to-peer computing; Runtime
【Paper Link】 【Pages】:735-746
【Authors】: Jieying She ; Yongxin Tong ; Lei Chen ; Caleb Chen Cao
【Abstract】: With the rapid development of Web 2.0 and Online To Offline (O2O) marketing model, various online event-based social networks (EBSNs), such as Meetup and Whova, are getting popular. An important task of EBSNs is to facilitate the most satisfactory event-participant arrangement for both sides, i.e. events enroll more participants and participants are arranged with personally interesting events. Existing approaches usually focus on the arrangement of each single event to a set of potential users, and ignore the conflicts between different events, which leads to infeasible or redundant arrangements. In this paper, to address the shortcomings of existing approaches, we first identify a more general and useful event-participant arrangement problem, called Global Event-participant Arrangement with Conflict and Capacity (GEACC) problem, focusing on the conflicts of different events and making event-participant arrangements in a global view. Though it is useful, unfortunately, we find that the GEACC problem is NP-hard due to the conflict constraints among events. Thus, we design two approximation algorithms with provable approximation ratios and an exact algorithm with pruning technique to address this problem. Finally, we verify the effectiveness and efficiency of the proposed methods through extensive experiments on real and synthetic datasets.
【Keywords】: Internet; approximation theory; computational complexity; marketing; social networking (online); EBSN; GEACC problem; Meetup; NP-hard problem; Web 2.0; Whova; conflict-aware event-participant arrangement; global event-participant arrangement with conflict and capacity arrangement problem; online event-based social networks; online to offline marketing model; pruning technique; Algorithm design and analysis; Approximation algorithms; Approximation methods; Economic indicators; Noise measurement; Social network services; Web 2.0
【Paper Link】 【Pages】:747-758
【Authors】: Zheng Li ; Tingjian Ge
【Abstract】: Most of the recent complex event semantics is based on regular expressions, extended with additional filters such as window constraints. We observe that many applications today require parallel (interleaving) event patterns. Moreover, we observe that this matching needs to be approximate in terms of event orders and missing events. We first propose the query semantics. Then we devise a foundation algorithm, on top of which two optimization techniques are proposed. Finally, we perform a comprehensive experimental evaluation using three real-world datasets in different domains and synthetic datasets.
【Keywords】: optimisation; pattern matching; query processing; sequences; PIE; approximate interleaving event matching; complex event semantics; foundation algorithm; optimization techniques; query semantics; real-world datasets; sequences; synthetic datasets; Approximation algorithms; Automata; Couplings; Instruction sets; Markov processes; Pattern matching; Semantics
【Paper Link】 【Pages】:759-770
【Authors】: Chunyao Song ; Tingjian Ge
【Abstract】: Sequence data is prevalent, and event processing over sequences is increasingly important in this Big Data era, drawing much attention from both research and industry. In this paper, we address a novel problem, which is to find common event subsequences from two long sequences. This problem is well motivated, with applications in diverse domains. We propose the window-chained longest common subsequence (WCLCS) semantics, and argue that the traditional longest common subsequence (LCS) cannot serve this need. We then devise efficient algorithms to solve this problem by reducing it to a graph problem. We also propose two more methods to improve the performance: one is based on informed search and exploration, and the other is an approximation algorithm with accuracy guarantees. We finally carry out a systematic experimental evaluation using two real-world datasets and some synthetic datasets.
【Keywords】: Big Data; approximation theory; graph theory; sequences; Big Data; WCLCS semantics; approximation algorithm; common event matching; event processing; event subsequences; graph problem; real-world datasets; sequence data; synthetic datasets; systematic experimental evaluation; window-chained longest-common subsequence semantics; Algorithm design and analysis; Approximation algorithms; Companies; Diabetes; Heuristic algorithms; Semantics; Sensors
【Paper Link】 【Pages】:771-782
【Authors】: François Goasdoué ; Zoi Kaoudi ; Ioana Manolescu ; Jorge-Arnulfo Quiané-Ruiz ; Stamatis Zampetakis
【Abstract】: As increasing volumes of RDF data are being produced and analyzed, many massively distributed architectures have been proposed for storing and querying this data. These architectures are characterized first, by their RDF partitioning and storage method, and second, by their approach for distributed query optimization, i.e., determining which operations to execute on each node in order to compute the query answers. We present CliqueSquare, a novel optimization approach for evaluating conjunctive RDF queries in a massively parallel environment. We focus on reducing query response time, and thus seek to build flat plans, where the number of joins encountered on a root-to-leaf path in the plan is minimized. We present a family of optimization algorithms, relying on n-ary (star) equality joins to build flat plans, and compare their ability to find the flattest possibles. We have deployed our algorithms in a MapReduce-based RDF platform and demonstrate experimentally the interest of the flat plans built by our best algorithms.
【Keywords】: data handling; optimisation; parallel processing; query processing; CliqueSquare; MapReduce-based RDF platform; distributed architectures; optimization approach; parallel RDF queries; Buildings; Distributed databases; Optimization; Parallel processing; Query processing; Resource description framework; Time factors
【Paper Link】 【Pages】:783-794
【Authors】: Mingzhu Wei ; Elke A. Rundensteiner ; Murali Mani
【Abstract】: Because of high volumes and unpredictable arrival rates, stream processing systems cannot always keep up with input data streams, resulting in buffer overflow and uncontrolled loss of data. Load shedding and spilling, the two prevalent technologies designed to solve this overflow problem by dropping or flushing data to disk, suffer from serious shortcomings. Dropping data suffers in that partial output is lost forever, while flushing may waste precious resources due to making the strong assumption that flushed data can and will eventually still be processed. In this paper, we propose our solution, INSURE, integrating structure-based drop and flush techniques within one unified framework for XML stream systems. Our INSURE framework provides an optimized fine-grained load reduction solution that achieves high quality result production. First, the fusion candidate lattice models the space of load reduction solutions incorporating both drop and flush decisions, called fusion candidates. Second, our systematic analysis of fusion candidates and their interrelationships in the fusion candidate lattice reveals important relationships, including the monotonicity of their feasibility and profitability properties. Third, based upon this fusion candidate lattice model, a family of optimization strategies for the selection of fusion candidates is designed to successfully maximize the overall result quality. Experimental results demonstrate that INSURE consistently achieves higher quality results compared to the state-of-the-art techniques, yet with negligible overhead.
【Keywords】: XML; optimisation; INSURE; INSURE framework; XML stream processing; XML stream systems; buffer overflow; dropping data; flushed data; input data streams; integrated load reduction framework; load shedding; load spilling; optimization strategies; optimized fine grained load reduction solution; systematic analysis; Delays; Electronic mail; Engines; Lattices; Optimization; Query processing; XML
【Paper Link】 【Pages】:795-806
【Authors】: Buwen Wu ; Yongluan Zhou ; Pingpeng Yuan ; Ling Liu ; Hai Jin
【Abstract】: The emerging need for conducting complex analysis over big RDF datasets calls for scale-out solutions that can harness a computing cluster to process big RDF datasets. Queries over RDF data often involve complex self-joins, which would be very expensive to run if the data are not carefully partitioned across the cluster and hence distributed joins over massive amount of data are necessary. Existing RDF data partitioning methods can nicely localize simple queries but still need to resort to expensive distributed joins for more complex queries. In this paper, we propose a new data partitioning approach that takes use of the rich structural information in RDF datasets and minimizes the amount of data that have to be joined across different computing nodes. We conduct an extensive experimental study using two popular RDF benchmark data and one real RDF dataset that contain up to billions of RDF triples. The results indicate that our approach can produce a balanced and low redundant data partitioning scheme that can avoid or largely reduce the cost of distributed joins even for very complicated queries. In terms of query execution time, our approach can outperform the state-of-the-art methods by orders of magnitude.
【Keywords】: data handling; query languages; RDF data partitioning methods; big RDF datasets; complex self-joins; path partitioning; query execution time; redundant data partitioning scheme; scalable SPARQL querying; scale-out solutions; Approximation algorithms; Approximation methods; Data models; Distributed databases; Merging; Partitioning algorithms; Resource description framework
【Paper Link】 【Pages】:807-818
【Authors】: Günes Aluç ; M. Tamer Özsu ; Khuzaima Daudjee ; Olaf Hartig
【Abstract】: Recent advances in Linked Data Management and the Semantic Web have led to a rapid increase in both the quantity as well as the variety of Web applications that rely on the SPARQL interface to query RDF data. Thus, RDF data management systems are increasingly exposed to workloads that are far more diverse and dynamic than what these systems were designed to handle. The problem is that existing systems rely on a workload-oblivious physical representation that has a fixed schema, which is not suitable for diverse and dynamic workloads. To address these issues, we propose a physical representation that is schemaless. The resulting flexibility enables an RDF dataset to be clustered based purely on the workload, which is key to achieving good performance through optimized I/O and cache utilization. Consequently, given a workload, we develop techniques to compute a good clustering of the database. We also design a new query evaluation model, namely, schemaless-evaluation that leverages this workload-aware clustering of the database whereby, with high probability, each tuple in the result set of a query is expected to be contained in at most one cluster. Our query evaluation model exploits this property to achieve better performance while ensuring fast generation of query plans without being hindered by the lack of a fixed physical schema.
【Keywords】: cache storage; database management systems; pattern clustering; query processing; semantic Web; RDF data management systems; RDF dataset; SPARQL interface; Web applications; cache utilization; dynamic workloads; linked data management; query evaluation model; query execution; schemaless RDF databases; semantic Web; workload-aware clustering; workload-oblivious physical representation; Clustering algorithms; Computational modeling; Indexes; Query processing; Resource description framework; Standards
【Paper Link】 【Pages】:819-830
【Authors】: Mohammadreza Najafi ; Mohammad Sadoghi ; Hans-Arno Jacobsen
【Abstract】: The limitations of traditional general-purpose processors have motivated the use of specialized hardware solutions (e.g., FPGAs) to achieve higher performance in stream processing. However, state-of-the-art hardware-only solutions have limited support to adapt to changes in the query workload. In this work, we present a reconfigurable hardware-based streaming architecture that offers the flexibility to accept new queries and to change existing ones without the need for expensive hardware reconfiguration. We introduce the Online Programmable Block (OP-Block), a ”Lego-like” connectable stream processing element, for constructing a custom Flexible Query Processor (FQP), suitable to a wide range of data streaming applications, including real-time data analytics, information filtering, intrusion detection, algorithmic trading, targeted advertising, and complex event processing. Through evaluations, we conclude that updating OP-Blocks to support new queries takes on the order of nano to micro-seconds (e.g., 40 ns to realize a join operator on an OP-Block), a feature critical to support of streaming applications on FPGAs.
【Keywords】: field programmable gate arrays; query processing; reconfigurable architectures; FPGA; FQP; Lego-like connectable stream processing element; OP-block; algorithmic trading; complex event processing; configurable hardware-based streaming architecture; data streaming applications; flexible query processor; general-purpose processors; hardware reconfiguration; information filtering; intrusion detection; online programmable block; online programmable-blocks; real-time data analytics; reconfigurable hardware-based streaming architecture; targeted advertising; Computer architecture; Field programmable gate arrays; Hardware; Logic gates; Pipeline processing; Random access memory; Software
【Paper Link】 【Pages】:831-842
【Authors】: Wenqing Lin ; Xiaokui Xiao ; Xing Xie ; Xiaoli Li
【Abstract】: The identification of network motifs has important applications in numerous domains, such as pattern detection in biological networks and graph analysis in digital circuits. However, mining network motifs is computationally challenging, as it requires enumerating subgraphs from a real-life graph, and computing the frequency of each subgraph in a large number of random graphs. In particular, existing solutions often require days to derive network motifs from biological networks with only a few thousand vertices. To address this problem, this paper presents a novel study on network motif discovery using Graphical Processing Units (GPUs). The basic idea is to employ GPUs to parallelize a large number of subgraph matching tasks in computing subgraph frequencies from random graphs, so as to reduce the overall computation time of network motif discovery. We explore the design space of GPU-based subgraph matching algorithms, with careful analysis of several crucial factors that affect the performance of GPU programs. Based on our analysis, we develop a GPU-based solution that (i) considerably differs from existing CPU-based methods, and (ii) exploits the strengths of GPUs in terms of parallelism while mitigating their limitations in terms of the computation power per GPU core. With extensive experiments on a variety of biological networks, we show that our solution is up to two orders of magnitude faster than the best CPU-based approach, and is around 20 times more cost-effective than the latter, when taking into account the monetary costs of the CPU and GPUs used.
【Keywords】: bioinformatics; data mining; graphics processing units; parallel processing; GPU program performance; GPU-based subgraph matching algorithms; biological networks; computation power; computation time reduction; graphical processing units; monetary costs; network motif discovery; network motif mining; random graphs; real-life graph; subgraph frequency computation; subgraph matching task parallelization; Central Processing Unit; Frequency estimation; Graphics processing units; Indexes; Instruction sets; Labeling; Parallel processing
【Paper Link】 【Pages】:843-854
【Authors】: Majed Sahli ; Essam Mansour ; Tariq Alturkestani ; Panos Kalnis
【Abstract】: This paper presents APlug, a framework for automatic tuning of large scale applications of many independent tasks. APlug suggests the best decomposition of the original computation into smaller tasks and the best number of CPUs to use, in order to meet user-specific constraints. We show that the problem is not trivial because there is large variability in the execution time of tasks, and it is possible for a task to occupy a CPU by performing useless computations. APlug collects a sample of task execution times and builds a model, which is then used by a discrete event simulator to calculate the optimal parameters. We provide a C++ API and a stand-alone implementation of APlug, and we integrate it with three typical applications from computational chemistry, bioinformatics, and data mining. A scenario for optimizing resources utilization is used to demonstrate our framework. We run experiments on 16,384 CPUs on a supercomputer, 480 cores on a Linux cluster and 80 cores on Amazon EC2, and show that APlug is very accurate with minimal overhead.
【Keywords】: C++ language; application program interfaces; resource allocation; task analysis; APlug; C++ API; CPU; automatic tuning; bag-of-tasks applications; resource utilisation optimisation; task execution time variability; user specific constraint; Bioinformatics; Lead; Parallel processing; Resource management; Supercomputers; Time-frequency analysis; Tuning
【Paper Link】 【Pages】:855-866
【Authors】: Arian Bär ; Lukasz Golab ; Stefan Ruehrup ; Mirko Schiavone ; Pedro Casas
【Abstract】: Shared workload optimization is feasible if the set of tasks to be executed is known in advance, as is the case in updating a set of materialized views or executing an extract-transform-load workflow. In this paper, we consider data-intensive workloads with precedence constraints arising from data dependencies. While there has been previous work on identifying common subexpressions and task re-ordering to enable shared scans, in this paper we solve the problem of scheduling shared data-intensive workloads in a cache-oblivious way. Our solution relies on a novel formulation of precedence constrained scheduling with the additional constraint that once a data item is in the cache, all tasks that require this item should execute as soon as possible thereafter. We give an optimal algorithm using A* search over the space of possible orderings, and we propose efficient and effective heuristics that obtain nearly-optimal schedules in much less time. We present experimental results on real-life data warehouse workloads and the TCP-DS benchmark to validate our claims.
【Keywords】: cache storage; data warehouses; scheduling; search problems; A * search; TCP-DS benchmark; cache-oblivious scheduling; data dependencies; data warehouse workloads; extract-transform-Ioad workflow; heuristics; nearly-optimal schedules; optimal algorithm; ordering space; precedence constrained scheduling; shared data-intensive workload scheduling; shared workload optimization; task set; Bandwidth
【Paper Link】 【Pages】:867-878
【Authors】: Alexander Shkapsky ; Mohan Yang ; Carlo Zaniolo
【Abstract】: The exploding demand for analytics has refocused the attention of data scientists on applications requiring aggregation in recursion. After resisting the efforts of researchers for more than twenty years, this problem is being addressed by innovative systems that are raising logic-oriented data languages to the levels of generality and performance that are needed to support efficiently a broad range of applications. Foremost among these new systems, the Deductive Application Language System (DeALS) achieves superior generality and performance via new constructs and optimization techniques for monotonic aggregates which are described in the paper. The use of a special class of monotonic aggregates in recursion was made possible by recent theoretical results that proved that they preserve the rigorous least-fixpoint semantics of core Datalog programs. This paper thus describes how DeALS extends their definitions and modifies their syntax to enable a concise expression of applications that, without them, could not be expressed in performance-conducive ways, or could not be expressed at all. Then the paper turns to the performance issue, and introduces novel implementation and optimization techniques that outperform traditional approaches, including Semi-naive evaluation. An extensive experimental evaluation was executed comparing DeALS with other systems on large datasets. The results suggest that, unlike other systems, DeALS indeed combines superior generality with superior performance.
【Keywords】: DATALOG; deductive databases; query processing; DeALS; datalog programs; deductive application language system; least-fixpoint semantics; monotonic aggregates; recursive queries optimization; seminaive evaluation; Aggregates; Hidden Markov models; Indexes; Lattices; Optimization; Semantics; Synchronization
【Paper Link】 【Pages】:879-890
【Authors】: Lukasz Golab ; Flip Korn ; Feng Li ; Barna Saha ; Divesh Srivastava
【Abstract】: In this paper, we introduce a natural generalization of Weighted Set Cover and Maximum Coverage, called Size-Constrained Weighted Set Cover. The input is a collection of n elements, a collection of weighted sets over the elements, a size constraint k, and a minimum coverage fraction ŝ; the output is a sub-collection of up to k sets whose union contains at least ŝn elements and whose sum of weights is minimal. We prove the hardness of approximation of this problem, and we present efficient approximation algorithms with provable quality guarantees that are the best possible. In many applications, the elements are data records, and the set collection to choose from is derived from combinations (patterns) of attribute values. We provide optimization techniques for this special case. Finally, we experimentally demonstrate the effectiveness and efficiency of our solutions.
【Keywords】: approximation theory; data mining; optimisation; approximation algorithms; maximum coverage; optimization techniques; size-constrained weighted set cover; Approximation algorithms; Approximation methods; Business; Heuristic algorithms; Optimization; Pattern matching; Polynomials
【Paper Link】 【Pages】:891-902
【Authors】: Xiang Ao ; Ping Luo ; Chengkai Li ; Fuzhen Zhuang ; Qing He
【Abstract】: Frequent episode mining is a popular framework for discovering sequential patterns from sequence data. Previous studies on this topic usually process data offline in a batch mode. However, for fast-growing sequence data, old episodes may become obsolete while new useful episodes keep emerging. More importantly, in time-critical applications we need a fast solution to discovering the latest frequent episodes from growing data. To this end, we formulate the problem of Online Frequent Episode Mining (OFEM). By introducing the concept of last episode occurrence within a time window, our solution can detect new minimal episode occurrences efficiently, based on which all recent frequent episodes can be discovered directly. Additionally, a trie-based data structure, episode trie, is developed to store minimal episode occurrences in a compact way. We also formally prove the soundness and completeness of our solution and analyze its time as well as space complexity. Experiment results of both online and offline FEM on real data sets show the superiority of our solution.
【Keywords】: data mining; finite element analysis; FEM; OFEM; episode trie; last episode occurrence; minimal episode occurrences; online frequent episode mining; sequence data; sequential pattern discovery; trie-based data structure; Complexity theory; Data mining; Data structures; Finite element analysis; Time factors; Time-frequency analysis
【Paper Link】 【Pages】:903-914
【Authors】: Marius Eich ; Guido Moerkotte
【Abstract】: We fill two gaps in the literature. First, we give a comprehensive set of equivalences allowing reordering of grouping with non-inner joins. Second, we show how to incorporate the optimal placement of grouping into a state-of-the-art dynamic programming (DP)-based plan generator.
【Keywords】: automatic programming; dynamic programming; dynamic programming-based plan generator; grouping rendering; optimal grouping placement; Aggregates; Context; Databases; Dynamic programming; Generators; Null value; Standards
【Paper Link】 【Pages】:915-926
【Authors】: Yubao Wu ; Ruoming Jin ; Xiaofeng Zhu ; Xiang Zhang
【Abstract】: Finding dense subgraphs is an important problem that has recently attracted a lot of interests. Most of the existing work focuses on a single graph (or network1). In many real-life applications, however, there exist dual networks, in which one network represents the physical world and another network represents the conceptual world. In this paper, we investigate the problem of finding the densest connected subgraph (DCS) which has the largest density in the conceptual network and is also connected in the physical network. Such pattern cannot be identified using the existing algorithms for a single network. We show that even though finding the densest subgraph in a single network is polynomial time solvable, the DCS problem is NP-hard. We develop a two-step approach to solve the DCS problem. In the first step, we effectively prune the dual networks while guarantee that the optimal solution is contained in the remaining networks. For the second step, we develop two efficient greedy methods based on different search strategies to find the DCS. Different variations of the DCS problem are also studied. We perform extensive experiments on a variety of real and synthetic dual networks to evaluate the effectiveness and efficiency of the developed methods.
【Keywords】: computational complexity; graph theory; greedy algorithms; network theory (graphs); search problems; DCS problem; NP-hard problem; conceptual network; conceptual world; densest-connected subgraph; dual networks; greedy methods; optimal solution; physical network; physical world; polynomial time solvable network; real dual networks; search strategies; synthetic dual networks; two-step approach; Approximation algorithms; Approximation methods; Complexity theory; Genetics; Polynomials; Proteins; Silicon
【Paper Link】 【Pages】:927-938
【Authors】: Rong-Hua Li ; Jeffrey Xu Yu ; Lu Qin ; Rui Mao ; Tan Jin
【Abstract】: Random walk based graph sampling has been recognized as a fundamental technique to collect uniform node samples from a large graph. In this paper, we first present a comprehensive analysis of the drawbacks of three widely-used random walk based graph sampling algorithms, called re-weighted random walk (RW) algorithm, Metropolis-Hastings random walk (MH) algorithm and maximum-degree random walk (MD) algorithm. Then, to address the limitations of these algorithms, we propose two general random walk based algorithms, named rejection-controlled Metropolis-Hastings (RCMH) algorithm and generalized maximum-degree random walk (GMD) algorithm. We show that RCMH balances the tradeoff between the limitations of RW and MH, and GMD balances the tradeoff between the drawbacks of RW and MD. To further improve the performance of our algorithms, we integrate the so-called delayed acceptance technique and the non-backtracking random walk technique into RCMH and GMD respectively. We conduct extensive experiments over four real-world datasets, and the results demonstrate the effectiveness of the proposed algorithms.
【Keywords】: graph theory; sampling methods; GMD algorithm; MH algorithm; Metropolis-Hastings random walk; RCMH algorithm; RW algorithm; delayed acceptance technique; generalized maximum-degree random walk algorithm; random walk based graph sampling; rejection-controlled Metropolis-Hastings algorithm; reweighted random walk; Algorithm design and analysis; Context; Estimation; Heuristic algorithms; Markov processes; Network topology; Social network services
【Paper Link】 【Pages】:939-950
【Authors】: Junzhou Zhao ; John C. S. Lui ; Don Towsley ; Pinghui Wang ; Xiaohong Guan
【Abstract】: Random walk-based graph sampling methods have become increasingly popular and important for characterizing large-scale complex networks. While powerful, they are known to exhibit problems when the graph is loosely connected, which slows down the convergence of a random walk and can result in poor estimation accuracy. In this work, we observe that many graphs under study, called target graphs, usually do not exist in isolation. In many situations, a target graph is often related to an auxiliary graph and an affiliation graph, and the target graph becomes better connected when viewed from these three graphs as a whole, or what we called a hybrid social-affiliation network. This viewpoint brings extra benefits to the graph sampling framework, e.g., when directly sampling a target graph is difficult or inefficient, we can efficiently sample it with the assistance of auxiliary and affiliation graphs. We propose three sampling methods on such a hybrid social-affiliation network to estimate target graph characteristics, and conduct extensive experiments on both synthetic and real datasets, to demonstrate the effectiveness of these new sampling methods.
【Keywords】: graph theory; sampling methods; social networking (online); affiliation graph; auxiliary graph; convergence; hybrid social-affiliation networks; large-scale complex networks; loosely-connected graph; random walk; random walk-based graph sampling methods; real datasets; synthetic datasets; target graph characteristics estimation; Bipartite graph; Cities and towns; Communities; Design methodology; Motion pictures; Sampling methods; Social network services
【Paper Link】 【Pages】:951-962
【Authors】: Arlei Silva ; Petko Bogdanov ; Ambuj K. Singh
【Abstract】: Many real-world complex systems can be modeled as dynamic networks with real-valued vertex/edge attributes. Examples include users' opinions in social networks and average speeds in a road system. When managing these large dynamic networks, compressing attribute values becomes a key requirement, since it enables the answering of attribute-based queries regarding a node/edge or network region based on a compact representation of the data. To address this problem, we introduce a lossy network compression scheme called Slice Tree (ST), which partitions a network into smooth regions with respect to node/edge values and compresses each value as the average of its region. ST applies a compact representation for network partitions, called slices, that are defined as a center node and radius distance. We propose an importance sampling algorithm to efficiently prune the search space of candidate slices in the ST construction by biasing the sampling process towards the node values that most affect the compression error. The effectiveness of ST in terms of compression error, compression rate, and running time is demonstrated using synthetic and real datasets. ST scales to million-node instances and removes up to 87% of the error in attribute values with a 103 compression ratio. We also illustrate how ST captures relevant phenomena in real networks, such as research collaboration patterns and traffic congestions.
【Keywords】: data structures; query processing; social networking (online); ST; center node; data compact representation; dynamic networks; hierarchical innetwork attribute compression; importance sampling; lossy network compression scheme; node values; node-edge values; radius distance; real-valued vertex edge attributes; road system; sampling algorithm; sampling process; slice tree; smooth regions; social networks; Approximation algorithms; Approximation methods; Greedy algorithms; Image coding; Monte Carlo methods; Partitioning algorithms; Probabilistic logic
【Paper Link】 【Pages】:963-974
【Authors】: Han Su ; Kai Zheng ; Kai Zeng ; Jiamin Huang ; Shazia Wasim Sadiq ; Nicholas Jing Yuan ; Xiaofang Zhou
【Abstract】: Due to the prevalence of GPS-enabled devices and wireless communication technology, spatial trajectories that describe the movement history of moving objects are being generated and accumulated at an unprecedented pace. However, a raw trajectory in the form of sequence of timestamped locations does not make much sense for humans without semantic representation. In this work we aim to facilitate human's understanding of a raw trajectory by automatically generating a short text to describe it. By formulating this task as the problem of adaptive trajectory segmentation and feature selection, we propose a partition-and-summarization framework. In the partition phase, we first define a set of features for each trajectory segment and then derive an optimal partition with the aim to make the segments within each partition as homogeneous as possible in terms of their features. In the summarization phase, for each partition we select the most interesting features by comparing against the common behaviours of historical trajectories on the same route and generate short text description for these features. For empirical study, we apply our solution to a real trajectory dataset and have found that the generated text can effectively reflect the important parts in a trajectory.
【Keywords】: Global Positioning System; feature selection; text analysis; GPS-enabled devices; adaptive trajectory segmentation; automatic short text description generation; feature selection; moving object movement history; partition phase; partition-and-summarization framework; spatial trajectories; summarization phase; trajectory data; wireless communication technology; Feature extraction; History; Roads; Routing; Semantics; Silicon; Trajectory
【Paper Link】 【Pages】:975-986
【Authors】: Bolong Zheng ; Nicholas Jing Yuan ; Kai Zheng ; Xing Xie ; Shazia Wasim Sadiq ; Xiaofang Zhou
【Abstract】: Driven by the advances in location positioning techniques and the popularity of location sharing services, semantic enriched trajectory data have become unprecedentedly available. While finding relevant Point-of-Interest (POIs) based on users' locations and query keywords has been extensively studied in the past years, it is largely untouched to explore the keyword queries in the context of semantic trajectory database. In this paper, we study the problem of approximate keyword search in massive semantic trajectories. Given a set of query keywords, an approximate keyword query of semantic trajectory (AKQST) returns k trajectories that contain the most relevant keywords to the query and yield the least travel effort in the meantime. The main difference between AKQST and conventional spatial keyword queries is that there is no query location in AKQST, which means the search area cannot be localized. To capture the travel effort in the context of query keywords, a novel utility function, called spatio-textual utility function, is first defined. Then we develop a hybrid index structure called GiKi to organize the trajectories hierarchically, which enables pruning the search space by spatial and textual similarity simultaneously. Finally an efficient search algorithm and fast evaluation of the minimum value of spatio-textual utility function are proposed. The results of our empirical studies based on real check-in datasets demonstrate that our proposed index and algorithms can achieve good scalability.
【Keywords】: database management systems; query processing; AKQST; GiKi; POIs; approximate keyword query of semantic trajectory database; hybrid index structure; location positioning techniques; location sharing services; point-of-interest; query keywords; search algorithm; semantic enriched trajectory data; spatial keyword query; spatial similarity; spatio-textual utility function; textual similarity; user locations; Approximation algorithms; Indexes; Keyword search; Partitioning algorithms; Semantics; Trajectory
【Paper Link】 【Pages】:987-998
【Authors】: Jiajun Liu ; Kun Zhao ; Philipp Sommer ; Shuo Shang ; Brano Kusy ; Raja Jurdak
【Abstract】: Long-term location tracking, where trajectory compression is commonly used, has gained high interest for many applications in transport, ecology, and wearable computing. However, state-of-the-art compression methods involve high space-time complexity or achieve unsatisfactory compression rate, leading to rapid exhaustion of memory, computation, storage and energy resources. We propose a novel online algorithm for error-bounded trajectory compression called the Bounded Quadrant System (BQS), which compresses trajectories with extremely small costs in space and time using convex-hulls. In this algorithm, we build a virtual coordinate system centered at a start point, and establish a rectangular bounding box as well as two bounding lines in each of its quadrants. In each quadrant, the points to be assessed are bounded by the convex-hull formed by the box and lines. Various compression error-bounds are therefore derived to quickly draw compression decisions without expensive error computations. In addition, we also propose a light version of the BQS version that achieves O(1) complexity in both time and space for processing each point to suit the most constrained computation environments. Furthermore, we briefly demonstrate how this algorithm can be naturally extended to the 3-D case. Using empirical GPS traces from flying foxes, cars and simulation, we demonstrate the effectiveness of our algorithm in significantly reducing the time and space complexity of trajectory compression, while greatly improving the compression rates of the state-of-the-art algorithms (up to 47%). We then show that with this algorithm, the operational time of the target resource-constrained hardware platform can be prolonged by up to 41%.
【Keywords】: Global Positioning System; computational complexity; convex programming; mobile computing; GPS; bounded quadrant system; bounding lines; convex hull; error bounded trajectory compression; long term location tracking; rectangular bounding box; space complexity; time complexity; virtual coordinate system; Algorithm design and analysis; Complexity theory; Compression algorithms; Global Positioning System; Hardware; Runtime; Trajectory
【Paper Link】 【Pages】:999-1010
【Authors】: Sayan Ranu ; Deepak P ; Aditya D. Telang ; Prasad Deshpande ; Sriram Raghavan
【Abstract】: Quantifying the similarity between two trajectories is a fundamental operation in analysis of spatio-temporal databases. While a number of distance functions exist, the recent shift in the dynamics of the trajectory generation procedure violates one of their core assumptions; a consistent and uniform sampling rate. In this paper, we formulate a robust distance function called Edit Distance with Projections (EDwP) to match trajectories under inconsistent and variable sampling rates through dynamic interpolation. This is achieved by deploying the idea of projections that goes beyond matching only the sampled points while aligning trajectories. To enable efficient trajectory retrievals using EDwP, we design an index structure called TrajTree. TrajTree derives its pruning power by employing the unique combination of bounding boxes with Lipschitz embedding. Extensive experiments on real trajectory databases demonstrate EDwP to be up to 5 times more accurate than the state-of-the-art distance functions. Additionally, TrajTree increases the efficiency of trajectory retrievals by up to an order of magnitude over existing techniques.
【Keywords】: database management systems; indexing; information retrieval; interpolation; EDwP; consistent sampling rate; distance functions; dynamic interpolation; edit distance with projections; inconsistent sampling rates; indexing trajectories; matching trajectories; real trajectory databases; spatio temporal databases; trajectory generation procedure; uniform sampling rate; Indexing; Interpolation; Measurement; Robustness; Trajectory
【Paper Link】 【Pages】:1011-1022
【Authors】: Jianneng Cao ; Fang-Yu Rao ; Elisa Bertino ; Murat Kantarcioglu
【Abstract】: Private record linkage protocols allow multiple parties to exchange matching records, which refer to the same entities or have similar values, while keeping the non-matching ones secret. Conventional protocols are based on computationally expensive cryptographic primitives and therefore do not scale. To address these scalability issues, hybrid protocols have been recently proposed that combine differential privacy techniques with secure multiparty computation techniques. However, a drawback of such protocols is that they disclose to the parties both the matching records and the differentially private synopses of the datasets involved in the linkage. Consequently, differential privacy is no longer always satisfied. To address this issue, we propose a novel framework, which separates the private synopses from the matching records. The two parties do not access the synopses directly, but still use them to efficiently link records. We theoretically prove the security of our framework. In addition, we have developed a simple but effective strategy for releasing private synopses. Extensive experimental results show that our framework is superior to the existing methods in terms of both recall rate and efficiency.
【Keywords】: cryptography; data integration; data privacy; pattern matching; cryptographic primitives; differentially private synopses; hybrid private record linkage scheme; hybrid protocols; matching records; privacy techniques; private record linkage protocols; secure multiparty computation techniques; Couplings; Cryptography; Mathematical model; Noise; Privacy; Protocols
【Paper Link】 【Pages】:1023-1034
【Authors】: Zach Jorgensen ; Ting Yu ; Graham Cormode
【Abstract】: Differential privacy is widely accepted as a powerful framework for providing strong, formal privacy guarantees for aggregate data analysis. A limitation of the model is that the same level of privacy protection is afforded for all individuals. However, it is common that the data subjects have quite different expectations regarding the acceptable level of privacy for their data. Consequently, differential privacy may lead to insufficient privacy protection for some users, while over-protecting others. We argue that by accepting that not all users require the same level of privacy, a higher level of utility can often be attained by not providing excess privacy to those who do not want it. We propose a new privacy definition called personalized differential privacy (PDP), a generalization of differential privacy in which users specify a personal privacy requirement for their data. We then introduce several novel mechanisms for achieving PDP. Our primary mechanism is a general one that automatically converts any existing differentially private algorithm into one that satisfies PDP. We also present a more direct approach for achieving PDP, inspired by the well-known exponential mechanism. We demonstrate our framework through extensive experiments on real and synthetic data.
【Keywords】: data protection; PDP; aggregate data analysis; exponential mechanism; formal privacy guarantees; personalized differential privacy; primary mechanism; privacy level; privacy protection; real data; synthetic data; utility level; Lead; Privacy
【Paper Link】 【Pages】:1035-1046
【Authors】: Shengzhi Xu ; Sen Su ; Xiang Cheng ; Zhengyi Li ; Li Xiong
【Abstract】: In this paper, we study the problem of mining frequent sequences under the rigorous differential privacy model. We explore the possibility of designing a differentially private frequent sequence mining (FSM) algorithm which can achieve both high data utility and a high degree of privacy. We found, in differentially private FSM, the amount of required noise is proportionate to the number of candidate sequences. If we could effectively reduce the number of unpromising candidate sequences, the utility and privacy tradeoff can be significantly improved. To this end, by leveraging a sampling-based candidate pruning technique, we propose a novel differentially private FSM algorithm, which is referred to as PFS2. The core of our algorithm is to utilize sample databases to further prune the candidate sequences generated based on the downward closure property. In particular, we use the noisy local support of candidate sequences in the sample databases to estimate which sequences are potentially frequent. To improve the accuracy of such private estimations, a sequence shrinking method is proposed to enforce the length constraint on the sample databases. Moreover, to decrease the probability of misestimating frequent sequences as infrequent, a threshold relaxation method is proposed to relax the user-specified threshold for the sample databases. Through formal privacy analysis, we show that our PFS2 algorithm is ϵ-differentially private. Extensive experiments on real datasets illustrate that our PFS2 algorithm can privately find frequent sequences with high accuracy.
【Keywords】: data mining; data privacy; probability; PFS2 algorithm; candidate sequences; formal privacy analysis; high data utility; private FSM; private estimations; private frequent sequence mining algorithm; probability; sampling-based candidate pruning technique; sequence shrinking method; threshold relaxation method; Algorithm design and analysis; Data privacy; Databases; Noise; Privacy; Sensitivity
【Paper Link】 【Pages】:1047-1058
【Authors】: Inah Jeon ; Evangelos E. Papalexakis ; U. Kang ; Christos Faloutsos
【Abstract】: How can we find useful patterns and anomalies in large scale real-world data with multiple attributes? For example, network intrusion logs, with (source-ip, target-ip, port-number, timestamp)? Tensors are suitable for modeling these multi-dimensional data, and widely used for the analysis of social networks, web data, network traffic, and in many other settings. However, current tensor decomposition methods do not scale for tensors with millions and billions of rows, columns and `fibers', that often appear in real datasets. In this paper, we propose HaTen2, a scalable distributed suite of tensor decomposition algorithms running on the MapReduce platform. By carefully reordering the operations, and exploiting the sparsity of real world tensors, HaTen2 dramatically reduces the intermediate data, and the number of jobs. As a result, using HaTen2, we analyze big real-world tensors that can not be handled by the current state of the art, and discover hidden concepts.
【Keywords】: Internet; security of data; tensors; HaTen2; MapReduce platform; Web data; billion-scale tensor decompositions; columns; fibers; large scale real-world data; modeling; multidimensional data; network intrusion logs; network traffic; rows; scalable distributed suite; social networks; tensor decomposition algorithms; tensor decomposition methods; Algorithm design and analysis; Computer science; Data models; Matrix converters; Matrix decomposition; Scalability; Tensile stress
【Paper Link】 【Pages】:1059-1070
【Authors】: Liping Peng ; Vuk Ercegovac ; Kai Zeng ; Peter J. Haas ; Andrey Balmin ; Yannis Sismanis
【Abstract】: Shared-nothing systems such as Hadoop vastly simplify parallel programming when processing disk-resident data whose size exceeds aggregate cluster memory. Such systems incur a significant performance penalty, however, on the important class of “groupwise set-valued analytics” (GSVA) queries in which the data is dynamically partitioned into groups and then a set-valued synopsis is computed for some or all of the groups. Key examples of synopses include top-k sets, bottom-k sets, and uniform random samples. Applications of GSVA queries include micro-marketing, root-cause analysis for problem diagnosis, and fraud detection. A naive approach to executing GSVA queries first reshuffles all of the data so that all records in a group are at the same node and then computes the synopsis for the group. This approach can be extremely inefficient when, as is typical, only a very small fraction of the records in each group actually contribute to the final groupwise synopsis, so that most of the shuffling effort is wasted. We show how to significantly speed up GSVA queries by slightly modifying the shared-nothing environment to allow tasks to occasionally access a small, common data structure; we focus on the Hadoop setting and use the “Adaptive MapReduce” infrastructure of Vernica et al. to implement the data structure. Our approach retains most of the advantages of a system such as Hadoop while significantly improving GSVA query performance, and also allows for incremental updating of query results. Experiments show speedups of up to 5x. Importantly, our new technique can potentially be applied to other shared-nothing systems with disk-resident data.
【Keywords】: data handling; data structures; parallel programming; query processing; set theory; storage management; GSVA queries; Hadoop; adaptive MapReduce; aggregate cluster memory; bottom-k sets; disk-resident data; disk-resident data processing; fraud detection; groupwise set-valued analytics; micromarketing; parallel programming; problem diagnosis; root-cause analysis; set-valued synopsis; shared-nothing systems; top-k sets; uniform random samples; Aggregates; Approximation methods; Data structures; Electronic mail; Generators; Optimization; Standards
【Paper Link】 【Pages】:1071-1082
【Authors】: Gregory Buehrer ; Roberto L. de Oliveira Jr. ; David Fuhry ; Srinivasan Parthasarathy
【Abstract】: Extracting interesting patterns from large data stores efficiently is a challenging problem in many domains. In the data mining literature, pattern frequency has often been touted as a proxy for interestingness and has been leveraged as a pruning criteria to realize scalable solutions. However, while there exist many frequent pattern algorithms in the literature, all scale exponentially in the worst case, restricting their utility on very large data sets. Furthermore, as we theoretically argue in this article, the problem is very hard to approximate within a reasonable factor, with a polynomial time algorithm. As a counter point to this theoretical result, we present a practical algorithm called Localized Approximate Miner (LAM) that scales linearithmically with the input data. Instead of fully exploring the top of the search lattice to a user-defined point, as traditional mining algorithms do, we instead explore different parts of the complete lattice, efficiently. The key to this efficient exploration is the reliance on min-wise independent permutations to collect the data into highly similar subsets of a partition. It is straightforward to implement and scales to very large data sets. We illustrate its utility on a range of data sets, and demonstrate that the algorithm finds more patterns of higher utility in much less time than several state-of-the-art algorithms. Moreover, we realize a natural multi-level parallelization of LAM that further reduces runtimes by up to 193-fold when leveraging 256 CMP cores spanning 32 machines.
【Keywords】: computational complexity; data mining; parallel processing; CMP cores; LAM; data collection; data mining; data sets; frequent pattern algorithms; large data storage; linearithmic time; localized approximate miner; min-wise independent permutation; natural multilevel parallelization; parameter-free parallel itemset mining algorithm; pattern frequency; polynomial time algorithm; search lattice; user-defined point; Integrated circuits; Itemsets
【Paper Link】 【Pages】:1083-1094
【Authors】: Sean Chester ; Darius Sidlauskas ; Ira Assent ; Kenneth S. Bøgh
【Abstract】: The skyline is an important query operator for multi-criteria decision making. It reduces a dataset to only those points that offer optimal trade-offs of dimensions. In general, it is very expensive to compute. Recently, multicore CPU algorithms have been proposed to accelerate the computation of the skyline. However, they do not sufficiently minimize dominance tests and so are not competitive with state-of-the-art sequential algorithms. In this paper, we introduce a novel multicore skyline algorithm, Hybrid, which processes points in blocks. It maintains a shared, global skyline among all threads, which is used to minimize dominance tests while maintaining high throughput. The algorithm uses an efficiently-updatable data structure over the shared, global skyline, based on point-based partitioning. Also, we release a large benchmark of optimized skyline algorithms, with which we demonstrate on challenging workloads a 100-fold speedup over state-of-the-art multicore algorithms and a 10-fold speedup with 16 cores over state-of-the-art sequential algorithms.
【Keywords】: multi-threading; multiprocessing systems; query processing; data structure; dominance test minimization; hybrid algorithm; multicore CPU algorithms; multicore processors; multicriteria decision making; optimized skyline algorithms; point-based partitioning; query operator; scalable parallelization; shared-global skyline; skyline computation; throughput; Data structures; Fuels; Parallel processing
【Paper Link】 【Pages】:1095-1106
【Authors】: Daniel Schall ; Theo Härder
【Abstract】: Traditional database management systems (DBMSs) running on powerful single-node servers are usually over-provisioned for most of their daily workloads and, because they do not show good-enough energy proportionality, waste a lot of energy while underutilized. A cluster of small (wimpy) servers, where its size can be dynamically adjusted to the current workload, offers better energy characteristics for those workloads. Yet, data migration, necessary to balance utilization among the nodes, is a non-trivial and time-consuming task that may consume the energy saved. For this reason, a sophisticated and easy to adjust partitioning scheme fostering dynamic reorganization is needed. In this paper, we adapt a technique originally created for SMP systems, called physiological partitioning, to distribute data among nodes that allows to easily repartition data without interrupting transactions. We dynamically partition DB tables based on the nodes' utilization and given energy constraints and compare our approach with physical partitioning and logical partitioning methods. To quantify possible energy saving and its conceivable drawback on query runtimes, we evaluate our implementation on an experimental cluster and compare the results w.r.t. performance and energy consumption. Depending on the workload, we can substantially save energy without sacrificing too much performance.
【Keywords】: database management systems; energy conservation; query processing; DB tables partitioning; DBMS; SMP systems; data migration; database management systems; dynamic physiological partitioning; dynamic reorganization; energy characteristics; energy constraints; energy saving; logical partitioning method; node utilization; physical partitioning; query runtimes; shared-nothing database cluster; single-node servers; Hardware; Physiology; Query processing; Random access memory; Servers; Throughput
【Paper Link】 【Pages】:1107-1118
【Authors】: Xiang Wang ; Ying Zhang ; Wenjie Zhang ; Xuemin Lin ; Wei Wang
【Abstract】: We investigate the problem of processing a large amount of continuous spatial-keyword queries over streaming data, which is essential in many applications such as location-based recommendation and advertising, thanks to the proliferation of geo-equipped devices and the ensuing location-based social media applications. For example, a location-based e-coupon system may allow potentially millions of users to register their continuous spatial-keyword queries (e.g., interests in nearby sales) by specifying a set of keywords and a spatial region; the system then delivers each incoming spatial-textual object (e.g., a geo-tagged e-coupon) to all the matched queries (i.e., users) whose spatial and textual requirements are satisfied. While there are several prior approaches aiming at providing efficient query processing techniques for the problem, their approaches belong to spatial-first indexing method which cannot well exploit the keyword distribution. In addition, their textual filtering techniques are built upon simple variants of traditional inverted indexes, which do not perform well for the textual constraint imposed by the problem. In this paper, we address the above limitations and provide a highly efficient solution based on a novel adaptive index, named AP-Tree. The AP-Tree adaptively groups registered queries using keyword and spatial partitions, guided by a cost model. The AP-Tree also naturally indexes ordered keyword combinations. We present index construction algorithm that seamlessly and effectively integrates keyword and spatial partitions. Consequently, our method adapts well to the underlying spatial and keyword distributions of the data. Our extensive experiments demonstrate that AP-Tree achieves up to an order of magnitude improvement on efficiency compared with prior state-of-the-art methods.
【Keywords】: advertising data processing; data handling; indexing; information filtering; query processing; recommender systems; tree data structures; trees (mathematics); AP-tree; adaptive index; advertising; continuous spatial-keyword queries; data streaming; geo-equipped devices prolifer-ation; index construction algorithm; keyword partition; location-based e-coupon system; location-based recommendation; location-based social media applications; query processing techniques; spatial partition; spatial requirement; spatial-first indexing method; spatial-textual object; textual filtering techniques; textual requirement; Filtering; Indexing; Media; Servers
【Paper Link】 【Pages】:1119-1130
【Authors】: Yagiz Kargin ; Martin L. Kersten ; Stefan Manegold ; Holger Pirk
【Abstract】: When addressing the problem of “big” data volume, preparation costs are one of the key challenges: the high costs for loading, aggregating and indexing data leads to a long data-to-insight time. In addition to being a nuisance to the end-user, this latency prevents real-time analytics on “big” data. Fortunately, data often comes in semantic chunks such as files that contain data items that share some characteristics such as acquisition time or location. A data management system that exploits this trait can significantly lower the data preparation costs and the associated data-to-insight time by only investing in the preparation of the relevant chunks. In this paper, we develop such a system as an extension of an existing relational DBMS (MonetDB). To this end, we develop a query processing paradigm and data storage model that are partial-loading aware. The result is a system that can make a 1.2 TB dataset (consisting of 4000 chunks) ready for querying in less than 3 minutes on a single server-class machine while maintaining good query processing performance.
【Keywords】: Big Data; data models; data preparation; query processing; relational databases; storage management; Big Data analytics; Big Data sommelier; Big Data volume preparation costs; MonetDB; associated data-to-insight time; data aggregation; data indexing; data items; data loading; data management system; data storage model; partial-loading aware; query processing performance; relational DBMS; semantic chunks; single server-class machine; Loading; Optimization; Query processing; Relational databases; Semantics; Transforms
【Paper Link】 【Pages】:1131-1142
【Authors】: Jiefeng Cheng ; Qin Liu ; Zhenguo Li ; Wei Fan ; John C. S. Lui ; Cheng He
【Abstract】: Recent studies show that disk-based graph computation on just a single PC can be as highly competitive as cluster-based computing systems on large-scale problems. Inspired by this remarkable progress, we develop VENUS, a disk-based graph computation system which is able to handle billion-scale problems efficiently on a commodity PC. VENUS adopts a novel computing architecture that features vertex-centric “streamlined” processing - the graph is sequentially loaded and the update functions are executed in parallel on the fly. VENUS deliberately avoids loading batch edge data by separating read-only structure data from mutable vertex data on disk. Furthermore, it minimizes random IOs by caching vertex data in main memory. The streamlined processing is realized with efficient sequential scan over massive structure data and fast feeding a large number of update functions. Extensive evaluation on large real-world and synthetic graphs has demonstrated the efficiency of VENUS. For example, VENUS takes just 8 minutes with hard disk for PageRank on the Twitter graph with 1.5 billion edges. In contrast, Spark takes 8.1 minutes with 50 machines and 100 CPUs, and GraphChi takes 13 minutes using fast SSD drive.
【Keywords】: graph theory; microprocessor chips; Twitter graph; VENUS; billion-scale problems; cluster based computing systems; disk-based graph computation system; large-scale problems; read-only structure data; single PC; vertex centric streamlined graph computation; vertex-centric streamlined processing; Analytical models; Computational modeling; Load modeling; Memory management; Random access memory; Venus
【Paper Link】 【Pages】:1143-1154
【Authors】: Wenlei Xie ; Yuanyuan Tian ; Yannis Sismanis ; Andrey Balmin ; Peter J. Haas
【Abstract】: A large scale network of social interactions, such as mentions in Twitter, can often be modeled as a “dynamic interaction graph” in which new interactions (edges) are continually added over time. Existing systems for extracting timely insights from such graphs are based on either a cumulative “snapshot” model or a “sliding window” model. The former model does not sufficiently emphasize recent interactions. The latter model abruptly forgets past interactions, leading to discontinuities in which, e.g., the graph analysis completely ignores historically important influencers who have temporarily gone dormant. We introduce TIDE, a distributed system for analyzing dynamic graphs that employs a new “probabilistic edge decay” (PED) model. In this model, the graph analysis algorithm of interest is applied at each time step to one or more graphs obtained as samples from the current “snapshot” graph that comprises all interactions that have occurred so far. The probability that a given edge of the snapshot graph is included in a sample decays over time according to a user specified decay function. The PED model allows controlled trade-offs between recency and continuity, and allows existing analysis algorithms for static graphs to be applied to dynamic graphs essentially without change. For the important class of exponential decay functions, we provide efficient methods that leverage past samples to incrementally generate new samples as time advances. We also exploit the large degree of overlap between samples to reduce memory consumption from O(N) to O(logN) when maintaining N sample graphs. Finally, we provide bulk-execution methods for applying graph algorithms to multiple sample graphs simultaneously without requiring any changes to existing graph-processing APIs. Experiments on a real Twitter dataset demonstrate the effectiveness and efficiency of our TIDE prototype, which is built on top of- the Spark distributed computing framework.
【Keywords】: Internet; graph theory; probability; social networking (online); PED model; Spark distributed computing framework; TIDE prototype; Twitter; cumulative snapshot model; dynamic interaction graphs; exponential decay functions; graph analysis algorithm; large scale network; probabilistic edge decay; sliding window model; social interactions; specified decay function; Aggregates; Algorithm design and analysis; Analytical models; Computational modeling; Heuristic algorithms; Probabilistic logic; Twitter
【Paper Link】 【Pages】:1155-1166
【Authors】: Aristides Gionis ; Michael Mathioudakis ; Antti Ukkonen
【Abstract】: We study the problem of discrepancy maximization on graphs: given a set of nodes Q of an underlying graph G, we aim to identify a connected subgraph of G that contains many more nodes from Q than other nodes. This variant of the discrepancy-maximization problem extends the well-known notion of “bump hunting” in the Euclidean space. We consider the problem under two access models. In the unrestricted-access model, the whole graph G is given as input, while in the local-access model we can only retrieve the neighbors of a given node in G using a possibly slow and costly interface. We prove that the basic problem of discrepancy maximization on graphs is NP-hard, and empirically evaluate the performance of four heuristics for solving it. For the local-access model we consider three different algorithms that aim to recover a part of G large enough to contain an optimal solution, while using only a small number of calls to the neighbor-function interface. We perform a thorough experimental evaluation in order to understand the trade offs between the proposed methods and their dependencies on characteristics of the input graph.
【Keywords】: graph theory; optimisation; query processing; set theory; Euclidean space; NP-hard problem; bump hunting approach; connected subgraph identification; empirical analysis; graph nodes; heuristics; input graph characteristics; local discrepancy maximization problem; local-access model; neighbor-function interface; optimal solution; performance evaluation; unrestricted-access model; Approximation algorithms; Color; Complexity theory; Computational modeling; Heuristic algorithms; Twitter
【Paper Link】 【Pages】:1167-1178
【Authors】: Sudip Roy ; Arnd Christian König ; Igor Dvorkin ; Manish Kumar
【Abstract】: Cloud platforms involve multiple independently developed components, often executing on diverse hardware configurations and across multiple data centers. This complexity makes tracking various key performance indicators (KPIs) and manual diagnosing of anomalies in system behavior both difficult and expensive. In this paper, we describe PerfAugur, an automated system for mining service logs to identify anomalies and help formulate data-driven hypotheses. PerfAugur includes a suite of efficient mining algorithms for detecting significant anomalies in system behavior, along with potential explanations for such anomalies, without the need for an explicit supervision signal. We perform extensive experimental evaluation using both synthetic and real-life data sets, and present detailed case studies showing the impact of this technology on operations of the Windows Azure Service.
【Keywords】: cloud computing; data mining; KPI tracking; PerfAugur; Windows Azure Service; anomaly diagnosis; anomaly identification; automated system; cloud platforms; cloud services; data-driven hypothesis; diverse hardware configurations; explicit supervision signal; key performance indicator tracking; multiple data centers; performance anomalies; real-life data sets; robust diagnostics; service log mining; synthetic data sets; system behavior; Aggregates; Atomic measurements; Context; Data mining; Electric breakdown; Robustness; Telemetry
【Paper Link】 【Pages】:1179-1190
【Authors】: Quan Pham ; Tanu Malik ; Boris Glavic ; Ian T. Foster
【Abstract】: We present a light-weight database virtualization (LDV) system that allows users to share and re-execute applications that operate on a relational database (DB). Previous methods for sharing DB applications, such as companion websites and virtual machine images (VMIs), support neither easy and efficient re-execution nor the sharing of only a relevant DB subset. LDV addresses these issues by monitoring application execution, including DB operations, and using the resulting execution trace to create a lightweight re-executable package. A LDV package includes, in addition to the application, either the DB management system (DBMS) and relevant data or, if the DBMS and/or data cannot be shared, just the application-DBMS communications for replay during re-execution. We introduce a linked DB-operating system provenance model and show how to infer data dependencies based on temporal information about the DB operations performed by the application's process(es). We use this model to determine the DB subset that needs to be included in a package in order to enable re-execution. We compare LDV with other sharing methods in terms of package size, monitoring overhead, and re-execution overhead. We show that LDV packages are often more than an order of magnitude smaller than a VMI for the same application, and have negligible re-execution overhead.
【Keywords】: relational databases; virtual machines; virtualisation; DB management system; DBMS; LDV system; VMI; light-weight database virtualization; linked DB-operating system provenance model; monitoring overhead; package size; reexecution overhead; relational database; virtual machine images; Application virtualization; Computational modeling; Data models; Joining processes; Monitoring; Servers; Virtualization
【Paper Link】 【Pages】:1191-1202
【Authors】: Sebastian Schelter ; Juan Soto ; Volker Markl ; Douglas Burdick ; Berthold Reinwald ; Alexandre V. Evfimievski
【Abstract】: Meta learning techniques such as cross-validation and ensemble learning are crucial for applying machine learning to real-world use cases. These techniques first generate samples from input data, and then train and evaluate machine learning models on these samples. For meta learning on large datasets, the efficient generation of samples becomes problematic, especially when the data is stored distributed in a block-partitioned representation, and processed on a shared-nothing cluster. We present a novel, parallel algorithm for efficient sample generation from large, block-partitioned datasets in a shared-nothing architecture. This algorithm executes in a single pass over the data, and minimizes inter-machine communication. The algorithm supports a wide variety of sample generation techniques through an embedded user-defined sampling function. We illustrate how to implement distributed sample generation for popular meta learning techniques such as hold-out tests, k-fold cross-validation, and bagging, using our algorithm and present an experimental evaluation on datasets with billions of datapoints.
【Keywords】: learning (artificial intelligence); parallel algorithms; sampling methods; block-partitioned datasets; distributed sample generation; efficient sample generation; embedded user-defined sampling function; ensemble learning; intermachine communication; machine learning; meta learning techniques; parallel algorithm; scalable meta learning; shared-nothing architecture; Data models; Distributed databases; Electronic mail; Indexes; Partitioning algorithms; Predictive models; Training
【Paper Link】 【Pages】:1203-1214
【Authors】: David B. Lomet ; Faisal Nawab
【Abstract】: Transaction time databases can be put to a number of valuable uses, auditing, regulatory compliance, readable backups, and enabling multi-version concurrency control. While additional storage for retaining multiple versions is unavoidable, compression and the declining cost of disk storage largely removes that impediment to supporting multi-version data. Not clear has been whether effective indexing of historical versions, can be achieved at high performance. The current paper shows how temporal indexing can exploit the latch-free infrastructure provided for the Bw-tree by the LLAMA cache/storage subsystem to support high performance. Further, it demonstrates how the LLAMA mapping table can be exploited to simultaneously enable migration of historical data, e.g. to cloud storage, while overcoming the index node time splitting difficulty that has arisen in the past when historical nodes are migrated.
【Keywords】: cache storage; concurrency control; database indexing; temporal databases; transaction processing; Bw-tree; LLAMA cache-storage subsystem; LLAMA mapping table; cloud storage; disk storage; historical data migration; historical nodes; index node time splitting difficulty; latch-free infrastructure; temporal indexing; transaction time databases; Buffer storage; Cache storage; Cleaning; Hardware; Indexing; Modern hardware; latch-free; log structured; multi-version indexing; temporal database
【Paper Link】 【Pages】:1215-1226
【Authors】: Abdeltawab M. Hendawi ; Jie Bao ; Mohamed F. Mokbel ; Mohamed H. Ali
【Abstract】: Predictive queries on moving objects offer an important category of location-aware services based on the objects' expected future locations. A wide range of applications utilize this type of services, e.g., traffic management systems, location-based advertising, and ride sharing systems. This paper proposes a novel index structure, named Predictive tree (P-tree), for processing predictive queries against moving objects on road networks. The predictive tree: (1) provides a generic infrastructure for answering the common types of predictive queries including predictive point, range, KNN, and aggregate queries, (2) updates the probabilistic prediction of the object's future locations dynamically and incrementally as the object moves around on the road network, and (3) provides an extensible mechanism to customize the probability assignments of the object's expected future locations, with the help of user defined functions. The proposed index enables the evaluation of predictive queries in the absence of the objects' historical trajectories. Based solely on the connectivity of the road network graph and assuming that the object follows the shortest route to destination, the predictive tree determines the reachable nodes of a moving object within a specified time window T in the future. The predictive tree prunes the space around each moving object in order to reduce computation, and increase system efficiency. Tunable threshold parameters control the behavior of the predictive trees by trading the maximum prediction time and the details of the reported results on one side for the computation and memory overheads on the other side. The predictive tree is integrated in the context of the iRoad system in two different query processing modes, namely, the precomputed query result mode, and the on-demand query result mode. Extensive experimental results based on large scale real and synthetic datasets confirm that the predictive tree achieves better accuracy compared to - he existing related work, and scales up to support a large number of moving objects and heavy predictive query workloads.
【Keywords】: mobile computing; pattern classification; query processing; road traffic; tree data structures; KNN; P-tree; aggregate queries; iRoad system; location-aware services; maximum prediction time; moving objects; predictive point; predictive queries; predictive tree; probabilistic prediction; probability assignments; query processing; road network graph; tunable threshold parameters; user defined functions; Artificial neural networks; Indexes; Prediction algorithms; Predictive models; Query processing; Roads; Trajectory
【Paper Link】 【Pages】:1227-1238
【Authors】: Victor Alvarez ; Stefan Richter ; Xiao Chen ; Jens Dittrich
【Abstract】: With prices of main memory constantly decreasing, people nowadays are more interested in performing their computations in main memory, and leave high I/O costs of traditional disk-based systems out of the equation. This change of paradigm, however, represents new challenges to the way data should be stored and indexed in main memory in order to be processed efficiently. Traditional data structures, like the venerable B-tree, were designed to work on disk-based systems, but they are no longer the way to go in main-memory systems, at least not in their original form, due to the poor cache utilization of the systems they run on. Because of this, in particular, during the last decade there has been a considerable amount of research on index data structures for main-memory systems. Among the most recent and most interesting data structures for main-memory systems there is the recently-proposed adaptive radix tree ARTful (ART for short). The authors of ART presented experiments that indicate that ART was clearly a better choice over other recent tree-based data structures like FAST and B+-trees. However, ART was not the first adaptive radix tree. To the best of our knowledge, the first was the Judy Array (Judy for short), and a comparison between ART and Judy was not shown. Moreover, the same set of experiments indicated that only a hash table was competitive to ART. The hash table used by the authors of ART in their study was a chained hash table, but this kind of hash tables can be suboptimal in terms of space and performance due to their potentially high use of pointers. In this paper we present a thorough experimental comparison between ART, Judy, two variants of hashing via quadratic probing, and three variants of Cuckoo hashing. These hashing schemes are known to be very efficient. For our study we consider whether the data structures are to be used as a non-covering index (relying on an additional store), or as a covering index (covering key-value pairs). We consi- er both OLAP and OLTP scenarios. Our experiments strongly indicate that neither ART nor Judy are competitive to the aforementioned hashing schemes in terms of performance, and, in the case of ART, sometimes not even in terms of space.
【Keywords】: cache storage; cryptography; disc drives; ARTful; Cuckoo hashing; Judy Array; OLAP; OLTP; adaptive radix tree; adaptive radix trees; cache utilization; disk based systems; hash tables; index data structures; key value pairs; main memory; main memory systems; quadratic probing; tree based data structures; venerable B-tree; Arrays; Google; Indexes; Robustness; Subspace constraints
【Paper Link】 【Pages】:1239-1245
【Authors】: Per-Åke Larson ; Eric N. Hanson ; Mike Zwilling
【Abstract】: The basic architecture of SQL Server, as well as other major database systems, goes back to a time when main memories were (very) small, data lived on disk, machines had a single (slow) processor, and OLTP was the only workload that mattered. This is not an optimal design for today's environment with large main memories, plenty of cores, and where transactional and analytical processing are equally important. To adapt to these trends and take advantage of the opportunities they offer SQL Server has added support for column store indexes and in-memory tables over the last two releases. The two features are aimed at dramatically improving performance on analytical and transactional workloads, respectively. This paper gives an overview of the design of the two features and the performance improvements they provide.
【Keywords】: SQL; data mining; transaction processing; OLTP; SQL server architecture; analytical processing; database systems; inmemory tables; modern hardware trends; optimal design; transactional processing; Buffer storage; Concurrency control; Dictionaries; Engines; Indexes; Market research; Servers
【Paper Link】 【Pages】:1246-1252
【Authors】: Ronald Barber ; Guy M. Lohman ; Vijayshankar Raman ; Richard Sidle ; Sam Lightstone ; Berni Schiefer
【Abstract】: Although the DRAM for main memories of systems continues to grow exponentially according to Moore's Law and to become less expensive, we argue that memory hierarchies will always exist for many reasons, both economic and practical, and in particular due to concurrent users competing for working memory to perform joins and grouping. We present the in-memory BLU Acceleration used in IBM's DB2 for Linux, UNIX, and Windows, and now also the dashDB cloud offering, which was designed and implemented from the ground up to exploit main memory but is not limited to what fits in memory and does not require manual management of what to retain in memory, as its competitors do. In fact, BLU Acceleration views memory as too slow, and is carefully engineered to work in higher levels of the system cache by keeping the data encoded and packed densely into bit-aligned vectors that can exploit SIMD instructions in processing queries. To achieve scalable multi-core parallelism, BLU assigns to each thread independent data structures, or partitions thereof, designed to have low synchronization costs, and doles out batches of values to threads. On customer workloads, BLU has improved performance on complex analytics queries by 10 to 50 times, compared to the legacy row-organized run-time, while also significantly simplifying database administration, shortening time to value, and improving data compression. UPDATE and DELETE performance was improved by up to 112 times with the new Cancun release of DB2 with BLU Acceleration, which also added Shadow Tables for high performance on mixed OLTP and BI analytics workloads, and extended DB2's High Availability Disaster Recovery (HADR) and SQL compatibility features to BLU's column-organized tables.
【Keywords】: DRAM chips; Linux; SQL; data compression; data structures; database management systems; multiprocessing systems; parallel architectures; query processing; storage management; synchronisation; BI analytic workload; BLU column-organized table; Cancun; DRAM; HADR; IBM DB2; Linux; Moore Law; OLTP; SIMD instruction; SQL compatibility feature; Shadow Tables; UNIX; Windows; bit-aligned vector; complex analytic query; dashDB cloud offering; data compression; data structure; database administration; hardware architecture; high availability disaster recovery; in-memory BLU acceleration; legacy row-organized run-time; memory hierarchies; multicore parallelism; query processing; synchronization cost; Acceleration; Hardware; Indexes; Memory management; Random access memory; BLU; Business Intelligence; DB2; SIMD; analytics; cache-conscious; compression; dashDB; in-memory; multi-core; query processing
【Paper Link】 【Pages】:1253-1258
【Authors】: Tirthankar Lahiri ; Shasank Chavan ; Maria Colgan ; Dinesh Das ; Amit Ganesh ; Mike Gleeson ; Sanket Hase ; Allison Holloway ; Jesse Kamp ; Teck-Hua Lee ; Juan Loaiza ; Neil MacNaughton ; Vineet Marwah ; Niloy Mukherjee ; Atrayee Mullick ; Sujatha Muthulingam ; Vivekanandhan Raja ; Marty Roth ; Ekrem Soylemez ; Mohamed Zaït
【Abstract】: The Oracle Database In-Memory Option allows Oracle to function as the industry-first dual-format in-memory database. Row formats are ideal for OLTP workloads which typically use indexes to limit their data access to a small set of rows, while column formats are better suited for Analytic operations which typically examine a small number of columns from a large number of rows. Since no single data format is ideal for all types of workloads, our approach was to allow data to be simultaneously maintained in both formats with strict transactional consistency between them.
【Keywords】: data mining; database indexing; information retrieval; storage management; transaction processing; OLTP workload; Oracle database in-memory; analytic operation; column format; data access; dual format in-memory database; index; row format; transactional consistency; Acceleration; Buffer storage; Indexes; Memory management; Optimization; Servers
【Paper Link】 【Pages】:1259-1267
【Authors】: Franz Faerber ; Jonathan Dees ; Martin Weidner ; Stefan Bäuerle ; Wolfgang Lehner
【Abstract】: Over the years, data management has diversified and moved into multiple directions, mainly caused by a significant growth in the application space with different usage patterns, a massive change in the underlying hardware characteristics, and-last but not least-growing data volumes to be processed. A solution matching these constraints has to cope with a multidimensional problem space including techniques dealing with a large number of domain-specific data types, data and consistency models, deployment scenarios, and processing, storage, and communication infrastructures on a hardware level. Specialized database engines are available and are positioned in the market optimizing a particular dimension on the one hand while relaxing other aspects (e.g. web-scale deployment with relaxed consistency). Today it is common sense, that there is no single engine which can handle all the different dimensions equally well and therefore we have very good reasons to tackle this problem and optimize the dimensions with specialized approaches in a first step. However, we argue for a second step (reflecting in our opinion on the even harder problem) of a deep integration of individual engines into a single coherent and consistent data management ecosystem providing not only shared components but also a common understanding of the overall business semantics. More specifically, a data management ecosystem provides common “infrastructure” for software and data life cycle management, backup/recovery, replication and high availability, accounting and monitoring, and many other operational topics, where administrators and users expect a harmonized experience. More importantly from an application perspective however, customer experience teaches us to provide a consistent business view across all different components and the ability to seamlessly combine different capabilities. For example, within recent customer-based Internet of Things scenarios, a huge potential exists i- combining graph-processing functionality with temporal and geospatial information and keywords extracted from high-throughput twitter streams. Using SAP HANA as the running example, we want to demonstrate what moving a set of individual engines and infra-structural components towards a holistic but also flexible data management ecosystem could look like. Although there are some solutions for some problems already visible on the horizon, we encourage the database research community in general to focus more on the Big Picture providing a holistic/integrated approach to efficiently deal with different types of data, with different access methods, and different consistency requirements-research in this field would push the envelope far beyond the traditional notion of data management.
【Keywords】: Internet; Internet of Things; data handling; database management systems; information retrieval; social networking (online); software management; SAP HANA; Web-scale data management ecosystem; application space; backup; business semantics; coherent data management ecosystem; communication infrastructures; consistency model; consistent data management ecosystem; customer-based Internet of Things scenarios; data life cycle management; data model; database engines; deployment scenarios; domain-specific data types; geospatial information extraction; graph- processing functionality; high-throughput Twitter streams; keyword extraction; multidimensional problem space; recovery; replication; software management; temporal information extraction; usage patterns; Biological system modeling; Business; Data models; Databases; Ecosystems; Engines; Planning
【Paper Link】 【Pages】:1268-1279
【Authors】: Hao Zhang ; Gang Chen ; Beng Chin Ooi ; Weng-Fai Wong ; Shensen Wu ; Yubin Xia
【Abstract】: The increase in the capacity of main memory coupled with the decrease in cost has fueled the development of in-memory database systems that manage data entirely in memory, thereby eliminating the disk I/O bottleneck. However, as we shall explain, in the Big Data era, maintaining all data in memory is impossible, and even unnecessary. Ideally we would like to have the high access speed of memory, with the large capacity and low price of disk. This hinges on the ability to effectively utilize both the main memory and disk. In this paper, we analyze state-of-the-art approaches to achieving this goal for in-memory databases, which is called as “Anti-Caching” to distinguish it from traditional caching mechanisms. We conduct extensive experiments to study the effect of each fine-grained component of the entire process of “Anti-Caching” on both performance and prediction accuracy. To avoid the interference from other unrelated components of specific systems, we implement these approaches on a uniform platform to ensure a fair comparison. We also study the usability of each approach, and how intrusive it is to the systems that intend to incorporate it. Based on our findings, we propose some guidelines on designing a good “Anti-Caching” approach, and sketch a general and efficient approach, which can be utilized in most in-memory database systems without much code modification.
【Keywords】: cache storage; data handling; database management systems; anticaching based elastic memory management; big data; caching mechanisms; code modification; data management; disk I/O bottleneck; fine grained component; inmemory database systems; main memory; Big data; Database systems; Linux; Memory management; Real-time systems; Semantics
【Paper Link】 【Pages】:1280-1291
【Authors】: Robert Brunel ; Jan Finis ; Gerald Franz ; Norman May ; Alfons Kemper ; Thomas Neumann ; Franz Färber
【Abstract】: Managing hierarchies is an ever-recurring challenge for relational database systems. Through investigations of customer scenarios at SAP we found that today's RDBMSs still leave a lot to be desired in order to meet the requirements of typical applications. Our research puts a new twist on handling hierarchies in SQL-based systems. We present an approach for modeling hierarchical data natively, and we extend the SQL language with expressive constructs for creating, manipulating, and querying a hierarchy. The constructs can be evaluated efficiently by leveraging existing indexing and query processing techniques. We demonstrate the feasibility of our concepts with initial measurements on a HANA-based prototype.
【Keywords】: SQL; database indexing; relational databases; HANA-based prototype; RDBMS; SAP HANA; SQL language; SQL-based systems; hierarchical data handling; hierarchical data modeling; indexing techniques; query processing techniques; relational database systems; Bills of materials; Data models; Encoding; Engines; Indexing; Syntactics; XML
【Paper Link】 【Pages】:1292-1303
【Authors】: Akihiro Yamaguchi ; Yukikazu Nakamoto ; Kenya Sato ; Yoshiharu Ishikawa ; Yousuke Watanabe ; Shinya Honda ; Hiroaki Takada
【Abstract】: Data stream management systems (DSMSs) are useful for the management and processing of continuous data at a high input rate with low latency. In the automotive domain, embedded systems use a variety of sensor data and communications from outside the vehicle to promote autonomous and safe driving. Thus, the software developed for these systems must be capable of handling large volumes of data and complex processing. At present, we are developing a platform for the integration and management of data in an automotive embedded system using a DSMS. However, compared with conventional DSMS fields, we have encountered new challenges such as precompiling queries when designing automotive systems (which demands time predictability), distributed stream processing in in-vehicle networks, and real-time scheduling and sensor data fusion by stream processing. Therefore, we developed an automotive embedded DSMS (AEDSMS) to address these challenges. The main contributions of the present study are: (1) a clear understanding of the challenges faced when introducing DSMSs into the automotive field; (2) the development of AEDSMS to tackle these challenges; and (3) an evaluation of AEDSMS during runtime using a driving assistance application.
【Keywords】: automobiles; road safety; software engineering; traffic engineering computing; AEDSMS; automotive embedded data stream management system; distributed stream processing; driving assistance application; in-vehicle networks; real-time scheduling; sensor data fusion; Aggregates; Lasers; Reliability; Sensor systems; Software; Vehicles
【Paper Link】 【Pages】:1304-1315
【Authors】: Aditi Pandit ; Derrick Kondo ; David E. Simmen ; Anjali Norwood ; Tongxin Bai
【Abstract】: The volume, velocity, and variety of Big Data necessitate the development of new and innovative data processing software. A multitude of SQL implementations on distributed systems have emerged in recent years to enable large-scale data analysis. User-Defined Table operators (written in procedural languages) embedded in these SQL implementations are a powerful mechanism to succinctly express and perform analytic operations typical in Big Data discovery workloads. Table operators can be easily customized to implement different processing models such as map, reduce and graph execution. Despite an inherently parallel execution model, the performance and scalability of these table operators is greatly restricted as they appear as a black box to a typical SQL query optimizer. The optimizer is not able to infer even the basic properties of table operators, prohibiting the application of optimization rules and strategies. In this paper, we introduce an innovative concept of “Collaborative Planning”, which results in the removal of redundant operations and a more optimal rearrangement of query plan operators. The optimization of the query proceeds through a collaborative exchange between the planner and the table operator. Plan properties and context information of surrounding query plan operations are exchanged between the optimizer and the table operator. Knowing these properties also allows the author of the table operator to optimize its embedded logic. Our main contribution in this paper is the design and implementation of Collaborative Planning in the Teradata Aster 6 system. Using real-world workloads, we show that Collaborative Planning reduces query execution times as much as 90.0% in common use cases, resulting in a 24x speedup.
【Keywords】: Big Data; SQL; groupware; query processing; SQL query optimizer; Teradata Aster 6 system; collaborative exchange; collaborative planning; innovative data processing software; large-scale Big data analysis; query plan operators; user-defined table operators; Big data; Collaboration; Context; Contracts; Optimization; Planning; Runtime
【Paper Link】 【Pages】:1316-1327
【Authors】: Douglas Lee Schales ; Xin Hu ; Jiyong Jang ; Reiner Sailer ; Marc Ph. Stoecklin ; Ting Wang
【Abstract】: In this paper, we present the design, architecture, and implementation of a novel analysis engine, called Feature Collection and Correlation Engine (FCCE), that finds correlations across a diverse set of data types spanning over large time windows with very small latency and with minimal access to raw data. FCCE scales well to collecting, extracting, and querying features from geographically distributed large data sets. FCCE has been deployed in a large production network with over 450,000 workstations for 3 years, ingesting more than 2 billion events per day and providing low latency query responses for various analytics. We explore two security analytics use cases to demonstrate how we utilize the deployment of FCCE on large diverse data sets in the cyber security domain: 1) detecting fluxing domain names of potential botnet activity and identifying all the devices in the production network querying these names, and 2) detecting advanced persistent threat infection. Both evaluation results and our experience with real-world applications show that FCCE yields superior performance over existing approaches, and excels in the challenging cyber security domain by correlating multiple features and deriving security intelligence.
【Keywords】: Big Data; feature selection; query processing; security of data; FCCE; advanced persistent threat infection detection; cyber security domain; fluxing domain name detection; geographically distributed large data sets; highly scalable distributed feature collection and correlation engine analysis engine; low latency Big Data analytics; low latency query responses; production network query; security analytics; security intelligence; time windows; Computer security; Correlation; Data mining; Distributed databases; Feature extraction; IP networks; Real-time systems
【Paper Link】 【Pages】:1328-1339
【Authors】: Ioannis Koltsidas ; Slavisa Sarafijanovic ; Martin Petermann ; Nils Haustein ; Harald Seipp ; Robert Haas ; Jens Jelitto ; Thomas Weigold ; Edwin R. Childers ; David Pease ; Evangelos Eleftheriou
【Abstract】: The explosion of data volumes in enterprise environments and limited budgets have triggered the need for multi-tiered storage systems. With the bulk of the data being extremely infrequently accessed, tape is a natural fit for storing such data. In this paper we present our approach to a file storage system that seamlessly integrates disk and tape, enabling a bottomless and cost-effective storage architecture that can scale to accommodate Big Data requirements. The proposed system offers access to data through a POSIX filesystem interface under a single global namespace, optimizing the placement of data across disk and tape tiers. Using a self-contained, standardized and open filesystem format on the removable tape media, the proposed system avoids dependence on proprietary software and external metadata servers to access the data stored on tape. By internally managing the tape tier resources, such as tape drives and cartridges, the system relieves the user from the burden of dealing with the complexities of tape storage. Our implementation, which is based on the GPFS and LTFS filesystems, demonstrates the applicability of the proposed architecture in real-world environments. Our experimental evaluation has shown that this is a very promising approach in terms scalability, performance and manageability. The proposed system has been productized by IBM as LTFS Enterprise Edition.
【Keywords】: Unix; disc storage; distributed databases; file organisation; network operating systems; parallel processing; Big Data requirements; GPFS filesystems; IBM; LTFS Enterprise Edition; LTFS filesystems; POSIX filesystem interface; disk; file storage system; general parallel file system; linear tape file system; multitiered distributed file system; open filesystem format; removable tape media; self-contained filesystem format; storage architecture; Computer architecture; Data transfer; File systems; Indexes; Libraries; Standards; Archival Storage; GPFS; LTFS; Multi-tiered Storage; Tape Storage
【Paper Link】 【Pages】:1340-1351
【Authors】: Songlin Hu ; Wantao Liu ; Tilmann Rabl ; Shuo Huang ; Ying Liang ; Zheng Xiao ; Hans-Arno Jacobsen ; Xubin Pei ; Jiye Wang
【Abstract】: Hive is the most mature and prevalent data warehouse tool providing SQL-like interface in the Hadoop ecosystem. It is successfully used in many Internet companies and shows its value for big data processing in traditional industries. However, enterprise big data processing systems as in Smart Grid applications usually require complicated business logics and involve many data manipulation operations like updates and deletes. Hive cannot offer sufficient support for these while preserving high query performance. Hive using the Hadoop Distributed File System (HDFS) for storage cannot implement data manipulation efficiently and Hive on HBase suffers from poor query performance even though it can support faster data manipulation. There is a project based on Hive issue Hive-5317 to support update operations, but it has not been finished in Hive's latest version. Since this ACID compliant extension adopts same data storage format on HDFS, the update performance problem is not solved. In this paper, we propose a hybrid storage model called DualTable, which combines the efficient streaming reads of HDFS and the random write capability of HBase. Hive on DualTable provides better data manipulation support and preserves query performance at the same time. Experiments on a TPC-H data set and on a real smart grid data set show that Hive on DualTable is up to 10 times faster than Hive when executing update and delete operations.
【Keywords】: SQL; data handling; data warehouses; distributed databases; network operating systems; parallel processing; query processing; storage management; ACID compliant extension; DualTable; HBase; HDFS; Hadoop distributed file system; Hadoop ecosystem; Hive-5317; Internet company; SQL-like interface; TPC-H dataset; data manipulation operations; data storage format; data warehouse tool; enterprise Big Data processing systems; hybrid storage model; query performance; real smart grid data set; update optimization; Business; Cloud computing; Data models; Data processing; Smart grids; Smart meters; Writing
【Paper Link】 【Pages】:1352-1363
【Authors】: Ahmed Eldawy ; Mohamed F. Mokbel
【Abstract】: This paper describes SpatialHadoop; a full-fledged MapReduce framework with native support for spatial data. SpatialHadoop is a comprehensive extension to Hadoop that injects spatial data awareness in each Hadoop layer, namely, the language, storage, MapReduce, and operations layers. In the language layer, SpatialHadoop adds a simple and expressive high level language for spatial data types and operations. In the storage layer, SpatialHadoop adapts traditional spatial index structures, Grid, R-tree and R+-tree, to form a two-level spatial index. SpatialHadoop enriches the MapReduce layer by two new components, SpatialFileSplitter and SpatialRecordReader, for efficient and scalable spatial data processing. In the operations layer, SpatialHadoop is already equipped with a dozen of operations, including range query, kNN, and spatial join. Other spatial operations are also implemented following a similar approach. Extensive experiments on real system prototype and real datasets show that SpatialHadoop achieves orders of magnitude better performance than Hadoop for spatial data processing.
【Keywords】: data handling; spatial data structures; MapReduce framework; R-tree; SpatialFileSplitter; SpatialHadoop; SpatialRecordReader; expressive high level language; kNN; language layer; range query; spatial data; spatial data awareness; spatial data processing; spatial index structures; two-level spatial index; Buildings; Computer architecture; Indexing; Partitioning algorithms; Spatial databases; Spatial indexes
【Paper Link】 【Pages】:1364-1375
【Authors】: Petko Bakalov ; Erik G. Hoel ; Wee-Liang Heng
【Abstract】: Network data models are frequently used as a mechanism to solve wide range of problems typical for the GIS applications and transportation planning in particular. They do this by modelling the two most important aspects of such systems: the connectivity and the attribution. For a long time the attributes like the travel time, associated with a transportation network model have been considered static. With the advancement of the technology data vendors now have the capability to capture more accurate information about the speeds of streets at different times of the day and provide this data to customers. The network attributes are not static anymore but associated with a particular time instance (e.g time-dependent). In this paper we describe our time dependent network model tailored towards the need of transportation network modelling. Our solution is based on the existing database functionality (tables, joins, sorting algorithms) provided by a standard relational DBMS and has been implemented and tested and currently being shipped as a part of the ESRI ArcGIS 10.1 platform and all subsequent releases.
【Keywords】: geographic information systems; relational databases; GIS applications; database functionality; network data model; relational DBMS; time dependent transportation network models; transportation planning; Algorithm design and analysis; Analytical models; Data models; Geometry; Junctions; Roads
【Paper Link】 【Pages】:1376-1387
【Authors】: Yanhua Li ; Jun Luo ; Chi-Yin Chow ; Kam-Lam Chan ; Ye Ding ; Fan Zhang
【Abstract】: Electric vehicles (EVs) have undergone an explosive increase over recent years, due to the unparalleled advantages over gasoline cars in green transportation and cost efficiency. Such a drastic increase drives a growing need for widely deployed publicly accessible charging stations. Thus, how to strategically deploy the charging stations and charging points becomes an emerging and challenging question to urban planners and electric utility companies. In this paper, by analyzing a large scale electric taxi trajectory data, we make the first attempt to investigate this problem. We develop an optimal charging station deployment (OCSD) framework that takes the historical EV taxi trajectory data, road map data, and existing charging station information as input, and performs optimal charging station placement (OCSP) and optimal charging point assignment (OCPA). The OCSP and OCPA optimization components are designed to minimize the average time to the nearest charging station, and the average waiting time for an available charging point, respectively. To evaluate the performance of our OCSD framework, we conduct experiments on one-month real EV taxi trajectory data. The evaluation results demonstrate that our OCSD framework can achieve a 26%-94% reduction rate on average time to find a charging station, and up to two orders of magnitude reduction on waiting time before charging, over baseline methods. Moreover, our results reveal interesting insights in answering the question: “Super or small stations?”: When the number of deployable charging points is sufficiently large, more small stations are preferred; and when there are relatively few charging points to deploy, super stations is a wiser choice.
【Keywords】: automobiles; electric vehicles; EV; OCPA optimization; OCSD framework; OCSP optimization; charging station network; cost efficiency; electric utility company; electric vehicle; green transportation; large scale electric taxi trajectory data; optimal charging point assignment; optimal charging station deployment; optimal charging station placement; trajectory data analytics; urban planners; Charging stations; Cities and towns; Electric vehicles; Global Positioning System; Petroleum; Roads; Trajectory
【Paper Link】 【Pages】:1388-1399
【Authors】: Abhay Jha ; Shubhankar Ray ; Brian Seaman ; Inderjit S. Dhillon
【Abstract】: Forecasting accurately is essential to successful inventory planning in retail. Unfortunately, there is not always enough historical data to forecast items individually- this is particularly true in e-commerce where there is a long tail of low selling items, and items are introduced and phased out quite frequently, unlike physical stores. In such scenarios, it is preferable to forecast items in well-designed groups of similar items, so that data for different items can be pooled together to fit a single model. In this paper, we first discuss the desiderata for such a grouping and how it differs from the traditional clustering problem. We then describe our approach which is a scalable local search heuristic that can naturally handle the constraints required in this setting, besides being capable of producing solutions competitive with well-known clustering algorithms. We also address the complementary problem of estimating similarity, particularly in the case of new items which have no past sales. Our solution is to regress the sales profile of items against their semantic features, so that given just the semantic features of a new item we can predict its relation to other items, in terms of as yet unobserved sales. Our experiments demonstrate both the scalability of our approach and implications for forecast accuracy.
【Keywords】: constraint handling; electronic commerce; forecasting theory; inventory management; pattern clustering; retail data processing; sales management; time series; clustering algorithms; complementary problem; constraint handling; e-commerce; inventory planning; retail; sales profile; scalable local search heuristics; semantic features; similarity estimation; sparse time-series data forecasting; Clustering algorithms; Correlation; Cost function; Data models; Forecasting; Robustness; Semantics
【Paper Link】 【Pages】:1400-1403
【Authors】: Shane Bracher ; Mark Holmes ; Liam Mischewski ; Asadul K. Islam ; Michael McClenaghan ; Daniel Ricketts ; Glenn Neuber ; Hoyoung Jeung ; Priya Vijayarajendran
【Abstract】: We present a demonstration based on SAP HANA to show a novel approach to churn risk scoring. Our focus is customer retention within a telecommunications setting. The purpose of this demonstration is to help identify customers who should be targeted for a customer retention marketing campaign. The data analysis considers multiple factors - churn likelihood (based on incoming and outgoing communications), customer influence (based on social connections) and the average revenue per customer. The results are presented using skyline visualization and advanced UI techniques to easily and intuitively interpret the analysis.
【Keywords】: customer services; data analysis; data visualisation; marketing data processing; social sciences computing; telecommunication industry; user interfaces; SAP HANA; advanced UI techniques; average revenue per customer; call network analysis; churn likelihood; churn risk scoring; customer influence; customer retention marketing campaign; data analysis; incoming communications; outgoing communications; skyline visualization; social connections; telecommunications setting; Analytical models; Companies; Data analysis; Data models; Electric breakdown; Social network services; Telecommunications
【Paper Link】 【Pages】:1404-1407
【Authors】: Hien To ; Gabriel Ghinita ; Cyrus Shahabi
【Abstract】: Spatial Crowdsourcing (SC) is a novel and transformative platform that engages individuals, groups and communities in the act of collecting, analyzing, and disseminating environmental, social and other spatio-temporal information. SC outsources a set of spatio-temporal tasks to a set of workers, i.e., individuals with mobile devices that perform the tasks by physically traveling to specified locations of interest. Protecting location privacy is an important concern in SC, as an adversary with access to individual whereabouts can infer sensitive details about a person (e.g., health status, political views). Due to the challenging nature of protecting worker privacy in SC, solutions for this problem are quite complex, and require tuning of several parameters to obtain satisfactory results. In this paper, we propose PrivGeoCrowd, a toolbox for interactive visualization and tuning of SC private task assignment methods. This toolbox is useful for several real-world entities that are involved in SC, such as: mobile phone operators that want to sanitize datasets with worker locations, spatial task requesters, and SC-service providers that match workers to tasks.
【Keywords】: data protection; geographic information systems; mobile computing; outsourcing; PrivGeoCrowd toolbox; SC private task assignment methods; SC-service providers; data analysis; data collection; data dissemination; environmental information; interactive visualization; location privacy protection; mobile devices; mobile phone operators; private spatial crowdsourcing; social information; spatial task requesters; spatio-temporal information; user health status; user political views; user sensitive detail inference; worker locations; Computer architecture; Crowdsourcing; Graphical user interfaces; Noise measurement; Privacy; Servers; Tuning
【Paper Link】 【Pages】:1408-1411
【Authors】: Feiran Huang ; Jia Li ; Jiaheng Lu ; Tok Wang Ling ; Zhaoan Dong
【Abstract】: In the world of academia, research documents enable the sharing and dissemination of scientific discoveries. During these “big data” times, academic search engines are widely used to find the relevant research documents. Considering the domain of computer science, a researcher often inputs a query with a specific goal to find an algorithm or a theorem. However, to this date, the return result of most search engines is just as a list of related papers. Users have to browse the results, download the interesting papers and look for the desired information, which is obviously laborious and inefficient. In this paper, we present a novel academic search system, called PandaSearch, that returns the results with a fine-grained interface, where the results are well organized by different categories, such as definitions, theorems, lemmas, algorithms and figures. The key technical challenges in our system include the automatic identification and extraction of different parts in a research document, the discovery of the main topic phrases for a definition or a theorem, and the recommendation of related definitions or figures to elegantly satisfy the search intention of users. Based on this, we have built a user friendly search interface for users to conveniently explore the documents, and find the relevant information.
【Keywords】: Big Data; query processing; search engines; user interfaces; Big Data; PandaSearch; academic search system; fine-grained academic search engine; fine-grained interface; research document; scientific discovery; user friendly search interface; Context; Detectors; Google; Portable document format; Search engines; XML
【Paper Link】 【Pages】:1412-1415
【Authors】: Chenjuan Guo ; Bin Yang ; Ove Andersen ; Christian S. Jensen ; Kristian Torp
【Abstract】: Reduction in greenhouse gas emissions from transportation attracts increasing interest from governments, fleet managers, and individual drivers. Eco-routing, which enables drivers to use eco-friendly routes, is a simple and effective approach to reducing emissions from transportation. We present EcoSky, a system that annotates edges of a road network with time dependent and uncertain eco-weights using GPS data and that supports different types of eco-routing. Basic eco-routing returns the most eco-friendly routes; skyline eco-routing takes into account not only fuel consumption but also travel time and distance when computing eco-routes; and personalized eco-routing considers each driver's past behavior and accordingly suggests different routes to different drivers.
【Keywords】: Global Positioning System; air pollution control; behavioural sciences; ecology; fuel economy; traffic engineering computing; vehicle routing; EcoSky; GPS data; driver past behavior; ecofriendly routes; fuel consumption; greenhouse gas emission reduction; personalized ecorouting; road network; skyline ecorouting; transportation; travel distance; travel time; uncertain ecoweights; vehicular environmental impact reduction; Fuels; Global Positioning System; Random variables; Roads; Routing; Three-dimensional displays; Vehicles
【Paper Link】 【Pages】:1416-1419
【Authors】: Amr Magdy ; Louai Alarabi ; Saif Al-Harthi ; Mashaal Musleh ; Thanaa M. Ghanem ; Sohaib Ghani ; Saleh M. Basalamah ; Mohamed F. Mokbel
【Abstract】: This paper demonstrates Taghreed; a full-fledged system for efficient and scalable querying, analyzing, and visualizing geotagged microblogs, such as tweets. Taghreed supports a wide variety of queries on all microblogs attributes. In addition, it is able to manage a large number (billions) of microblogs for relatively long periods, e.g., months. Taghreed consists of four main components: (1) indexer, (2) query engine, (3) recovery manager, and (4) visualizer. Taghreed indexer efficiently digests incoming microblogs with high arrival rates in light main-memory indexes. When the memory becomes full, the memory contents are flushed to disk indexes which are managing billions of microblogs efficiently. On memory failure, the recovery manager restores the memory contents from backup copies. Taghreed query engine consists of two modules: a query optimizer and a query processor. The query optimizer generates an optimized query plan to be executed by the query processor to provide low query responses. Taghreed visualizer features to its users a wide variety of spatiotemporal queries and presents the answers on a map-based user interface that allows an interactive exploration. Taghreed is the first system that addresses all these challenges collectively for geotagged microblogs data. The system is demonstrated based on real system implementation through different scenarios that show system functionality and internals.
【Keywords】: data visualisation; geographic information systems; query processing; social networking (online); user interfaces; Taghreed query engine; Tweets; geotagged microblogs visualization; map-based user interface; query processing; Data visualization; Engines; Indexing; Real-time systems; Twitter; Visualization
【Paper Link】 【Pages】:1420-1423
【Authors】: Shaoqing Wang ; Benyou Zou ; Cuiping Li ; Kankan Zhao ; Qiang Liu ; Hong Chen
【Abstract】: It is popular for most people to read news online since the web sites can provide access to news articles from millions of sources around the world. For these news web sites, the key challenge is to help users find related news articles to read. In this paper, we present a system called CROWN (Context-aware RecOmmender for Web News) to do Chinese news recommendation. By recommendation, the system can retrieve personalized fresh and relevant news articles to mobile users according to their particular context. Differing from existing mobile news applications which employ rather simple strategies for news recommendation, CROWN integrates the contextual information in prediction by modeling the data as a tensor. Such context information usually includes the time, the location, etc. This demo paper presents the implementation of the whole procedure of news recommendation in the system of CROWN. Experimental results on a large corpus of newly-published Chinese web news show its performance is satisfactory.
【Keywords】: Web sites; information retrieval; mobile computing; recommender systems; CROWN; Chinese news recommendation; Web sites; context-aware recommender for Web news; contextual information; mobile users; personalized fresh news article retrieval; relevant news retrieval; Context; Context modeling; Data models; Mobile communication; Predictive models; Tensile stress; Tin
【Paper Link】 【Pages】:1424-1427
【Authors】: Lei Liu ; Georgia Koutrika ; Shanchan Wu
【Abstract】: Reading online content for educational, learning, training or recreational purposes has become a very popular activity. While reading, people may have difficulty understanding a passage or wish to learn more about the topics covered by it, hence they may naturally seek additional or supplementary resources for the particular passage. These resources should be close to the passage both in terms of the subject matter and the reading level. However, using a search engine to find such resources interrupts the reading flow. It is also an inefficient, trial-and-error process because existing web search and recommendation systems do not support large queries, they do not understand semantic topics, and they do not take into account the reading level of the original document a person is reading. In this demo, we present LearningAssistant, a novel system that enables online reading material to be smoothly enriched with additional resources that can supplement or explain any passage from the original material for a reader on demand. The system facilitates the learning process by recommending learning resources (documents, videos, etc) for selected text passages of any length. The recommended resources are ranked based on two criteria (a) how they match the different topics covered within the selected passage, and (b) the reading level of the original text where the selected passage comes from. User feedback from students who use our system in two real pilots, one with a high school and one with a university, for their courses suggest that our system is promising and effective.
【Keywords】: Internet; computer aided instruction; recommender systems; search engines; LearningAssistant; Web search; learning resource recommendation system; online content reading; online reading material; search engine; Books; Electronic publishing; Encyclopedias; Internet; Search engines; Videos
【Paper Link】 【Pages】:1428-1431
【Authors】: Gregor Jossé ; Maximilian Franzke ; Georgios Skoumas ; Andreas Züfle ; Mario A. Nascimento ; Matthias Renz
【Abstract】: Directions and paths, as commonly provided by route guidance systems, are usually derived considering absolute metrics, e.g., finding the shortest path within the underlying road network. This demo presents a framework which uses crowdsourced geospatial data to obtain paths that do not only minimize travel time but also guide users along popular points of interest (POIs). By analyzing textual travel blog data and Flickr data, we define a measure for popularity of POIs. This measure is used as an additional cost criterion in the underlying road network graph. Furthermore, we propose an approach to reduce the problem of finding paths which maximize popularity while minimizing travel time to the computation of bicriterion pareto optimal paths. The presented framework allows users to specify origin and destination within a road network, returning the set of pareto optimal paths or a subset thereof if a desired number of POIs along the path has been specified. Each of the returned routes is enriched with representative Flickr images and textual information from travel blogs. The framework and its results show that the computed paths yield competitive solutions in terms of travel time while also providing more “popular” paths, making routing easier and more informative for the user.
【Keywords】: Pareto optimisation; graph theory; network theory (graphs); outsourcing; social networking (online); Flickr data analysis; POI; absolute metrics; bicriterion Pareto optimal paths; cost criterion; crowdsourced geospatial data; path destination; path origin; point-of-interest; popular-path computation; road network graph; route guidance systems; shortest path finding; textual information; textual travel blog data analysis; travel time minimization; Blogs; Data mining; Estimation; Geospatial analysis; Pareto optimization; Roads; Routing
【Paper Link】 【Pages】:1432-1435
【Authors】: Benjamin Djahandideh ; François Goasdoué ; Zoi Kaoudi ; Ioana Manolescu ; Jorge-Arnulfo Quiané-Ruiz ; Stamatis Zampetakis
【Abstract】: RDF is an increasingly popular data model for many practical applications, leading to large volumes of RDF data; efficient RDF data management methods are crucial to allow applications to scale. We propose to demonstrate CliqueSquare, an RDF data management system built on top of a MapReduce-like infrastructure. The main technical novelty of CliqueSquare resides in its logical query optimization algorithm, guaranteed to find a logical plan as flat as possible for a given query, meaning: a plan having the smallest possible number of join operators on top of each other. CliqueSquare's ability to build flat plans allows it to take advantage of a parallel processing framework in order to shorten response times. We demonstrate loading and querying the data, with a particular focus on query optimization, and on the performance benefits of CliqueSquare's flat plans.
【Keywords】: data handling; data models; parallel languages; query languages; query processing; CliqueSquare; MapReduce-like infrastructure; RDF data management methods; data model; logical query optimization algorithm; massively parallel RDF query; parallel processing framework; resource description framework; Distributed databases; Optimization; Parallel processing; Query processing; Resource description framework; Time factors
【Paper Link】 【Pages】:1436-1439
【Authors】: Xin Li ; Guandong Xu ; Enhong Chen ; Lin Li
【Abstract】: With the pervasive use of GPS-enabled smart phones, location-based services, e.g., Location Based Social Networking (LBSN) have emerged . Point-of-Interests (POIs) Recommendation, as a typical component in LBSN, provides additional values to both customers and merchants in terms of user experience and business turnover. Existing POI recommendation systems mainly adopt Collaborative Filtering (CF), which only exploits user given ratings (i.e., user overall evaluation) about a merchant while regardless of the user preference difference across multiple aspects, which exists commonly in real scenarios. Meanwhile, besides ratings, most LBSNs also provide the review function to allow customers to give their opinions when dealing with merchants, which is often overlooked in these recommender systems. In this demo, we present MARS, a novel POI recommender system based on multi-aspect user preference learning from reviews by using utility theory. We first introduce the organization of our system, and then show how the user preferences across multiple aspects are integrated into our system alongside several case studies of mining user preference and POI recommendations.
【Keywords】: collaborative filtering; recommender systems; social networking (online); ubiquitous computing; CF; GPS enabled smart phones; LBSN; MARS; POI recommendation systems; POI recommender system; business turnover; collaborative filtering; location based services; location based social networking; multiaspect recommender system; point-of-interest; utility theory; Collaboration; Mars; Radar; Recommender systems; Servers; Smart phones; Training
【Paper Link】 【Pages】:1440-1443
【Authors】: Dalia Attia Waguih ; Naman Goel ; Hossam M. Hammady ; Laure Berti-Equille
【Abstract】: In the Web, a massive amount of user-generated contents is available through various channels, e.g., texts, tweets, Web tables, databases, multimedia-sharing platforms, etc. Conflicting information, rumors, erroneous and fake contents can be easily spread across multiple sources, making it hard to distinguish between what is true and what is not. How do you figure out that a lie has been told often enough that it is now considered to be true? How many lying sources are required to introduce confusion in what you knew before to be the truth? To answer these questions, we present AllegatorTrack, a system that discovers true claims among conflicting data from multiple sources.
【Keywords】: application program interfaces; data mining; API; AllegatorTrack; multisource data; truth discovery; Computational modeling; Computer architecture; Data mining; Data models; Gold; Maximum likelihood estimation; Standards
【Paper Link】 【Pages】:1444-1447
【Authors】: Ahmed Eldawy ; Saif Al-Harthi ; Abdulhadi Alzaidy ; Anas Daghistani ; Sohaib Ghani ; Saleh M. Basalamah ; Mohamed F. Mokbel
【Abstract】: Several space agencies such as NASA are continuously collecting datasets of earth dynamics-e.g., temperature, vegetation, and cloud coverage-through satellites. This data is stored in a publicly available archive for scientists and researchers and is very useful for studying climate, desertification, and land use change. The benefit of this data comes from its richness as it provides an archived history for over 15 years of satellite observations. Unfortunately, the use of such data is very limited due to the huge size of archives (> 500TB) and the limited capabilities of traditional applications. In this demo, we present Shahed, an interactive system which provides an efficient way to index, query, and visualize satellite datasets available in NASA archive. Shahed is composed of four main modules. The uncertainty module resolves data uncertainty imposed by the satellites. The indexing module organizes the data in a novel multi-resolution spatio-temporal index designed for satellite data. The querying module uses the indexes to answer both spatiotemporal selection and aggregate queries provided by the user. The visualization module generates images, videos, and multi-level images which gives an insight of data distribution and dynamics over time. This demo gives users a hands-on experience with Shahed through a map-based web interface in which users can browse the available datasets using the map, issue spatiotemporal queries, and visualize the results as images or videos.
【Keywords】: data visualisation; database indexing; geophysical image processing; interactive systems; land use; parallel processing; query processing; terrain mapping; user interfaces; MapReduce-based system; NASA archive; Shahed; climate change; data distribution; data uncertainty; desertification; image generation; interactive system; land use change; map-based Web interface; multilevel image generation; multiresolution spatiotemporal index; publicly available archive; satellite dataset indexing module; satellite dataset querying module; satellite dataset visualization; spatiotemporal queries; uncertainty module; video generation; Aggregates; Data visualization; Heating; Indexes; Satellites; Uncertainty; Videos
【Paper Link】 【Pages】:1448-1451
【Authors】: Jiansong Ma ; Yu Cao ; Xiaoling Wang ; Chaoyong Wang ; Cheqing Jin ; Aoying Zhou
【Abstract】: In modern cities, more and more people drive the vehicles, equipped with the GPS devices, which create a large scale of trajectories. Gathering and analyzing these large-scale trajectory data provide a new opportunity to understand the city dynamics and to reveal the hidden social and economic phenomena. This paper designs and implements a tool, named as PGWinFunc, to analyze trajectory data by extending a traditional relational database. Firstly we introduce some efficient query process and optimization methods for SQL Window Aggregate Functions in PostgreSQL. Secondly, we present how to mine the LBS (Location-Based Service) patterns, such as the average speed and traffic flow, from the large-scale trajectories with SQL expression with Window Aggregate Functions. Finally, the effectiveness and efficiency of the PGWinFunc tool are demonstrated and we also visualized the results by BAIDU MAP.
【Keywords】: SQL; application program interfaces; data mining; query processing; relational databases; road traffic; smart cities; socio-economic effects; BAIDU MAP; GPS devices; LBS pattern mining; PGWinFunc tool; PostgreSQL; SQL expression; SQL window aggregate function optimization; average speed; city dynamics; economic phenomena; large-scale trajectory data analysis; large-scale trajectory data gathering; location-based service pattern mining; query process; relational database; social phenomena; traffic flow; Aggregates; Cities and towns; Data visualization; Optimization; Roads; Sequential analysis; Trajectory
【Paper Link】 【Pages】:1452-1455
【Authors】: Manoranjan Dash ; Kee Kiat Koo ; James Decraene ; Ghim-Eng Yap ; Wei Wu ; João Bártolo Gomes ; Amy Shi Nash ; Xiaoli Li
【Abstract】: CDR (Call Detail Records) data is more easily available than other network related data (such as GPS data) as most telecommunications service providers (TSPs) maintain such data. By analyzing it one can find mobility patterns of most of the population thus leading to efficient urban planning, disease and traffic control, etc. But its granularity is low as the latitude and longitude (lat-lon) of a cell tower is used as the current location of all mobile phones that are connected to the cell tower at that time. Granularity can range between 10s of metres to several kms depending on population density of a locality. This is one reason why, although there are many existing systems on visualizing mobility of people based on GPS data, there is hardly any existing system for CDR. We develop a Mobility Visualization System (MoVis) for visualizing mobility of people from their CDR records. First of all, given the CDR records of a user, we determine her stay regions (places where she stays for a significant amount of time). Trajectories of phone events (and lat-lon of cell towers) between stay regions are extracted as her trips. Start and end times of a trip are estimated using linear extrapolation. Based on the start and end times, temporal patterns are extracted. Trips with sufficient number of intermediate points are mapped to transport network that consists of train lines, bus routes and expressways. We use Kernel density estimation to visualize the most common path for a given origin and destination. Based on this we create a round-the-clock visualization of mobility of people over the entire city separately for weekdays and weekends. At the end we show the validation results.
【Keywords】: data visualisation; extrapolation; mobile computing; CDR data; CDR-to-MoVis; bus routes; call detail records; expressways; kernel density estimation; linear extrapolation; mobility visualization system; phone events trajectory; stay regions; temporal patterns; train lines; Computer architecture; Data visualization; Microprocessors; Poles and towers; Sociology; Statistics; Trajectory
【Paper Link】 【Pages】:1456-1459
【Authors】: Azhar Ait Ouassarah ; Nicolas Aversengy ; Xavier Fournety ; Jean-Marc Petit ; Romain Revol ; Vasile-Marian Scuturici
【Abstract】: Nowadays, every company could understand how its business evolves from the data (deluge) generated by its activities. Roughly speaking, two types of data co-exist: historical data and real-time data from which business analysts have to take their decisions in a timely fashion. In this context, the notions of time (application time and transaction time) and traceability turn out to play a crucial role to understand what happened in the company and what is currently happening. Tornado offers a full-fledged platform to deal with such data and is based on two key features: 1) a bi-temporal DB specifically designed for handling historical and real-time data, 2) a GUI that aims to facilitate query formulation for business analysts. In this demonstration, we provide the key resources to let the visitors play with the Tornado functionalities to interact with predefined data.
【Keywords】: business data processing; data handling; graphical user interfaces; GUI; Tornado functionalities; application time; bitemporal DB; business analysts; business trends; data evolution; historical data handling; real-time data handling; transaction time; Companies; Databases; Engines; Graphical user interfaces; Real-time systems; Tornadoes
【Paper Link】 【Pages】:1460-1463
【Authors】: Hannes Voigt ; Patrick Damme ; Wolfgang Lehner
【Abstract】: Relational database management systems build on the closed world assumption requiring upfront modeling of a usually stable schema. However, a growing number of today's database applications are characterized by self-descriptive data. The schema of self-descriptive data is very dynamic and prone to frequent changes; a situation which is always troublesome to handle in relational systems. This demo presents the relational database management system FRDM. With flexible relational tables FRDM greatly simplifies the management of self-descriptive data in a relational database system. Self-descriptive data can reside directly next to traditionally modeled data and both can be queried together using SQL. This demo presents the various features of FRDM and provides first-hand experience of the newly gained freedom in relational database systems.
【Keywords】: SQL; relational databases; SQL; relational database management systems; relational tables FRDM; schema-flexible RDBMS; self-descriptive data; upfront modeling; Apertures; Cameras; Catalogs; Data models; Relational databases; Standards
【Paper Link】 【Pages】:1464-1467
【Authors】: Stefanie Scherzinger ; Thomas Cerqueus ; Eduardo Cunha de Almeida
【Abstract】: Building scalable web applications on top of NoSQL data stores is becoming common practice. Many of these data stores can easily be accessed programmatically, and do not enforce a schema. Software engineers can design the data model on the go, a flexibility that is crucial in agile software development. The typical tasks of database schema management are now handled within the application code, usually involving object mapper libraries. However, today's Integrated Development Environments (IDEs) lack the proper tool support when it comes to managing the combined evolution of the application code and of the schema. Yet simple refactorings such as renaming an attribute at the source code level can cause irretrievable data loss or runtime errors once the application is serving in production. In this demo, we present ControVol, a framework for controlled schema evolution in application development against NoSQL data stores. ControVol is integrated into the IDE and statically type checks object mapper class declarations against the schema evolution history, as recorded by the code repository. ControVol is capable of warning of common yet risky cases of mismatched data and schema. ControVol is further able to suggest quick fixes by which developers can have these issues automatically resolved.
【Keywords】: Internet; SQL; data models; database management systems; software prototyping; source code (software); ControVol; IDE; NoSQL application development; NoSQL data stores; agile software development; application code; code repository; controlled schema evolution framework; data model; database schema management; integrated development environments; object mapper libraries; scalable Web applications; software engineers; source code level; Databases; Google; History; Java; Production; Runtime; Software
【Paper Link】 【Pages】:1468-1471
【Authors】: Amol Bhangdiya ; Bikash Chandra ; Biplab Kar ; Bharath Radhakrishnan ; K. V. Maheshwara Reddy ; Shetal Shah ; S. Sudarshan
【Abstract】: Grading of student SQL queries is usually done by executing the query on sample datasets (which may be unable to catch many errors) and/or by manually comparing/checking a student query with the correct query (which can be tedious and error prone). In this demonstration we present the XDa-TA system which can be used by instructors and TAs for grading SQL query assignments automatically. Given one or more correct queries for an SQL assignment, the tool uses the XData system to automatically generate datasets that are designed specifically to catch common errors. The grading is then done by comparing the results of student queries with those of the correct queries against these generated datasets; instructors can optionally provide additional datasets for testing. The tool can also be used in a learning mode by students, where it can provide immediate feedback with hints explaining possible reasons for erroneous output. This tool could be of great value to instructors particularly, to instructors of MOOCs.
【Keywords】: SQL; computer aided instruction; query processing; MOOC; SQL query assignment; XDa-TA system; learning mode; query execution; sample dataset generation; student query checking; Database systems; Linear systems; Manuals; Remuneration; Syntactics; Testing
【Paper Link】 【Pages】:1472-1475
【Authors】: Yasin N. Silva ; Jaime Chon
【Abstract】: A key area of focus in recent Computer Science education research has been block-based programming. In this approach, program instructions are represented as visual blocks with shapes that control the way multiple instructions can be combined. Since programs are created by dragging and connecting blocks, the focus is on the program's logic rather than its syntax. In this demonstration we present DBSnap, a system that enables building database queries, specifically relational algebra queries, by connecting blocks. A differentiating property of DBSnap is that it uses a visual tree-based structure to represent queries. This structure is, in fact, very similar to the intuitive query trees commonly used by database practitioners and educators. DBSnap is also highly dynamic, it shows the query result and the corresponding relational algebra expression as the query is built and enables the inspection of intermediate query results. This paper describes DBSnap's main design elements, its architecture and implementation guidelines, and the interactive demonstration scenarios. DBSnap is a publicly available system and aims to have a transformational effect on database learning.
【Keywords】: Internet; query processing; relational algebra; DBSnap; block-based programming; computer science education; database learning; databases querying; interactive Web application; intermediate query inspection; program instructions; program logic; query trees; relational algebra query; snapping blocks; visual blocks; visual tree-based structure; Algebra; Buildings; Databases; Joining processes; Programming profession; Visualization
【Paper Link】 【Pages】:1476-1479
【Authors】: Xiufeng Liu ; Lukasz Golab ; Ihab F. Ilyas
【Abstract】: Smart electricity meters are replacing conventional meters worldwide and have enabled a new application domain: smart meter data analytics. In this paper, we introduce SMAS, our smart meter analytics system, which demonstrates the actionable insight that consumers and utilities can obtain from smart meter data. Notably, we implemented SMAS inside a relational database management system using open source tools: PostgreSQL and the MADLib machine learning toolkit. In the proposed demonstration, conference attendees will interact with SMAS as electricity providers, consultants and consumers, and will perform various analyses on real data sets.
【Keywords】: SQL; data analysis; learning (artificial intelligence); power engineering computing; relational databases; smart meters; MADLib machine learning toolkit; PostgreSQL; SMAS; open source tools; relational database management system; smart electricity meters; smart meter data analytics system; Cooling; Data analysis; Databases; Feature extraction; Forecasting; Smart meters; Temperature sensors
【Paper Link】 【Pages】:1480-1483
【Authors】: Sourav S. Bhowmick ; Huey-Eng Chua ; Benji Thian ; Byron Choi
【Abstract】: In [3], we laid out the vision of a novel graph query processing paradigm, where visual subgraph query formulation is interleaved (or “blended”) with query processing by exploiting the latency offered by the gui. Our recent attempts at implementing this vision [6], [7] do not provide any robust framework to systematically investigate the performance of this novel paradigm. This is because it is prohibitively expensive to engage a large number of users to formulate a large number of visual queries in order to measure the performance of blending query formulation with query processing. In this demonstration, we present a novel synthetic visual subgraph query simulator called ViSual that can evaluate the performance of this paradigm for a large number of visual subgraph queries without requiring a large number of users to formulate them. Specifically, it leverages principles from hci to quantify the gui latency that is necessary to realistically simulate blending of query formulation and query processing.
【Keywords】: digital simulation; graph theory; graphical user interfaces; human computer interaction; query processing; software performance evaluation; visual databases; GUI latency; HCI-inspired simulator; ViSual; graphical user interface; performance evaluation; synthetic visual subgraph query simulator; visual subgraph query construction blending; visual subgraph query formulation; visual subgraph query processing blending; Generators; Indexes; Mathematical model; Prefetching; Query processing; Simulation; Visualization
【Paper Link】 【Pages】:1484-1487
【Authors】: Daniel Deutch ; Amir Gilad ; Yuval Moskovitch
【Abstract】: Highly expressive declarative languages, such as Datalog, are now commonly used to model the operational logic of data-intensive applications. The typical complexity of such Datalog programs, and the large volume of data that they process, call for the tracking and presentation of data provenance. Provenance information is crucial for explaining and justifying the Datalog program results. However, the size of full provenance information is in many cases too large (and its concise representations are too complex) to allow its presentation to the user. To this end, we propose a demonstration of selP, a system that allows the selective presentation of provenance, based on user-specified top-k queries. We will demonstrate the usefulness of selP using a real-life program and data, in the context of Information Extraction.
【Keywords】: DATALOG; query processing; Datalog programs; data processing; data volume; data-intensive applications; information extraction; operational logic; provenance information; real-life program; selP; selective data provenance presentation; selective data provenance tracking; user-specified top-k queries; Complexity theory; Context; Data mining; Databases; Information retrieval; Knowledge based systems; Pattern matching
【Paper Link】 【Pages】:1488-1491
【Authors】: Nguyen Quoc Viet Hung ; Nguyen Thanh Tam ; Vinh Tuan Chau ; Tri Kurniawan Wijaya ; Zoltán Miklós ; Karl Aberer ; Avigdor Gal ; Matthias Weidlich
【Abstract】: Schema matching supports data integration by establishing correspondences between the attributes of independently designed database schemas. In recent years, various tools for automatic pair-wise matching of schemas have been developed. Since the matching process is inherently uncertain, the correspondences generated by such tools are often validated by a human expert. In this work, we consider scenarios in which attribute correspondences are identified in a network of schemas and not only in a pairwise setting. Here, correspondences between different schemas are interrelated, so that incomplete and erroneous matching results propagate in the network and the validation of a correspondence by an expert has ripple effects. To analyse and reconcile such matchings in schema networks, we present the Schema Matching Analyzer and Reconciliation Tool (SMART). It allows for the definition of network-level integrity constraints for the matching and, based thereon, detects and visualizes inconsistencies of the matching. The tool also supports the reconciliation of a matching by guiding an expert in the validation process and by offering semi-automatic conflict-resolution techniques.
【Keywords】: Scheme; data integration; database management systems; network theory (graphs); pattern matching; SMART; automatic pair-wise schema matching process; data integration; database schemas; human expert; network-level integrity constraints; ripple effects; schema matching analyzer and reconciliation tool; schema matching networks; semiautomatic conflict-resolution techniques; validation process; Cognition; Context; Data integration; Databases; Maintenance engineering; Uncertainty; User interfaces
【Paper Link】 【Pages】:1492-1495
【Authors】: Xiaojing Xie ; Zhe Fan ; Byron Choi ; Peipei Yi ; Sourav S. Bhowmick ; Shuigeng Zhou
【Abstract】: Subgraph queries have been a fundamental query for retrieving patterns from graph data. Due to the well known NP hardness of subgraph queries, those queries may sometimes take a long time to complete. Our recent investigation on real- world datasets revealed that the performance of queries on graphs generally varies greatly. In other words, query clients may occasionally encounter “unexpectedly” long execution from a subgraph query processor. This paper aims to demonstrate a tool that alleviates the problem by monitoring subgraph query progress. Specifically, we present a novel subgraph query progress indicator called PIGEON that exploits query-time information to report to users accurate estimated query progress. In the demonstration, users may interact with PIGEON to gain insights on the query evaluation, which include the following: Users are enabled to (i) monitor query progress; (ii) analyze the causes of long query times; and (iii) abort queries that run abnormally long, which may sometimes contain human errors.
【Keywords】: computational complexity; graph theory; query processing; user interfaces; NP hardness; PIGEON; graph data; long query time cause analysis; pattern retrieval; query abortion; query evaluation; query-time information; subgraph query progress indicator; subgraph query progress monitoring; Cascading style sheets; Databases; Estimation; Optimization; Runtime; Visualization; YouTube
【Paper Link】 【Pages】:1496-1499
【Authors】: Manoj K. Agarwal ; Krithi Ramamritham
【Abstract】: Classical XML keyword search based on the Lowest Common Ancestor (LCA) framework requires users to be well versed with data and semantic relationships between the query keywords to extract meaningful response, restricting its applicability. GKS (Generic Keyword Search), on the other hand, allows users to browse and navigate XML data without such constraints. GKS enables discovery of deeper insights (DI) in the XML data, found in the context of the search results. Such insights not only expose patterns hidden in the search results but also help users tune their queries, thus enabling the navigation of complex XML repositories with ease. We further show how, for a search query, different insights can be discovered from the data by varying a single parameter.
【Keywords】: XML; query processing; GKS; XML data; XML keyword search; complex XML repositories; deeper insights; generic keyword search; lowest common ancestor; search query; Engines; Indexes; Keyword search; Navigation; Semantics; XML; Analytics; Information Retrieval; XML
【Paper Link】 【Pages】:1500-1503
【Authors】: Jinbo Zhang ; Sourav S. Bhowmick ; Hong H. Nguyen ; Byron Choi ; Feida Zhu
【Abstract】: Due to the complexity of graph query languages, the need for visual query interfaces that can reduce the burden of query formulation is fundamental to the spreading of graph data management tools to a wider community. Despite the significant progress towards building such query interfaces to simplify visual subgraph query formulation task, construction of current generation visual interfaces is not data-driven. That is, it does not exploit the underlying data graphs to automatically generate the contents of various panels in the interface. Such data-driven construction has several benefits such as superior support for subgraph query formulation and portability of the interface across different graph databases. In this demonstration, we present a novel data-driven visual subgraph query interface construction engine called DaVinci. Specifically, it automatically generates from the underlying database two key components of the visual interface to aid subgraph query formulation, namely canned patterns and node labels.
【Keywords】: graph theory; graphical user interfaces; query languages; query processing; DaVinci; canned patterns; data-driven visual interface construction; data-driven visual subgraph query interface construction engine; graph data management tools; graph databases; graph query languages; node labels; subgraph search; visual query interfaces; visual subgraph query formulation task; Database languages; Generators; Linear programming; Query processing; Visual databases; Visualization
【Paper Link】 【Pages】:1504-1507
【Authors】: Long Guo ; Lu Chen ; Dongxiang Zhang ; Guoliang Li ; Kian-Lee Tan ; Zhifeng Bao
【Abstract】: The prevalence of social networks and mobile devices has facilitated the real-time dissemination of local events such as sales, shows and exhibitions. To explore nearby events, mobile users can query a location based search engine for the desired data. However, operating under such a pull based model means that users may miss interesting events (because no explicit queries are issued) or processing/communication overheads may be high (because users have to continuously issue queries). In this demo, we present Elaps, an efficient location-aware publish/subscribe system that can effectively disseminate interesting events to moving users. Elaps is based on the push model and notifies mobile users instantly whenever there is a matching event around their locations. Through the demo, we will demonstrate that Elaps is scalable to a large number of subscriptions and events. Moreover, Elaps can effectively monitor the subscribers without missing any event matching, and incur low communication overhead.
【Keywords】: message passing; middleware; mobile computing; Elaps; communication overhead; efficient location-aware pub/sub system; event dissemination; location query; location-aware publish/subscribe system; matching event; mobile devices; mobile users; processing overhead; pull based model; real-time local event dissemination; search engine; social networks; Footwear; Indexes; Mobile communication; Mobile handsets; Monitoring; Real-time systems; Subscriptions
【Paper Link】 【Pages】:1508-1511
【Authors】: Xuchuan Shen ; Lei Zou ; M. Tamer Özsu ; Lei Chen ; Youhuan Li ; Shuo Han ; Dongyan Zhao
【Abstract】: In this demonstration, we present the gStore RDF triple store. gStore is based on graph encoding and subgraph match, distinct from many other systems. More importantly, it can handle, in a uniform manner, different data types (strings and numerical data) and SPARQL queries with wildcards, aggregate, range and top-k operators over dynamic RDF datasets. We will demonstrate the main features of our system, show how to search Wikipedia documents using gStore and how to build users' own application using gStore through C++/Java API.
【Keywords】: data handling; graph theory; pattern matching; query processing; C++; Java API; SPARQL queries; data types; dynamic RDF datasets; graph encoding; graph-based gStore RDF triple store; numerical data; strings; subgraph match; top-k operators; Electronic publishing; Encoding; Encyclopedias; Indexes; Resource description framework
【Paper Link】 【Pages】:1512-1515
【Authors】: N. R. Ong ; S. E. Rojcewicz ; Nicholas L. Farnan ; Adam J. Lee ; Panos K. Chrysanthis ; Ting Yu
【Abstract】: PASQL is an extension to SQL that allows users of a distributed database to specify privacy constraints on an SQL query evaluation plan. However, privacy constraints can be difficult for users to specify, and worse yet, all possible situations that could lead to a privacy violation may not be known to the user a priori. To address these challenges, we propose a GUI-based interactive process for detecting such violations and generating appropriate constraints. In this work, we demonstrate two approaches to implementing such a GUI that provide different ways of analyzing and interactively optimizing a PASQL query plan.
【Keywords】: SQL; data privacy; graphical user interfaces; query processing; relational databases; GUI; GUI-based interactive process; PASQL query plan; SQL query evaluation plan; constraint generation; distributed database; interactive preference-aware query optimization; privacy constraints; violation detection; Data privacy; Graphical user interfaces; Image color analysis; Privacy; Query processing; Servers; Visualization
【Paper Link】 【Pages】:1516-1519
【Authors】: Nathan Bronson ; Thomas Lento ; Janet L. Wiener
【Abstract】: At Facebook, our data systems process huge volumes of data, ranging from hundreds of terabytes in memory to hundreds of petabytes on disk. We categorize our systems as “small data” or “big data” based on the type of queries they run. Small data refers to OLTP-like queries that process and retrieve a small amount of data, for example, the 1000s of objects necessary to render Facebook's personalized News Feed for each person. These objects are requested by their ids; indexes limit the amount of data accessed during a single query, regardless of the total volume of data. Big data refers to queries that process large amounts of data, usually for analysis: trouble-shooting, identifying trends, and making decisions. Big data stores are the workhorses for data analysis at Facebook. They grow by millions of events (inserts) per second and process tens of petabytes and hundreds of thousands of queries per day. In this tutorial, we will describe our data systems and the current challenges we face. We will lead a discussion on these challenges, approaches to solve them, and potential pitfalls. We hope to stimulate interest in solving these problems in the research community.
【Keywords】: Big Data; data analysis; query processing; social networking (online); Facebook; OLTP-like queries; big data; data analysis; personalized news feed; small data; Big data; Databases; Facebook; Feeds; Mobile handsets; Reliability; Time series analysis
【Paper Link】 【Pages】:1520-1522
【Authors】: Sangyeun Cho ; Sanghoan Chang ; Insoon Jo
【Abstract】: This tutorial overviews flash memory solid-state drive technologies and recent advances in the field.
【Keywords】: flash memories; SSD technology; flash memory solid-state drive technology; Flash memory cells; Nonvolatile memory; Programming; Standards; Synthetic aperture sonar; Tutorials
【Paper Link】 【Pages】:1523-1526
【Authors】: Chris Douglas ; Carlo Curino
【Abstract】: This tutorial is organized in two parts. In the first half, we will present an overview of applications and services in the BigData ecosystem. We will use known distributed database and systems literature as landmarks to orient the attendees in this fast-evolving space. Throughout, we will contrast models of resource management, performance, and the constraints that shape the architectures of prominent systems. We will also discuss the role of academia and industry in the development of open-source infrastructure, with an emphasis on open problems and strategies for collaboration. We assume only basic familiarity with distributed systems. In the second half, we will delve into Apache Hadoop YARN. YARN (Yet Another Resource Negotiator) transformed Hadoop from a MapReduce engine to a general-purpose cluster scheduler. Since its introduction, it has been deployed in production and extended to support use cases beyond large-scale batch processing. The tutorial will present the active research and development supporting such heterogeneous workloads, with particular attention to multi-tenant scheduling. Topics include security, resource isolation, protocols, and preemption. This portion will be detailed, but accessible to anyone with a background in distributed systems and all attendees of the first half of the tutorial.
【Keywords】: Big Data; batch processing (computers); data handling; distributed databases; parallel processing; public domain software; Apache Hadoop YARN; BigData ecosystem; MapReduce engine; distributed database; general-purpose cluster scheduler; large-scale batch processing; multitenant scheduling; open-source; resource management; yet another resource negotiator; Communities; Databases; Ecosystems; Engines; Resource management; Tutorials; Yarn
【Paper Link】 【Pages】:1527-1529
【Authors】: Lei Chen ; Dongwon Lee ; Tova Milo
【Abstract】: In this 3-hour tutorial, we present the landscape of recent developments in data management and mining research, and survey a selected set of state-of-the-art works that significantly extended existing database reserach in order to incorporate and exploit the novel notion of “crowdsourcing” in a creative fashion. In particular, three speakers take turns to present the topics of human-powered database operations, crowdsourced data mining, and the application of crowdsourcing in social media, respectively.
【Keywords】: data mining; social networking (online); crowdsourced data mining; data management; data-driven crowdsourcing; human-powered database operations; social media; Crowdsourcing; Data mining; Databases; Media; Sociology; Statistics; Tutorials
【Paper Link】 【Pages】:1530-1533
【Authors】: Anastasia Ailamaki ; Erietta Liarou ; Pinar Tözün ; Danica Porobic ; Iraklis Psaroudakis
【Abstract】: Hardware trends oblige software to overcome three major challenges against systems scalability: (1) taking advantage of the implicit/vertical parallelism within a core that is enabled through the aggressive micro-architectural features, (2) exploiting the explicit/horizontal parallelism provided by multicores, and (3) achieving predictively efficient execution despite the variability in communication latencies among cores on multisocket multicores. In this three hour tutorial, we shed light on the above three challenges and survey recent proposals to alleviate them. The first part of the tutorial describes the instruction- and data-level parallelism opportunities in a core coming from the hardware and software side. In addition, it examines the sources of under-utilization in a modern processor and presents insights and hardware/software techniques to better exploit the micro-architectural resources of a processor by improving cache locality at the right level of the memory hierarchy. The second part focuses on the scalability bottlenecks of database applications at the level of multicore and multisocket multicore architectures. It first presents a systematic way of eliminating such bottlenecks in online transaction processing workloads, which is based on minimizing unbounded communication, and shows several techniques that minimize bottlenecks in major components of database management systems. Then, it demonstrates the data and work sharing opportunities for analytical workloads, and reviews advanced scheduling mechanisms that are aware of non-uniform memory accesses and alleviate bandwidth saturation.
【Keywords】: multiprocessing systems; parallel processing; communication latencies; data level parallelism; database applications; database management systems; explicit-horizontal parallelism; hardware-software techniques; implicit-vertical parallelism; instruction level parallelism; memory hierarchy; microarchitectural features; microarchitectural resources; multicores; multisocket multicore architectures; multisocket multicores; online transaction processing workloads; systems scalability; Database systems; Hardware; Multicore processing; Parallel processing; Scalability; Tutorials
【Paper Link】 【Pages】:1534-1537
【Authors】: Denilson Barbosa ; Haixun Wang ; Cong Yu
【Abstract】: Information extraction at Web scale has become one of the most important research topics in data management since major commercial search engines started incorporating knowledge in their search results a couple of years ago [1]. Users increasingly expect structured knowledge as answers to their search needs. Using Bing as an example, the result page for “Lionel Messi” is full of structured knowledge facts, such as his birthday and awards. The research efforts towards improving the accuracy and coverage of such knowledge bases have led to significant advances in Information Extraction techniques [2], [3]. As the initial challenge of accurately extracting facts for popular entities are being addressed, more difficult challenges have emerged such as extending knowledge coverage to long tail entities and domains, understanding interestingness and usefulness of facts within a given context, and addressing information-seeking needs more directly and accurately. In this tutorial, we will survey the recent research efforts and provide an introduction to the techniques that address those challenges, and the applications that benefit from the adoption of those techniques. In particular, this tutorial will focus on a variety of techniques that can be broadly viewed as knowledge inferencing, i.e., combining multiple data sources and extraction techniques to verify existing knowledge and derive new knowledge. More specifically, we focus on four main categories of inferencing techniques: 1) deep natural language processing using machine learning techniques, 2) data cleaning using integrity constraints, 3) large-scale probabilistic reasoning, and 4) leveraging human expertise for domain knowledge extraction.
【Keywords】: Internet; data integrity; inference mechanisms; information retrieval; learning (artificial intelligence); natural language processing; search engines; Bing; Lionel Messi; Web scale; commercial search engines; data cleaning; data management; deep natural language processing; domain knowledge extraction; fact extraction; human expertise leverage; information extraction techniques; information-seeking needs; integrity constraints; knowledge coverage; knowledge inferencing; large-scale probabilistic reasoning; machine learning techniques; multiple data sources; Cleaning; Data mining; Google; Information retrieval; Knowledge based systems; Knowledge engineering; Tutorials
【Paper Link】 【Pages】:1538-1540
【Authors】: Dimitra Papadimitriou ; Yannis Velegrakis ; Georgia Koutrika ; John Mylopoulos
【Abstract】: This tutorial provides a comprehensive and cohesive overview of goal modeling and recognition approaches by the Information Retrieval, the Artificial Intelligence and the Social Media communities. We will examine how these fields restrict the domain of study and how they capture notions easily perceived by humans' intuition but difficult to be formally defined and handled algorithmically. It is the purpose of this tutorial to provide a solid framework for placing existing work into perspective and highlight critical open challenges that will act as a springboard for researchers and practitioners in database systems, social data, and the Web, as well as developers of web-based, database-driven, and social applications, to work towards more user-centric systems and applications.
【Keywords】: Internet; database management systems; information retrieval; multi-agent systems; social networking (online); Web; artificial intelligence; database systems; goal modelling; goal recognition approach; information retrieval; intelligent agents; social applications; social data; social media; user-centric systems; Analytical models; Computational modeling; Computer science; Media; Taxonomy; Tutorials
【Paper Link】 【Pages】:1541-1544
【Authors】: Damian Bursztyn ; François Goasdoué ; Ioana Manolescu ; Alexandra Roatis
【Abstract】: Techniques for efficiently managing Semantic Web data have attracted significant interest from the data management and knowledge representation communities. A great deal of effort has been invested, especially in the database community, into algorithms and tools for efficient RDF query evaluation. However, the main interest of RDF lies in its blending of heterogeneous data and semantics. Simple RDF graphs can be seen as collections of facts, which may be further enriched with ontological schemas, or semantic constraints, based on which reasoning can be applied to infer new information. Taking into account this implicit information is crucial for answering queries.
【Keywords】: knowledge representation; ontologies (artificial intelligence); query processing; semantic Web; data management; efficient RDF query evaluation; knowledge representation communities; ontological schemas; semantic Web; semantic constraints; Cognition; Query processing; Resource description framework; Semantics; Tutorials
【Paper Link】 【Pages】:1545-1548
【Authors】: Ilia Petrov ; Robert Gottstein ; Sergej Hardock
【Abstract】: In the present tutorial we perform a cross-cut analysis of database systems from the perspective of modern storage technology, namely Flash memory. We argue that neither the design of modern DBMS, nor the architecture of Flash storage technologies are aligned with each other. The result is needlessly suboptimal DBMS performance and inefficient Flash utilisation as well as low Flash storage endurance and reliability. We showcase new DBMS approaches with improved algorithms and leaner architectures, designed to leverage the properties of modern storage technologies. We cover the area of transaction management and multi-versioning, putting a special emphasis on: (i) version organisation models and invalidation mechanisms in multi-versioning DBMS; (ii) Flash storage management especially on append-based storage in tuple granularity; (iii) Flash-friendly buffer management; as well as (iv) improvements in the searching and indexing models. Furthermore, we present our NoFTL approach to native Flash access that integrates parts of the Flash-management functionality into the DBMS yielding significant performance increase and simplification of the I/O stack. In addition, we cover the basics of building large Flash storage for DBMS and revisit some of the RAID techniques and principles.
【Keywords】: RAID; buffer storage; database indexing; flash memories; memory architecture; I/O stack; NoFTL approach; RAID techniques; append-based storage; database systems; flash access; flash storage endurance; flash storage management; flash storage reliability; flash utilisation; flash-friendly buffer management; indexing model; invalidation mechanisms; modern DBMS performance; modern storage hardware; modern storage technologies; multiversioning DBMS; searching model; transaction management; tuple granularity; version organisation models; Buffer storage; Hardware; Indexing; Tutorials
【Paper Link】 【Pages】:1549-1560
【Authors】: David C. Anastasiu ; Al Mamunur Rashid ; Andrea Tagarelli ; George Karypis
【Abstract】: The proliferation of computing devices in recent years has dramatically changed the way people work, play, communicate, and access information. The personal computer (PC) now has to compete with smartphones, tablets, and other devices for tasks it used to be the default device for. Understanding how PC usage evolves over time can help provide the best overall user experience for current customers, can help determine when they need brand new systems vs. upgraded components, and can inform future product design to better anticipate user needs.
【Keywords】: microprocessor chips; PC; computer usage evolution; computing devices proliferation; personal computer; smartphones; tablets; Clustering algorithms; Computational modeling; Dynamic programming; Heuristic algorithms; Market research; Minimization; Polynomials
【Paper Link】 【Pages】:1561-1572
【Authors】: Wei Feng ; Chao Zhang ; Wei Zhang ; Jiawei Han ; Jianyong Wang ; Charu Aggarwal ; Jianbin Huang
【Abstract】: What is happening around the world? When and where? Mining the geo-tagged Twitter stream makes it possible to answer the above questions in real-time. Although a single tweet can be short and noisy, proper aggregations of tweets can provide meaningful results. In this paper, we focus on hierarchical spatio-temporal hashtag clustering techniques. Our system has the following features: (1) Exploring events (hashtag clusters) with different space granularity. Users can zoom in and out on maps to find out what is happening in a particular area. (2) Exploring events with different time granularity. Users can choose to see what is happening today or in the past week. (3) Efficient single-pass algorithm for event identification, which provides human-readable hashtag clusters. (4) Efficient event ranking which aims to find burst events and localized events given a particular region and time frame. To support aggregation with different space and time granularity, we propose a data structure called STREAMCUBE, which is an extension of the data cube structure from the database community with spatial and temporal hierarchy. To achieve high scalability, we propose a divide-and-conquer method to construct the STREAMCUBE. To support flexible event ranking with different weights, we proposed a top-k based index. Different efficient methods are used to speed up event similarity computations. Finally, we have conducted extensive experiments on a real twitter data. Experimental results show that our framework can provide meaningful results with high scalability.
【Keywords】: data mining; database management systems; divide and conquer methods; geography; pattern clustering; social networking (online); STREAMCUBE; Twitter stream; burst events; data cube structure; database community; divide-and-conquer method; event exploration; event identification; event ranking; event similarity computation; geo-tagged Twitter stream mining; hierarchical spatio-temporal hashtag clustering; human-readable hashtag clusters; localized events; single-pass algorithm; space granularity; spatial hierarchy; temporal hierarchy; time granularity; top-k based index; Clustering algorithms; Media; Noise measurement; Nominations and elections; Real-time systems; Semantics; Twitter
【Paper Link】 【Pages】:1573-1584
【Authors】: Siarhei Bykau ; Flip Korn ; Divesh Srivastava ; Yannis Velegrakis
【Abstract】: The advent of Web 2.0 gave birth to a new kind of application where content is generated through the collaborative contribution of many different users. This form of content generation is believed to generate data of higher quality since the “wisdom of the crowds” makes its way into the data. However, a number of specific data quality issues appear within such collaboratively generated data. Apart from normal updates, there are cases of intentional harmful changes known as vandalism as well as naturally occurring disagreements on topics which don't have an agreed upon viewpoint, known as controversies. While much work has focused on identifying vandalism, there has been little prior work on detecting controversies, especially at a fine granularity. Knowing about controversies when processing user-generated content is essential to understand the quality of the data and the trust that should be given to them. Controversy detection is a challenging task, since in the highly dynamic context of user updates, one needs to differentiate among normal updates, vandalisms and actual controversies. We describe a novel technique that finds these controversial issues by analyzing the edits that have been performed on the data over time. We apply the developed technique on Wikipedia, the world's largest known collaboratively generated database and we show that our approach has higher precision and recall than baseline approaches as well as is capable of finding previously unknown controversies.
【Keywords】: Internet; Web sites; Web 2.0; Wikipedia; collaborative contribution; content generation; data quality; fine grained controversy detection; user generated content; Context; Databases; Electronic publishing; Encyclopedias; History; Internet
【Paper Link】 【Pages】:1585-1596
【Authors】: Ahmed Eldawy ; Mohamed F. Mokbel ; Saif Al-Harthi ; Abdulhadi Alzaidy ; Kareem Tarek ; Sohaib Ghani
【Abstract】: Remote sensing data collected by satellites are now made publicly available by several space agencies. This data is very useful for scientists pursuing research in several applications including climate change, desertification, and land use change. The benefit of this data comes from its richness as it provides an archived history for over 15 years of satellite observations for natural phenomena such as temperature and vegetation. Unfortunately, the use of such data is very limited due to the huge size of archives (> 500TB) and the limited capabilities of traditional applications. This paper introduces SHAHED; a MapReduce-based system for querying, visualizing, and mining large scale satellite data. SHAHED considers both the spatial and temporal aspects of the data to provide efficient query processing at large scale. The core of SHAHED is composed of four main components. The uncertainty component recovers missing data in the input which comes from cloud coverage and satellite mis-alignment. The indexing component provides a novel multi-resolution quad-tree-based spatio-temporal index structure, which indexes satellite data efficiently with minimal space overhead. The querying component answers selection and aggregate queries in real-time using the constructed index. Finally, the visualization component uses MapReduce programs to generate heat map images and videos for user queries. A set of experiments running on a live system deployed on a cluster of machines show the efficiency of the proposed design. All the features supported by SHAHED are made accessible through an easy to use Web interface that hides the complexity of the system and provides a nice user experience.
【Keywords】: artificial satellites; data handling; data visualisation; database indexing; geographic information systems; parallel programming; quadtrees; query processing; remote sensing; spatiotemporal phenomena; user interfaces; MapReduce-based System; SHAHED; Web interface; climate change; cloud coverage; data archives; desertification; heat map image generation; heat map video generation; indexing component; land use change; large-scale satellite data mining; large-scale satellite data query; large-scale satellite data visualization; live system; minimal space overhead; missing data recovery; multiresolution quad-tree-based spatio-temporal index structure; natural phenomena; query aggregation; query processing; query selection; querying component; remote sensing data collection; satellite data indexes; satellite misalignment; satellite observations; space agencies; spatial aspects; spatio-temporal satellite data query; spatio-temporal satellite data visualization; temporal aspects; uncertainty component; user experience; user queries; Aggregates; Data visualization; Heating; Indexing; Satellites; Uncertainty
【Paper Link】 【Pages】:1597-1599
【Authors】: Rakesh Agrawal
【Abstract】: Without question, astounding advances in storage, communication, and computation have propelled us into a new world, the world of Big Data. The Big Data enthusiasts believe that it is a revolution that will transform how we live, work, and think, and it will underpin new waves of innovation and productivity growth.
【Keywords】: Big Data; Big Data; innovation growth; productivity growth; Awards activities; Big data; Business; Computer science; Databases; Laboratories; Technological innovation
【Paper Link】 【Pages】:1600-1602
【Authors】: Rakesh Agrawal
【Abstract】: Asia is in the midst of a historic transformation. Asia's per capita income is projected to rise sixfold and its share of global gross domestic product is expected to increase to 52 percent by 2050 [1]. Science and technology has been cited as one of the key pillars for the success of Asia's development [2].
【Keywords】: data mining; economic indicators; financial data processing; information technology; Asia development success; Asia per capita income; data engineering; global gross domestic product; historic transformation; Asia; Australia; Awards activities; Big data; Computer science; Data mining; Laboratories