29. ICDE 2013:Brisbane, Australia

29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013. IEEE Computer Society 【DBLP Link】

Paper Num: 142 || Session Num: 0

1. Hardware killed the software star.

【Paper Link】【Pages】:1-4

【Authors】: Gustavo Alonso

【Abstract】: Until relatively recently, the development of data processing applications took place largely ignoring the underlying hardware. Only in niche applications (supercomputing, embedded systems) or in special software (operating systems, database internals, language runtimes) did (some) programmers had to pay attention to the actual hardware where the software would run. In most cases, working atop the abstractions provided by either the operating system or by system libraries was good enough. The constant improvements in processor speed did the rest. The new millennium has radically changed the picture. Driven by multiple needs - e.g., scale, physical constraints, energy limitations, virtualization, business models- hardware architectures are changing at a speed and in ways that current development practices for data processing cannot accommodate. From now on, software will have to be developed paying close attention to the underlying hardware and following strict performance engineering principles. In this paper, several aspects of the ongoing hardware revolution and its impact on data processing are analysed, pointing to the need for new strategies to tackle the challenges ahead.

【Keywords】: embedded systems; mainframes; parallel machines; software libraries; data processing applications; database internals; embedded systems; hardware revolution; language runtimes; operating systems; supercomputing; system libraries; Data processing; Engines; Hardware; Multicore processing; Parallel processing; Software

2. Recent progress towards an ecosystem of structured data on the Web.

【Paper Link】【Pages】:5-8

【Authors】: Nitin Gupta ; Alon Y. Halevy ; Boulos Harb ; Heidi Lam ; Hongrae Lee ; Jayant Madhavan ; Fei Wu ; Cong Yu

【Abstract】: Google Fusion Tables aims to support an ecosystem of structured data on the Web by providing a tool for managing and visualizing data on the one hand, and for searching and exploring for data on the other. This paper describes a few recent developments in our efforts to further the ecosystem.

【Keywords】: Internet; data structures; data visualisation; search engines; Google fusion tables; World Wide Web; data management; data visualization; ecosystem; structured data; Data visualization; Ecosystems; Google; Government; HTML; Indexes; Semantics

3. Re-thinking the performance of information processing systems.

【Paper Link】【Pages】:9-13

【Authors】: Vishal Sikka

【Abstract】: Recent advances in hardware and software technologies have enabled us to re-think how we architect databases to meet the demands of today's information systems. However, this makes existing performance evaluation metrics obsolete. In this paper, I describe SAP HANA a novel, powerful database platform that leverages the availability of large main memory and massively parallel processors. Based on this, I propose a new, multi-dimensional performance metric that better reflects the value expected from today's complex information systems.

【Keywords】: database management systems; software metrics; software performance evaluation; SAP HANA; database platform; hardware technology; information processing system; multidimensional performance metric; parallel processor; performance evaluation metrics; software technology; Companies; Complexity theory; Databases; Information processing; Real-time systems

4. CPU and cache efficient management of memory-resident databases.

【Paper Link】【Pages】:14-25

【Authors】: Holger Pirk ; Florian Funke ; Martin Grund ; Thomas Neumann ; Ulf Leser ; Stefan Manegold ; Alfons Kemper ; Martin L. Kersten

【Abstract】: Memory-Resident Database Management Systems (MRDBMS) have to be optimized for two resources: CPU cycles and memory bandwidth. To optimize for bandwidth in mixed OLTP/OLAP scenarios, the hybrid or Partially Decomposed Storage Model (PDSM) has been proposed. However, in current implementations, bandwidth savings achieved by partial decomposition come at increased CPU costs. To achieve the aspired bandwidth savings without sacrificing CPU efficiency, we combine partially decomposed storage with Just-in-Time (JiT) compilation of queries, thus eliminating CPU inefficient function calls. Since existing cost based optimization components are not designed for JiT-compiled query execution, we also develop a novel approach to cost modeling and subsequent storage layout optimization. Our evaluation shows that the JiT-based processor maintains the bandwidth savings of previously presented hybrid query processors but outperforms them by two orders of magnitude due to increased CPU efficiency.

【Keywords】: cache storage; data mining; data warehouses; database management systems; CPU cycle; JiT compilation; JiT-based processor; MRDBMS; OLAP; OLTP; PDSM; bandwidth savings; cache efficient management; cost based optimization component; hybrid decomposed storage model; hybrid query processor; just-in-time; memory bandwidth; memory-resident database management system; partially decomposed storage model; Bandwidth; Data models; Databases; Layout; Optimization; Prefetching; Volcanoes

5. Identifying hot and cold data in main-memory databases.

【Paper Link】【Pages】:26-37

【Authors】: Justin J. Levandoski ; Per-Åke Larson ; Radu Stoica

【Abstract】: Main memories are becoming sufficiently large that most OLTP databases can be stored entirely in main memory, but this may not be the best solution. OLTP workloads typically exhibit skewed access patterns where some records are hot (frequently accessed) but many records are cold (infrequently or never accessed). It is more economical to store the coldest records on secondary storage such as flash. As a first step towards managing cold data in databases optimized for main memory we investigate how to efficiently identify hot and cold data. We propose to log record accesses - possibly only a sample to reduce overhead - and perform offline analysis to estimate record access frequencies. We present four estimation algorithms based on exponential smoothing and experimentally evaluate their efficiency and accuracy. We find that exponential smoothing produces very accurate estimates, leading to higher hit rates than the best caching techniques. Our most efficient algorithm is able to analyze a log of 1B accesses in sub-second time on a workstation-class machine.

【Keywords】: data handling; database management systems; estimation theory; OLTP databases; estimation algorithms; exponential smoothing; identifying cold data; identifying hot data; main memory databases; record access frequencies; secondary storage; skewed access patterns; Classification algorithms; Databases; Engines; Equations; Frequency estimation; Random access memory; Smoothing methods

6. The adaptive radix tree: ARTful indexing for main-memory databases.

【Paper Link】【Pages】:38-49

【Authors】: Viktor Leis ; Alfons Kemper ; Thomas Neumann

【Abstract】: Main memory capacities have grown up to a point where most databases fit into RAM. For main-memory database systems, index structure performance is a critical bottleneck. Traditional in-memory data structures like balanced binary search trees are not efficient on modern hardware, because they do not optimally utilize on-CPU caches. Hash tables, also often used for main-memory indexes, are fast but only support point queries. To overcome these shortcomings, we present ART, an adaptive radix tree (trie) for efficient indexing in main memory. Its lookup performance surpasses highly tuned, read-only search trees, while supporting very efficient insertions and deletions as well. At the same time, ART is very space efficient and solves the problem of excessive worst-case space consumption, which plagues most radix trees, by adaptively choosing compact and efficient data structures for internal nodes. Even though ART's performance is comparable to hash tables, it maintains the data in sorted order, which enables additional operations like range scan and prefix lookup.

【Keywords】: cache storage; database indexing; table lookup; tree data structures; tree searching; ART performance; ARTful indexing; RAM; adaptive radix tree; balanced binary search tree; deletion; hash table; in-memory data structure; index structure performance; insertion; internal node; lookup performance; main memory index; main-memory database system; min memory capacity; on-CPU cache utilization; point query; prefix lookup; range scan; read-only search tree; sorted order data; worst-case space consumption; Arrays; Indexing; Subspace constraints; Vegetation

7. Finding connected components in map-reduce in logarithmic rounds.

【Paper Link】【Pages】:50-61

【Authors】: Vibhor Rastogi ; Ashwin Machanavajjhala ; Laukik Chitnis ; Anish Das Sarma

【Abstract】: Given a large graph G = (V, E) with millions of nodes and edges, how do we compute its connected components efficiently? Recent work addresses this problem in map-reduce, where a fundamental trade-off exists between the number of map-reduce rounds and the communication of each round. Denoting d the diameter of the graph, and n the number of nodes in the largest component, all prior techniques for map-reduce either require a linear, Θ(d), number of rounds, or a quadratic, Θ (n|V| + |E|), communication per round. We propose here two efficient map-reduce algorithms: (i) Hash-Greater-to-Min, which is a randomized algorithm based on PRAM techniques, requiring O(log n) rounds and O(|V | + |E|) communication per round, and (ii) Hash-to-Min, which is a novel algorithm, provably finishing in O(log n) iterations for path graphs. The proof technique used for Hash-to-Min is novel, but not tight, and it is actually faster than Hash-Greater-to-Min in practice. We conjecture that it requires 2 log d rounds and 3(|V| + |E|) communication per round, as demonstrated in our experiments. Using secondary sorting, a standard map-reduce feature, we scale Hash-to-Min to graphs with very large connected components. Our techniques for connected components can be applied to clustering as well. We propose a novel algorithm for agglomerative single linkage clustering in map-reduce. This is the first map-reduce algorithm for clustering in at most O(log n) rounds, where n is the size of the largest cluster. We show the effectiveness of all our algorithms through detailed experiments on large synthetic as well as real-world datasets.

【Keywords】: computational complexity; file organisation; graph theory; pattern clustering; randomised algorithms; sorting; PRAM technique; agglomerative single linkage clustering; connected component; graph diameter; hash-greater-to-min; logarithmic rounds; map-reduce algorithm; map-reduce rounds; proof technique; randomized algorithm; real-world dataset; secondary sorting; Clustering algorithms; Complexity theory; Convergence; Couplings; Merging; Phase change random access memory; Vegetation

8. Enumerating subgraph instances using map-reduce.

【Paper Link】【Pages】:62-73

【Authors】: Foto N. Afrati ; Dimitris Fotakis ; Jeffrey D. Ullman

【Abstract】: The theme of this paper is how to find all instances of a given “sample” graph in a larger “data graph,” using a single round of map-reduce. For the simplest sample graph, the triangle, we improve upon the best known such algorithm. We then examine the general case, considering both the communication cost between mappers and reducers and the total computation cost at the reducers. To minimize communication cost, we exploit the techniques of [1] for computing multiway joins (evaluating conjunctive queries) in a single map-reduce round. Several methods are shown for translating sample graphs into a union of conjunctive queries with as few queries as possible. We also address the matter of optimizing computation cost. Many serial algorithms are shown to be “convertible,” in the sense that it is possible to partition the data graph, explore each partition in a separate reducer, and have the total computation cost at the reducers be of the same order as the computation cost of the serial algorithm.

【Keywords】: graph theory; parallel algorithms; query processing; Map-Reduce; communication cost; computation cost; conjunctive queries; data graph; multiway joins; subgraph instance enumeration; Abstracts; Approximation algorithms; Complexity theory; Educational institutions; Partitioning algorithms; Probabilistic logic; Silicon

9. Scalable maximum clique computation using MapReduce.

【Paper Link】【Pages】:74-85

【Authors】: Jingen Xiang ; Cong Guo ; Ashraf Aboulnaga

【Abstract】: We present a scalable and fault-tolerant solution for the maximum clique problem based on the MapReduce framework. The key contribution that enables us to effectively use MapReduce is a recursive partitioning method that partitions the graph into several subgraphs of similar size. After partitioning, the maximum cliques of the different partitions can be computed independently, and the computation is sped up using a branch and bound method. Our experiments show that our approach leads to good scalability, which is unachievable by other partitioning methods since they result in partitions of different sizes and hence lead to load imbalance. Our method is more scalable than an MPI algorithm, and is simpler and more fault tolerant.

【Keywords】: fault tolerant computing; message passing; tree searching; MPI algorithm; MapReduce framework; branch and bound method; fault-tolerant solution; load imbalance; maximum clique problem; partitioning methods; recursive partitioning method; scalable maximum clique computation; Clustering algorithms; Color; Fault tolerance; Fault tolerant systems; Partitioning algorithms; Peer-to-peer computing; Scalability

10. Ficklebase: Looking into the future to erase the past.

【Paper Link】【Pages】:86-97

【Authors】: Sumeet Bajaj ; Radu Sion

【Abstract】: It has become apparent that in the digital world data once stored is never truly deleted even when such an expunction is desired either as a normal system function or for regulatory compliance purposes. Forensic Analysis techniques on systems are often successful at recovering information said to have been deleted in the past. Efforts aimed at thwarting such forensic analysis of systems have either focused on (i) identifying the system components where deleted data lingers and performing a secure delete operation over these remnants, or (ii) designing history independent data structures that hide information about past operations which result in the current system state. Yet, new data is constantly derived by processing existing (input) data which makes it increasingly difficult to remove all traces of this existing data, i.e., for regulatory compliance purposes. Even after deletion, significant information can linger in and be recoverable from the side effects the deleted data records left on the currently available state. In this paper we address this aspect in the context of a relational database, such that when combined with (i) & (ii), complete erasure of data and its effects can be achieved (“un-traceable deletion”). We introduce Ficklebase - a relational database wherein once a tuple has been “expired” - any and all its side-effects are removed, thereby eliminating all its traces, rendering it unrecoverable, and also guaranteeing that the deletion itself is undetectable. We present the design and evaluation of Ficklebase, and then discuss several of the fundamental functional implications of un-traceable deletion.

【Keywords】: digital forensics; relational databases; Ficklebase; current system state; digital world data; forensic analysis techniques; history independent data structures; normal system function; regulatory compliance purposes; relational database; secure delete operation; system component identification; un-traceable deletion; Data privacy; Data structures; Forensics; History; Indexes; Relational databases

11. Time travel in a scientific array database.

【Paper Link】【Pages】:98-109

【Authors】: Emad Soroush ; Magdalena Balazinska

【Abstract】: In this paper, we present TimeArr, a new storage manager for an array database. TimeArr supports the creation of a sequence of versions of each stored array and their exploration through two types of time travel operations: selection of a specific version of a (sub)-array and a more general extraction of a (sub)-array history, in the form of a series of (sub)-array versions. TimeArr contributes a combination of array-specific storage techniques to efficiently support these operations. To speed-up array exploration, TimeArr further introduces two additional techniques. The first is the notion of approximate time travel with two types of operations: approximate version selection and approximate history. For these operations, users can tune the degree of approximation tolerable and thus trade-off accuracy and performance in a principled manner. The second is to lazily create short connections, called skip links, between the same (sub)-arrays at different versions with similar data patterns to speed up the selection of a specific version. We implement TimeArr within the SciDB array processing engine and demonstrate its performance through experiments on two real datasets from the astronomy and earth sciences domains.

【Keywords】: approximation theory; database management systems; storage management; SciDB array processing engine; TimeArr; approximate history; approximate time travel; approximate version selection; array exploration; array-specific storage techniques; astronomy; data patterns; earth sciences; scientific array database; skip links; time travel operations; Approximation methods; Arrays; Encoding; History; Layout; Query processing

12. Time travel in column stores.

【Paper Link】【Pages】:110-121

【Authors】: Martin Kaufmann ; Amin Amiri Manjili ; Stefan Hildenbrand ; Donald Kossmann ; Andreas Tonder

【Abstract】: Recent studies have shown that column stores can outperform row stores significantly. This paper explores alternative approaches to extend column stores with versioning, i.e., time travel queries and the maintenance of historic data. On the one hand, adding versioning can actually simplify the design of a column store because it provides a solution for the implementation of updates, traditionally a weak point in the design of column stores. On the other hand, implementing a versioned column store is challenging because it imposes a two dimensional clustering problem: should the data be clustered by row or by version? This paper devises the details of three memory layouts: clustering by row, clustering by version, and hybrid clustering. Performance experiments demonstrate that all three approaches outperform a (traditional) versioned row store. The efficiency of these three memory layouts depends on the query and update workload. Furthermore, the performance experiments analyze the time-space tradeoff that can be made in the implementation of versioned column stores.

【Keywords】: storage management; 2D clustering problem; column stores; historic data; hybrid clustering; memory layout; time travel queries; Arrays; Database systems; Dictionaries; Layout; Memory management; Portfolios

13. Top-k query processing in probabilistic databases with non-materialized views.

【Paper Link】【Pages】:122-133

【Authors】: Maximilian Dylla ; Iris Miliaraki ; Martin Theobald

【Abstract】: We investigate a novel approach of computing confidence bounds for top-k ranking queries in probabilistic databases with non-materialized views. Unlike related approaches, we present an exact pruning algorithm for finding the top-ranked query answers according to their marginal probabilities without the need to first materialize all answer candidates via the views. Specifically, we consider conjunctive queries over multiple levels of select-project-join views, the latter of which are cast into Datalog rules which we ground in a top-down fashion directly at query processing time. To our knowledge, this work is the first to address integrated data and confidence computations for intensional query evaluations in the context of probabilistic databases by considering confidence bounds over first-order lineage formulas. We extend our query processing techniques by a tool-suite of scheduling strategies based on selectivity estimation and the expected impact on confidence bounds. Further extensions to our query processing strategies include improved top-k bounds in the case when sorted relations are available as input, as well as the consideration of recursive rules. Experiments with large datasets demonstrate significant runtime improvements of our approach compared to both exact and sampling-based top-k methods over probabilistic data.

【Keywords】: DATALOG; database management systems; probability; query processing; scheduling; Datalog rule; confidence bound; confidence computation; conjunctive query; first-order lineage formula; integrated data; intensional query evaluation; marginal probability; nonmaterialized view; probabilistic database; pruning algorithm; scheduling strategy; select-project-join views; selectivity estimation; top-k bound; top-k query processing; top-k ranking query; top-ranked query answer; Grounding; Motion pictures; Probabilistic logic; Query processing; Semantics; Upper bound

14. Cleaning uncertain data for top-k queries.

【Paper Link】【Pages】:134-145

【Authors】: Luyi Mo ; Reynold Cheng ; Xiang Li ; David W. Cheung ; Xuan S. Yang

【Abstract】: The information managed in emerging applications, such as sensor networks, location-based services, and data integration, is inherently imprecise. To handle data uncertainty, probabilistic databases have been recently developed. In this paper, we study how to quantify the ambiguity of answers returned by a probabilistic top-k query. We develop efficient algorithms to compute the quality of this query under the possible world semantics. We further address the cleaning of a probabilistic database, in order to improve top-k query quality. Cleaning involves the reduction of ambiguity associated with the database entities. For example, the uncertainty of a temperature value acquired from a sensor can be reduced, or cleaned, by requesting its newest value from the sensor. While this “cleaning operation” may produce a better query result, it may involve a cost and fail. We investigate the problem of selecting entities to be cleaned under a limited budget. Particularly, we propose an optimal solution and several heuristics. Experiments show that the greedy algorithm is efficient and close to optimal.

【Keywords】: data integration; query processing; cleaning operation; data integration; data uncertainty handling; location-based services; optimal solution; probabilistic databases; probabilistic top-k query; sensor networks; temperature value uncertainty; top-k queries; top-k query quality; uncertain data cleaning; world semantics; Cleaning; Motion pictures; Probabilistic logic; Query processing; Semantics; Uncertainty

15. Top-K oracle: A new way to present top-k tuples for uncertain data.

【Paper Link】【Pages】:146-157

【Authors】: Chunyao Song ; Zheng Li ; Tingjian Ge

【Abstract】: Managing noisy and uncertain data is needed in a great number of modern applications. A major difficulty in managing such data is the sheer number of query result tuples with diverse probabilities. In many cases, users have a preference over the tuples in a deterministic world, determined by a scoring function. Yet it has been a challenging problem to return top-k for uncertain data. Various semantics have been proposed, and they have been shown to give wildly different tuple rankings. In this paper, we propose a completely different approach. Instead of returning users fc tuples, which are merely one point in the complex distribution of top-k tuple vectors, we provide a so-called top-k oracle and users can arbitrarily query it. Intuitively, an oracle is a black box that, whenever given an SQL query, returns its result. Any information we give is based on faithful, best-effort estimates of the ground-truth top-k tuples. This is especially critical in emergency response applications and in monitoring top-k applications. Furthermore, we are the first to provide the nested query capability with the uncertain top-k result being a subquery. We devise various query processing algorithms for top-k oracles, and verify their efficiency and accuracy through a systematic evaluation over real-world and synthetic datasets.

【Keywords】: SQL; data handling; query processing; SQL query; black box; ground-truth top-k tuples; query result tuples; scoring function; top-k oracle; top-k tuple vectors; tuple rankings; uncertain data; Algorithm design and analysis; Approximation algorithms; Approximation methods; Query processing; Semantics; Servers; Vectors

16. Voronoi-based nearest neighbor search for multi-dimensional uncertain databases.

【Paper Link】【Pages】:158-169

【Authors】: Peiwu Zhang ; Reynold Cheng ; Nikos Mamoulis ; Matthias Renz ; Andreas Züfle ; Yu Tang ; Tobias Emrich

【Abstract】: In Voronoi-based nearest neighbor search, the Voronoi cell of every point p in a database can be used to check whether p is the closest to some query point q. We extend the notion of Voronoi cells to support uncertain objects, whose attribute values are inexact. Particularly, we propose the Possible Voronoi cell (or PV-cell). A PV-cell of a multi-dimensional uncertain object o is a region R, such that for any point pϵR, o may be the nearest neighbor of p. If the PV-cells of all objects in a database S are known, they can be used to identify objects that have a chance to be the nearest neighbor of q. However, there is no efficient algorithm for computing an exact PV-cell. We hence study how to derive an axis-parallel hyper-rectangle (called the Uncertain Bounding Rectangle, or UBR) that tightly contains a PV-cell. We further develop the PV-index, a structure that stores UBRs, to evaluate probabilistic nearest neighbor queries over uncertain data. An advantage of the PV-index is that upon updates on S, it can be incrementally updated. Extensive experiments on both synthetic and real datasets are carried out to validate the performance of the PV-index.

【Keywords】: computational geometry; database management systems; optimisation; query processing; PV-cell; PV-index; UBR; Voronoi-based nearest neighbor search; axis-parallel hyper-rectangle; multidimensional uncertain databases; possible Voronoi cell; probabilistic nearest neighbor queries; uncertain bounding rectangle; Approximation methods; Indexes; Nearest neighbor searches; Shape; Three-dimensional displays; Uncertainty

17. Interval reverse nearest neighbor queries on uncertain data with Markov correlations.

【Paper Link】【Pages】:170-181

【Authors】: Chuanfei Xu ; Yu Gu ; Lei Chen ; Jianzhong Qiao ; Ge Yu

【Abstract】: Nowadays, many applications return to the user a set of results that take the query as their nearest neighbor, which are commonly expressed through reverse nearest neighbor (RNN) queries. When considering moving objects, users would like to find objects that appear in the RNN result set for a period of time in some real-world applications such as collaboration recommendation and anti-tracking. In this work, we formally define the problem of interval reverse nearest neighbor (IRNN) queries over moving objects, which return the objects that maintain nearest neighboring relations to the moving query objects for the longest time in the given interval. Location uncertainty of moving data objects and moving query objects is inherent in various domains, and we investigate objects that exhibit Markov correlations, that is, each object's location is only correlated with its own location at previous timestamp while being independent of other objects. There exists the efficiency challenge for answering IRNN queries on uncertain moving objects with Markov correlations since we have to retrieve not only all the possible locations of each object at current time but also its historically possible locations. To speed up the query processing, we present a general framework for answering IRNN queries on uncertain moving objects with Markov correlations in two phases. In the first phase, we apply space pruning and probability pruning techniques, which reduce the search space significantly. In the second phase, we verify whether each unpruned object is an IRNN of the query object. During this phase, we propose an approach termed Probability Decomposition Verification (PDV) algorithm which avoid computing the probability of any object being an RNN of the query object exactly and thus improve the efficiency of verification. The performance of the proposed algorithm is demonstrated by extensive experiments on synthetic and real datasets, and the experimental results show that our algori- hm is more efficient than the Monte-Carlo based approximate algorithm.

【Keywords】: Markov processes; Monte Carlo methods; approximation theory; data handling; query processing; recommender systems; IRNN queries; Markov correlations; Monte-Carlo based approximate algorithm; RNN queries; collaboration recommendation; data objects; extensive experiments; interval reverse nearest neighbor queries; nearest neighboring relations; probability pruning techniques; query objects; query processing; real datasets; search space; synthetic datasets; uncertain data; Correlation; Markov processes; Monitoring; Nearest neighbor searches; Query processing; Trajectory

18. Efficient tracking and querying for coordinated uncertain mobile objects.

【Paper Link】【Pages】:182-193

【Authors】: Nicholas D. Larusso ; Ambuj K. Singh

【Abstract】: Accurately estimating the current positions of moving objects is a challenging task due to the various forms of data uncertainty (e.g. limited sensor precision, periodic updates from continuously moving objects). However, in many cases, groups of objects tend to exhibit similarities in their movement behavior. For example, vehicles in a convoy or animals in a herd both exhibit tightly coupled movement behavior within the group. While such statistical dependencies often increase the computational complexity necessary for capturing this additional structure, they also provide useful information which can be utilized to provide more accurate location estimates. In this paper, we propose a novel model for accurately tracking coordinated groups of mobile uncertain objects. We introduce an exact and more efficient approximate inference algorithm for updating the current location of each object upon the arrival of new (uncertain) location observations. Additionally, we derive probability bounds over the groups in order to process probabilistic threshold range queries more efficiently. Our experimental evaluation shows that our proposed model can provide 4X improvements in tracking accuracy over competing models which do not consider group behavior. We also show that our bounds enable us to prune up to 50% of the database, resulting in more efficient processing over a linear scan.

【Keywords】: approximation theory; inference mechanisms; mobile computing; object tracking; probability; query processing; approximate inference algorithm; arrival of new location observations; computational complexity; coordinated uncertain mobile object query; coordinated uncertain mobile object tracking; coupled movement behavior; data uncertainty; linear scan processing; mobile uncertain object; moving object position; probabilistic threshold range query; probability bounds; statistical dependencies; Approximation algorithms; Computational modeling; Covariance matrices; Hidden Markov models; Inference algorithms; Mobile communication; Uncertainty

19. Attribute extraction and scoring: A probabilistic approach.

【Paper Link】【Pages】:194-205

【Authors】: Taesung Lee ; Zhongyuan Wang ; Haixun Wang ; Seung-won Hwang

【Abstract】: Knowledge bases, which consist of concepts, entities, attributes and relations, are increasingly important in a wide range of applications. We argue that knowledge about attributes (of concepts or entities) plays a critical role in inferencing. In this paper, we propose methods to derive attributes for millions of concepts and we quantify the typicality of the attributes with regard to their corresponding concepts. We employ multiple data sources such as web documents, search logs, and existing knowledge bases, and we derive typicality scores for attributes by aggregating different distributions derived from different sources using different methods. To the best of our knowledge, ours is the first approach to integrate concept- and instance-based patterns into probabilistic typicality scores that scale to broad concept space. We have conducted extensive experiments to show the effectiveness of our approach.

【Keywords】: inference mechanisms; knowledge based systems; Web documents; attribute extraction; attribute scoring; concept-based patterns; instance-based patterns; knowledge bases; multiple data sources; probabilistic approach; probabilistic typicality scores; search logs; Companies; Data mining; Knowledge based systems; Probabilistic logic; Sociology; Statistics; Syntactics

20. TYPifier: Inferring the type semantics of structured data.

【Paper Link】【Pages】:206-217

【Authors】: Yongtao Ma ; Thanh Tran ; Veli Bicer

【Abstract】: Structured data representing entity descriptions often lacks precise type information. That is, it is not known to which type an entity belongs to, or the type is too general to be useful. In this work, we propose to deal with this novel problem of inferring the type semantics of structured data, called typification. We formulate it as a clustering problem and discuss the features needed to obtain several solutions based on existing clustering solutions. Because schema features perform best, but are not abundantly available, we propose an approach to automatically derive them from data. Optimized for the use of schema features, we present TYPifier, a novel clustering algorithm that in experiments, yields better typification results than the baseline clustering solutions.

【Keywords】: data structures; pattern clustering; programming language semantics; type theory; TYPifier; clustering algorithm; clustering problem; clustering solution; entity description; schema feature; structured data; type information; type semantics inference; typification; Clustering algorithms; DVD; Feature extraction; Measurement; Media; Resource description framework; Semantics

21. SUSIE: Search using services and information extraction.

【Paper Link】【Pages】:218-229

【Authors】: Nicoleta Preda ; Fabian M. Suchanek ; Wenjun Yuan ; Gerhard Weikum

【Abstract】: The API of a Web service restricts the types of queries that the service can answer. For example, a Web service might provide a method that returns the songs of a given singer, but it might not provide a method that returns the singers of a given song. If the user asks for the singer of some specific song, then the Web service cannot be called - even though the underlying database might have the desired piece of information. This asymmetry is particularly problematic if the service is used in a Web service orchestration system. In this paper, we propose to use on-the-fly information extraction to collect values that can be used as parameter bindings for the Web service. We show how this idea can be integrated into a Web service orchestration system. Our approach is fully implemented in a prototype called SUSIE. We present experiments with real-life data and services to demonstrate the practical viability and good performance of our approach.

【Keywords】: Web services; application program interfaces; information retrieval; API; SUSIE; Web service orchestration system; on-the-fly information extraction; parameter bindings; real life services; real-life data; search using services; Context; Databases; Educational institutions; Information retrieval; Knowledge based systems; Standards; Web services

22. Towards efficient search for activity trajectories.

【Paper Link】【Pages】:230-241

【Authors】: Kai Zheng ; Shuo Shang ; Nicholas Jing Yuan ; Yi Yang

【Abstract】: The advances in location positioning and wireless communication technologies have led to a myriad of spatial trajectories representing the mobility of a variety of moving objects. While processing trajectory data with the focus of spatio-temporal features has been widely studied in the last decade, recent proliferation in location-based web applications (e.g., Foursquare, Facebook) has given rise to large amounts of trajectories associated with activity information, called activity trajectory. In this paper, we study the problem of efficient similarity search on activity trajectory database. Given a sequence of query locations, each associated with a set of desired activities, an activity trajectory similarity query (ATSQ) returns k trajectories that cover the query activities and yield the shortest minimum match distance. An order-sensitive activity trajectory similarity query (OATSQ) is also proposed to take into account the order of the query locations. To process the queries efficiently, we firstly develop a novel hybrid grid index, GAT, to organize the trajectory segments and activities hierarchically, which enables us to prune the search space by location proximity and activity containment simultaneously. In addition, we propose algorithms for efficient computation of the minimum match distance and minimum order-sensitive match distance, respectively. The results of our extensive empirical studies based on real online check-in datasets demonstrate that our proposed index and methods are capable of achieving superior performance and good scalability.

【Keywords】: Internet; mobile computing; object detection; query formulation; spatiotemporal phenomena; OATSQ; efficient search; location positioning; location-based Web applications; moving objects; order-sensitive activity trajectory similarity query; query locations; spatio-temporal features; wireless communication technologies; Educational institutions; Indexing; Search problems; Trajectory; Vocabulary

23. On discovery of gathering patterns from trajectories.

【Paper Link】【Pages】:242-253

【Authors】: Kai Zheng ; Yu Zheng ; Nicholas Jing Yuan ; Shuo Shang

【Abstract】: The increasing pervasiveness of location-acquisition technologies has enabled collection of huge amount of trajectories for almost any kind of moving objects. Discovering useful patterns from their movement behaviours can convey valuable knowledge to a variety of critical applications. In this light, we propose a novel concept, called gathering, which is a trajectory pattern modelling various group incidents such as celebrations, parades, protests, traffic jams and so on. A key observation is that these incidents typically involve large congregations of individuals, which form durable and stable areas with high density. Since the process of discovering gathering patterns over large-scale trajectory databases can be quite lengthy, we further develop a set of well thought out techniques to improve the performance. These techniques, including effective indexing structures, fast pattern detection algorithms implemented with bit vectors, and incremental algorithms for handling new trajectory arrivals, collectively constitute an efficient solution for this challenging task. Finally, the effectiveness of the proposed concepts and the efficiency of the approaches are validated by extensive experiments based on a real taxicab trajectory dataset.

【Keywords】: database management systems; indexing; mobile computing; fast pattern detection algorithms; gathering patterns; group incidents; incremental algorithms; indexing structures; large-scale trajectory databases; location-acquisition technologies; movement behaviours; taxicab trajectory dataset; trajectory pattern; Clustering algorithms; Complexity theory; Databases; Educational institutions; Shape; Trajectory; Vectors

24. Destination prediction by sub-trajectory synthesis and privacy protection against such prediction.

【Paper Link】【Pages】:254-265

【Authors】: Andy Yuan Xue ; Rui Zhang ; Yu Zheng ; Xing Xie ; Jin Huang ; Zhenghua Xu

【Abstract】: Destination prediction is an essential task for many emerging location based applications such as recommending sightseeing places and targeted advertising based on destination. A common approach to destination prediction is to derive the probability of a location being the destination based on historical trajectories. However, existing techniques using this approach suffer from the “data sparsity problem”, i.e., the available historical trajectories is far from being able to cover all possible trajectories. This problem considerably limits the number of query trajectories that can obtain predicted destinations. We propose a novel method named Sub-Trajectory Synthesis (SubSyn) algorithm to address the data sparsity problem. SubSyn algorithm first decomposes historical trajectories into sub-trajectories comprising two neighbouring locations, and then connects the sub-trajectories into “synthesised” trajectories. The number of query trajectories that can have predicted destinations is exponentially increased by this means. Experiments based on real datasets show that SubSyn algorithm can predict destinations for up to ten times more query trajectories than a baseline algorithm while the SubSyn prediction algorithm runs over two orders of magnitude faster than the baseline algorithm. In this paper, we also consider the privacy protection issue in case an adversary uses SubSyn algorithm to derive sensitive location information of users. We propose an efficient algorithm to select a minimum number of locations a user has to hide on her trajectory in order to avoid privacy leak. Experiments also validate the high efficiency of the privacy protection algorithm.

【Keywords】: query processing; SubSyn prediction algorithm; baseline algorithm; data sparsity problem; destination prediction; emerging location based application; historical trajectory; privacy protection algorithm; query trajectory; subtrajectory synthesis algorithm; synthesised trajectory; Bayes methods; Computational modeling; Markov processes; Prediction algorithms; Privacy; Roads; Trajectory

【Paper Link】【Pages】:266-277

【Authors】: Jinha Kim ; Seung-Keol Kim ; Hwanjo Yu

【Abstract】: As social network services connect people across the world, influence maximization, i.e., finding the most influential nodes (or individuals) in the network, is being actively researched with applications to viral marketing. One crucial challenge in scalable influence maximization processing is evaluating influence, which is #P-hard and thus hard to solve in polynomial time. We propose a scalable influence approximation algorithm, Independent Path Algorithm (IPA) for Independent Cascade (IC) diffusion model. IPA efficiently approximates influence by considering an independent influence path as an influence evaluation unit. IPA are also easily parallelized by simply adding a few lines of OpenMP meta-programming expressions. Also, overhead of maintaining influence paths in memory is relieved by safely throwing away insignificant influence paths. Extensive experiments conducted on large-scale real social networks show that IPA is an order of magnitude faster and uses less memory than the state of the art algorithms. Our experimental results also show that parallel versions of IPA speeds up further as the number of CPU cores increases, and more speed-up is achieved for larger datasets. The algorithms have been implemented in our demo application for influence maximization (available at http://dm.postech.ac.kr/ipa demo), which efficiently finds the most influential nodes in a social network.

【Keywords】: approximation theory; computational complexity; message passing; parallel processing; social networking (online); #P-hard; IPA; OpenMP meta-programming expression; independent cascade diffusion model; independent path algorithm; influence evaluation unit; influence maximization; parallelizable processing; polynomial time; scalable influence approximation algorithm; social network service; viral marketing; Approximation algorithms; Approximation methods; Greedy algorithms; Integrated circuit modeling; Mathematical model; Social network services; Influence maximization; parallel processing; social networks

【Paper Link】【Pages】:278-289

【Authors】: Jiwon Seo ; Stephen Guo ; Monica S. Lam

【Abstract】: With the rise of social networks, large-scale graph analysis becomes increasingly important. Because SQL lacks the expressiveness and performance needed for graph algorithms, lower-level, general-purpose languages are often used instead. For greater ease of use and efficiency, we propose SociaLite, a high-level graph query language based on Datalog. As a logic programming language, Datalog allows many graph algorithms to be expressed succinctly. However, its performance has not been competitive when compared to low-level languages. With SociaLite, users can provide high-level hints on the data layout and evaluation order; they can also define recursive aggregate functions which, as long as they are meet operations, can be evaluated incrementally and efficiently. We evaluated SociaLite by running eight graph algorithms (shortest paths, PageRank, hubs and authorities, mutual neighbors, connected components, triangles, clustering coefficients, and betweenness centrality) on two real-life social graphs, Live-Journal and Last.fm. The optimizations proposed in this paper speed up almost all the algorithms by 3 to 22 times. SociaLite even outperforms typical Java implementations by an average of 50% for the graph algorithms tested. When compared to highly optimized Java implementations, SociaLite programs are an order of magnitude more succinct and easier to write. Its performance is competitive, giving up only 16% for the largest benchmark. Most importantly, being a query language, SociaLite enables many more users who are not proficient in software engineering to make social network queries easily and efficiently.

【Keywords】: DATALOG; Java; graph theory; pattern clustering; recursive functions; social networking (online); Datalog extension; Java; PageRank; SociaLite; betweenness centrality; clustering coefficient; data layout; evaluation order; graph algorithm; high-level graph query language; hubs; large-scale graph analysis; logic programming language; lower-level general-purpose language; real-life social graph; recursive aggregate function; shortest path; social network analysis; software engineering; Aggregates; Algorithm design and analysis; Arrays; Java; Optimization; Semantics; Social network services

【Paper Link】【Pages】:290-301

【Authors】: Haiquan Chen ; Wei-Shinn Ku ; Haixun Wang ; Liang Tang ; Min-Te Sun

【Abstract】: As one of the most important Semantic Web applications, social network analysis has attracted more and more interest from researchers due to the rapidly increasing availability of massive social network data. A desired solution for social network analysis should address the following issues. First, in many real world applications, inference rules are partially correct. An ideal solution should be able to handle partially correct rules. Second, applications in practice often involve large amounts of data. The inference mechanism should scale up towards large-scale data. Third, inference methods should take into account probabilistic evidence data because these are domains abounding with uncertainty. Various solutions for social network analysis have existed for quite a few years; however, none of them support all the aforementioned features. In this paper, we design and implement LinkProbe, a prototype to quantitatively predict the existence of links among nodes in large-scale social networks, which are empowered by Markov Logic Networks (MLNs). MLN has been proved to be an effective inference model which can handle complex dependencies and partially correct rules. More importantly, although MLN has shown acceptable performance in prior works, it is also reported as impractical in handling large-scale data due to its highly demanding nature in terms of inference time and memory consumption. In order to overcome these limitations, LinkProbe retrieves the k-backbone graphs and conducts the MLN inference on both the most globally influencing nodes and most locally related nodes. Our extensive experiments show that LinkProbe manages to provide a tunable balance between MLN inference accuracy and inference efficiency.

【Keywords】: Markov processes; graph theory; inference mechanisms; semantic Web; social networking (online); LinkProbe; MLN; Markov logic networks; Semantic Web applications; inference mechanism; inference methods; k-backbone graphs; large scale data; large scale social networks; memory consumption; probabilistic inference; social network analysis; social network data; Equations; Markov random fields; Mathematical model; Monte Carlo methods; Probabilistic logic; Social network services

28. The Bw-Tree: A B-tree for new hardware platforms.

【Paper Link】【Pages】:302-313

【Authors】: Justin J. Levandoski ; David B. Lomet ; Sudipta Sengupta

【Abstract】: The emergence of new hardware and platforms has led to reconsideration of how data management systems are designed. However, certain basic functions such as key indexed access to records remain essential. While we exploit the common architectural layering of prior systems, we make radically new design decisions about each layer. Our new form of B-tree, called the Bw-tree achieves its very high performance via a latch-free approach that effectively exploits the processor caches of modern multi-core chips. Our storage manager uses a unique form of log structuring that blurs the distinction between a page and a record store and works well with flash storage. This paper describes the architecture and algorithms for the Bw-tree, focusing on the main memory aspects. The paper includes results of our experiments that demonstrate that this fresh approach produces outstanding performance.

【Keywords】: computer architecture; storage management; tree data structures; B-tree; Bw-Tree; data management system; hardware platform; latch-free approach; multicore chips; Ash; Hardware; Indexes; Instruction sets; Latches; Vectors

29. Secure and efficient range queries on outsourced databases using Rp-trees.

【Paper Link】【Pages】:314-325

【Authors】: Peng Wang ; Chinya V. Ravishankar

【Abstract】: We show how to execute range queries securely and efficiently on encrypted databases in the cloud. Current methods provide either security or efficiency, but not both. Many schemes even reveal the ordering of encrypted tuples, which, as we show, allows adversaries to estimate plaintext values accurately. We present the R̂-trees, a hierarchical encrypted index that may be securely placed in the cloud, and searched efficiently. It is based on a mechanism we design for encrypted halfspace range queries in ℝ^d, using Asymmetric Scalar-product Preserving Encryption. Data owners can tune the R̂-trees parameters to achieve desired security-efficiency tradeoffs. We also present extensive experiments to evaluate R̂-trees performance. Our results show that R̂-trees queries are efficient on encrypted databases, and reveal far less information than competing methods.

【Keywords】: cloud computing; cryptography; database management systems; query processing; trees (mathematics); R̂-trees parameters; asymmetric scalar-product preserving encryption; cloud computing; encrypted halfspace range queries; hierarchical encrypted index; outsourced databases; plaintext values; security-efficiency tradeoffs; Computational modeling; Encryption; Face; Indexes

30. An efficient and compact indexing scheme for large-scale data store.

【Paper Link】【Pages】:326-337

【Authors】: Peng Lu ; Sai Wu ; Lidan Shou ; Kian-Lee Tan

【Abstract】: The amount of data managed in today's Cloud systems has reached an unprecedented scale. In order to speed up query processing, an effective mechanism is to build indexes on attributes that are used in query predicates. However, conventional indexing schemes fail to provide a scalable service: as the size of these indexes are proportional to the data size, it is not space efficient to build many indexes. As such, it becomes more crucial to develop effective index to provide scalable database services in the Cloud. In this paper, we propose a compact bitmap indexing scheme for a large-scale data store. The bitmap indexing scheme combines state-of-the-art bitmap compression techniques, such as WAH encoding and bit-sliced encoding. To further reduce the index cost, a novel and query efficient partial indexing technique is adopted, which dynamically refreshes the index to handle updates and process queries. The intuition of our indexing approach is to maximize the number of indexed attributes, so that a wider range of queries, including range and join queries, can be efficiently supported. Our indexing scheme is light-weight and its creation can be seamlessly grafted onto the MapReduce processing engine without incurring significant running cost. Moreover, the compactness allows us to maintain the bitmap indexes in memory so that performance overhead of index access is minimal. We implement our indexing scheme on top of the underlying Distributed File System (DFS) and evaluate its performance on an in-house cluster. We compare our index-based query processing with HadoopDB to show its superior performance. Our experimental results confirm the effectiveness, efficiency and scalability of the indexing scheme.

【Keywords】: cloud computing; cost reduction; distributed databases; indexing; information management; pattern clustering; performance evaluation; query processing; DFS; HadoopDB; MapReduce processing engine; bitmap compression techniques; cloud systems; compact bitmap indexing scheme; data management; distributed file system; in-house cluster; index cost reduction; index-based query processing; large-scale data store; performance evaluation; query efficient partial indexing technique; query processing; scalable database services; Encoding; Engines; Indexing; Query processing; Sorting

31. Recycling in pipelined query evaluation.

【Paper Link】【Pages】:338-349

【Authors】: Fabian Nagel ; Peter A. Boncz ; Stratis Viglas

【Abstract】: Database systems typically execute queries in isolation. Sharing recurring intermediate and final results between successive query invocations is ignored or only exploited by caching final query results. The DBA is kept in the loop to make explicit sharing decisions by identifying and/or defining materialized views. Thus decisions are made only after a long time and sharing opportunities may be missed. Recycling intermediate results has been proposed as a method to make database query engines profit from opportunities to reuse fine-grained partial query results, that is fully autonomous and is able to continuously adapt to changes in the workload. The technique was recently revisited in the context of MonetDB, a system that by default materializes all intermediate results. Materializing intermediate results can consume significant system resources, therefore most other database systems avoid this where possible, following a pipelined query architecture instead. The novelty of this paper is to show how recycling can successfully be applied in pipelined query executors, by tracking the benefit of materializing possible intermediate results and then choosing the ones making best use of a limited intermediate result cache. We present ways to maximize the potential of recycling by leveraging subsumption and proactive query rewriting. We have implemented our approach in the Vectorwise database engine and have experimentally evaluated its potential using both synthetic and real-world datasets. Our results show that intermediate result recycling significantly improves performance.

【Keywords】: cache storage; database management systems; query processing; MonetDB; Vectorwise database engine; database query engine; database system; explicit sharing decision; final query result caching; fine-grained partial query result; opportunity sharing; pipelined query architecture; pipelined query evaluation; proactive query rewriting; query execution; query invocation; recycling; system resource; Computer architecture; Context; Engines; Measurement; Query processing; Recycling

32. Efficient many-core query execution in main memory column-stores.

【Paper Link】【Pages】:350-361

【Authors】: Jonathan Dees ; Peter Sanders

【Abstract】: We use the full query set of the TPC-H Benchmark as a case study for the efficient implementation of decision support queries on main memory column-store databases. Instead of splitting a query into separate independent operators, we consider the query as a whole and translate the execution plan into a single function performing the query. This allows highly efficient CPU utilization, minimal materialization, and execution in a single pass over the data for most queries. The single pass is performed in parallel and scales near-linearly with the number of cores. The resulting query plans for most of the 22 queries are remarkably simple and are suited for automatic generation and fast compilation. Using a data-parallel, NUMA-aware many-core implementation with block summaries, inverted index data structures, and efficient aggregation algorithms, we achieve one to two orders of magnitude better performance than the current record holders of the TPC-H Benchmark.

【Keywords】: data structures; decision support systems; multiprocessing systems; query processing; storage management; CPU utilization; NUMA-aware many-core implementation; TPC-H benchmark; aggregation algorithms; automatic generation; block summary; compilation; data-parallel; decision support query; execution plan; inverted index data structures; main memory column-store databases; many-core query execution; memory column-stores; minimal materialization; query plans; query set; separate independent operators; single pass; Bandwidth; Benchmark testing; Data structures; Indexes; Instruction sets; Sockets

33. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware.

【Paper Link】【Pages】:362-373

【Authors】: Cagri Balkesen ; Jens Teubner ; Gustavo Alonso ; M. Tamer Özsu

【Abstract】: The architectural changes introduced with multi-core CPUs have triggered a redesign of main-memory join algorithms. In the last few years, two diverging views have appeared. One approach advocates careful tailoring of the algorithm to the architectural parameters (cache sizes, TLB, and memory bandwidth). The other approach argues that modern hardware is good enough at hiding cache and TLB miss latencies and, consequently, the careful tailoring can be omitted without sacrificing performance. In this paper we demonstrate through experimental analysis of different algorithms and architectures that hardware still matters. Join algorithms that are hardware conscious perform better than hardware-oblivious approaches. The analysis and comparisons in the paper show that many of the claims regarding the behavior of join algorithms that have appeared in literature are due to selection effects (relative table sizes, tuple sizes, the underlying architecture, using sorted data, etc.) and are not supported by experiments run under different parameters settings. Through the analysis, we shed light on how modern hardware affects the implementation of data operators and provide the fastest implementation of radix join to date, reaching close to 200 million tuples per second.

【Keywords】: cache storage; memory architecture; multiprocessing systems; performance evaluation; TLB miss latencies; architectural parameters; cache hiding; cache sizes; data operators; experimental analysis; hardware-oblivious approaches; main-memory hash join algorithms; memory bandwidth; multicore CPUs; radix join implementation; table sizes; tuple sizes; Algorithm design and analysis; Hardware; Instruction sets; Latches; Partitioning algorithms; Probes; Tuning

34. Coupled clustering ensemble: Incorporating coupling relationships both between base clusterings and objects.

【Paper Link】【Pages】:374-385

【Authors】: Can Wang ; Zhong She ; Longbing Cao

【Abstract】: Clustering ensemble is a powerful approach for improving the accuracy and stability of individual (base) clustering algorithms. Most of the existing clustering ensemble methods obtain the final solutions by assuming that base clusterings perform independently with one another and all objects are independent too. However, in real-world data sources, objects are more or less associated in terms of certain coupling relationships. Base clusterings trained on the source data are complementary to one another since each of them may only capture some specific rather than full picture of the data. In this paper, we discuss the problem of explicating the dependency between base clusterings and between objects in clustering ensembles, and propose a framework for coupled clustering ensembles (CCE). CCE not only considers but also integrates the coupling relationships between base clusterings and between objects. Specifically, we involve both the intra-coupling within one base clustering (i.e., cluster label frequency distribution) and the inter-coupling between different base clusterings (i.e., cluster label co-occurrence dependency). Furthermore, we engage both the intra-coupling between two objects in terms of the base clustering aggregation and the inter-coupling among other objects in terms of neighborhood relationship. This is the first work which explicitly addresses the dependency between base clusterings and between objects, verified by the application of such couplings in three types of consensus functions: clustering-based, object-based and cluster-based. Substantial experiments on synthetic and UCI data sets demonstrate that the CCE framework can effectively capture the interactions embedded in base clusterings and objects with higher clustering accuracy and stability compared to several state-of-the-art techniques, which is also supported by statistical analysis.

【Keywords】: pattern clustering; statistical analysis; CCE; base clusterings; cluster-based consensus functions; clustering-based consensus functions; coupled clustering ensembles; coupling relationships; neighborhood relationship; object-based consensus functions; statistical analysis; Accuracy; Clustering algorithms; Couplings; Equations; Mathematical model; Rocks; Stability analysis

35. Focused matrix factorization for audience selection in display advertising.

【Paper Link】【Pages】:386-397

【Authors】: Bhargav Kanagal ; Amr Ahmed ; Sandeep Pandey ; Vanja Josifovski ; Lluis Garcia Pueyo ; Jeffrey Yuan

【Abstract】: Audience selection is a key problem in display advertising systems in which we need to select a list of users who are interested (i.e., most likely to buy) in an advertising campaign. The users' past feedback on this campaign can be leveraged to construct such a list using collaborative filtering techniques such as matrix factorization. However, the user-campaign interaction is typically extremely sparse, hence the conventional matrix factorization does not perform well. Moreover, simply combining the users feedback from all campaigns does not address this since it dilutes the focus on target campaign in consideration. To resolve these issues, we propose a novel focused matrix factorization model (FMF) which learns users' preferences towards the specific campaign products, while also exploiting the information about related products. We exploit the product taxonomy to discover related campaigns, and design models to discriminate between the users' interest towards campaign products and non-campaign products. We develop a parallel multi-core implementation of the FMF model and evaluate its performance over a real-world advertising dataset spanning more than a million products. Our experiments demonstrate the benefits of using our models over existing approaches.

【Keywords】: advertising; collaborative filtering; matrix decomposition; FMF; advertising campaign; audience selection; campaign products; collaborative filtering techniques; display advertising systems; focused matrix factorization model; parallel multicore implementation; product taxonomy; real-world advertising dataset; user-campaign interaction; users feedback; Advertising; Collaboration; Computational modeling; Mathematical model; Sparse matrices; Taxonomy; Vectors

36. Graph stream classification using labeled and unlabeled graphs.

【Paper Link】【Pages】:398-409

【Authors】: Shirui Pan ; Xingquan Zhu ; Chengqi Zhang ; Philip S. Yu

【Abstract】: Graph classification is becoming increasingly popular due to the rapidly rising applications involving data with structural dependency. The wide spread of the graph applications and the inherent complex relationships between graph objects have made the labels of the graph data expensive and/or difficult to obtain, especially for applications involving dynamic changing graph records. While labeled graphs are limited, the copious amounts of unlabeled graphs are often easy to obtain with trivial efforts. In this paper, we propose a framework to build a stream based graph classification model by combining both labeled and unlabeled graphs. Our method, called gSLU, employs an ensemble based framework to partition graph streams into a number of graph chunks each containing some labeled and unlabeled graphs. For each individual chunk, we propose a minimum-redundancy subgraph feature selection module to select a set of informative subgraph features to build a classifier. To tackle the concept drifting in graph streams, an instance level weighting mechanism is used to dynamically adjust the instance weight, through which the subgraph feature selection can emphasize on difficult graph samples. The classifiers built from different graph chunks form an ensemble for graph stream classification. Experiments on real-world graph streams demonstrate clear benefits of using minimum-redundancy subgraph features to build accurate classifiers. By employing instance level weighting, our graph ensemble model can effectively adapt to the concept drifting in the graph stream for classification.

【Keywords】: graph theory; pattern classification; redundancy; complex graph object relationship; dynamic changing graph record; ensemble based framework; gSLU; graph chunk; graph stream classification; informative subgraph feature; instance level weighting mechanism; labeled graph; partition graph stream; redundancy subgraph feature selection module; structural dependency; unlabeled graph; Accuracy; Communities; Correlation; Feature extraction; Learning systems; Redundancy; Vectors

【Paper Link】【Pages】:410-421

【Authors】: Shuo Ma ; Yu Zheng ; Ouri Wolfson

【Abstract】: Taxi ridesharing can be of significant social and environmental benefit, e.g. by saving energy consumption and satisfying people's commute needs. Despite the great potential, taxi ridesharing, especially with dynamic queries, is not well studied. In this paper, we formally define the dynamic ridesharing problem and propose a large-scale taxi ridesharing service. It efficiently serves real-time requests sent by taxi users and generates ridesharing schedules that reduce the total travel distance significantly. In our method, we first propose a taxi searching algorithm using a spatio-temporal index to quickly retrieve candidate taxis that are likely to satisfy a user query. A scheduling algorithm is then proposed. It checks each candidate taxi and inserts the query's trip into the schedule of the taxi which satisfies the query with minimum additional incurred travel distance. To tackle the heavy computational load, a lazy shortest path calculation strategy is devised to speed up the scheduling algorithm. We evaluated our service using a GPS trajectory dataset generated by over 33,000 taxis during a period of 3 months. By learning the spatio-temporal distributions of real user queries from this dataset, we built an experimental platform that simulates user real behaviours in taking a taxi. Tested on this platform with extensive experiments, our approach demonstrated its efficiency, effectiveness, and scalability. For example, our proposed service serves 25% additional taxi users while saving 13% travel distance compared with no-ridesharing (when the ratio of the number of queries to that of taxis is 6).

【Keywords】: Global Positioning System; graph theory; mobile computing; road traffic; traffic engineering computing; GPS trajectory dataset; T-share; energy consumption; large-scale dynamic taxi ridesharing service; lazy shortest path calculation; scheduling algorithm; spatio-temporal distribution; spatio-temporal index; taxi searching algorithm; travel distance; Dynamic scheduling; Energy consumption; Heuristic algorithms; Indexes; Real-time systems; Roads; Schedules

38. Efficient notification of meeting points for moving groups via independent safe regions.

【Paper Link】【Pages】:422-433

【Authors】: Jing Li ; Man Lung Yiu ; Nikos Mamoulis

【Abstract】: In applications like social networking services and online games, multiple moving users form a group and wish to be continuously notified with the best meeting point from their locations. To reduce the communication frequency of the application server, a promising technique is to apply safe regions, which capture the validity of query results with respect to the users' locations. Unfortunately, the safe regions in our problem exhibit characteristics such as irregular shapes and dependency among multiple safe regions. These unique characteristics render existing safe region methods that focus on a single safe region inapplicable to our problem. To tackle these challenges, we first examine the shapes of safe regions in our problem context and propose feasible approximations for them. We design efficient algorithms for computing these safe regions, as well as develop compression techniques for representing safe regions in a compact manner. Experiments with both real and synthetic data demonstrate the efficiency of our proposal in terms of computation and communication costs.

【Keywords】: computer games; data compression; mobile computing; query processing; social networking (online); application server; communication frequency reduction; compression technique; independent safe region; irregular shape; meeting point notification; moving group; moving query processing; online game; safe region representation; social networking service; user location; Computer architecture; Games; Optimization; Roads; Servers; Shape; Social network services

39. Efficient distance-aware query evaluation on indoor moving objects.

【Paper Link】【Pages】:434-445

【Authors】: Xike Xie ; Hua Lu ; Torben Bach Pedersen

【Abstract】: Indoor spaces accommodate large parts of people's life. The increasing availability of indoor positioning, driven by technologies like Wi-Fi, RFID, and Bluetooth, enables a variety of indoor location-based services (LBSs). Efficient indoor distance-aware queries on indoor moving objects play an important role in supporting and boosting such LBSs. However, the distance-aware query evaluation on indoor moving objects is challenging because: (1) indoor spaces are characterized by many special entities and thus render distance calculation very complex; (2) the limitations of indoor positioning technologies create inherent uncertainties in indoor moving objects data. In this paper, we propose a complete set of techniques for efficient distance-aware queries on indoor moving objects. We define and categorize the indoor distances in relation to indoor uncertain objects, and derive different distance bounds that can facilitate query evaluation. Existing works often assume indoor floor plans are static, and require extensive pre-computation on indoor topologies. In contrast, we design a composite index scheme that integrates indoor geometries, indoor topologies, as well as indoor uncertain objects, and thus supports indoor distance-aware queries efficiently without time-consuming and volatile distance computation. We design algorithms for range query and k nearest neighbor query on indoor moving objects. The results of extensive experimental studies demonstrate that our proposals are efficient and scalable in evaluating distance-aware queries over indoor moving objects.

【Keywords】: Bluetooth; computational geometry; mobile computing; query processing; radiofrequency identification; wireless LAN; Bluetooth; LBS; RFID; Wi-Fi; distance-aware query evaluation; indoor geometries; indoor location-based services; indoor moving objects; indoor positioning; indoor spaces; indoor topologies; indoor topology precomputation; k nearest neighbor query; range query; Equations; Indexes; Probabilistic logic; Proposals; Query processing; Topology; Uncertainty

40. HANDS: A heuristically arranged non-backup in-line deduplication system.

【Paper Link】【Pages】:446-457

【Authors】: Avani Wildani ; Ethan L. Miller ; Ohad Rodeh

【Abstract】: Deduplicating in-line data on primary storage is hampered by the disk bottleneck problem, an issue which results from the need to keep an index mapping portions of data to hash values in memory in order to detect duplicate data without paying the performance penalty of disk paging. The index size is proportional to the volume of unique data, so placing the entire index into RAM is not cost effective with a deduplication ratio below 45%. HANDS reduces the amount of in-memory index storage required by up to 99% while still achieving between 30% and 90% of the deduplication a full memory-resident index provides, making primary deduplication cost effective in workloads with deduplication rates as low as 8%. HANDS is a framework that dynamically pre-fetches fingerprints from disk into memory cache according to working sets statistically derived from access patterns. We use a simple neighborhood grouping as our statistical technique to demonstrate the effectiveness of our approach. HANDS is modular and requires only spatio-temporal data, making it suitable for a wide range of storage systems without the need to modify host file systems.

【Keywords】: cache storage; paged storage; random-access storage; statistical analysis; storage management; HANDS; RAM; deduplication cost; deduplication ratio; disk bottleneck problem; disk paging; duplicate data detection; fingerprint prefetching; full memory-resident index; in-memory index storage; index mapping portion; memory cache; neighborhood grouping; nonbackup in-line deduplication system; performance penalty; primary storage; spatio-temporal data; statistical technique; storage system; Indexes; Measurement; Memory management; Organizations; Prediction algorithms; Random access memory; Scalability

41. Holistic data cleaning: Putting violations into context.

【Paper Link】【Pages】:458-469

【Authors】: Xu Chu ; Ihab F. Ilyas ; Paolo Papotti

【Abstract】: Data cleaning is an important problem and data quality rules are the most promising way to face it with a declarative approach. Previous work has focused on specific formalisms, such as functional dependencies (FDs), conditional functional dependencies (CFDs), and matching dependencies (MDs), and those have always been studied in isolation. Moreover, such techniques are usually applied in a pipeline or interleaved. In this work we tackle the problem in a novel, unified framework. First, we let users specify quality rules using denial constraints with ad-hoc predicates. This language subsumes existing formalisms and can express rules involving numerical values, with predicates such as “greater than” and “less than”. More importantly, we exploit the interaction of the heterogeneous constraints by encoding them in a conflict hypergraph. Such holistic view of the conflicts is the starting point for a novel definition of repair context which allows us to compute automatically repairs of better quality w.r.t. previous approaches in the literature. Experimental results on real datasets show that the holistic approach outperforms previous algorithms in terms of quality and efficiency of the repair.

【Keywords】: constraint handling; data handling; graph theory; pattern matching; CFD; MD; ad-hoc predicate; conditional functional dependencies; conflict hypergraph; data quality rule; data repair; declarative approach; denial constraint; heterogeneous constraint; holistic data cleaning; matching dependencies; numerical value; Cities and towns; Cleaning; Context; Databases; Maintenance engineering; Proposals; Remuneration

42. Inferring data currency and consistency for conflict resolution.

【Paper Link】【Pages】:470-481

【Authors】: Wenfei Fan ; Floris Geerts ; Nan Tang ; Wenyuan Yu

【Abstract】: This paper introduces a new approach for conflict resolution: given a set of tuples pertaining to the same entity, it is to identify a single tuple in which each attribute has the latest and consistent value in the set. This problem is important in data integration, data cleaning and query answering. It is, however, challenging since in practice, reliable timestamps are often absent, among other things. We propose a model for conflict resolution, by specifying data currency in terms of partial currency orders and currency constraints, and by enforcing data consistency with constant conditional functional dependencies. We show that identifying data currency orders helps us repair inconsistent data, and vice versa. We investigate a number of fundamental problems associated with conflict resolution, and establish their complexity. In addition, we introduce a framework and develop algorithms for conflict resolution, by integrating data currency and consistency inferences into a single process, and by interacting with users. We experimentally verify the accuracy and efficiency of our methods using real-life and synthetic data.

【Keywords】: data integration; query processing; Inferring data consistency; conditional functional dependencies; conflict resolution; consistent value; data cleaning; data integration; inferring data currency; integrating data currency; partial currency constraints; partial currency orders; query answering; real-life data; reliable timestamps; synthetic data; Cities and towns; Cognition; Complexity theory; Data models; Databases; Integrated circuits; Semantics

43. LSII: An indexing structure for exact real-time search on microblogs.

【Paper Link】【Pages】:482-493

【Authors】: Lingkun Wu ; Wenqing Lin ; Xiaokui Xiao ; Yabo Xu

【Abstract】: Indexing microblogs for real-time search is challenging given the efficiency issue caused by the tremendous speed at which new microblogs are created by users. Existing approaches address this efficiency issue at the cost of query accuracy, as they either (i) exclude a significant portion of microblogs from the index to reduce update cost or (ii) rank microblogs mostly by their timestamps (without sufficient consideration of their relevance to the queries) to enable append-only index insertion. As a consequence, the search results returned by the existing approaches do not satisfy the users who demand timely and high-quality search results. To remedy this deficiency, we propose the Log-Structured Inverted Indices (LSII), a structure for exact real-time search on microblogs. The core of LSII is a sequence of inverted indices with exponentially increasing sizes, such that new microblogs are (i) first inserted into the smallest index and (ii) later moved into the larger indices in a batch manner. The batch insertion mechanism leads to a small amortize update cost for each new microblog, without significantly degrading query performance. We present a comprehensive study on LSII, exploring various design options to strike a good balance between query and update performance. In addition, we propose extensions of LSII to support personalized search and to exploit multi-threading for performance improvement. Extensive experiments demonstrate the efficiency of LSII with experiments on real data.

【Keywords】: Web sites; indexing; LSII; append-only index insertion; batch insertion mechanism; exact real-time search; high-quality search results; indexing microblogs; indexing structure; log-structured inverted indices; microblog ranking; query performance; Corporate acquisitions; Indexing; Query processing; Real-time systems; Servers; Vectors

44. Utilizing users' tipping points in E-commerce Recommender systems.

【Paper Link】【Pages】:494-504

【Authors】: Kailun Hu ; Wynne Hsu ; Mong-Li Lee

【Abstract】: Existing recommendation algorithms assume that users make their purchase decisions solely based on individual preferences, without regard to the type of users nor the products' maturity stages. Yet, extensive studies have shown that there are two types of users: innovators and imitators. Innovators tend to make purchase decisions based solely on their own preferences; whereas imitators' purchase decisions are often influenced by a product's stage of maturity. In this paper, we propose a framework that seamlessly incorporates the type of user and product maturity into existing recommendation algorithms. We apply Bass model to classify each user as either an innovator or imitator according to his/her previous purchase behavior. In addition, we introduce the concept of tipping point of a user. This tipping point refers to the point on the product maturity curve beyond which the user is likely to be more receptive to purchasing the product. We refine two widely-adopted recommendation algorithms to incorporate the effect of product maturity in relation to the user type. Experiment results on a real-world dataset obtained from an E-commerce website show that the proposed approach outperforms existing algorithms.

【Keywords】: Web sites; electronic commerce; purchasing; recommender systems; bass model; e-commerce website; imitators; innovators; product maturity curve; purchase decisions; real-world dataset; recommender systems; user tipping points; user type; Arrays; Collaboration; Computational modeling; Educational institutions; History; Predictive models; Recommender systems

45. Presenting diverse location views with real-time near-duplicate photo elimination.

【Paper Link】【Pages】:505-516

【Authors】: Jiajun Liu ; Zi Huang ; Hong Cheng ; Yueguo Chen ; Heng Tao Shen ; Yanchun Zhang

【Abstract】: Supported by the technical advances and the commercial success of GPS-enabled mobile devices, geo-tagged photos have drawn plenteous attention in research community. The explosive growth of geo-tagged photos enables many large-scale applications, such as location-based photo browsing, landmark recognition, etc. Meanwhile, as the number of geo-tagged photos continues to climb, new challenges are brought to various applications. The existence of massive near-duplicate geo-tagged photos jeopardizes the effective presentation for the above applications. A new dimension in the search and presentation of geo-tagged photos is urgently demanded. In this paper, we devise a location visualization framework to efficiently retrieve and present diverse views captured within a local proximity. Novel photos, in terms of capture locations and visual content, are identified and returned in response to a query location for diverse visualization. For real-time response and good scalability, a new Hybrid Index structure which integrates R-tree and Geographic Grid is proposed to quickly identify the Maximal Near-duplicate Photo Groups (MNPG) in the query proximity. The most novel photos from different groups are then returned to generate diverse views on the location. Extensive experiments on synthetic and real-life photo datasets prove the novelty and efficiency of our methods.

【Keywords】: Global Positioning System; data visualisation; geographic information systems; mobile handsets; object recognition; query processing; realistic images; trees (mathematics); GPS-enabled mobile devices; MNPG; R-tree; diverse location views; diverse visualization; geographic grid; hybrid index structure; landmark recognition; location visualization framework; location-based photo browsing; maximal near-duplicate photo groups; near-duplicate geo-tagged photos; query location; query proximity; real-life photo datasets; real-time near-duplicate photo elimination; real-time response; research community; synthetic photo datasets; Communities; Data visualization; Educational institutions; Indexing; Multimedia communication; Visualization

46. Publicly verifiable grouped aggregation queries on outsourced data streams.

【Paper Link】【Pages】:517-528

【Authors】: Suman Nath ; Ramarathnam Venkatesan

【Abstract】: Outsourcing data streams and desired computations to a third party such as the cloud is a desirable option to many companies. However, data outsourcing and remote computations intrinsically raise issues of trust, making it crucial to verify results returned by third parties. In this context, we propose a novel solution to verify outsourced grouped aggregation queries (e.g., histogram or SQL Group-by queries) that are common in many business applications. We consider a setting where a data owner employs an untrusted remote server to run continuous grouped aggregation queries on a data stream it forwards to the server. Untrusted clients then query the server for results and efficiently verify correctness of the results by using a small and easy-to-compute signature provided by the data owner. Our work complements previous works on authenticating remote computation of selection and aggregation queries. The most important aspect of our solution is that it is publicly verifiable - unlike most prior works, we support untrusted clients (who can collude with other clients or with the server). Experimental results on real and synthetic data show that our solution is practical and efficient.

【Keywords】: business data processing; digital signatures; outsourcing; query processing; trusted computing; business applications; continuous grouped aggregation queries; data outsourcing; easy-to-compute signature; outsourced data streams; publicly verifiable grouped aggregation queries; remote computations; third party; untrusted clients; untrusted remote server; Aggregates; Cryptography; Histograms; Outsourcing; Protocols; Servers; Vectors

47. Trustworthy data from untrusted databases.

【Paper Link】【Pages】:529-540

【Authors】: Rohit Jain ; Sunil Prabhakar

【Abstract】: Ensuring the trustworthiness of data retrieved from a database is of utmost importance to users. The correctness of data stored in a database is defined by the faithful execution of only valid (authorized) transactions. In this paper we address the question of whether it is necessary to trust a database server in order to trust the data retrieved from it. The lack of trust arises naturally if the database server is owned by a third party, as in the case of cloud computing. It also arises if the server may have been compromised, or there is a malicious insider. In particular, we reduce the level of trust necessary in order to establish the authenticity and integrity of data at an untrusted server. Earlier work on this problem is limited to situations where there are no updates to the database, or all updates are authorized and vetted by a central trusted entity. This is an unreasonable assumption for a truly dynamic database, as would be expected in many business applications, where multiple clients can update data without having to check with a central server that approves of their changes. We identify the problem of ensuring trustworthiness of data at an untrusted server in the presence of transactional updates that run directly on the database, and develop the first solutions to this problem. Our solutions also provide indemnity for an honest server and assured provenance for all updates to the data. We implement our solution in a prototype system built on top of Oracle with no modifications to the database internals. We also provide an empirical evaluation of the proposed solutions and establish their feasibility.

【Keywords】: authorisation; business data processing; cloud computing; data integrity; database management systems; file servers; information retrieval; trusted computing; business applications; central server; cloud computing; data authenticity; data integrity; data retrieved trustworthiness; data storage; database server; malicious insider; prototype system; transactional updates; truly dynamic database; trust level; trustworthy data; unreasonable assumption; untrusted databases; untrusted server; Cloud computing; Databases; Digital signatures; Hardware; Protocols; Servers

48. On the relative trust between inconsistent data and inaccurate constraints.

【Paper Link】【Pages】:541-552

【Authors】: George Beskales ; Ihab F. Ilyas ; Lukasz Golab ; Artur Galiullin

【Abstract】: Functional dependencies (FDs) specify the intended data semantics while violations of FDs indicate deviation from these semantics. In this paper, we study a data cleaning problem in which the FDs may not be completely correct, e.g., due to data evolution or incomplete knowledge of the data semantics. We argue that the notion of relative trust is a crucial aspect of this problem: if the FDs are outdated, we should modify them to fit the data, but if we suspect that there are problems with the data, we should modify the data to fit the FDs. In practice, it is usually unclear how much to trust the data versus the FDs. To address this problem, we propose an algorithm for generating non-redundant solutions (i.e., simultaneous modifications of the data and the FDs) corresponding to various levels of relative trust. This can help users determine the best way to modify their data and/or FDs to achieve consistency.

【Keywords】: security of data; trusted computing; FD violations; data cleaning problem; data evolution; data semantics; functional dependencies; inaccurate constraints; inconsistent data; nonredundant solutions; relative trust; Approximation algorithms; Cleaning; Maintenance engineering; Measurement; Optimized production technology; Semantics; Vectors

49. Catch the Wind: Graph workload balancing on cloud.

【Paper Link】【Pages】:553-564

【Authors】: Zechao Shang ; Jeffrey Xu Yu

【Abstract】: Graph partitioning is a key issue in graph database processing systems for achieving high efficiency on Cloud. However, the balanced graph partitioning itself is difficult because it is known to be NP-complete. In addition a static graph partitioning cannot keep all graph algorithms efficient for a long time in parallel on Cloud because the workload balancing in different iterations for different graph algorithms are all possible different. In this paper, we investigate graph behaviors by exploring the working window (we call it wind) changes, where a working window is a set of active vertices that a graph algorithm really needs to access in parallel computing. We investigated nine classic graph algorithms using real datasets, and propose simple yet effective policies that can achieve both high graph workload balancing and efficient partition on Cloud.

【Keywords】: cloud computing; optimisation; NP-complete; balanced graph partitioning; cloud; graph algorithm; graph behaviors; graph database processing system; high graph workload balancing; parallel computing; static graph partitioning; Algorithm design and analysis; Computational modeling; Google; Minimization; Nickel; Partitioning algorithms; Synchronization

50. EAGRE: Towards scalable I/O efficient SPARQL query evaluation on the cloud.

【Paper Link】【Pages】:565-576

【Authors】: Xiaofei Zhang ; Lei Chen ; Yongxin Tong ; Min Wang

【Abstract】: To benefit from the Cloud platform's unlimited resources, managing and evaluating huge volume of RDF data in a scalable manner has attracted intensive research efforts recently. Progresses have been made on evaluating SPARQL queries with either high-level declarative programming languages, like Pig [1], or a sequence of sophisticated designed MapReduce jobs, both of which tend to answer the query with multiple join operations. However, due to the simplicity of Cloud storage and the coarse organization of RDF data in existing solutions, multiple join operations easily bring significant I/O and network traffic which can severely degrade the system performance. In this work, we first propose EAGRE, an Entity-Aware Graph compREssion technique to form a new representation of RDF data on Cloud platforms, based on which we propose an I/O efficient strategy to evaluate SPARQL queries as quickly as possible, especially queries with specified solution sequence modifiers, e.g., PROJECTION, ORDER BY, etc. We implement a prototype system and conduct extensive experiments over both real and synthetic datasets on an in-house cluster. The experimental results show that our solution can achieve over an order of magnitude of time saving for the SPARQL query evaluation compared to the state-of-art MapReduce-based solutions.

【Keywords】: cloud computing; data compression; data handling; query languages; query processing; EAGRE; Entity-Aware Graph compREssion technique; MapReduce job; MapReduce-based solution; ORDER BY; PROJECTION; Pig; RDF data representation; cloud computing; cloud platform; cloud storage; high-level declarative programming language; in-house cluster; multiple join operation; network traffic; query answering; scalable I/O efficient SPARQL query evaluation; scalable RDF data management; solution sequence modifier; system performance degradation; Data models; Layout; Nickel; Processor scheduling; Query processing; Resource description framework; Scheduling

51. C-Cube: Elastic continuous clustering in the cloud.

【Paper Link】【Pages】:577-588

【Authors】: Zhenjie Zhang ; Hu Shu ; Zhihong Chong ; Hua Lu ; Yin Yang

【Abstract】: Continuous clustering analysis over a data stream reports clustering results incrementally as updates arrive. Such analysis has a wide spectrum of applications, including traffic monitoring and topic discovery on microblogs. A common characteristic of streaming applications is that the amount of workload fluctuates, often in an unpredictable manner. On the other hand, most existing solutions for continuous clustering assume either a central server, or a distributed setting with a fixed number of dedicated servers. In other words, they are not ELASTIC, meaning that they cannot dynamically adapt to the amount of computational resources to the fluctuating workload. Consequently, they incur considerable waste of resources, as the servers are under-utilized when the amount of workload is low. This paper proposes C-Cube, the first elastic approach to continuous streaming clustering. Similar to popular cloud-based paradigms such as MapReduce, C-Cube routes each new record to a processing unit, e.g., a virtual machine, based on its hash value. Each processing unit performs the required computations, and sends its results to a lightweight aggregator. This design enables dynamic adding/removing processing units, as well as replacing faulty ones and re-running their tasks. In addition to elasticity, C-Cube is also effective (in that it provides quality guarantees on the clustering results), efficient (it minimizes the computational workload at all times), and generally applicable to a large class of clustering criteria. We implemented C-Cube in a real system based on Twitter Storm, and evaluated it using real and synthetic datasets. Extensive experimental results confirm our performance claims.

【Keywords】: cloud computing; file organisation; pattern clustering; social networking (online); virtual machines; C-Cube; MapReduce; Twitter Storm; cloud computing; continuous clustering analysis; continuous streaming clustering; data stream; dedicated servers; elastic continuous clustering; hash value; lightweight aggregator; microblogs; virtual machine; Algorithm design and analysis; Approximation algorithms; Approximation methods; Clustering algorithms; Elasticity; Mathematical model; Measurement

52. Efficient search algorithm for SimRank.

【Paper Link】【Pages】:589-600

【Authors】: Yasuhiro Fujiwara ; Makoto Nakatsuji ; Hiroaki Shiokawa ; Makoto Onizuka

【Abstract】: Graphs are a fundamental data structure and have been employed to model objects as well as their relationships. The similarity of objects on the web (e.g., webpages, photos, music, micro-blogs, and social networking service users) is the key to identifying relevant objects in many recent applications. SimRank, proposed by Jeh and Widom, provides a good similarity score and has been successfully used in many applications such as web spam detection, collaborative tagging analysis, link prediction, and so on. SimRank computes similarities iteratively, and it needs O(N⁴T) time and O(N²) space for similarity computation where N and T are the number of nodes and iterations, respectively. Unfortunately, this iterative approach is computationally expensive. The goal of this work is to process top-k search and range search efficiently for a given node. Our solution, SimMat, is based on two ideas: (1) It computes the approximate similarity of a selected node pair efficiently in non-iterative style based on the Sylvester equation, and (2) It prunes unnecessary approximate similarity computations when searching for the high similarity nodes by exploiting estimations based on the Cauchy-Schwarz inequality. These two ideas reduce the time and space complexities of the proposed approach to O(Nn) where n is the target rank of the low-rank approximation (n ≪ N in practice). Our experiments show that our approach is much faster, by several orders of magnitude, than previous approaches in finding the high similarity nodes.

【Keywords】: approximation theory; computational complexity; graph theory; search problems; Cauchy-Schwarz inequality; SimMat; SimRank; Sylvester equation; Web spam detection; Webpage; collaborative tagging analysis; data structure; graph; link prediction; low-rank approximation; microblog; music; noniterative style; object similarity; photo; range search; search algorithm; similarity computation; similarity node; similarity score; social networking service user; space complexity; time complexity; top-k search; Approximation methods; Eigenvalues and eigenfunctions; Equations; Iterative methods; Mathematical model; Social network services; Vectors

53. Towards efficient SimRank computation on large networks.

【Paper Link】【Pages】:601-612

【Authors】: Weiren Yu ; Xuemin Lin ; Wenjie Zhang

【Abstract】: SimRank has been a powerful model for assessing the similarity of pairs of vertices in a graph. It is based on the concept that two vertices are similar if they are referenced by similar vertices. Due to its self-referentiality, fast SimRank computation on large graphs poses significant challenges. The state-of-the-art work [17] exploits partial sums memorization for computing SimRank in O(Kmn) time on a graph with n vertices and m edges, where K is the number of iterations. Partial sums memorizing can reduce repeated calculations by caching part of similarity summations for later reuse. However, we observe that computations among different partial sums may have duplicate redundancy. Besides, for a desired accuracy ϵ, the existing SimRank model requires K = [log_C ϵ] iterations [17], where C is a damping factor. Nevertheless, such a geometric rate of convergence is slow in practice if a high accuracy is desirable. In this paper, we address these gaps. (1) We propose an adaptive clustering strategy to eliminate partial sums redundancy (i.e., duplicate computations occurring in partial sums), and devise an efficient algorithm for speeding up the computation of SimRank to 0(Kd'n²) time, where d' is typically much smaller than the average in-degree of a graph. (2) We also present a new notion of SimRank that is based on a differential equation and can be represented as an exponential sum of transition matrices, as opposed to the geometric sum of the conventional counterpart. This leads to a further speedup in the convergence rate of SimRank iterations. (3) Using real and synthetic data, we empirically verify that our approach of partial sums sharing outperforms the best known algorithm by up to one order of magnitude, and that our revised notion of SimRank further achieves a 5X speedup on large graphs while also fairly preserving the relative order of original SimRank scores.

【Keywords】: computational complexity; graph theory; iterative methods; matrix algebra; pattern clustering; SimRank computation; SimRank iteration; adaptive clustering strategy; differential equation; duplicate redundancy; geometric rate; graph; partial sums memorization; partial sums redundancy; similarity assessment; similarity summation caching; transition matrix; Accuracy; Clustering algorithms; Computational modeling; Convergence; Damping; Optimization; Redundancy

54. RoundTripRank: Graph-based proximity with importance and specificity?

【Paper Link】【Pages】:613-624

【Authors】: Yuan Fang ; Kevin Chen-Chuan Chang ; Hady Wirawan Lauw

【Abstract】: Graph-based proximity has many applications with different ranking needs. However, most previous works only stress the sense of importance by finding “popular” results for a query. Often times important results are overly general without being well-tailored to the query, lacking a sense of specificity - which only emerges recently. Even then, the two senses are treated independently, and only combined empirically. In this paper, we generalize the well-studied importance-based random walk into a round trip and develop RoundTripRank, seamlessly integrating specificity and importance in one coherent process. We also recognize the need for a flexible trade-off between the two senses, and further develop RoundTripRank+ based on a scheme of hybrid random surfers. For efficient computation, we start with a basic model that decomposes RoundTripRank into smaller units. For each unit, we apply a novel two-stage bounds updating framework, enabling an online top-K algorithm 2SBound. Finally, our experiments show that RoundTripRank and RoundTripRank+ are robust over various ranking tasks, and 2SBound enables scalable online processing.

【Keywords】: graph theory; network theory (graphs); query processing; 2SBound; RoundTripRank; coherent process; graph-based proximity; hybrid random surfers; importance-based random walk; online top-K algorithm; query processing; scalable online processing; two-stage bounds; Computational modeling; Databases; Educational institutions; Joining processes; MIMICs; Teleportation; Uniform resource locators

55. Finding distance-preserving subgraphs in large road networks.

【Paper Link】【Pages】:625-636

【Authors】: Da Yan ; James Cheng ; Wilfred Ng ; Steven Liu

【Abstract】: Given two sets of points, S and T, in a road network, G, a distance-preserving subgraph (DPS) query returns a subgraph of G that preserves the shortest path from any point in S to any point in T. DPS queries are important in many real world applications, such as route recommendation systems, logistics planning, and all kinds of shortest-path-related applications that run on resource-limited mobile devices. In this paper, we study efficient algorithms for processing DPS queries in large road networks. Four algorithms are proposed with different tradeoffs in terms of DPS quality and query processing time, and the best one is a graph-partitioning based index, called RoadPart, that finds a high quality DPS with short response time. Extensive experiments on large road networks demonstrate the merits of our algorithms, and verify the efficiency of RoadPart for finding a high-quality DPS.

【Keywords】: graph theory; mobile computing; query processing; recommender systems; traffic information systems; DPS quality; DPS query processing; RoadPart; distance-preserving subgraph finding; graph-partitioning based index; large road network; logistics planning; query processing time; resource-limited mobile device; route recommendation system; shortest path preservation; shortest-path-related application; Bridges; Indexes; Labeling; Mobile handsets; Query processing; Roads; Vectors

56. Maximum visibility queries in spatial databases.

【Paper Link】【Pages】:637-648

【Authors】: Sarah Masud ; Farhana Murtaza Choudhury ; Mohammed Eunus Ali ; Sarana Nutanong

【Abstract】: Many real-world problems, such as placement of surveillance cameras and pricing of hotel rooms with a view, require the ability to determine the visibility of a given target object from different locations. Advances in large-scale 3D modeling (e.g., 3D virtual cities) provide us with data that can be used to solve these problems with high accuracy. In this paper, we investigate the problem of finding the location which provides the best view of a target object with visual obstacles in 2D or 3D space, for example, finding the location that provides the best view of fireworks in a city with tall buildings. To solve this problem, we first define the quality measure of a view (i.e., visibility measure) as the visible angular size of the target object. Then, we propose a new query type called the k-Maximum Visibility (kMV) query, which finds k locations from a set of locations that maximize the visibility of the target object. Our objective in this paper is to design a query solution which is capable of handling large-scale city models. This objective precludes the use of approaches that rely on constructing a visibility graph of the entire data space. As a result, we propose three approaches that incrementally consider relevant obstacles in order to determine the visibility of a target object from a given set of locations. These approaches differ in the order of obstacle retrieval, namely: query centric distance based, query centric visible region based, and target centric distance based approaches. We have conducted an extensive experimental study on real 2D and 3D datasets to demonstrate the efficiency and effectiveness of our solutions.

【Keywords】: graph theory; image retrieval; solid modelling; visual databases; k-maximum visibility query; kMV query type; large-scale 3D modeling; maximum visibility queries; obstacle retrieval; quality measure; query centric distance based approach; query centric visible region based approach; spatial databases; target centric distance based approach; target object; visibility graph; visible angular size; visual obstacles; Cameras; Cities and towns; Computational geometry; Face; Measurement; Spatial databases; Three-dimensional displays

57. Memory-efficient algorithms for spatial network queries.

【Paper Link】【Pages】:649-660

【Authors】: Sarana Nutanong ; Hanan Samet

【Abstract】: Incrementally finding the k nearest neighbors (kNN) in a spatial network is an important problem in location-based services. One method (INE) simply applies Dijkstra's algorithm. Another method (IER) computes the k nearest neighbors using Euclidean distance followed by computing their corresponding network distances, and then incrementally finds the next nearest neighbors in order of increasing Euclidean distance until finding one whose Euclidean distance is greater than the current k nearest neighbor in terms of network distance. The LBC method improves on INE by avoiding the visit of nodes that cannot possibly lead to the k nearest neighbors by using a Euclidean heuristic estimator, and on IER by avoiding the repeated visits to nodes in the spatial network that appear on the shortest paths to different members of the k nearest neighbors by performing multiple instances of heuristic search using a Euclidean heuristic estimator on candidate objects around the query point. LBC's drawback is that the maintenance of multiple instances of heuristic search (called wavefronts) requires k priority queues and the queue operations required to maintain them incur a high in-memory processing cost. A method (SWH) is proposed that utilizes a novel heuristic function which considers objects surrounding the query point together as a single unit, instead of as one destination at a time as in LBC, thereby eliminating the need for multiple wavefronts and needs just one priority queue. These results in a significant reduction in the in-memory processing cost components while having the same reduced cost of the access to the spatial network as LBC. SWH is also extended to support the incremental distance semi-join (IDSJ) query, which is a multiple query point generalization of the kNN query. In addition, SWH is shown to support landmark-based heuristic functions, thereby enabling it to be applied to non-spatial networks/graphs such as social networks. Comparisons of experiments on S- H for kNN queries with INE, the best single-wavefront method, show that SWH is 2.5 times faster, and with LBC, the best existing heuristic search method, show that SWH is 3.5 times faster. For IDSJ queries, SWH-IDSJ is 5 times faster than INE-IDSJ, and 4 times faster than LBC-IDSJ.

【Keywords】: directed graphs; mobile computing; network theory (graphs); query processing; queueing theory; search problems; Dijkstra algorithm; Euclidean distance; Euclidean heuristic estimation; IDSJ query; IER; INE; LBC method; SWH; heuristic search method; in-memory processing cost; incremental distance semi-join; k nearest neighbor; kNN query; landmark-based heuristic function; location-based service; memory efficient algorithm; multiple query point generalization; network distance; nonspatial graph; nonspatial network; priority queue; queue operation; spatial network query; Algorithm design and analysis; Artificial neural networks; Euclidean distance; Heuristic algorithms; Memory management; Reliability; Search problems

【Paper Link】【Pages】:661-672

【Authors】: Hongzhi Yin ; Bin Cui ; Hua Lu ; Yuxin Huang ; Junjie Yao

【Abstract】: Web 2.0 users generate and spread huge amounts of messages in online social media. Such user-generated contents are mixture of temporal topics (e.g., breaking events) and stable topics (e.g., user interests). Due to their different natures, it is important and useful to distinguish temporal topics from stable topics in social media. However, such a discrimination is very challenging because the user-generated texts in social media are very short in length and thus lack useful linguistic features for precise analysis using traditional approaches. In this paper, we propose a novel solution to detect both stable and temporal topics simultaneously from social media data. Specifically, a unified user-temporal mixture model is proposed to distinguish temporal topics from stable topics. To improve this model's performance, we design a regularization framework that exploits prior spatial information in a social network, as well as a burst-weighted smoothing scheme that exploits temporal prior information in the time dimension. We conduct extensive experiments to evaluate our proposal on two real data sets obtained from Del.icio.us and Twitter. The experimental results verify that our mixture model is able to distinguish temporal topics from stable topics in a single detection process. Our mixture model enhanced with the spatial regularization and the burst-weighted smoothing scheme significantly outperforms competitor approaches, in terms of topic detection accuracy and discrimination in stable and temporal topics.

【Keywords】: Internet; information retrieval; linguistics; social networking (online); text analysis; Del.icio.us; Twitter; UGC; Web 2.0; burst-weighted smoothing scheme; linguistic features; online social media; spatial regularization; stable topic detection; temporal topic detection; user-generated contents; user-temporal mixture model; Equations; Feature extraction; Hidden Markov models; Mathematical model; Media; Twitter

59. Crowdsourced enumeration queries.

【Paper Link】【Pages】:673-684

【Authors】: Beth Trushkowsky ; Tim Kraska ; Michael J. Franklin ; Purnamrita Sarkar

【Abstract】: Hybrid human/computer database systems promise to greatly expand the usefulness of query processing by incorporating the crowd for data gathering and other tasks. Such systems raise many implementation questions. Perhaps the most fundamental question is that the closed world assumption underlying relational query semantics does not hold in such systems. As a consequence the meaning of even simple queries can be called into question. Furthermore, query progress monitoring becomes difficult due to non-uniformities in the arrival of crowdsourced data and peculiarities of how people work in crowdsourcing systems. To address these issues, we develop statistical tools that enable users and systems developers to reason about query completeness. These tools can also help drive query execution and crowdsourcing strategies. We evaluate our techniques using experiments on a popular crowdsourcing platform.

【Keywords】: data handling; query processing; relational databases; statistical analysis; crowdsourced enumeration query; crowdsourcing platform; crowdsourcing strategy; data gathering; hybrid human-computer database system; query completeness; query processing; query progress monitoring; relational query semantics; statistical tool; Computers; Crowdsourcing; Estimation; Query processing; Sociology

60. On incentive-based tagging.

【Paper Link】【Pages】:685-696

【Authors】: Xuan S. Yang ; Reynold Cheng ; Luyi Mo ; Ben Kao ; David W. Cheung

【Abstract】: A social tagging system, such as del.icio.us and Flickr, allows users to annotate resources (e.g., web pages and photos) with text descriptions called tags. Tags have proven to be invaluable information for searching, mining, and recommending resources. In practice, however, not all resources receive the same attention from users. As a result, while some highly-popular resources are over-tagged, most of the resources are under-tagged. Incomplete tagging on resources severely affects the effectiveness of all tag-based techniques and applications. We address an interesting question: if users are paid to tag specific resources, how can we allocate incentives to resources in a crowd-sourcing environment so as to maximize the tagging quality of resources? We address this question by observing that the tagging quality of a resource becomes stable after it has been tagged a sufficient number of times. We formalize the concepts of tagging quality (TQ) and tagging stability (TS) in measuring the quality of a resource's tag description. We propose a theoretically optimal algorithm given a fixed “budget” (i.e., the amount of money paid for tagging resources). This solution decides the amount of rewards that should be invested on each resource in order to maximize tagging stability. We further propose a few simple, practical, and efficient incentive allocation strategies. On a dataset from del.icio.us, our best strategy provides resources with a close-to-optimal gain in tagging stability.

【Keywords】: information retrieval; resource allocation; social networking (online); text analysis; close-to-optimal gain; crowd-sourcing environment; del.icio.us dataset; highly-popular resources; incentive allocation strategies; incentive-based tagging; incomplete tagging; over-tagged resources; resource annotation; resource tagging quality maximization; social tagging system; tagging stability maximization; text descriptions; under-tagged resources; Earth; Google; Radio spectrum management; Resource management; Stability analysis; Tagging; Uniform resource locators

61. Ontology-based subgraph querying.

【Paper Link】【Pages】:697-708

【Authors】: Yinghui Wu ; Shengqi Yang ; Xifeng Yan

【Abstract】: Subgraph querying has been applied in a variety of emerging applications. Traditional subgraph querying based on subgraph isomorphism requires identical label matching, which is often too restrictive to capture the matches that are semantically close to the query graphs. This paper extends subgraph querying to identify semantically related matches by leveraging ontology information. (1) We introduce the ontology-based subgraph querying, which revises subgraph isomorphism by mapping a query to semantically related subgraphs in terms of a given ontology graph. We introduce a metric to measure the similarity of the matches. Based on the metric, we introduce an optimization problem to find top K best matches. (2) We provide a filtering-and-verification framework to identify (top-K) matches for ontology-based subgraph queries. The framework efficiently extracts a small subgraph of the data graph from an ontology index, and further computes the matches by only accessing the extracted subgraph. (3) In addition, we show that the ontology index can be efficiently updated upon the changes to the data graphs, enabling the framework to cope with dynamic data graphs. (4) We experimentally verify the effectiveness and efficiency of our framework using both synthetic and real life graphs, comparing with traditional subgraph querying methods.

【Keywords】: formal verification; graph theory; ontologies (artificial intelligence); optimisation; pattern matching; query processing; dynamic data graphs; filtering-and-verification framework; identical label matching; match similarity; ontology index; ontology information; ontology-based subgraph querying; optimization problem; real life graphs; subgraph isomorphism-based subgraph querying; synthetic graphs; Data mining; Indexing; Measurement; Ontologies; Query processing; Semantics

62. Stratification driven placement of complex data: A framework for distributed data analytics.

【Paper Link】【Pages】:709-720

【Authors】: Ye Wang ; Srinivasan Parthasarathy ; P. Sadayappan

【Abstract】: With the increasing popularity of XML data stores, social networks and Web 2.0 and 3.0 applications, complex data formats, such as trees and graphs, are becoming ubiquitous. Managing and processing such large and complex data stores, on modern computational eco-systems, to realize actionable information efficiently, is an important challenge. A critical element at the heart of this challenge relates to the placement, storage and access of such tera- and peta- scale data. In this work we develop a novel distributed framework to ease the burden on the programmer and propose an agile and intelligent placement service layer as a flexible yet unified means to address this challenge. Central to our framework is the notion of stratification which seeks to initially group structurally (or semantically) similar entities into strata. Subsequently strata are partitioned within this ecosystem according to the needs of the application to maximize locality, balance load, or minimize data skew. Results on several real-world applications validate the efficacy and efficiency of our approach.

【Keywords】: Internet; XML; data analysis; ecology; resource allocation; social networking (online); Web 2.0 applications; Web 3.0 applications; XML data stores; balance load; complex data formats; complex data storage; computational ecosystems; critical element; data skew minimization; distributed data analytics; distributed framework; intelligent placement service layer; locality maximization; petascale data; real-world applications; social networks; stratification driven placement; terascale data; Clustering algorithms; Communities; Distributed databases; Social network services; Sorting; XML

63. Optimizing approximations of DNF query lineage in probabilistic XML.

【Paper Link】【Pages】:721-732

【Authors】: Asma Souihli ; Pierre Senellart

【Abstract】: Probabilistic XML is a probabilistic model for uncertain tree-structured data, with applications to data integration, information extraction, or uncertain version control. We explore in this work efficient algorithms for evaluating tree-pattern queries with joins over probabilistic XML or, more specifically, for listing the answers to a query along with their computed or approximated probability. The approach relies on, first, producing the lineage query by evaluating it over the probabilistic XML document, and, second, looking for an optimal strategy to compute the probability of the lineage formula. This latter part relies on a query-optimizer - like approach: exploring different evaluation plans for different parts of the formula and estimating the cost of each plan, using a cost model for the various evaluation algorithms. We demonstrate the efficiency of this approach on datasets used in previous research on probabilistic XML querying, as well as on synthetic data. We also compare the performance of our query engine with EvalDP [1], Trio [2], and MayBMS/SPROUT [3].

【Keywords】: XML; information retrieval; tree data structures; DNF query lineage; EvalDP; MayBMS/SPROUT; Trio; approximation optimization; cost model; data integration; information extraction; lineage query; optimal strategy; probabilistic XML document; probabilistic XML querying; probabilistic model; query engine; query-optimizer; tree-pattern queries; uncertain tree-structured data; uncertain version control; Additives; Approximation algorithms; Approximation methods; Computational modeling; Data models; Probabilistic logic; XML

64. Secure nearest neighbor revisited.

【Paper Link】【Pages】:733-744

【Authors】: Bin Yao ; Feifei Li ; Xiaokui Xiao

【Abstract】: In this paper, we investigate the secure nearest neighbor (SNN) problem, in which a client issues an encrypted query point E(q) to a cloud service provider and asks for an encrypted data point in E(D) (the encrypted database) that is closest to the query point, without allowing the server to learn the plaintexts of the data or the query (and its result). We show that efficient attacks exist for existing SNN methods [21], [15], even though they were claimed to be secure in standard security models (such as indistinguishability under chosen plaintext or ciphertext attacks). We also establish a relationship between the SNN problem and the order-preserving encryption (OPE) problem from the cryptography field [6], [5], and we show that SNN is at least as hard as OPE. Since it is impossible to construct secure OPE schemes in standard security models [6], [5], our results imply that one cannot expect to find the exact (encrypted) nearest neighbor based on only E(q) and E(D). Given this hardness result, we design new SNN methods by asking the server, given only E(q) and E(D), to return a relevant (encrypted) partition E(G) from E(D) (i.e., G ⊆ D), such that that E(G) is guaranteed to contain the answer for the SNN query. Our methods provide customizable tradeoff between efficiency and communication cost, and they are as secure as the encryption scheme E used to encrypt the query and the database, where E can be any well-established encryption schemes.

【Keywords】: client-server systems; cloud computing; cryptography; learning (artificial intelligence); query processing; OPE problem; SNN problem; SNN query answer; ciphertext attack; cloud service provider; communication cost; cryptography; data plaintext learning; encrypted data point; encrypted database; encrypted nearest neighbor; encrypted query point; hardness result; indistinguishability; order-preserving encryption problem; plaintext attack; query encryption; secure nearest neighbor; standard security model; Databases; Encryption; Equations; Servers; Standards

65. Accurate and efficient private release of datacubes and contingency tables.

【Paper Link】【Pages】:745-756

【Authors】: Grigory Yaroslavtsev ; Graham Cormode ; Cecilia M. Procopiuc ; Divesh Srivastava

【Abstract】: A central problem in releasing aggregate information about sensitive data is to do so accurately while providing a privacy guarantee on the output. Recent work focuses on the class of linear queries, which include basic counting queries, data cubes, and contingency tables. The goal is to maximize the utility of their output, while giving a rigorous privacy guarantee. Most results follow a common template: pick a “strategy” set of linear queries to apply to the data, then use the noisy answers to these queries to reconstruct the queries of interest. This entails either picking a strategy set that is hoped to be good for the queries, or performing a costly search over the space of all possible strategies. In this paper, we propose a new approach that balances accuracy and efficiency: we show how to improve the accuracy of a given query set by answering some strategy queries more accurately than others. This leads to an efficient optimal noise allocation for many popular strategies, including wavelets, hierarchies, Fourier coefficients and more. For the important case of marginal queries we show that this strictly improves on previous methods, both analytically and empirically. Our results also extend to ensuring that the returned query answers are consistent with an (unknown) data set at minimal extra cost in terms of time and noise.

【Keywords】: data privacy; query processing; tree data structures; basic counting queries; contingency table; datacubes; linear queries; optimal noise allocation; private release; query answer; Data privacy; Databases; Noise; Noise measurement; Optimization; Privacy; Vectors

66. Differentially private grids for geospatial data.

【Paper Link】【Pages】:757-768

【Authors】: Wahbeh H. Qardaji ; Weining Yang ; Ninghui Li

【Abstract】: In this paper, we tackle the problem of constructing a differentially private synopsis for two-dimensional datasets such as geospatial datasets. The current state-of-the-art methods work by performing recursive binary partitioning of the data domains, and constructing a hierarchy of partitions. We show that the key challenge in partition-based synopsis methods lies in choosing the right partition granularity to balance the noise error and the non-uniformity error. We study the uniform-grid approach, which applies an equi-width grid of a certain size over the data domain and then issues independent count queries on the grid cells. This method has received no attention in the literature, probably due to the fact that no good method for choosing a grid size was known. Based on an analysis of the two kinds of errors, we propose a method for choosing the grid size. Experimental results validate our method, and show that this approach performs as well as, and often times better than, the state-of-the-art methods. We further introduce a novel adaptive-grid method. The adaptive grid method lays a coarse-grained grid over the dataset, and then further partitions each cell according to its noisy count. Both levels of partitions are then used in answering queries over the dataset. This method exploits the need to have finer granularity partitioning over dense regions and, at the same time, coarse partitioning over sparse regions. Through extensive experiments on real-world datasets, we show that this approach consistently and significantly outperforms the uniform-grid method and other state-of-the-art methods.

【Keywords】: Global Positioning System; data privacy; grid computing; query processing; adaptive-grid method; coarse partitioning; coarse-grained grid; data domains; differentially private grids; differentially private synopsis; equiwidth grid; fine granularity partitioning; geospatial datasets; grid cells; noise error; nonuniformity error; partition-based synopsis methods; query answering; recursive binary partitioning; sparse regions; state-of-the-art methods; two-dimensional datasets; uniform-grid approach; uniform-grid method; Data privacy; Guidelines; Noise; Noise measurement; Privacy; Standards; Wavelet transforms

【Paper Link】【Pages】:769-780

【Authors】: Zhuojie Zhou ; Nan Zhang ; Zhiguo Gong ; Gautam Das

【Abstract】: Many online social networks feature restrictive web interfaces which only allow the query of a user's local neighborhood through the interface. To enable analytics over such an online social network through its restrictive web interface, many recent efforts reuse the existing Markov Chain Monte Carlo methods such as random walks to sample the social network and support analytics based on the samples. The problem with such an approach, however, is the large amount of queries often required (i.e., a long “mixing time”) for a random walk to reach a desired (stationary) sampling distribution. In this paper, we consider a novel problem of enabling a faster random walk over online social networks by “rewiring” the social network on-the-fly. Specifically, we develop Modified TOpology (MTO)-Sampler which, by using only information exposed by the restrictive web interface, constructs a “virtual” overlay topology of the social network while performing a random walk, and ensures that the random walk follows the modified overlay topology rather than the original one. We show that MTO-Sampler not only provably enhances the efficiency of sampling, but also achieves significant savings on query cost over real-world online social networks such as Google Plus, Epinion etc.

【Keywords】: Markov processes; Monte Carlo methods; query processing; social networking (online); Epinion; Google Plus; MTO sampler; Markov chain Monte Carlo method; Web interface; modified topology sampler; online social network; overlay topology; query cost savings; random walk; sampling distribution; support analytics; Aggregates; Educational institutions; Estimation; Knowledge engineering; Network topology; Social network services; Topology

68. Sampling node pairs over large graphs.

【Paper Link】【Pages】:781-792

【Authors】: Pinghui Wang ; Junzhou Zhao ; John C. S. Lui ; Don Towsley ; Xiaohong Guan

【Abstract】: Characterizing user pair relationships is important for applications such as friend recommendation and interest targeting in online social networks (OSNs). Due to the large scale nature of such networks, it is infeasible to enumerate all user pairs and so sampling is used. In this paper, we show that it is a great challenge even for OSN service providers to characterize user pair relationships even when they possess the complete graph topology. The reason is that when sampling techniques (i.e., uniform vertex sampling (UVS) and random walk (RW)) are naively applied, they can introduce large biases, in particular, for estimating similarity distribution of user pairs with constraints such as existence of mutual neighbors, which is important for applications such as identifying network homophily. Estimating statistics of user pairs is more challenging in the absence of the complete topology information, since an unbiased sampling technique such as UVS is usually not allowed, and exploring the OSN graph topology is expensive. To address these challenges, we present asymptotically unbiased sampling methods to characterize user pair properties based on UVS and RW techniques respectively. We carry out an evaluation of our methods to show their accuracy and efficiency. Finally, we apply our methods to two Chinese OSNs, Doudan and Xiami, and discover significant homophily is present in these two networks.

【Keywords】: graph theory; sampling methods; social networking (online); OSN graph topology; OSN service providers; complete graph topology; complete topology information; friend recommendation; interest targeting; large graphs; mutual neighbors; network homophily; online social networks; pair relationships; random walk; sampling node pairs; unbiased sampling technique; vertex sampling; Educational institutions; Equations; Markov processes; Network topology; Probability distribution; Sampling methods; Topology

69. Link prediction across networks by biased cross-network sampling.

【Paper Link】【Pages】:793-804

【Authors】: Guo-Jun Qi ; Charu C. Aggarwal ; Thomas S. Huang

【Abstract】: The problem of link inference has been widely studied in a variety of social networking scenarios. In this problem, we wish to predict future links in a growing network with the use of the existing network structure. However, most of the existing methods work well only if a significant number of links are already available in the network for the inference process. In many scenarios, the existing network may be too sparse, and may have too few links to enable meaningful learning mechanisms. This paucity of linkage information can be challenging for the link inference problem. However, in many cases, other (more densely linked) networks may be available which show similar linkage structure in terms of underlying attribute information in the nodes. The linkage information in the existing networks can be used in conjunction with the node attribute information in both networks in order to make meaningful link recommendations. Thus, this paper introduces the use of transfer learning methods for performing cross-network link inference. We present experimental results illustrating the effectiveness of the approach.

【Keywords】: computer networks; inference mechanisms; learning (artificial intelligence); telecommunication links; biased cross-network sampling; cross-network link inference; inference process; learning mechanisms; link inference problem; link prediction; link recommendations; network structure; social networking scenarios; transfer learning methods; Couplings; Knowledge engineering; Linear programming; Predictive models; Social network services; Training; Vectors

70. Interval indexing and querying on key-value cloud stores.

【Paper Link】【Pages】:805-816

【Authors】: George Sfakianakis ; Ioannis Patlakas ; Nikos Ntarmos ; Peter Triantafillou

【Abstract】: Cloud key-value stores are becoming increasingly more important. Challenging applications, requiring efficient and scalable access to massive data, arise every day. We focus on supporting interval queries (which are prevalent in several data intensive applications, such as temporal querying for temporal analytics), an efficient solution for which is lacking. We contribute a compound interval index structure, comprised of two tiers: (i) the MRSegmentTree (MRST), a key-value representation of the Segment Tree, and (ii) the Endpoints Index (EPI), a column family index that stores information for interval endpoints. In addition to the above, our contributions include: (i) algorithms for efficiently constructing and populating our indices using MapReduce jobs, (ii) techniques for efficient and scalable index maintenance, and (iii) algorithms for processing interval queries. We have implemented all algorithms using HBase and Hadoop, and conducted a detailed performance evaluation. We quantify the costs associated with the construction of the indices, and evaluate our query processing algorithms using queries on real data sets. We compare the performance of our approach to two alternatives: the native support for interval queries provided in HBase, and the execution of such queries using the Hive query execution tool. Our results show a significant speedup, far outperforming the state of the art.

【Keywords】: cloud computing; indexing; query processing; tree data structures; EPI; HBase; Hadoop; Hive query execution tool; MRST; MRSegmentTree; MapReduce job; cloud key-value store; column family index; data access; data intensive application; endpoints index; interval index structure; interval querying; key-value representation; query processing algorithm; real data set; scalable index maintenance; temporal analytics; temporal querying; Buildings; Crawlers; Indexing; Query processing; Vegetation; Web pages

71. Robust distributed stream processing.

【Paper Link】【Pages】:817-828

【Authors】: Chuan Lei ; Elke A. Rundensteiner ; Joshua D. Guttman

【Abstract】: Distributed stream processing systems must function efficiently for data streams that fluctuate in their arrival rates and data distributions. Yet repeated and prohibitively expensive load re-allocation across machines may make these systems ineffective, potentially resulting in data loss or even system failure. To overcome this problem, we instead propose a load distribution (RLD) strategy that is robust to data fluctuations. RLD provides ϵ-optimal query performance under load fluctuations without suffering from the performance penalty caused by load migration. RLD is based on three key strategies. First, we model robust distributed stream processing as a parametric query optimization problem. The notions of robust logical and robust physical plans then are overlays of this parameter space. Second, our Early-terminated Robust Partitioning (ERP) finds a set of robust logical plans, covering the parameter space, while minimizing the number of prohibitively expensive optimizer calls with a probabilistic bound on the space coverage. Third, our OptPrune algorithm maps the space-covering logical solution to a single robust physical plan tolerant to deviations in data statistics that maximizes the parameter space coverage at runtime. Our experimental study using stock market and sensor networks streams demonstrates that our RLD methodology consistently outperforms state-of-the-art solutions in terms of efficiency and effectiveness in highly fluctuating data stream environments.

【Keywords】: distributed processing; probability; query processing; resource allocation; statistical analysis; ERP; OptPrune algorithm; RLD methodology; RLD strategy; data distributions; data fluctuations; data loss; data statistics; data stream environments; data streams; distributed stream processing systems; early-terminated robust partitioning; load distribution strategy; load fluctuations; load migration; load re-allocation across machines; optimal query performance; parameter space coverage; parametric query optimization problem; performance penalty; probabilistic bound; prohibitively expensive optimizer calls; robust distributed stream processing; robust logical plans; robust physical plans; sensor networks streams; space-covering logical solution; state-of-the-art solutions; stock market; system failure; Digital signal processing; Partitioning algorithms; Query processing; Robustness; Runtime; Silicon; Uncertainty

72. Learning to rank from distant supervision: Exploiting noisy redundancy for relational entity search.

【Paper Link】【Pages】:829-840

【Authors】: Mianwei Zhou ; Hongning Wang ; Kevin Chen-Chuan Chang

【Abstract】: In this paper, we study the task of relational entity search which aims at automatically learning an entity ranking function for a desired relation. To rank entities, we exploit the redundancy abound in their snippets; however, such redundancy is noisy as not all the snippets represent information relevant to the desired relation. To explore useful information from such noisy redundancy, we abstract the task as a distantly supervised ranking problem - based on coarse entity-level annotations, deriving a relation-specific ranking function for the purpose of online searching. As the key challenge, without detailed snippet-level annotations, we have to learn an entity ranking function that can effectively filter noise; furthermore, the ranking function should also be online executable. We develop Pattern-based Filter Network (PFNet), a novel probabilistic graphical model, as our solution. To balance the accuracy and efficiency requirements, PFNet selects a limited size of indicative patterns to filter noisy snippets, and inverted indexes are utilized to retrieve required features. Experiments on the large scale CuleWeb09 data set for six different relations confirm the effectiveness of the proposed PFNet model, which outperforms five state-of-the-art relational entity ranking methods.

【Keywords】: feature extraction; graph theory; information retrieval; learning (artificial intelligence); probability; CuleWeb09 data set; PFNet; coarse entity-level annotation; distant supervision; distantly supervised ranking problem; entity ranking function; feature retrieval; indicative pattern; inverted index; noise filtering; noisy redundancy; noisy snippet filtering; online searching; pattern-based filter network; probabilistic graphical model; rank learning; relation-specific ranking function; relational entity search; snippet-level annotation; Accuracy; Indexes; Logic gates; Noise; Noise measurement; Redundancy; Training

73. AFFINITY: Efficiently querying statistical measures on time-series data.

【Paper Link】【Pages】:841-852

【Authors】: Saket Sathe ; Karl Aberer

【Abstract】: Computing statistical measures for large databases of time series is a fundamental primitive for querying and mining time-series data [1]-[6]. This primitive is gaining importance with the increasing number and rapid growth of time series databases. In this paper, we introduce a framework for efficient computation of statistical measures by exploiting the concept of affine relationships. Affine relationships can be used to infer statistical measures for time series, from other related time series, instead of computing them directly; thus, reducing the overall computational cost significantly. The resulting methods exhibit at least one order of magnitude improvement over the best known methods. To the best of our knowledge, this is the first work that presents an unified approach for computing and querying several statistical measures at once. Our approach exploits affine relationships using three key components. First, the AFCLST algorithm clusters the time-series data, such that high-quality affine relationships could be easily found. Second, the SYMEX algorithm uses the clustered time series and efficiently computes the desired affine relationships. Third, the SCAPE index structure produces a many-fold improvement in the performance of processing several statistical queries by seamlessly indexing the affine relationships. Finally, we establish the effectiveness of our approaches by performing comprehensive experimental evaluation on real datasets.

【Keywords】: data mining; database management systems; pattern clustering; query processing; time series; AFCLST algorithm; SCAPE index structure; SYMEX algorithm; affine relationship concept; data clustering; data mining; statistical measure query; time series database; time-series data; Clustering algorithms; Correlation; Covariance matrices; Indexes; Measurement; Time series analysis; Vectors

74. Forecasting the data cube: A model configuration advisor for multi-dimensional data sets.

【Paper Link】【Pages】:853-864

【Authors】: Ulrike Fischer ; Christopher Schildt ; Claudio Hartmann ; Wolfgang Lehner

【Abstract】: Forecasting time series data is crucial in a number of domains such as supply chain management and display advertisement. In these areas, the time series data to forecast is typically organized along multiple dimensions leading to a high number of time series that need to be forecasted. Most current approaches focus only on selection and optimizing a forecast model for a single time series. In this paper, we explore how we can utilize time series at different dimensions to increase forecast accuracy and, optionally, reduce model maintenance overhead. Solving this problem is challenging due to the large space of possibilities and possible high model creation costs. We propose a model configuration advisor that automatically determines the best set of models, a model configuration, for a given multi-dimensional data set. Our approach is based on a general process that iteratively examines more and more models and simultaneously controls the search space depending on the data set, model type and available hardware. The final model configuration is integrated into F2DB, an extension of PostgreSQL, that processes forecast queries and maintains the configuration as new data arrives. We comprehensively evaluated our approach on real and synthetic data sets. The evaluation shows that our approach significantly increases forecast query accuracy while ensuring low model costs.

【Keywords】: SQL; data handling; query processing; relational databases; time series; PostgreSQL; Structured Query Language; data cube forecasting; display advertisement; forecast accuracy; model configuration advisor; multidimensional data set; query accuracy; supply chain management; time series data forecasting; time series forecast model; Accuracy; Cities and towns; Data models; Forecasting; Numerical models; Predictive models; Time series analysis

75. SubZero: A fine-grained lineage system for scientific databases.

【Paper Link】【Pages】:865-876

【Authors】: Eugene Wu ; Samuel Madden ; Michael Stonebraker

【Abstract】: Data lineage is a key component of provenance that helps scientists track and query relationships between input and output data. While current systems readily support lineage relationships at the file or data array level, finer-grained support at an array-cell level is impractical due to the lack of support for user defined operators and the high runtime and storage overhead to store such lineage. We interviewed scientists in several domains to identify a set of common semantics that can be leveraged to efficiently store fine-grained lineage. We use the insights to define lineage representations that efficiently capture common locality properties in the lineage data, and a set of APIs so operator developers can easily export lineage information from user defined operators. Finally, we introduce two benchmarks derived from astronomy and genomics, and show that our techniques can reduce lineage query costs by up to 10× while incuring substantially less impact on workflow runtime and storage.

【Keywords】: application program interfaces; cost reduction; database management systems; query processing; API; array-cell level; astronomy; fine-grained lineage system; genomics; lineage query cost reduction; scientific databases; user defined operators; Arrays; Benchmark testing; Bioinformatics; Genomics; Optimization; Payloads; Runtime

76. Logical provenance in data-oriented workflows?

【Paper Link】【Pages】:877-888

【Authors】: Robert Ikeda ; Akash Das Sarma ; Jennifer Widom

【Abstract】: We consider the problem of defining, generating, and tracing provenance in data-oriented workflows, in which input data sets are processed by a graph of transformations to produce output results. We first give a new general definition of provenance for general transformations, introducing the notions of correctness, precision, and minimality. We then determine when properties such as correctness and minimality carry over from the individual transformations' provenance to the workflow provenance. We describe a simple logical-provenance specification language consisting of attribute mappings and filters. We provide an algorithm for provenance tracing in workflows where logical provenance for each transformation is specified using our language. We consider logical provenance in the relational setting, observing that for a class of Select-Project-Join (SPJ) transformations, logical provenance specifications encode minimal provenance. We have built a prototype system supporting the features and algorithms presented in the paper, and we report a few preliminary experimental results.

【Keywords】: graph grammars; specification languages; attribute mappings; correctness notion; correctness properties; data-oriented workflows; filters; general transformations; logical provenance specification language; minimality notion; minimality properties; precision notion; provenance tracing; select-project-join transformations; transformation provenance; workflow provenance; Data mining; Debugging; Educational institutions; IP networks; Integrated circuits; Portable computers; Writing

77. Revision provenance in text documents of asynchronous collaboration.

【Paper Link】【Pages】:889-900

【Authors】: Jing Zhang ; H. V. Jagadish

【Abstract】: Many text documents today are collaboratively edited, often with multiple small changes. The problem we consider in this paper is how to find provenance for a specific part of interest in the document. A full revision history, represented as a version tree, can tell us about all updates made to the document, but most of these updates may apply to other parts of the document, and hence not be relevant to answer the provenance question at hand. In this paper, we propose the notion of a revision unit as a flexible unit to capture the necessary provenance. We demonstrate through experiments the capability of the revision units in keeping only relevant updates in the provenance representation and the flexibility of the revision units in adjusting to updates reflected in the version tree.

【Keywords】: collaborative filtering; text analysis; tree data structures; asynchronous collaboration; provenance representation; revision unit flexibility; text documents; version tree; Collaboration; History; Keyword search; Merging; Roads; Semantics; Standards

78. Inverted linear quadtree: Efficient top k spatial keyword search.

【Paper Link】【Pages】:901-912

【Authors】: Chengyuan Zhang ; Ying Zhang ; Wenjie Zhang ; Xuemin Lin

【Abstract】: With advances in geo-positioning technologies and geo-location services, there are a rapidly growing amount of spatio-textual objects collected in many applications such as location based services and social networks, in which an object is described by its spatial location and a set of keywords (terms). Consequently, the study of spatial keyword search which explores both location and textual description of the objects has attracted great attention from the commercial organizations and research communities. In the paper, we study the problem of top k spatial keyword search (TOPK-SK), which is fundamental in the spatial keyword queries. Given a set of spatio-textual objects, a query location and a set of query keywords, the top k spatial keyword search retrieves the closest k objects each of which contains all keywords in the query. Based on the inverted index and the linear quadtree, we propose a novel index structure, called inverted linear quadtree (IL-Quadtree), which is carefully designed to exploit both spatial and keyword based pruning techniques to effectively reduce the search space. An efficient algorithm is then developed to tackle top k spatial keyword search. In addition, we show that the IL-Quadtree technique can also be applied to improve the performance of other spatial keyword queries such as the direction-aware top k spatial keyword search and the spatio-textual ranking query. Comprehensive experiments on real and synthetic data clearly demonstrate the efficiency of our methods.

【Keywords】: indexing; quadtrees; query formulation; query processing; IL-Quadtree technique; TOPK-SK; direction-aware top k spatial keyword search; geo-location services; geo-positioning technologies; index structure; inverted index; inverted linear quadtree; keyword based pruning techniques; location based services; object location description; object textual description; query keywords; query location; search space; social networks; spatial based pruning techniques; spatial keyword queries; spatial location; spatio-textual objects; spatio-textual ranking query; Business; Equations; Indexing; Keyword search; Mathematical model; Search problems

79. Similarity query processing for probabilistic sets.

【Paper Link】【Pages】:913-924

【Authors】: Ming Gao ; Cheqing Jin ; Wei Wang ; Xuemin Lin ; Aoying Zhou

【Abstract】: Evaluating similarity between sets is a fundamental task in computer science. However, there are many applications in which elements in a set may be uncertain due to various reasons. Existing work on modeling such probabilistic sets and computing their similarities suffers from huge model sizes or significant similarity evaluation cost, and hence is only applicable to small probabilistic sets. In this paper, we propose a simple yet expressive model that supports many applications where one probabilistic set may have thousands of elements. We define two types of similarities between two probabilistic sets using the possible world semantics; they complement each other in capturing the similarity distributions in the cross product of possible worlds. We design efficient dynamic programming-based algorithms to calculate both types of similarities. Novel individual and batch pruning techniques based on upper bounding the similarity values are also proposed. To accommodate extremely large probabilistic sets, we also design sampling-based approximate query processing methods with strong probabilistic guarantees. We have conducted extensive experiments using both synthetic and real datasets, and demonstrated the effectiveness and efficiency of our proposed methods.

【Keywords】: data integration; dynamic programming; query processing; sampling methods; dynamic programming; probabilistic sets; pruning technique; sampling-based approximate query processing; similarity query processing; Approximation algorithms; Computational modeling; Heuristic algorithms; Probabilistic logic; Query processing; Semantics; Upper bound

80. Top-k string similarity search with edit-distance constraints.

【Paper Link】【Pages】:925-936

【Authors】: Dong Deng ; Guoliang Li ; Jianhua Feng ; Wen-Syan Li

【Abstract】: String similarity search is a fundamental operation in many areas, such as data cleaning, information retrieval, and bioinformatics. In this paper we study the problem of top-k string similarity search with edit-distance constraints, which, given a collection of strings and a query string, returns the top-k strings with the smallest edit distances to the query string. Existing methods usually try different edit-distance thresholds and select an appropriate threshold to find top-k answers. However it is rather expensive to select an appropriate threshold. To address this problem, we propose a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance. We prune unnecessary entries in the dynamic-programming matrix and only compute those pivotal entries. We extend our techniques to support top-k similarity search. We develop a range-based method by grouping the pivotal entries to avoid duplicated computations. Experimental results show that our method achieves high performance, and significantly outperforms state-of-the-art approaches on real-world datasets.

【Keywords】: dynamic programming; query processing; string matching; bioinformatics; data cleaning; dynamic programming algorithm; dynamic programming matrix; edit-distance constraint; edit-distance threshold; information retrieval; query string; range-based method; top-k answer; top-k string similarity search; Bioinformatics; Cleaning; Indexes; Search problems; Time complexity

81. On shortest unique substring queries.

【Paper Link】【Pages】:937-948

【Authors】: Jian Pei ; Wush Chi-Hsuan Wu ; Mi-Yen Yeh

【Abstract】: In this paper, we tackle a novel type of interesting queries - shortest unique substring queries. Given a (long) string S and a query point q in the string, can we find a shortest substring containing q that is unique in S? We illustrate that shortest unique substring queries have many potential applications, such as information retrieval, bioinformatics, and event context analysis. We develop efficient algorithms for online query answering. First, we present an algorithm to answer a shortest unique substring query in O(n) time using a suffix tree index, where n is the length of string S. Second, we show that, using O(n·h) time and O(n) space, we can compute a shortest unique substring for every position in a given string, where h is variable theoretically in O(n) but on real data sets often much smaller than n and can be treated as a constant. Once the shortest unique substrings are pre-computed, shortest unique substring queries can be answered online in constant time. In addition to the solid algorithmic results, we empirically demonstrate the effectiveness and efficiency of shortest unique substring queries on real data sets.

【Keywords】: query processing; bioinformatics; event context analysis; information retrieval; online query answering; shortest unique substring queries; suffix tree index; Algorithm design and analysis; Bioinformatics; Context; Indexes; Organisms; Upper bound

82. Engineering Generalized Shortest Path queries.

【Paper Link】【Pages】:949-960

【Authors】: Michael N. Rice ; Vassilis J. Tsotras

【Abstract】: Generalized Shortest Path (GSP) queries represent a variant of constrained shortest path queries in which a solution path of minimum total cost must visit at least one location from each of a set of specified location categories (e.g., gas stations, grocery stores) in a specified order. This problem type has many practical applications in logistics and personalized location-based services, and is closely related to the NP-hard Generalized Traveling Salesman Path Problem (GTSPP). In this work, we present a new dynamic programming formulation to highlight the structure of this problem. Using this formulation as our foundation, we progressively engineer a fast and scalable GSP query algorithm for use on large, real-world road networks. Our approach incorporates concepts from Contraction Hierarchies, a well-known graph indexing technique for static shortest path queries. To demonstrate the practicality of our algorithm we experimented on the North American road network (with over 50 million edges) where we achieved up to several orders of magnitude speed improvements over the previous-best algorithm, depending on the relative sizes of the location categories.

【Keywords】: dynamic programming; graph theory; logistics; mobile computing; query processing; travelling salesman problems; GSP queries; GTSPP; NP-hard generalized traveling salesman path problem; dynamic programming; generalized shortest path queries; graph indexing technique; location-based services; logistics; specified location categories; Algorithm design and analysis; Approximation algorithms; Approximation methods; Dynamic programming; Heuristic algorithms; Roads; Traveling salesman problems

83. Efficient direct search on compressed genomic data.

【Paper Link】【Pages】:961-972

【Authors】: Xiaochun Yang ; Bin Wang ; Chen Li ; Jiaying Wang ; Xiaohui Xie

【Abstract】: The explosive growth in the amount of data produced by next-generation sequencing poses significant computational challenges on how to store, transmit and query these data, efficiently and accurately. A unique characteristic of the genomic sequence data is that many of them can be highly similar to each other, which has motivated the idea of compressing sequence data by storing only their differences to a reference sequence, thereby drastically cutting the storage cost. However, an unresolved question in this area is whether it is possible to perform search directly on the compressed data, and if so, how. Here we show that directly querying compressed genomic sequence data is possible and can be done efficiently. We describe a set of novel index structures and algorithms for this purpose, and present several optimization techniques to reduce the space requirement and query response time. We demonstrate the advantage of our method and compare it against existing ones through a thorough experimental study on real genomic data.

【Keywords】: bioinformatics; data compression; genomics; indexing; query processing; compressed genomic data; data querying; data storage; data transmission; direct search; genomic sequence data; index structure; next-generation sequencing; optimization technique; query response time; sequence data compression; space requirement reduction; Bioinformatics; Genomics; Indexes; Niobium; Pattern matching; Sequential analysis; Silicon

84. On answering why-not questions in reverse skyline queries.

【Paper Link】【Pages】:973-984

【Authors】: Md. Saiful Islam ; Rui Zhou ; Chengfei Liu

【Abstract】: This paper aims at answering the so called why-not questions in reverse skyline queries. A reverse skyline query retrieves all data points whose dynamic skylines contain the query point. We outline the benefit and the semantics of answering why-not questions in reverse skyline queries. In connection with this, we show how to modify the why-not point and the query point to include the why-not point in the reverse skyline of the query point. We then show, how a query point can be positioned safely anywhere within a region (i.e., called safe region) without losing any of the existing reverse skyline points. We also show how to answer why-not questions considering the safe region of the query point. Our approach efficiently combines both query point and data point modification techniques to produce meaningful answers. Experimental results also demonstrate that our approach can produce high quality explanations for why-not questions in reverse skyline queries.

【Keywords】: query processing; question answering (information retrieval); data point modification techniques; dynamic skylines; query point; reverse skyline queries; why-not point; why-not questions; Companies; Complexity theory; Database systems; Educational institutions; Heuristic algorithms; Semantics

85. Layered processing of skyline-window-join (SWJ) queries using iteration-fabric.

【Paper Link】【Pages】:985-996

【Authors】: Mithila Nagendra ; K. Selçuk Candan

【Abstract】: The problem of finding interesting tuples in a data set, more commonly known as the skyline problem, has been extensively studied in scenarios where the data is static. More recently, skyline research has moved towards data streaming environments, where tuples arrive/expire in a continuous manner. Several algorithms have been developed to track skyline changes over sliding windows; however, existing methods focus on skyline analysis in which all required skyline attributes belong to a single incoming data stream. This constraint renders current algorithms unsuitable for applications that require a real-time “join” operation to be carried out between multiple incoming data streams, arriving from different sources, before the skyline query can be answered. Based on this motivation, in this paper, we address the problem of computing skyline-window-join (SWJ) queries over pairs of data streams, considering sliding windows that take into account only the most recent tuples. In particular, we propose a Layered Skyline-window-Join (LSJ) operator that (a) partitions the overall process into processing layers and (b) maintains skyline-join results in an incremental manner by continuously monitoring the changes in all layers of the process. We combine the advantages of existing skyline methods (including those that efficiently maintain skyline results over a single stream, and those that compute the skyline of pairs of static data sets) to develop a novel iteration-fabric skyline-window-join processing structure. Using the iteration-fabric, LSJ eliminates redundant work across consecutive windows by leveraging shared data across all iteration layers of the windowed skyline-join processing. To the best of our knowledge, this is the first paper that addresses join-based skyline queries over sliding windows. Extensive experimental evaluations over real and simulated data show that LSJ provides large gains over naive extensions of existing schemes which are not d- signed to eliminate redundant work across multiple processing layers.

【Keywords】: iterative methods; query processing; set theory; LSJ operator; data streaming environments; iteration-fabric skyline-window-join processing structure; join-based skyline queries; layered SWJ query processing; layered skyline-window-join operator; layered skyline-window-join query processing; naive extensions; real-time join operation; skyline analysis; skyline attributes; skyline change tracking; skyline problem; skyline research; sliding windows; tuple finding; windowed skyline-join processing; Algorithm design and analysis; Computational modeling; Context; Distributed databases; Monitoring; Sensors; Stock markets

86. Efficient snapshot retrieval over historical graph data.

【Paper Link】【Pages】:997-1008

【Authors】: Udayan Khurana ; Amol Deshpande

【Abstract】: We present a distributed graph database system to manage historical data for large evolving information networks, with the goal to enable temporal and evolutionary queries and analysis. The cornerstone of our system is a novel, user-extensible, highly tunable, and distributed hierarchical index structure called DeltaGraph, that enables compact recording of the historical network information, and that supports efficient retrieval of historical graph snapshots for single-site or parallel processing. Our system exposes a general programmatic API to process and analyze the retrieved snapshots. Along with the original graph data, DeltaGraph can also maintain and index auxiliary information; this functionality can be used to extend the structure to efficiently execute queries like subgraph pattern matching over historical data. We develop analytical models for both the storage space needed and the snapshot retrieval times to aid in choosing the right construction parameters for a specific scenario. We also present an in-memory graph data structure called GraphPool that can maintain hundreds of historical graph instances in main memory in a non-redundant manner. We present a comprehensive experimental evaluation that illustrates the effectiveness of our proposed techniques at managing historical graph information.

【Keywords】: application program interfaces; data structures; distributed databases; information networks; query processing; DeltaGraph; GraphPool; analytical models; comprehensive experimental evaluation; distributed graph database system; distributed hierarchical index structure; evolutionary analysis; evolutionary queries; general programmatic API; historical data management; historical graph data; historical graph information management; historical graph snapshots; historical network information; in-memory graph data structure; information networks; parallel processing; single-site processing; snapshot retrieval; storage space; subgraph pattern matching; temporal queries; Analytical models; Data models; Data structures; Indexes; Memory management; Pattern matching

87. FERRARI: Flexible and efficient reachability range assignment for graph indexing.

【Paper Link】【Pages】:1009-1020

【Authors】: Stephan Seufert ; Avishek Anand ; Srikanta J. Bedathur ; Gerhard Weikum

【Abstract】: In this paper, we propose a scalable and highly efficient index structure for the reachability problem over graphs. We build on the well-known node interval labeling scheme where the set of vertices reachable from a particular node is compactly encoded as a collection of node identifier ranges. We impose an explicit bound on the size of the index and flexibly assign approximate reachability ranges to nodes of the graph such that the number of index probes to answer a query is minimized. The resulting tunable index structure generates a better range labeling if the space budget is increased, thus providing a direct control over the trade off between index size and the query processing performance. By using a fast recursive querying method in conjunction with our index structure, we show that, in practice, reachability queries can be answered in the order of microseconds on an off-the-shelf computer - even for the case of massive-scale real world graphs. Our claims are supported by an extensive set of experimental results using a multitude of benchmark and real-world web-scale graph datasets.

【Keywords】: indexing; query processing; reachability analysis; FERRARI; Web-scale graph dataset; efficient index structure; efficient reachability range assignment; flexible reachability range assignment; graph indexing; node interval labeling scheme; off-the-shelf computer; query answering; query processing performance; recursive querying method; tunable index structure; Approximation algorithms; Equations; Indexing; Labeling; Query processing; Vegetation

88. gIceberg: Towards iceberg analysis in large graphs.

【Paper Link】【Pages】:1021-1032

【Authors】: Nan Li ; Ziyu Guan ; Lijie Ren ; Jian Wu ; Jiawei Han ; Xifeng Yan

【Abstract】: Traditional multi-dimensional data analysis techniques such as iceberg cube cannot be directly applied to graphs for finding interesting or anomalous vertices due to the lack of dimensionality in graphs. In this paper, we introduce the concept of graph icebergs that refer to vertices for which the concentration (aggregation) of an attribute in their vicinities is abnormally high. Intuitively, these vertices shall be “close” to the attribute of interest in the graph space. Based on this intuition, we propose a novel framework, called gIceberg, which performs aggregation using random walks, rather than traditional SUM and AVG aggregate functions. This proposed framework scores vertices by their different levels of interestingness and finds important vertices that meet a user-specified threshold. To improve scalability, two aggregation strategies, forward and backward aggregation, are proposed with corresponding optimization techniques and bounds. Experiments on both real-world and synthetic large graphs demonstrate that gIceberg is effective and scalable.

【Keywords】: data analysis; graph theory; optimisation; attribute aggregation; attribute concentration; backward aggregation strategy; forward aggregation strategy; gIceberg framework; graph dimensionality; graph icebergs concept; graph vertex; iceberg analysis; iceberg cube; multidimensional data analysis technique; optimization technique; random walk; user-specified threshold; Accuracy; Aggregates; Approximation algorithms; Approximation methods; Educational institutions; Equations; Vectors

89. Top-k graph pattern matching over large graphs.

【Paper Link】【Pages】:1033-1044

【Authors】: Jiefeng Cheng ; Xianggang Zeng ; Jeffrey Xu Yu

【Abstract】: There exist many graph-based applications including bioinformatics, social science, link analysis, citation analysis, and collaborative work. All need to deal with a large data graph. Given a large data graph, in this paper, we study finding top-k answers for a graph pattern query (kGPM), and in particular, we focus on top-k cyclic graph queries where a graph query is cyclic and can be complex. The capability of supporting kGPM provides much more flexibility for a user to search graphs. And the problem itself is challenging. In this paper, we propose a new framework of processing kGPM with on-the-fly ranked lists based on spanning trees of the cyclic graph query. We observe a multidimensional representation for using multiple ranked lists to answer a given kGPM query. Under this representation, we propose a cost model to estimate the least number of tree answers to be consumed in each ranked list for a given kGPM query. This leads to a query optimization approach for kGPM processing, and a top-k algorithm to process kGPM with the optimal query plan. We conducted extensive performance studies using a synthetic dataset and a real dataset, and we confirm the efficiency of our proposed approach.

【Keywords】: pattern matching; query processing; tree searching; trees (mathematics); bioinformatics; citation analysis; collaborative work; cost model; data graph; graph pattern query; graph-based application; link analysis; multidimensional representation; multiple ranked list; optimal query plan; query optimization; search graph; social science; spanning trees; top-k answer finding; top-k cyclic graph query; top-k graph pattern matching; tree answer; Bioinformatics; Medical services; Optimization; Pattern matching; Query processing; Social network services

90. Breaking the top-k barrier of hidden web databases?

【Paper Link】【Pages】:1045-1056

【Authors】: Saravanan Thirumuruganathan ; Nan Zhang ; Gautam Das

【Abstract】: A large number of web databases are only accessible through proprietary form-like interfaces which require users to query the system by entering desired values for a few attributes. A key restriction enforced by such an interface is the top-k output constraint - i.e., when there are a large number of matching tuples, only a few (top-k) of them are preferentially selected and returned by the website, often according to a proprietary ranking function. Since most web database owners set k to be a small value, the top-k output constraint prevents many interesting third-party (e.g., mashup) services from being developed over real-world web databases. In this paper we consider the novel problem of “digging deeper” into such web databases. Our main contribution is the meta-algorithm GetNext that can retrieve the next ranked tuple from the hidden web database using only the restrictive interface of a web database without any prior knowledge of its ranking function. This algorithm can then be called iteratively to retrieve as many top ranked tuples as necessary. We develop principled and efficient algorithms that are based on generating and executing multiple reformulated queries and inferring the next ranked tuple from their returned results. We provide theoretical analysis of our algorithms, as well as extensive experimental results over synthetic and real-world databases that illustrate the effectiveness of our techniques.

【Keywords】: Web sites; database management systems; query processing; Web site; digging deeper problem; hidden Web database interface; mashup service; matching tuple; meta-algorithm GetNext; next ranked tuple retrieval; proprietary ranking function; real-world Web database; real-world database; reformulated query; synthetic database; third-party service; top-k barrier; top-k output constraint; Algorithm design and analysis; Knowledge engineering; Mashups; Query processing; Silicon; Testing

91. Automatic extraction of top-k lists from the web.

【Paper Link】【Pages】:1057-1068

【Authors】: Zhixian Zhang ; Kenny Q. Zhu ; Haixun Wang ; Hongsong Li

【Abstract】: This paper is concerned with information extraction from top-k web pages, which are web pages that describe top k instances of a topic which is of general interest. Examples include “the 10 tallest buildings in the world”, “the 50 hits of 2010 you don't want to miss”, etc. Compared to other structured information on the web (including web tables), information in top-k lists is larger and richer, of higher quality, and generally more interesting. Therefore top-k lists are highly valuable. For example, it can help enrich open-domain knowledge bases (to support applications such as search or fact answering). In this paper, we present an efficient method that extracts top-k lists from web pages with high performance. Specifically, we extract more than 1.7 million top-k lists from a web corpus of 1.6 billion pages with 92.0% precision and 72.3% recall.

【Keywords】: Web sites; information analysis; knowledge based systems; Web corpus; automatic top-k lists extraction; information extraction; open-domain knowledge bases; top-k Web pages; Companies; Context; Data mining; Digital audio broadcasting; Feature extraction; Knowledge based systems; Web pages; Web information extraction; list extraction; top-k lists; web mining

92. Finding interesting correlations with conditional heavy hitters.

【Paper Link】【Pages】:1069-1080

【Authors】: Katsiaryna Mirylenka ; Themis Palpanas ; Graham Cormode ; Divesh Srivastava

【Abstract】: The notion of heavy hitters-items that make up a large fraction of the population - has been successfully used in a variety of applications across sensor and RFID monitoring, network data analysis, event mining, and more. Yet this notion often fails to capture the semantics we desire when we observe data in the form of correlated pairs. Here, we are interested in items that are conditionally frequent: when a particular item is frequent within the context of its parent item. In this work, we introduce and formalize the notion of Conditional Heavy Hitters to identify such items, with applications in network monitoring, and Markov chain modeling. We introduce several streaming algorithms that allow us to find conditional heavy hitters efficiently, and provide analytical results. Different algorithms are successful for different input characteristics. We perform experimental evaluations to demonstrate the efficacy of our methods, and to study which algorithms are most suited for different types of data.

【Keywords】: Markov processes; data handling; Markov chain modeling; RFID monitoring; conditional heavy hitters; event mining; finding interesting correlations; network data analysis; network monitoring; observe data; sensor monitoring; streaming algorithms; Correlation; Data models; Data structures; Frequency estimation; Itemsets; Markov processes; Monitoring

93. Predicting query execution time: Are optimizer cost models really unusable?

【Paper Link】【Pages】:1081-1092

【Authors】: Wentao Wu ; Yun Chi ; Shenghuo Zhu ; Jun'ichi Tatemura ; Hakan Hacigümüs ; Jeffrey F. Naughton

【Abstract】: Predicting query execution time is useful in many database management issues including admission control, query scheduling, progress monitoring, and system sizing. Recently the research community has been exploring the use of statistical machine learning approaches to build predictive models for this task. An implicit assumption behind this work is that the cost models used by query optimizers are insufficient for query execution time prediction. In this paper we challenge this assumption and show while the simple approach of scaling the optimizer's estimated cost indeed fails, a properly calibrated optimizer cost model is surprisingly effective. However, even a well-tuned optimizer cost model will fail in the presence of errors in cardinality estimates. Accordingly we investigate the novel idea of spending extra resources to refine estimates for the query plan after it has been chosen by the optimizer but before execution. In our experiments we find that a well calibrated query optimizer model along with cardinality estimation refinement provides a low overhead way to provide estimates that are always competitive and often much better than the best reported numbers from the machine learning approaches.

【Keywords】: database management systems; query processing; admission control; cardinality estimates; cardinality estimation refinement; database management; predictive model; progress monitoring; query execution time prediction; query optimizer cost model; query plan; query scheduling; research community; statistical machine learning; system sizing; Calibration; Equations; Estimation; Hardware; Indexes; Mathematical model

94. Query optimization for differentially private data management systems.

【Paper Link】【Pages】:1093-1104

【Authors】: Shangfu Peng ; Yin Yang ; Zhenjie Zhang ; Marianne Winslett ; Yong Yu

【Abstract】: Differential privacy (DP) enables publishing statistical query results over sensitive data, with rigorous privacy guarantees, and very conservative assumptions about the adversary's background knowledge. This paper focuses on the interactive DP framework, which processes incoming queries on the fly, each of which consumes a portion of the user-specified privacy budget. Existing systems process each query independently, which often leads to considerable privacy budget waste. Motivated by this, we propose Pioneer, a query optimizer for an interactive, DP-compliant DBMS. For each new query, Pioneer creates an execution plan that combines past query results and new results from the underlying data. When a query has multiple semantically equivalent plans, Pioneer automatically selects one with minimal privacy budget consumption. Extensive experiments confirm that Pioneer achieves significant savings of the privacy budget, and can answer many more queries than existing systems for a fixed total budget, with comparable result accuracy.

【Keywords】: data privacy; database management systems; query processing; DP-compliant DBMS; Pioneer query optimizer; database management system; differential privacy; differentially private data management system; interactive DP framework; privacy budget; query optimization; query processing; statistical query result; user-specified privacy budget; Accuracy; Databases; Equations; Noise; Noise measurement; Privacy; Sensitivity

95. Top down plan generation: From theory to practice.

【Paper Link】【Pages】:1105-1116

【Authors】: Pit Fender ; Guido Moerkotte

【Abstract】: Finding the optimal execution order of join operations is a crucial task of today's cost-based query optimizers. There are two approaches to identify the best plan: bottom-up and top-down join enumeration. But only the top-down approach allows for branch-and-bound pruning, which can improve compile time by several orders of magnitude while still preserving optimality. For both optimization strategies, efficient enumeration algorithms have been published. However, there are two severe limitations for the top-down approach: The published algorithms can handle only (1) simple (binary) join predicates and (2) inner joins. Since real queries may contain complex join predicates involving more than two relations, and outer joins as well as other non-inner joins, efficient top-down join enumeration cannot be used in practice yet. We develop a novel top-down join enumeration algorithm that overcomes these two limitations. Furthermore, we show that our new algorithm is competitive when compared with the state of the art in bottom-up processing even without playing out its advantage by making use of its branch-and-bound pruning capabilities.

【Keywords】: query processing; tree searching; bottom-up join enumeration; bottom-up processing; branch-and-bound pruning capabilities; cost-based query optimizers; enumeration algorithms; optimization strategies; top down plan generation; top-down approach; top-down join enumeration algorithm; Abstracts; Dynamic programming; Generators; Heuristic algorithms; Joining processes; Optimization; Partitioning algorithms

96. TBF: A memory-efficient replacement policy for flash-based caches.

【Paper Link】【Pages】:1117-1128

【Authors】: Cristian Ungureanu ; Biplob Debnath ; Stephen Rago ; Akshat Aranya

【Abstract】: The performance and capacity characteristics of flash storage make it attractive to use as a cache. Recency-based cache replacement policies rely on an in-memory full index, typically a B-tree or a hash table, that maps each object to its recency information. Even though the recency information itself may take very little space, the full index for a cache holding N keys requires at least log N bits per key. This metadata overhead is undesirably high when used for very large flash-based caches, such as key-value stores with billions of objects. To solve this problem, we propose a new RAM-frugal cache replacement policy that approximates the least-recently-used (LRU) policy. It uses two in-memory Bloom sub-filters (TBF) for maintaining the recency information and leverages an on-flash key-value store to cache objects. TBF requires only one byte of RAM per cached object, making it suitable for implementing very large flash-based caches. We evaluate TBF through simulation on traces from several block stores and key-value stores, as well as evaluate it using the Yahoo! Cloud Serving Benchmark in a real system implementation. Evaluation results show that TBF achieves cache hit rate and operations per second comparable to those of LRU in spite of its much smaller memory requirements.

【Keywords】: cache storage; cloud computing; flash memories; meta data; tree data structures; B-tree; LRU policy; RAM-frugal cache replacement policy; TBF; Yahoo! Cloud Serving Benchmark; cache object; capacity characteristics; flash storage; flash-based cache; hash table; in-memory Bloom subfilter; in-memory full index; least-recently-used policy; memory requirement; memory-efficient replacement policy; metadata overhead; on-flash key-value store; performance characteristics; recency information; recency-based cache replacement policy; Ash; Benchmark testing; Clocks; Data structures; Indexes; Memory management; Random access memory

97. Fast peak-to-peak behavior with SSD buffer pool.

【Paper Link】【Pages】:1129-1140

【Authors】: Jaeyoung Do ; Donghui Zhang ; Jignesh M. Patel ; David J. DeWitt

【Abstract】: A promising use of flash SSDs in a DBMS is to extend the main memory buffer pool by caching selected pages that have been evicted from the buffer pool. Such a use has been shown to produce significant gains in the steady state performance of the DBMS. One strategy for using the SSD buffer pool is to throw away the data in the SSD when the system is restarted (either when recovering from a crash or restarting after a shutdown), and consequently a long “ramp-up” period to regain peak performance is needed. One approach to eliminate this limitation is to use a memory-mapped file to store the SSD buffer table in order to be able to restore its contents on restart. However, this design can result in lower sustained performance, because every update to the SSD buffer table may incur an I/O operation to the memory-mapped file. In this paper we propose two new alternative designs. One design reconstructs the SSD buffer table using transactional logs. The other design asynchronously flushes the SSD buffer table, and upon restart, lazily verifies the integrity of the data cached in the SSD buffer pool. We have implemented these three designs in SQL Server 2012. For each design, both the write-through and write-back SSD caching policies were implemented. Using two OLTP benchmarks (TPC-C and TPC-E), our experimental results show that our designs produce up to 3.8X speedup on the interval between peak-to-peak performance, with negligible performance loss; in contrast, the previous approach has a similar speedup but up to 54% performance loss.

【Keywords】: SQL; cache storage; flash memories; system monitoring; transaction processing; DBMS; OLTP benchmarks; SQL Server 2012; SSD buffer pool; SSD buffer table; TPC-C benchmarks; TPC-E benchmarks; data cache integrity verification; flash SSD; lazy verification restart; log-based restart; main memory buffer pool; memory-mapped file; peak-to-peak behavior; transactional logs; write-back SSD caching policies; write-through SSD caching policies; Algorithm design and analysis; Buffer storage; Computer crashes; Data structures; Databases; Servers; Writing

98. SELECT triggers for data auditing.

【Paper Link】【Pages】:1141-1152

【Authors】: Daniel Fabbri ; Ravishankar Ramamurthy ; Raghav Kaushik

【Abstract】: Auditing is a key part of the security infrastructure in a database system. While commercial database systems provide mechanisms such as triggers that can be used to track and log any changes made to “sensitive” data using UPDATE queries, they are not useful for tracking accesses to sensitive data using complex SQL queries, which is important for many applications given recent laws such as HIPAA. In this paper, we propose the notion of SELECT triggers that extends triggers to work for SELECT queries in order to facilitate data auditing. We discuss the challenges in integrating SELECT triggers in a database system including specification, semantics as well as efficient implementation techniques. We have prototyped our framework in a commercial database system and present an experimental evaluation of our framework using the TPC-H benchmark.

【Keywords】: SQL; data analysis; database management systems; query processing; security of data; HIPAA; SELECT query; SELECT triggers; SQL query; TPC-H benchmark; UPDATE query; commercial database systems; data auditing; experimental evaluation; implementation techniques; security infrastructure; sensitive data; tracking accesses; Cancer; Database systems; Diseases; Security; Semantics

99. Data services for E-tailers leveraging web search engine assets.

【Paper Link】【Pages】:1153-1164

【Authors】: Tao Cheng ; Kaushik Chakrabarti ; Surajit Chaudhuri ; Vivek R. Narasayya ; Manoj Syamala

【Abstract】: Retail is increasingly moving online. There are only a few big e-tailers but there is a long tail of small-sized e-tailers. The big e-tailers are able to collect significant data on user activities at their websites. They use these assets to derive insights about their products and to provide superior experiences for their users. On the other hand, small e-tailers do not possess such user data and hence cannot match the rich user experiences offered by big e-tailers. Our key insight is that web search engines possess significant data on user behaviors that can be used to help smaller e-tailers mine the same signals that big e-tailers derive from their proprietary user data assets. These signals can be exposed as data services in the cloud; e-tailers can leverage them to enable similar user experiences as the big e-tailers. We present three such data services in the paper: entity synonym data service, query-to-entity data service and entity tagging data service. The entity synonym service is an in-production data service that is currently available while the other two are data services currently in development at Microsoft. Our experiments on product datasets show (i) these data services have high quality and (ii) they have significant impact on user experiences on e-tailer websites. To the best of our knowledge, this is the first paper to explore the potential of using search engine data assets for e-tailers.

【Keywords】: cloud computing; retail data processing; search engines; user interfaces; Web search engine assets; Web site; cloud computing; entity synonym data service; entity tagging data service; query-to-entity data service; retailing; small-sized e-tailer; user activity; user behavior; user experience; Catalogs; Context; Digital cameras; Earth Observing System; Search engines; Web search

100. SAP HANA distributed in-memory database system: Transaction, session, and metadata management.

【Paper Link】【Pages】:1165-1173

【Authors】: Juchang Lee ; Yong Sik Kwon ; Franz Färber ; Michael Muehle ; Chulwon Lee ; Christian Bensberg ; Joo-Yeon Lee ; Arthur H. Lee ; Wolfgang Lehner

【Abstract】: One of the core principles of the SAP HANA database system is the comprehensive support of distributed query facility. Supporting scale-out scenarios was one of the major design principles of the system from the very beginning. Within this paper, we first give an overview of the overall functionality with respect to data allocation, metadata caching and query routing. We then dive into some level of detail for specific topics and explain features and methods not common in traditional disk-based database systems. In summary, the paper provides a comprehensive overview of distributed query processing in SAP HANA database to achieve scalability to handle large databases and heterogeneous types of workloads.

【Keywords】: distributed databases; meta data; query processing; SAP HANA database system; data allocation; disk based database system; distributed query processing; in-memory database system; metadata caching; metadata management; query facility; query routing; Business; Distributed databases; Query processing; Routing; Servers

101. HFMS: Managing the lifecycle and complexity of hybrid analytic data flows.

【Paper Link】【Pages】:1174-1185

【Authors】: Alkis Simitsis ; Kevin Wilkinson ; Umeshwar Dayal ; Meichun Hsu

【Abstract】: To remain competitive, enterprises are evolving their business intelligence systems to provide dynamic, near realtime views of business activities. To enable this, they deploy complex workflows of analytic data flows that access multiple storage repositories and execution engines and that span the enterprise and even outside the enterprise. We call these multi-engine flows hybrid flows. Designing and optimizing hybrid flows is a challenging task. Managing a workload of hybrid flows is even more challenging since their execution engines are likely under different administrative domains and there is no single point of control. To address these needs, we present a Hybrid Flow Management System (HFMS). It is an independent software layer over a number of independent execution engines and storage repositories. It simplifies the design of analytic data flows and includes optimization and executor modules to produce optimized executable flows that can run across multiple execution engines. HFMS dispatches flows for execution and monitors their progress. To meet service level objectives for a workload, it may dynamically change a flow's execution plan to avoid processing bottlenecks in the computing infrastructure. We present the architecture of HFMS and describe its components. To demonstrate its potential benefit, we describe performance results for running sample batch workloads with and without HFMS. The ability to monitor multiple execution engines and to dynamically adjust plans enables HFMS to provide better service guarantees and better system utilization.

【Keywords】: competitive intelligence; data analysis; storage management; HFMS; business intelligence system; complexity management; executor module; hybrid analytic data flow; hybrid flow management system; independent execution engine; independent software layer; lifecycle management; multiengine flows hybrid flow; optimization module; storage repository; system utilization; workload management; Business; Connectors; Databases; Engines; Fault tolerance; Monitoring; Optimization

102. KuaFu: Closing the parallelism gap in database replication.

【Paper Link】【Pages】:1186-1195

【Authors】: Chuntao Hong ; Dong Zhou ; Mao Yang ; Carbo Kuo ; Lintao Zhang ; Lidong Zhou

【Abstract】: Database systems are nowadays increasingly deployed on multi-core commodity servers, with replication to guard against failures. Database engine is best designed to scale with the number of cores to offer a high degree of parallelism on a modern multi-core architecture. On the other hand, replication traditionally resorts to a certain form of serialization for data consistency among replicas. In the widely used primary/backup replication with log shipping, concurrent executions on the primary and the serialized log replay on a backup creates a serious parallelism gap. Our experiment on MySQL with a 16-core configuration shows that the serial replay of a backup can sustain only less than one third of the throughput achievable on the primary under an OLTP workload. This paper proposes KuaFu to close the parallelism gap on replicated database systems by enabling concurrent replay of transactions on a backup. KuaFu maintains write consistency on backups by tracking transaction dependencies. Concurrent replay on a backup does introduce read inconsistency between the primary and backups. KuaFu further leverages multi-version concurrency control to produce snapshots in order to restore the consistency semantics. We have implemented KuaFu on MySQL; our evaluations show that KuaFu allows a backup to keep up with the primary while preserving replication consistency.

【Keywords】: SQL; concurrency control; data mining; multiprocessing systems; replicated databases; KuaFu; MySQL; OLTP workload; backup replication; concurrent executions; concurrent replay; data consistency; database engine; database replication; log shipping; modern multicore architecture; multicore commodity servers; multiversion concurrency control; parallelism gap; preserving replication consistency; primary replication; replicated database systems; serialized log replay; Database systems; Engines; Parallel processing; Semantics; Servers; Throughput

103. Materialization strategies in the Vertica analytic database: Lessons learned.

【Paper Link】【Pages】:1196-1207

【Authors】: Lakshmikant Shrinivas ; Sreenath Bodagala ; Ramakrishna Varadarajan ; Ariel Cary ; Vivek Bharathan ; Chuck Bear

【Abstract】: Column store databases allow for various tuple reconstruction strategies (also called materialization strategies). Early materialization is easy to implement but generally performs worse than late materialization. Late materialization is more complex to implement, and usually performs much better than early materialization, although there are situations where it is worse. We identify these situations, which essentially revolve around joins where neither input fits in memory (also called spilling joins). Sideways information passing techniques provide a viable solution to get the best of both worlds. We demonstrate how early materialization combined with sideways information passing allows us to get the benefits of late materialization, without the bookkeeping complexity or worse performance for spilling joins. It also provides some other benefits to query processing in Vertica due to positive interaction with compression and sort orders of the data. In this paper, we report our experiences with late and early materialization, highlight their strengths and weaknesses, and present the details of our sideways information passing implementation. We show experimental results of comparing these materialization strategies, which highlight the significant performance improvements provided by our implementation of sideways information passing (up to 72% on some TPC-H queries).

【Keywords】: data compression; database management systems; query processing; storage management; TPC-H query; Vertica analytic database; bookkeeping complexity; column store database; data compression; late materialization; materialization strategy; memory; query processing; sideways information passing technique; spilling joins; tuple reconstruction strategy; Complexity theory; Containers; Context; Data models; Engines; Query processing

104. Pipe failure prediction: A data mining method.

【Paper Link】【Pages】:1208-1218

【Authors】: Rui Wang ; Weishan Dong ; Yu Wang ; Ke Tang ; Xin Yao

【Abstract】: Pipe breaks in urban water distribution network lead to significant economical and social costs, putting the service quality as well as the profit of water utilities at risk. To cope with such a situation, scheduled preventive maintenance is desired, which aims to predict and fix potential break pipes proactively. Physical models developed for understanding and predicting the failure of pipes are usually expensive, thus can only be used on a limited number of trunk pipes. As an alternative, statistical models that try to predict pipe breaks based on historical data are far less expensive, and therefore have attracted a lot of interests from water utilities recently. In this paper, we report a novel data mining prediction system that has been built for a water utility in a big Chinese city. Various aspects of how to build such a system are described, including problem formulation, data cleaning, model construction, as well as evaluating the importance of attributes according to the requirements of end users in water utilities. Satisfactory results have been achieved by our prediction system. For example, with the system trained on the available dataset at the end of 2010, the water utility would avoid 50% of pipe breaks in 2011 by examining only 6.98% of its pipes in advance. During the construction of the system, we find that the extremely skew distribution of break and non-break pipes, interestingly, is not an obstacle. This lesson could serve as a practical reference for both academical studies on imbalanced learning as well as future explorations on pipe failure prediction problems.

【Keywords】: data handling; data mining; failure (mechanical); failure analysis; learning (artificial intelligence); mechanical engineering computing; pipes; preventive maintenance; statistical distributions; water supply; Chinese city; data cleaning; data mining prediction system; economical cost; extremely skew distribution; historical data; imbalanced learning; model construction; physical model; pipe break; pipe failure prediction problem; problem formulation; scheduled preventive maintenance; service quality; social cost; statistical model; trunk pipe; urban water distribution network; water utility; Data models; Materials; Prediction algorithms; Predictive models; Preventive maintenance; Testing; Training

105. SASH: Enabling continuous incremental analytic workflows on Hadoop.

【Paper Link】【Pages】:1219-1230

【Authors】: Manish Sethi ; Narendran Sachindran ; Sriram Raghavan

【Abstract】: There is an emerging class of enterprise applications in areas such as log data analysis, information discovery, and social media marketing that involve analytics over large volumes of unstructured and semi-structured data. These applications are leveraging new analytics platforms based on the MapReduce framework and its open source Hadoop implementation. While this trend has engendered work on high-level data analysis languages, NoSQL data stores, workflow engines etc., there has been very little attention to the challenges of deploying analytic workflows into production for continuous operation. In this paper, we argue that an essential platform component for enabling continuous production analytic workflows is an analytics store. We highlight five key requirements that impact the design of such a store: (i) efficient incremental operations, (ii) flexible storage model for hierarchical data, (iii) snapshot support (iv) object-level incremental updates, and (v) support for handling change sets. We describe the design of SASH, a scalable analytics store that we have developed on top of HBase to address these requirements. Using the workload from a production workflow that powers search within IBM's intranet and extranet, we demonstrate orders of magnitude improvement in IO performance using SASH.

【Keywords】: SQL; data analysis; data mining; data structures; high level languages; intranets; object-oriented programming; public domain software; social networking (online); storage management; workflow management software; HBase; IBM intranet and extranet; IO performance; MapReduce framework; NoSQL data stores; SASH; continuous incremental analytic workflows; continuous operation; continuous production analytic workflows; enterprise applications; flexible storage model; handling change sets support; hierarchical data; high-level data analysis languages; incremental operations; information discovery; log data analysis; object-level incremental updates; open source Hadoop implementation; platform component; production workflow; scalable analytics store; semistructured data; snapshot support; social media marketing; unstructured data; workflow engines; Computer architecture; Data models; Indexing; Libraries; Media; Production

106. Automating pattern discovery for rule based data standardization systems.

【Paper Link】【Pages】:1231-1241

【Authors】: Snigdha Chaturvedi ; K. Hima Prasad ; Tanveer A. Faruquie ; Bhupesh Chawda ; L. Venkata Subramaniam ; Raghu Krishnapuram

【Abstract】: Data quality is a perennial problem for many enterprise data assets. To improve data quality, businesses often employ rule based data standardization systems in which domain experts code rules for handling important and prevalent patterns. Finding these patterns is laborious and time consuming, particularly for noisy or highly specialized data sets. It is also subjective to the persons determining these patterns. In this paper we present a tool to automatically mine patterns that can help in improving the efficiency and effectiveness of these data standardization systems. The automatically extracted patterns are used by the domain and knowledge experts for rule writing. We use a greedy algorithm to extract patterns that result in a maximal coverage of data. We further group the extracted patterns such that each group represents patterns that capture similar domain knowledge. We propose a similarity measure that uses input pattern semantics to group these patterns. We demonstrate the effectiveness of our method for standardization tasks on three real world datasets.

【Keywords】: business data processing; data mining; greedy algorithms; knowledge based systems; standardisation; automatically pattern mining; data quality; domain experts; domain knowledge; enterprise data assets; greedy algorithm; knowledge experts; pattern discovery automation; pattern extraction; pattern handling; perennial problem; real world datasets; rule based data standardization systems; rule writing; similarity measure; Buildings; Data mining; Noise measurement; Semantics; Standards; Writing

107. Machine learning on Big Data.

【Paper Link】【Pages】:1242-1244

【Authors】: Tyson Condie ; Paul Mineiro ; Neoklis Polyzotis ; Markus Weimer

【Abstract】: Statistical Machine Learning has undergone a phase transition from a pure academic endeavor to being one of the main drivers of modern commerce and science. Even more so, recent results such as those on tera-scale learning [1] and on very large neural networks [2] suggest that scale is an important ingredient in quality modeling. This tutorial introduces current applications, techniques and systems with the aim of cross-fertilizing research between the database and machine learning communities. The tutorial covers current large scale applications of Machine Learning, their computational model and the workflow behind building those. Based on this foundation, we present the current state-of-the-art in systems support in the bulk of the tutorial. We also identify critical gaps in the state-of-the-art. This leads to the closing of the seminar, where we introduce two sets of open research questions: Better systems support for the already established use cases of Machine Learning and support for recent advances in Machine Learning research.

【Keywords】: data analysis; learning (artificial intelligence); big data; cross fertilizing research; database; neural networks; phase transition; pure academic endeavor; quality modeling; statistical machine learning; tera scale learning; tutorial; Big data; Communities; Computational modeling; Databases; Machine learning algorithms; Seminars; Tutorials

108. Big data integration.

【Paper Link】【Pages】:1245-1248

【Authors】: Xin Luna Dong ; Divesh Srivastava

【Abstract】: The Big Data era is upon us: data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Since the value of data explodes when it can be linked and fused with other data, addressing the big data integration (BDI) challenge is critical to realizing the promise of Big Data. BDI differs from traditional data integration in many dimensions: (i) the number of data sources, even for a single domain, has grown to be in the tens of thousands, (ii) many of the data sources are very dynamic, as a huge amount of newly collected data are continuously made available, (iii) the data sources are extremely heterogeneous in their structure, with considerable variety even for substantially similar entities, and (iv) the data sources are of widely differing qualities, with significant differences in the coverage, accuracy and timeliness of data provided. This seminar explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.

【Keywords】: data integration; decision making; sensor fusion; BDI; big data integration; data driven decision making; data fusion; data integration community; data sources; record linkage; schema mapping; unprecedented scale; Big data; Couplings; Data integration; Databases; Educational institutions; Joining processes; Seminars

109. Workload management for Big Data analytics.

【Paper Link】【Pages】:1249

【Authors】: Ashraf Aboulnaga ; Shivnath Babu

【Abstract】: Parallel database systems and MapReduce systems are essential components of today's infrastructure for Big Data analytics. These systems process multiple concurrent workloads consisting of complex user requests, where each request is associated with an (explicit or implicit) service level objective.

【Keywords】: data analysis; parallel databases; MapReduce systems; big data analytics; complex user requests; parallel database systems; service level objective; workload management; Admission control; Big data; Databases; Educational institutions; Processor scheduling; Resource management; Tutorials

110. Knowledge harvesting from text and Web sources.

【Paper Link】【Pages】:1250-1253

【Authors】: Fabian M. Suchanek ; Gerhard Weikum

【Abstract】: The proliferation of knowledge-sharing communities such as Wikipedia and the progress in scalable information extraction from Web and text sources has enabled the automatic construction of very large knowledge bases. Recent endeavors of this kind include academic research projects such as DBpedia, KnowItAll, Probase, ReadTheWeb, and YAGO, as well as industrial ones such as Freebase and Trueknowledge. These projects provide automatically constructed knowledge bases of facts about named entities, their semantic classes, and their mutual relationships. Such world knowledge in turn enables cognitive applications and knowledge-centric services like disambiguating natural-language text, deep question answering, and semantic search for entities and relations in Web and enterprise data. Prominent examples of how knowledge bases can be harnessed include the Google Knowledge Graph and the IBM Watson question answering system. This tutorial presents state-of-the-art methods, recent advances, research opportunities, and open challenges along this avenue of knowledge harvesting and its applications.

【Keywords】: Internet; knowledge based systems; natural language processing; text analysis; DBpedia; Freebase; Google knowledge graph; IBM Watson question answering system; KnowItAll; Probase; ReadTheWeb; Web data; Web sources; Wikipedia; YAGO; automatic construction; cognitive applications; deep question answering; enterprise data; knowledge bases; knowledge centric services; knowledge harvesting; knowledge sharing communities; mutual relationships; natural-language text; scalable information extraction; semantic classes; semantic search; text sources; Electronic publishing; Encyclopedias; Information retrieval; Internet; Knowledge based systems; Semantics

111. Sorting in Space: Multidimensional, spatial, and metric data structures for applications in spatial databases, geographic information systems (GIS), and location-based services.

【Paper Link】【Pages】:1254-1257

【Authors】: Hanan Samet

【Abstract】: Techniques for representing multidimensional, spatial, and metric data for applications in spatial databases, geographic information systems (GIS), and location-based services are reviewed. This includes both geometric and textual representations of spatial data.

【Keywords】: data structures; geographic information systems; mobile computing; visual databases; GIS; geographic information system; geometric representation; location-based service; metric data structure; multidimensional data structure; spatial data structure; spatial database; textual representation; Geographic information systems; Indexes; Measurement; Octrees; Sorting; Spatial databases

112. Triples in the clouds.

【Paper Link】【Pages】:1258-1261

【Authors】: Zoi Kaoudi ; Ioana Manolescu

【Abstract】: The W3C's Resource Description Framework (or RDF, in short) is a promising candidate which may deliver many of the original semi-structured data promises: flexible structure, optional schema, and rich, flexible URIs as a basis for information sharing. Moreover, RDF is uniquely positioned to benefit from the efforts of scientific communities studying databases, knowledge representation, and Web technologies. Many RDF data collections are being published, going from scientific data to general-purpose ontologies to open government data, in particular in the Linked Data movement. Managing such large volumes of RDF data is challenging, due to the sheer size, the heterogeneity, and the further complexity brought by RDF reasoning. To tackle the size challenge, distributed storage architectures are required. Cloud computing is an emerging paradigm massively adopted in many applications for the scalability, fault-tolerance and elasticity features it provides. This tutorial discusses the problems involved in efficiently handling massive amounts of RDF data in a cloud environment. We provide the necessary background, analyze and classify existing solutions, and discuss open problems and perspectives.

【Keywords】: cloud computing; database management systems; fault tolerant computing; inference mechanisms; memory architecture; ontologies (artificial intelligence); RDF data collection; RDF data management; RDF reasoning; W3C; Web technology; cloud computing; cloud environment; database; distributed storage architecture; fault-tolerance; flexible URI; flexible structure; general-purpose ontology; information sharing; knowledge representation; linked data movement; open government data; optional schema; resource description framework; scientific community; scientific data; semistructured data; Cognition; Conferences; Query processing; Resource description framework; Tutorials

113. Querying encrypted data.

【Paper Link】【Pages】:1262-1263

【Authors】: Arvind Arasu ; Ken Eguro ; Raghav Kaushik ; Ravi Ramamurthy

【Abstract】: Data security is a serious concern when we migrate data to a cloud DBMS. Database encryption, where sensitive columns are encrypted before they are stored in the cloud, has been proposed as a mechanism to address such data security concerns. The intuitive expectation is that an adversary cannot “learn” anything about the encrypted columns, since she does not have access to the encryption key. However, query processing becomes a challenge since it needs to “look inside” the data. This tutorial explores the space of designs studied in prior work on processing queries over encrypted data. We cover approaches based on both classic client-server and involving the use of a trusted hardware module where data can be securely decrypted. We discuss the privacy challenges that arise in both approaches and how they may be addressed. Briefly, supporting the full complexity of a modern DBMS including complex queries, transactions and stored procedures leads to significant challenges that we survey.

【Keywords】: client-server systems; cloud computing; cryptography; database management systems; query processing; trusted computing; client-server; cloud DBMS; complex queries; data security; database encryption; encrypted data querying; encryption key; trusted hardware module; Encryption; Hardware; Query processing; Tutorials

114. Shallow Information Extraction for the knowledge Web.

【Paper Link】【Pages】:1264-1267

【Authors】: Denilson Barbosa ; Haixun Wang ; Cong Yu

【Abstract】: A new breed of Information Extraction tools has become popular and shown to be very effective in building massive-scale knowledge bases that fuel applications such as question answering and semantic search. These approaches rely on Web-scale probabilistic models populated through shallow language processing of the text, pre-existing knowledge, and structured data already on the Web. This tutorial provides an introduction to these techniques, starting from the foundations of information extraction, and covering some of its key applications.

【Keywords】: Internet; knowledge based systems; natural language processing; question answering (information retrieval); text analysis; Web-scale probabilistic models; information extraction tools; knowledge Web; massive-scale knowledge bases; pre-existing knowledge; question answering; semantic search; shallow information extraction; shallow language processing; structured data; text processing; Data mining; Electronic publishing; Encyclopedias; Information retrieval; Knowledge based systems; Semantics

115. Secure and privacy-preserving database services in the cloud.

【Paper Link】【Pages】:1268-1271

【Authors】: Divyakant Agrawal ; Amr El Abbadi ; Shiyuan Wang

【Abstract】: Cloud computing becomes a very successful paradigm for data computing and storage. Increasing concerns about data security and privacy in the cloud, however, have arisen. Ensuring security and privacy for data management and query processing in the cloud is critical for better and broader uses of the cloud. This tutorial covers recent research on cloud security and privacy, while focusing on the works that protect data confidentiality and query access privacy for sensitive data being stored and queried in the cloud. We provide a comprehensive study of state-of-the-art schemes and techniques for protecting data confidentiality and access privacy, and explain their tradeoffs in security, privacy, functionality and performance.

【Keywords】: cloud computing; data privacy; database management systems; security of data; storage management; cloud computing; data computing; data storage; privacy-preserving database services; secure database services; Data privacy; Encryption; Query processing; Tutorials

116. Twitter+: Build personalized newspaper for Twitter.

【Paper Link】【Pages】:1272-1275

【Authors】: Chen Liu ; Anthony K. H. Tung

【Abstract】: Nowadays, microblogging services, e.g., Twitter, have played important roles in people's everyday lives. It enables users to publish and read text-based posts, known as “tweets” and interact with each other through re-tweeting or commenting. In the literature, many efforts have been devoted on exploiting the social property of Twitter. However, except the social component, Twitter itself has become an indispensable source for users to acquire useful information. To maximize its value, we expect to pay more attention on the media property of Twitter. To be good media, the first requirement is that it should provide an effective presentation of its news so that users are facilitated of reading. Currently, all tweets from followings are presented to the users and usually organized by their published timelines or coming sources. However, too few dimensions of presenting tweets hinder users from finding their interested information conveniently. In this demo, we presents “Twitter+”, which aims to enrich user's reading experiences in Twitter by providing multiple ways for them to explore tweets, such as keyword presentation, topic finding. It presents users an alternative interface to browse tweets more effectively.

【Keywords】: social aspects of automation; social networking (online); Twitter+; keyword presentation; media property; microblogging services; personalized newspaper; published timelines; retweeting; social property; text-based posts; topic finding; user reading experiences; Electronic publishing; Encyclopedias; Internet; Kernel; Media; Twitter

117. A generic database benchmarking service.

【Paper Link】【Pages】:1276-1279

【Authors】: Martin Kaufmann ; Peter M. Fischer ; Donald Kossmann ; Norman May

【Abstract】: Benchmarks are widely applied for the development and optimization of database systems. Standard benchmarks such as TPC-C and TPC-H provide a way of comparing the performance of different systems. In addition, micro benchmarks can be exploited to test a specific behavior of a system. Yet, despite all the benefits that can be derived from benchmark results, the effort of implementing and executing benchmarks remains prohibitive: Database systems need to be set up, a large number of artifacts such as data generators and queries need to be managed and complex, time-consuming operations have to be orchestrated. In this demo, we introduce a generic benchmarking service that combines a rich meta model, low marginal cost and ease of use, which drastically reduces the time and cost to define, adapt and run a benchmark.

【Keywords】: benchmark testing; database management systems; software performance evaluation; TPC-C; TPC-H; data generator; database system development; database system optimization; generic database benchmarking service; meta model; microbenchmark; Benchmark testing; Data models; Databases; Generators; Servers; Standards; Tuning

118. Aeolus: An optimizer for distributed intra-node-parallel streaming systems.

【Paper Link】【Pages】:1280-1283

【Authors】: Matthias J. Sax ; Malú Castellanos ; Qiming Chen ; Meichun Hsu

【Abstract】: Aeolus is a prototype implementation of a topology optimizer on top of the distributed streaming system Storm. Aeolus extends Storm with a batching layer which can increase the topology's throughput by more than one order of magnitude. Furthermore, Aeolus implements an optimization algorithm that computes the optimal batch size and degree of parallelism for each node in the topology automatically. Even if Aeolus is built on top of Storm, the developed concepts are not limited to Storm and can be applied to any distributed intra-node-parallel streaming system. We propose to demo Aeolus using an interactive Web UI. One part of the Web UI is a topology builder allowing the user to interact with the system. Topologies can be created from scratch and their structure and/or parameters can be modified. Furthermore, the user is able to observe the impact of the changes on the optimization decisions and runtime behavior. Additionally, the Web UI gives a deep insight in the optimization process by visualizing it. The user can interactively step through the optimization process while the UI shows the optimizer's state, computations, and decisions. The Web UI is also able to monitor the execution of a non-optimized and optimized topology simultaneously showing the advantage of using Aeolus.

【Keywords】: Internet; optimisation; parallel processing; topology; user interfaces; Aeolus; Storm; batching layer; distributed intra-node-parallel streaming systems; distributed streaming system; interactive Web UI; optimal batch size; optimization algorithm; optimization decisions; optimization process; optimized topology; prototype implementation; runtime behavior; topology builder; topology optimizer; topology throughput; Monitoring; Network topology; Optimization; Parallel processing; Storms; Throughput; Topology

119. Crowd-answering system via microblogging.

【Paper Link】【Pages】:1284-1287

【Authors】: Xianke Zhou ; Ke Chen ; Sai Wu ; Bingbing Zhang

【Abstract】: Most crowdsourcing systems leverage the public platforms, such as Amazon Mechanical Turk (AMT), to publish their jobs and collect the results. They are charged for using the platform's service and they are also required to pay the workers for each successful job. Although the average wage of the online human worker is not high, for a 24×7 running service, the crowdsourcing system becomes very expensive to maintain. We observe that there are, in fact, many sources that can provide free online human volunteers. Microblogging system is one of the most promising human resources. In this paper, we present our CrowdAnswer system, which is built on top of Weibo, the largest microblogging system in China. CrowdAnswer is a question-answering system, which distributes various questions to different groups of microblogging users adaptively. The answers are then collected from those users' tweets and visualized for the question originator. CrowdAnswer maintains a virtual credit system. The users need credits to publish questions and they can gain credits by answering the questions. A novel algorithm is proposed to route the questions to the interested users, which tries to maximize the probability of successfully answering a question.

【Keywords】: question answering (information retrieval); social networking (online); AMT; Amazon mechanical turk; China; CrowdAnswer system; Weibo; crowd-answering system; crowdsourcing systems; microblogging system; online human volunteers; public platforms; question originator; question-answering system; user tweets; Broadcasting; Crowdsourcing; Histograms; Predictive models; Radio frequency; Real-time systems; Visualization

120. With a little help from my friends.

【Paper Link】【Pages】:1288-1291

【Authors】: Arnab Nandi ; Stelios Paparizos ; John C. Shafer ; Rakesh Agrawal

【Abstract】: A typical person has numerous online friends that, according to studies, the person often consults for opinions and advice. However, public broadcasting a question to all friends risks social capital when repeated too often, is not tolerant to topic sensitivity, and can result in no response, as the message is lost in a myriad of status updates. Direct messaging is more personal and avoids these pitfalls, but requires manual selection of friends to contact, which can be time consuming and challenging. A user may have difficulty guessing which of their numerous online friends can provide a high quality and timely response. We demonstrate a working system that addresses these issues by returning an ordered subset of friends predicting (a) near-term availability, (b) willingness to respond and (c) topical knowledge, given a query. The combination of these three aspects are unique to our solution, and all are critical to the problem of obtaining timely and relevant responses. Our system acts as a decision aid - we give insight into why each friend was recommended and let the user decide whom to contact.

【Keywords】: query processing; social networking (online); direct messaging; near-term availability; online friends; public broadcasting; query; social capital; status update; topic sensitivity; topical knowledge; Availability; Bars; Electronic mail; Indexes; Search engines; Twitter

121. Peeking into the optimization of data flow programs with MapReduce-style UDFs.

【Paper Link】【Pages】:1292-1295

【Authors】: Fabian Hueske ; Mathias Peters ; Aljoscha Krettek ; Matthias Ringwald ; Kostas Tzoumas ; Volker Markl ; Johann-Christoph Freytag

【Abstract】: Data flows are a popular abstraction to define dataintensive processing tasks. In order to support a wide range of use cases, many data processing systems feature MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known from relational DBMS, MapReduce-style UDFs have less strict templates. These templates do not alone provide all the information needed to decide whether they can be reordered with relational operators and other UDFs. However, it is well-known that reordering operators such as filters, joins, and aggregations can yield runtime improvements by orders of magnitude. We demonstrate an optimizer for data flows that is able to reorder operators with MapReduce-style UDFs written in an imperative language. Our approach leverages static code analysis to extract information from UDFs which is used to reason about the reorderbility of UDF operators. This information is sufficient to enumerate a large fraction of the search space covered by conventional RDBMS optimizers including filter and aggregation push-down, bushy join orders, and choice of physical execution strategies based on interesting properties. We demonstrate our optimizer and a job submission client that allows users to peek step-by-step into each phase of the optimization process: the static code analysis of UDFs, the enumeration of reordered candidate data flows, the generation of physical execution plans, and their parallel execution. For the demonstration, we provide a selection of relational and nonrelational data flow programs which highlight the salient features of our approach.

【Keywords】: data flow computing; optimisation; relational databases; user interfaces; MapReduce; UDF; data flow programs; data processing systems; data-intensive processing; imperative language; optimization; relational DBMS; user-defined functions; Data mining; Data processing; Data visualization; Optimization; Programming; Query processing; Runtime

122. Very fast estimation for result and accuracy of big data analytics: The EARL system.

【Paper Link】【Pages】:1296-1299

【Authors】: Nikolay Laptev ; Kai Zeng ; Carlo Zaniolo

【Abstract】: Approximate results based on samples often provide the only way in which advanced analytical applications on very massive data sets (a.k.a. `big data') can satisfy their time and resource constraints. Unfortunately, methods and tools for the computation of accurate early results are currently not supported in big data systems (e.g., Hadoop). Therefore, we propose a nonparametric accuracy estimation method and system to speedup big data analytics. Our framework is called EARL (Early Accurate Result Library) and it works by predicting the learning curve and choosing the appropriate sample size for achieving the desired error bound specified by the user. The error estimates are based on a technique called bootstrapping that has been widely used and validated by statisticians, and can be applied to arbitrary functions and data distributions. Therefore, this demo will elucidate (a) the functionality of EARL and its intuitive GUI interface whereby first-time users can appreciate the accuracy obtainable from increasing sample sizes by simply viewing the learning curve displayed by EARL, (b) the usability of EARL, whereby conference participants can interact with the system to quickly estimate the sample sizes needed to obtain the desired accuracies or response times, and then compare them against the accuracies and response times obtained in the actual computations.

【Keywords】: data analysis; graphical user interfaces; statistical analysis; EARL system; advanced analytical applications; arbitrary functions; big data analytics; bootstrapping; data distributions; early accurate result library; intuitive GUI interface; massive data sets; statisticians; Accuracy; Big data; Computational modeling; Data mining; Error analysis; Estimation; Time factors

123. Road network mix-zones for anonymous location based services.

【Paper Link】【Pages】:1300-1303

【Authors】: Balaji Palanisamy ; Sindhuja Ravichandran ; Ling Liu ; Binh Han ; Kisung Lee ; Calton Pu

【Abstract】: We present MobiMix, a road network based mix-zone framework to protect location privacy of mobile users traveling on road networks. An alternative and complementary approach to spatial cloaking based location privacy protection is to break the continuity of location exposure by introducing techniques, such as mix-zones, where no applications can trace user movements. However, existing mixzone proposals fail to provide effective mix-zone construction and placement algorithms that are resilient to timing and transition attacks. In MobiMix, mix-zones are constructed and placed by carefully taking into consideration of multiple factors, such as the geometry of the zones, the statistical behavior of the user population, the spatial constraints on movement patterns of the users, and the temporal and spatial resolution of the location exposure. In this demonstration, we first introduce a visualization of the location privacy risks of mobile users traveling on road networks and show how mixzone based anonymization breaks the continuity of location exposure to protect user location privacy. We demonstrate a suite of road network mix-zone construction and placement methods that provide higher level of resilience to timing and transition attacks on road networks. We show the effectiveness of the MobiMix approach through detailed visualization using traces produced by GTMobiSim on different scales of geographic maps.

【Keywords】: data privacy; data visualisation; geography; mobile computing; statistical analysis; GTMobiSim; MobiMix; anonymous location based service; geographic map; location privacy risk visualization; mix-zone construction; mixzone based anonymization; placement algorithm; road network mix-zone; spatial cloaking; spatial constraint; spatial resolution; statistical behavior; temporal resolution; timing attack; transition attack; user location privacy protection; user population; Entropy; Junctions; Mobile communication; Privacy; Roads; Timing; Visualization

124. Query time scaling of attribute values in interval timestamped databases.

【Paper Link】【Pages】:1304-1307

【Authors】: Anton Dignös ; Michael H. Böhlen ; Johann Gamper

【Abstract】: In valid-time databases with interval timestamping each tuple is associated with a time interval over which the recorded fact is true in the modeled reality. The adjustment of these intervals is an essential part of processing interval timestamped data. Some attribute values remain valid if the associated interval changes, whereas others have to be scaled along with the time interval. For example, attributes that record total (cumulative) quantities over time, such as project budgets, total sales or total costs, often must be scaled if the timestamp is adjusted. The goal of this demo is to show how to support the scaling of attribute values in SQL at query time.

【Keywords】: SQL; database management systems; query processing; SQL; attribute value query time scaling; interval timestamped data processing; interval timestamped databases; project budgets; total costs; total sales; Aggregates; Algebra; Artificial intelligence; Databases; Integral equations; Market research; Semantics

【Paper Link】【Pages】:1308-1311

【Authors】: Craig P. Sayers ; Meichun Hsu

【Abstract】: To enable the interactive exploration of large social media datasets we exploit the temporal distributions of word n-grams within the message stream to discover “interesting” concepts, determine “relatedness” between concepts, and find representative examples for display. We present a new algorithm for context-dependent “interestingness” using the coefficient of variation of the temporal distribution, apply the well-known technique of Pearson's Correlation to tweets using equi-height histogramming to determine correlation, and employ an asymmetric variant for computing “relatedness” to encourage exploration. We further introduce techniques using interestingness, correlation, and relatedness to automatically discover concepts and select preferred word N-grams for display. These techniques are demonstrated on an 800,000 tweet dataset from the Academy Awards.

【Keywords】: Internet; information analysis; information retrieval; social networking (online); Pearson correlation; coefficient of variation; context dependent concepts; equiheight histogramming; interactive exploration; interesting concepts extraction; social media datasets; social media streams; temporal distribution; temporal distributions; word n-grams; Awards activities; Context; Correlation; Histograms; Media; Twitter; Visualization

126. VERDICT: Privacy-preserving authentication of range queries in location-based services.

【Paper Link】【Pages】:1312-1315

【Authors】: Haibo Hu ; Qian Chen ; Jianliang Xu

【Abstract】: We demonstrate VERDICT, a location-based range query service featuring the privacy-preserving authentication capability. VERDICT adopts the common data-as-a-service (DaaS) model, which consists of the data owner (a location registry or a mobile operator) who provides the querying data, the service provider who executes the query, and the querying users. The system features a privacy-preserving query authentication module that enables the user to verify the correctness of results while still protecting the data privacy. This feature is crucial in many location-based services where the querying data are user locations. To achieve this, VERDICT employs an MR-tree based privacy-preserving authentication scheme proposed in our earlier work [3]. The use case study shows that VERDICT provides efficient and smooth user experience for authenticating location-based range queries.

【Keywords】: data privacy; mobile computing; query processing; security of data; trees (mathematics); DaaS model; MR-tree based privacy-preserving authentication scheme; VERDICT; correctness verification; data owner; data privacy protection; data-as-a-service model; location registry; location-based range query service; location-based service; mobile operator; privacy-preserving authentication capability; privacy-preserving query authentication module; query execution; querying data; querying user; service provider; user location; Authentication; Business; Indexes; Mobile radio mobility management; Privacy; Servers; Spatial databases

127. ExpFinder: Finding experts by graph pattern matching.

【Paper Link】【Pages】:1316-1319

【Authors】: Wenfei Fan ; Xin Wang ; Yinghui Wu

【Abstract】: We present ExpFinder, a system for finding experts in social networks based on graph pattern matching. We demonstrate (1) how ExpFinder identifies top-K experts in a social network by supporting bounded simulation of graph patterns, and by ranking the matches based on a metric for social impact; (2) how it copes with the sheer size of real-life social graphs by supporting incremental query evaluation and query preserving graph compression, and (3) how the GUI of ExpFinder interacts with users to help them construct queries and inspect matches.

【Keywords】: data compression; graph theory; graphical user interfaces; network theory (graphs); pattern matching; query processing; social networking (online); ExpFinder; GUI; bounded graph pattern simulation; expert finding system; graph pattern matching; incremental query evaluation; match ranking; query preserving graph compression; real-life social graphs; social impact; social networks; Barium; Collaboration; Engines; Graphical user interfaces; Pattern matching; Query processing; Social network services

128. Tajo: A distributed data warehouse system on large clusters.

【Paper Link】【Pages】:1320-1323

【Authors】: Hyunsik Choi ; Jihoon Son ; Haemi Yang ; Hyoseok Ryu ; Byungnam Lim ; Soohyung Kim ; Yon Dohn Chung

【Abstract】: The increasing volumes of relational data let us find an alternative to cope with them. Recently, several hybrid approaches (e.g., HadoopDB and Hive) between parallel databases and Hadoop have been introduced to the database community. Although these hybrid approaches have gained wide popularity, they cannot avoid the choice of suboptimal execution strategies. We believe that this problem is caused by the inherent limits of their architectures. In this demo, we present Tajo, a relational, distributed data warehouse system on shared-nothing clusters. It uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine that we have developed instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes. The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and disseminates them to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators. A DAG of operators can take two or more input sources and be pipelined within the local query engine. In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques. By combining these features, Tajo can employ more optimized and efficient query processing, including the existing methods that have been studied in the traditional database research areas. To give a deep understanding of the Tajo architecture and behavior during query processing, the demonstration will allow users to submit TPC-H queries to 32 Tajo cluster nodes. The web-based user interface will show (1) how the submitted queries are planned, (2) how the query are distributed across nodes, (3) the cluster and node status, and (4) the detail of relations and their physical information. Also, we provide the performance evaluation of Tajo compared with Hive.

【Keywords】: Internet; data warehouses; directed graphs; distributed control; information dissemination; parallel databases; pattern clustering; pipeline processing; query processing; relational databases; storage management; user interfaces; DAG; HDFS; Hadoop distributed file system; Hive; MapReduce framework; TPC-H queries; Tajo architecture; Tajo cluster; Tajo cluster nodes; Web-based user interface; control distributed data flow; database community; directed acyclic graph; large clusters; local query engine; parallel databases; physical operators; query execution engine; query planning; query processing; relational data volumes; relational distributed data warehouse system; shared-nothing clusters; storage layer; suboptimal execution strategies; worker coordinator; Catalogs; Engines; Fault tolerance; Fault tolerant systems; Planning; Query processing

129. Πgora: An Integration System for Probabilistic Data.

【Paper Link】【Pages】:1324-1327

【Authors】: Dan Olteanu ; Lampros Papageorgiou ; Sebastiaan J. van Schaik

【Abstract】: Πgora is an integration system for probabilistic data modelled using different formalisms such as pc-tables, Bayesian networks, and stochastic automata. User queries are expressed over a global relational layer and are evaluated by Πgora using a range of strategies, including data conversion into one probabilistic formalism followed by evaluation using a formalism-specific engine, and hybrid plans, where subqueries are evaluated using engines for different formalisms. This demonstration allows users to experience Πgora on real-world heterogeneous data sources from the medical domain.

【Keywords】: data integration; distributed databases; medical administrative data processing; probability; query processing; Πgora; data conversion; formalism-specific engine; global relational layer; hybrid plans; medical domain; probabilistic data integration system; probabilistic formalism; real-world heterogeneous data sources; subqueries; user queries; Bayes methods; Breast cancer; Diabetes; Engines; Pregnancy; Probabilistic logic; Random variables

130. Complex pattern matching in complex structures: The XSeq approach.

【Paper Link】【Pages】:1328-1331

【Authors】: Kai Zeng ; Mohan Yang ; Barzan Mozafari ; Carlo Zaniolo

【Abstract】: There is much current interest in applications of complex event processing over data streams and of complex pattern matching over stored sequences. While some applications use streams of flat records, XML and various semi-structured information formats are preferred by many others-in particular, applications that deal with domain science, social networks, RSS feeds, and finance. XSeq and its system improve complex pattern matching technology significantly, both in terms of expressive power and efficient implementation. XSeq achieves higher expressiveness through an extension of XPath based on Kleene-* pattern constructs, and achieves very efficient execution, on both stored and streaming data, using Visibly Pushdown Automata (VPA). In our demo, we will (i) show examples of XSeq in different application domains, (ii) explain its compilation/query optimization techniques and show the speed-ups they deliver, and (iii) demonstrate how powerful and efficient application-specific languages were implemented by superimposing simple `skins' on XSeq and its system.

【Keywords】: XML; optimisation; pattern matching; pushdown automata; query processing; Kleene-* pattern constructs; RSS feeds; VPA; XML; XPath; XSeq approach; compilation-query optimization techniques; complex event processing; complex pattern matching; complex structures; domain science; finance; flat records; semistructured information formats; social networks; visibly pushdown automata; Data analysis; Databases; Genomics; Optimization; Proteins; Remuneration; XML

131. T-Music: A melody composer based on frequent pattern mining.

【Paper Link】【Pages】:1332-1335

【Authors】: Cheng Long ; Raymond Chi-Wing Wong ; Raymond Ka Wai Sze

【Abstract】: There are a bulk of studies on proposing algorithms for composing the melody of a song automatically with algorithms, which is known as algorithmic composition. To the best of our knowledge, none of them took the lyric into consideration for melody composition. However, according to some recent studies, within a song, there usually exists a certain extent of correlation between its melody and its lyric. In this demonstration, we propose to utilize this type of correlation information for melody composition. Based on this idea, we design a new melody composition algorithm and develop a melody composer called T-Music which employs this composition algorithm.

【Keywords】: data mining; music; T-Music; algorithmic composition; correlation information; frequent pattern mining; melody composer; melody composition algorithm; Algorithm design and analysis; Automata; Buildings; Correlation; Databases; Probabilistic logic; Stress; frequent pattern mining; melody composition; probabilistic automaton

【Paper Link】【Pages】:1336-1339

【Authors】: Barbara Carminati ; Elena Ferrari ; Michele Guglielmi

【Abstract】: 9/11, Katrina, Fukushima and other recent emergencies demonstrate the need for effective information sharing across government agencies as well as non-governmental and private organizations to assess emergency situations, and generate proper response plans. In this demo, we present a system to enforce timely and controlled information sharing in emergency situations. The framework is able to detect emergencies, enforce temporary access control policies and obligations to be activated during emergencies, simulate emergency situations for demonstrational purposes and show statistical results related to emergency activation/deactivation and consequent access control policies triggering.

【Keywords】: authorisation; emergency management; local government; SHARE; access control policies triggering; emergency activation; emergency deactivation; emergency management; emergency situations; government agency; nongovernmental organizations; private organizations; proper response plans; secure information sharing framework; temporary access control policy; Access control; Computational modeling; Emergency services; Government; Information management; Servers

133. KORS: Keyword-aware Optimal Route Search System.

【Paper Link】【Pages】:1340-1343

【Authors】: Xin Cao ; Lisi Chen ; Gao Cong ; Jihong Guan ; Nhan-Tue Phan ; Xiaokui Xiao

【Abstract】: We present the Keyword-aware Optimal Route Search System (KORS), which efficiently answers the KOR queries. A KOR query is to find a route such that it covers a set of user-specified keywords, a specified budget constraint is satisfied, and an objective score of the route is optimized. Consider a tourist who wants to spend a day exploring a city. The user may issue the following KOR query: “find the most popular route such that it passes by shopping mall, restaurant, and pub, and the travel time to and from her hotel is within 4 hours.” KORS provides browser-based interfaces for desktop and laptop computers and provides a client application for mobile devices as well. The interfaces and the client enable users to formulate queries and view the query results on a map. Queries are then sent to the server for processing by the HTTP post operation. Since answering a KOR query is NP-hard, we devise two approximation algorithms with provable performance bounds and one greedy algorithm to process the KOR queries in our KORS prototype. We use two real-world datasets to demonstrate the functionality and performance of this system.

【Keywords】: hypermedia; online front-ends; optimisation; query processing; traffic engineering computing; transport protocols; user interfaces; vehicle routing; HTTP post operation; KOR queries; KORS; NP-hard problem; browser-based interfaces; keyword-aware optimal route search system; travel time; user-specified keywords; Browsers; Computers; Google; Greedy algorithms; Mobile handsets; Roads; Servers

134. CrowdPlanr: Planning made easy with crowd.

【Paper Link】【Pages】:1344-1347

【Authors】: Ilia Lotosh ; Tova Milo ; Slava Novgorodov

【Abstract】: Recent research has shown that crowd sourcing can be used effectively to solve problems that are difficult for computers, e.g., optical character recognition and identification of the structural configuration of natural proteins [1]. In this demo we propose to use the power of the crowd to address yet another difficult problem that frequently occurs in a daily life-planning a sequence of actions, when the goal is hard to formalize. For example, planning the sequence of places/attractions to visit in the course of a vacation, where the goal is to enjoy the resulting vacation the most, or planning the sequence of courses to take in an academic schedule planning, where the goal is to obtain solid knowledge of a given subject domain. Such goals may be easily understandable by humans, but hard or even impossible to formalize for a computer. We present a novel algorithm for efficiently harnessing the crowd to assist in solving such planning problems. The algorithm builds the desired plans incrementally, optimally choosing at each step the `best' questions so that the overall number of questions that need to be asked is minimized. We demonstrate the effectiveness of our solution in CrowdPlanr, a system for vacation travel planning. Given a destination, dates, preferred activities and other constraints CrowdPlanr employs the crowd to build a vacation plan (sequence of places to visit) that is expected to maximize the “enjoyment” of the vacation.

【Keywords】: planning; travel industry; CrowdPlanr; academic schedule planning; action sequence; crowd sourcing; daily life-planning; destination; planning problem; sequence planning; vacation enjoyment; vacation travel planning; Abstracts; Cities and towns; Computers; Databases; Games; Schedules

135. ASVTDECTOR: A practical near duplicate video retrieval system.

【Paper Link】【Pages】:1348-1351

【Authors】: Xiangmin Zhou ; Lei Chen

【Abstract】: In this paper, we present a system, named ASVT-DECTOR, to retrieve the near duplicate videos with large variations based on an 3D structure tensor model, named ASVT series, over the local descriptors of video segments. Different from the traditional global feature-based video detection systems that incur severe information loss, ASVT model is built over the local descriptor set of each video segment, keeping the robustness of local descriptors. Meanwhile, unlike the traditional local feature-based methods that suffer from the high cost of pair-wise descriptor comparison, ASVT model describes a video segment as an 3D structure tensor that is actually a 3×3 matrix, obtaining high retrieval efficiency. In this demonstration, we show that, given a clip, our ASVTDETECTOR system can effectively find the near-duplicates with large variations from a large collection in real time.

【Keywords】: matrix algebra; tensors; video retrieval; 3D structure tensor model; ASVT-DECTOR; adaptive structure video tensor series; local descriptor set; near duplicate video retrieval system; Feature extraction; Multimedia communication; Solid modeling; Streaming media; TV broadcasting; Tensile stress; Three-dimensional displays

136. YumiInt - A deep Web integration system for local search engines for Geo-referenced objects.

【Paper Link】【Pages】:1352-1355

【Authors】: Eduard Constantin Dragut ; Brian P. Beirne ; Ali Neyestani ; Badr Atassi ; Clement T. Yu ; Bhaskar DasGupta ; Weiyi Meng

【Abstract】: We present YumiInt a deep Web integration system for local search engines for Geo-referenced objects. YumiInt consists of two systems: YumiDev and YumiMeta. YumiDev is an off-line integration system that builds the key components (e.g., query translation and entity resolution) of YumiMeta. YumiMeta is the Web application to which users post queries. It translates queries to multiple sources and gets back aggregated lists of results. We present the two systems in this paper.

【Keywords】: Web services; geographic information systems; query processing; search engines; YumiDev; YumiInt; YumiMeta; deep Web integration system; georeferenced object; offline integration system; query processing; search engine; Business; Couplings; Data mining; Educational institutions; Engines; Search engines; Standards

137. A demonstration of the G∗ graph database system.

【Paper Link】【Pages】:1356-1359

【Authors】: Sean R. Spillane ; Jeremy Birnbaum ; Daniel Bokser ; Daniel Kemp ; Alan G. Labouseur ; Paul W. Olsen ; Jayadevan Vijayan ; Jeong-Hyon Hwang ; Jun-Weon Yoon

【Abstract】: The world is full of evolving networks, many of which can be represented by a series of large graphs. Neither the current graph processing systems nor database systems can efficiently store and query these graphs due to their lack of support for managing multiple graphs and lack of essential graph querying capabilities. We propose to demonstrate our system, G, that meets the new challenges of managing multiple graphs and supporting fundamental graph querying capabilities. G can store graphs on a large number of servers while compressing these graphs based on their commonalities. G also allows users to easily express queries on graphs and efficiently executes these queries by sharing computations across graphs. During our demonstrations, conference attendees will run various analytic queries on large, practical data sets. These demonstrations will highlight the convenience and performance benefits of G over existing database and graph processing systems, the effectiveness of sharing in graph data storage and processing, as well as G*'s scalability.

【Keywords】: database management systems; graph theory; current graph processing system; essential graph querying capability; fundamental graph querying capability; graph data storage; graph database system; Educational institutions; Engines; History; Indexes; Servers; Social network services

138. RECODS: Replica consistency-on-demand store.

【Paper Link】【Pages】:1360-1363

【Authors】: Yuqing Zhu ; Philip S. Yu ; Jianmin Wang

【Abstract】: Replication is critical to the scalability, availability and reliability of large-scale systems. The trade-off of replica consistency vs. response latency has been widely understood for large-scale stores with replication. The weak consistency guaranteed by existing large-scale stores complicates application development, while the strong consistency hurts application performance. It is desirable that the best consistency be guaranteed for a tolerable response latency, but none of existing large-scale stores supports maximized replica consistency within a given latency constraint. In this demonstration, we showcase RECODS (REplica Consistency-On-Demand Store), a NoSQL store implementation that can finely control the trade-off on an operation basis and thus facilitate application development with on-demand replica consistency. With RECODS, developers can specify the tolerable latency for each read/write operation. Within the specified latency constraint, a response will be returned and the replica consistency be maximized. RECODS implementation is based on Cassandra, an open source NoSQL store, but with a different operation execution process, replication process and in-memory storage hierarchy.

【Keywords】: SQL; cloud computing; digital storage; public domain software; reliability; storage management; Cassandra; NoSQL store implementation; RECODS; application development; in-memory storage hierarchy; large-scale system availability; large-scale system reliability; large-scale system scalability; on-demand replica consistency; open source NoSQL store; operation execution process; read-write operation; replica consistency; replica consistency-on-demand store; replication process; response latency; tolerable response latency; Availability; Compaction; Computers; Medals; Process control; Web pages

139. SODIT: An innovative system for outlier detection using multiple localized thresholding and interactive feedback.

【Paper Link】【Pages】:1364-1367

【Authors】: Ji Zhang ; Hua Wang ; Xiaohui Tao ; Lili Sun

【Abstract】: Outlier detection is an important long-standing research problem in data mining and has enjoyed applications in a wide range of applications in business, engineering, biology and security, etc. However, the traditional outlier detection methods inevitably need to use different parameters for detection such as those used to specify the distance or density cutoff for distinguish outliers from normal data points. Using the trial and error approach, the traditional outlier detection methods are rather tedious in parameter tuning. In this demo proposal, we introduce an innovative outlier detection system, called SODIT, that uses localized thresholding to assist the value specification of the thresholds that reflect closely the local data distribution. In addition, easy-to-use user feedback are employed to further facilitate the determination of optimal parameter values. SODIT is able to make outlier detection much easier to operate and produce more accurate, intuitive and informative results than before.

【Keywords】: data mining; SODIT; data mining; innovative system; interactive feedback; local data distribution; localized thresholding; outlier detection; parameter tuning; value specification; Clustering algorithms; Data mining; Detection algorithms; Distributed databases; Feature extraction; Labeling; Standards

140. COLA: A cloud-based system for online aggregation.

【Paper Link】【Pages】:1368-1371

【Authors】: Yantao Gan ; Xiaofeng Meng ; Yingjie Shi

【Abstract】: Online aggregation is a promising solution to achieving fast early responses for interactive ad-hoc queries that compute aggregates on massive data. To process large datasets on large-scale computing clusters, MapReduce has been introduced as a popular paradigm into many data analysis applications. However, typical MapReduce implementations are not well-suited to analytic tasks, since they are geared towards batch processing. With the increasing popularity of ad-hoc analytic query processing over enormous datasets, processing aggregate queries using MapReduce in an online fashion is therefore an emerging important application need. We present a MapReduce-based online aggregation system called COLA, which provides progressive approximate aggregate answers for both single table and multiple joined tables. COLA provides an online aggregation execution engine with novel sampling techniques to support incremental and continuous computing of aggregation, and minimize the waiting time before an acceptably precise estimate is available. In addition, user-friendly SQL queries are supported in COLA. Furthermore, COLA can implicitly convert non-OLA jobs into online version so that users don't have to write any special-purpose code to make estimates.

【Keywords】: SQL; batch processing (computers); cloud computing; data handling; query processing; COLA; MapReduce implementations; MapReduce-based online aggregation system; batch processing; cloud based system for online aggregation; data analysis applications; interactive adhoc queries; large dataset process; large scale computing clusters; novel sampling techniques; online aggregation execution engine; user-friendly SQL queries; Aggregates; Electronic publishing; Encyclopedias; Engines; Internet; Query processing

141. RoadAlarm: A spatial alarm system on road networks.

【Paper Link】【Pages】:1372-1375

【Authors】: Kisung Lee ; Emre Yigitoglu ; Ling Liu ; Binh Han ; Balaji Palanisamy ; Calton Pu

【Abstract】: Spatial alarms are one of the fundamental functionalities for many LBSs. We argue that spatial alarms should be road network aware as mobile objects travel on spatially constrained road networks or walk paths. In this software system demonstration, we will present the first prototype system of ROADALARM - a spatial alarm processing system for moving objects on road networks. The demonstration system of ROAD-ALARM focuses on the three unique features of ROADALARM system design. First, we will show that the road network distance-based spatial alarm is best modeled using road network distance such as segment length-based and travel time-based distance. Thus, a road network spatial alarm is a star-like subgraph centered at the alarm target. Second, we will show the suite of ROADALARM optimization techniques to scale spatial alarm processing by taking into account spatial constraints on road networks and mobility patterns of mobile subscribers. Third, we will show that, by equipping the ROADALARM system with an activity monitoring-based control panel, we are able to enable the system administrator and the end users to visualize road network-based spatial alarms, mobility traces of moving objects and dynamically make selection or customization of the ROADALARM techniques for spatial alarm processing through graphical user interface. We show that the ROADALARM system provides both the general system architecture and the essential building blocks for location-based advertisements and location-based reminders.

【Keywords】: alarm systems; graphical user interfaces; mobile computing; traffic engineering computing; LBS; ROADALARM system; ROADALARM system design; ROADALARM techniques; activity monitoring-based control panel; alarm target; fundamental functionalities; general system architecture; graphical user interface; location-based advertisements; location-based reminders; mobile objects; mobile subscribers; mobility patterns; mobility traces; moving objects; road network distance; road network spatial alarm processing system; road network-based spatial alarm visualization; software system demonstration; spatially constrained road networks; star-like subgraph; suite processing; system administrator; walk paths; Accuracy; Barium; Engines; Euclidean distance; Mobile communication; Mobile computing; Roads

142. A real-time abnormality detection system for intensive care management.

【Paper Link】【Pages】:1376-1379

【Authors】: Guangyan Huang ; Jing He ; Jie Cao ; Zhi Qiao ; Michael Steyn ; Kersi Taraporewalla

【Abstract】: Detecting abnormalities from multiple correlated time series is valuable to those applications where a credible realtime event prediction system will minimize economic losses (e.g. stock market crash) and save lives (e.g. medical surveillance in the operating theatre). For example, in an intensive care scenario, anesthetists perform a vital role in monitoring the patient and adjusting the flow and type of anesthetics to the patient during an operation. An early awareness of possible complications is vital for an anesthetist to correctly react to a given situation. In this demonstration, we provide a comprehensive medical surveillance system to effectively detect abnormalities from multiple physiological data streams for assisting online intensive care management. Particularly, a novel online support vector regression (OSVR) algorithm is developed to approach the problem of discovering the abnormalities from multiple correlated time series for accuracy and real-time efficiency. We also utilize historical data streams to optimize the precision of the OSVR algorithm. Moreover, this system comprises a friendly user interface by integrating multiple physiological data streams and visualizing alarms of abnormalities.

【Keywords】: Internet; data visualisation; graphical user interfaces; human computer interaction; patient care; patient monitoring; real-time systems; regression analysis; support vector machines; surveillance; time series; OSVR algorithm; abnormality alarm visualization; anesthetics; economic loss minimization; medical surveillance system; multiple correlated time series; online intensive care management; online support vector regression algorithm; patient monitoring; physiological data streams; real-time abnormality detection system; real-time efficiency; real-time event prediction system; user friendly interface; Biomedical monitoring; Market research; Monitoring; Real-time systems; Support vector machines; Time series analysis; Vectors