Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany. IEEE Computer Society 【DBLP Link】
【Paper Link】 【Pages】:1
【Authors】: Anastasia Ailamaki
【Abstract】: Summary form only given. Database systems have long optimized for parallel execution; the research community has pursued parallel database machines since the early '80s, and several key ideas from that era underlie the design and success of commercial database engines today. Computer microarchitecture, however, has shifted drastically during the intervening decades. Until the end of the 20th century Moore's Law translated into single-processor performance gains; today's constraints in semiconductor technology cause transistor integration to increase the number of processors per chip roughly every two years. Chip multiprocessor, or multicore, platforms are commodity; harvesting scalable performance from the available raw parallelism, however, is increasingly challenging for conventional database servers running business intelligence and transaction processing workloads. A careful analysis of database performance scaling trends on future chip multiprocessors [7] demonstrates that current parallelism methods are of bounded utility as the number of processors per chip increases exponentially. Common sense is often contradicted; for instance, increasing on-chip cache size or aggressively sharing data among processors is often detrimental to performance [6]. When designing high performance database systems, there are tradeoffs between single-thread performance and scalability; as the number of hardware contexts grows, favoring scalability wins [5]. In order to transform a database storage manager from a single-threaded Atlas into a multi-threaded Lernaean Hydra which scales to infinity, substantial rethinking of fundamental constructs at all levels of the system is in order. Primitives such as the mechanism to access critical sections become crucial: spinning wastes cycles, while blocking incurs high overhead [3]. At the database processing level, converting concurrency into parallelism proves to be a challenging task, even for transactional workloads that are inherent- - ly concurrent. Typical obstacles are by-definition centralized operations, such as locking; we need to ensure consistency by decoupling transaction data access from process assignment, while adapting lessons from the first parallel database machines on the multicore platforms of the future [1][4]. Often, parallelism needs to be extracted from seemingly serial operations such as logging; extensive research in distributed systems proves to be very useful in this context [2]. At the query processing level, service-oriented architectures provide an excellent framework to exploit available parallelism. In this talk, I present lessons learned when trying to scale database workloads on chip multiprocessors. I discuss the tradeoffs and trends and outline the above research directions using examples from the StagedDB/CMP and ShoreMT projects at EPFL.
【Keywords】: cache storage; competitive intelligence; microprocessor chips; multi-threading; multiprocessing systems; parallel databases; peer-to-peer computing; query processing; service-oriented architecture; transaction processing; CMP project; Moore Law; ShoreMT project; StagedDB project; business intelligence; chip multiprocessor; commercial database engine; computer microarchitecture; conventional database server; data sharing; database storage manager; multithreaded Lernaean Hydra; on-chip cache size; parallel database machine; query processing level; scalable database system; semiconductor technology; service oriented architecture; single processor performance; single threaded Atlas; transaction data access; transaction processing; transistor integration; waste cycle; Awards activities; Computers; Database systems; Parallel processing; Program processors; Scalability
【Paper Link】 【Pages】:2-13
【Authors】: Georg Gottlob ; Giorgio Orsi ; Andreas Pieris
【Abstract】: Ontological queries are evaluated against an enterprise ontology rather than directly on a database. The evaluation and optimization of such queries is an intriguing new problem for database research. In this paper we discuss two important aspects of this problem: query rewriting and query optimization. Query rewriting consists of the compilation of an ontological query into an equivalent query against the underlying relational database. The focus here is on soundness and completeness. We review previous results and present a new rewriting algorithm for rather general types of ontological constraints (description logics). In particular, we show how a conjunctive query (CQ) against an enterprise ontology can be compiled into a union of conjunctive queries (UCQ) against the underlying database. Ontological query optimization, in this context, attempts to improve this process so to produce possibly small and cost-effective output UCQ. We review existing optimization methods, and propose an effective new method that works for Linear Datalog±, a description logic that encompasses well-known description logics of the DL-Lite family.
【Keywords】: formal logic; ontologies (artificial intelligence); query processing; relational databases; DL-Lite family; conjunctive query; description logic; enterprise ontology; linear datalog±; ontological queries; query optimization; query rewriting; relational database; rewriting algorithm; Cognition; Companies; OWL; Ontologies; Optimization; Relational databases
【Paper Link】 【Pages】:14
【Authors】: Johannes Gehrke
【Abstract】: Scalability is a fundamental problem in the development of computer games and massively multiplayer online games (MMOs). Players always demand more - more polygons, more physics particles, more interesting AI behavior, more monsters, more simultaneous players and interactions, and larger virtual worlds. Scalability has long been a focus of the database research community. However, the games community has done little to exploit database research. Most game developers think of databases as persistence solutions, designed to store and read game state. But the advantages of database systems go way beyond persistence. Database research has dealt with the beautiful question of efficiently processing declarative languages, and the resulting research questions have touched many areas of computer science. In this talk, I will show how the idea of declarative processing from databases can be applied to computer games and simulations. I will describe our journey from declarative to imperative scripting languages for computer games and simulations, introducing the state-effect pattern, a design pattern that enables developers to design games and simulations that can be programmed imperatively, but processed declaratively. I will then show how we can use database techniques to run scalable games and simulations cheaply in the cloud. I will then discuss how we can use ideas from optimistic concurrency control to scale interactions between players in MMOs, and I will describe how we can use these techniques to build elastic transaction processing infrastructure for the cloud.
【Keywords】: authoring languages; computer games; concurrency control; database languages; transaction processing; computer games; concurrency control; database research community; declarative languages; game developers; games community; multiplayer online games; scalability; scripting languages; transaction processing; Awards activities; Computational modeling; Computers; Database systems; Games; Scalability
【Paper Link】 【Pages】:15-26
【Authors】: Senjuti Basu Roy ; Gautam Das ; Sihem Amer-Yahia ; Cong Yu
【Abstract】: Planning an itinerary when traveling to a city involves substantial effort in choosing Points-of-Interest (POIs), deciding in which order to visit them, and accounting for the time it takes to visit each POI and transit between them. Several online services address different aspects of itinerary planning but none of them provides an interactive interface where users give feedbacks and iteratively construct their itineraries based on personal interests and time budget. In this paper, we formalize interactive itinerary planning as an iterative process where, at each step: (1) the user provides feedback on POIs selected by the system, (2) the system recommends the best itineraries based on all feedback so far, and (3) the system further selects a new set of POIs, with optimal utility, to solicit feedback for, at the next step. This iterative process stops when the user is satisfied with the recommended itinerary. We show that computing an itinerary is NP-complete even for simple itinerary scoring functions, and that POI selection is NP-complete. We develop heuristics and optimizations for a specific case where the score of an itinerary is proportional to the number of desired POIs it contains. Our extensive experiments show that our algorithms are efficient and return high quality itineraries.
【Keywords】: computational complexity; feedback; transportation; NP-complete problem; POI; feedback; interactive itinerary planning; iterative process; itinerary scoring functions; online services; points of interest; Algorithm design and analysis; Approximation algorithms; Computational modeling; Planning; Probabilistic logic; Semantics
【Paper Link】 【Pages】:27-38
【Authors】: Bin Bi ; Sau Dan Lee ; Ben Kao ; Reynold Cheng
【Abstract】: In a social tagging system, resources (such as photos, video and web pages) are associated with tags. These tags allow the resources to be effectively searched through tag-based keyword matching using traditional IR techniques. We note that in many such systems, tags of a resource are often assigned by a diverse audience of causal users (taggers). This leads to two issues that gravely affect the effectiveness of resource retrieval: (1) Noise: tags are picked from an uncontrolled vocabulary and are assigned by untrained taggers. The tags are thus noisy features in resource retrieval. (2) A multitude of aspects: different taggers focus on different aspects of a resource. Representing a resource using a flattened bag of tags ignores this important diversity of taggers. To improve the effectiveness of resource retrieval in social tagging systems, we propose CubeLSI - a technique that extends traditional LSI to include taggers as another dimension of feature space of resources. We compare CubeLSI against a number of other tag-based retrieval models and show that CubeLSI significantly outperforms the other models in terms of retrieval accuracy. We also prove two interesting theorems that allow CubeLSI to be very efficiently computed despite the much enlarged feature space it employs.
【Keywords】: identification technology; information retrieval; social networking (online); string matching; vocabulary; CubeLSI; IR technique; resource retrieval; social tagging system; tag based keyword matching; tag based retrieval model; uncontrolled vocabulary; Large scale integration; Matrix decomposition; Portable computers; Semantics; Tagging; Tensile stress; Vocabulary
【Paper Link】 【Pages】:39-50
【Authors】: Wenfei Fan ; Jianzhong Li ; Shuai Ma ; Nan Tang ; Yinghui Wu
【Abstract】: It is increasingly common to find graphs in which edges bear different types, indicating a variety of relationships. For such graphs we propose a class of reachability queries and a class of graph patterns, in which an edge is specified with a regular expression of a certain form, expressing the connectivity in a data graph via edges of various types. In addition, we define graph pattern matching based on a revised notion of graph simulation. On graphs in emerging applications such as social networks, we show that these queries are capable of finding more sensible information than their traditional counterparts. Better still, their increased expressive power does not come with extra complexity. Indeed, (1) we investigate their containment and minimization problems, and show that these fundamental problems are in quadratic time for reachability queries and are in cubic time for pattern queries. (2) We develop an algorithm for answering reachability queries, in quadratic time as for their traditional counterpart. (3) We provide two cubic-time algorithms for evaluating graph pattern queries based on extended graph simulation, as opposed to the NP-completeness of graph pattern matching via subgraph isomorphism. (4) The effectiveness, efficiency and scalability of these algorithms are experimentally verified using real-life data and synthetic data.
【Keywords】: computational complexity; pattern matching; query processing; question answering (information retrieval); reachability analysis; NP-completeness; cubic-time algorithms; extended graph simulation; graph pattern matching; graph reachability pattern; pattern queries; reachability query answering; regular expressions; subgraph isomorphism; Biology; Cloning; Complexity theory; Image color analysis; Minimization; Pattern matching; Semantics
【Paper Link】 【Pages】:51-62
【Authors】: James Cheng ; Yiping Ke ; Shumo Chu ; M. Tamer Özsu
【Abstract】: The k-core of a graph is the largest subgraph in which every vertex is connected to at least k other vertices within the subgraph. Core decomposition finds the k-core of the graph for every possible k. Past studies have shown important applications of core decomposition such as in the study of the properties of large networks (e.g., sustainability, connectivity, centrality, etc.), for solving NP-hard problems efficiently in real networks (e.g., maximum clique finding, densest subgraph approximation, etc.), and for large-scale network fingerprinting and visualization. The k-core is a well accepted concept partly because there exists a simple and efficient algorithm for core decomposition, by recursively removing the lowest degree vertices and their incident edges. However, this algorithm requires random access to the graph and hence assumes the entire graph can be kept in main memory. Nevertheless, real-world networks such as online social networks have become exceedingly large in recent years and still keep growing at a steady rate. In this paper, we propose the first external-memory algorithm for core decomposition in massive graphs. When the memory is large enough to hold the graph, our algorithm achieves comparable performance as the in-memory algorithm. When the graph is too large to be kept in the memory, our algorithm requires only O(kmax) scans of the graph, where kmax is the largest core number of the graph. We demonstrate the efficiency of our algorithm on real networks with up to 52.9 million vertices and 1.65 billion edges.
【Keywords】: approximation theory; computational complexity; graph theory; network theory (graphs); optimisation; NP-hard problem; core decomposition; external memory algorithm; inmemory algorithm; large-scale network fingerprinting; large-scale network visualization; massive graph; online social networks; real-world network; subgraph approximation; Algorithm design and analysis; Approximation algorithms; Clustering algorithms; Estimation; Memory management; Partitioning algorithms; Upper bound
【Paper Link】 【Pages】:63-74
【Authors】: Xian Li ; Weiyi Meng ; Clement T. Yu
【Abstract】: The Web has become the most popular place for people to acquire information. Unfortunately, it is widely recognized that the Web contains a significant amount of untruthful information. As a result, good tools are needed to help Web users determine the truthfulness of certain information. In this paper, we propose a two-step method that aims to determine whether a given statement is truthful, and if it is not, find out the truthful statement most related to the given statement. In the first step, we try to find a small number of alternative statements of the same topic as the given statement and make sure that one of these statements is truthful. In the second step, we identify the truthful statement from the given statement and the alternative statements. Both steps heavily rely on analysing various features extracted from the search results returned by a popular search engine for appropriate queries. Our experimental results show the best variation of the proposed method can achieve a precision of about 90%.
【Keywords】: Internet; query processing; T-verifier; Web; feature extraction; truthful statement; truthfulness verification; Computer science; Correlation; Genetic algorithms; Matched filters; Merging; Search engines; Web pages
【Paper Link】 【Pages】:75-86
【Authors】: Konstantinos Tsakalozos ; Herald Kllapi ; Eva Sitaridi ; Mema Roussopoulos ; Dimitris Paparas ; Alex Delis
【Abstract】: Modern frameworks, such as Hadoop, combined with abundance of computing resources from the cloud, offer a significant opportunity to address long standing challenges in distributed processing. Infrastructure-as-a-Service clouds reduce the investment cost of renting a large data center while distributed processing frameworks are capable of efficiently harvesting the rented physical resources. Yet, the performance users get out of these resources varies greatly because the cloud hardware is shared by all users. The value for money cloud consumers achieve renders resource sharing policies a key player in both cloud performance and user satisfaction. In this paper, we employ microeconomics to direct the allotment of cloud resources for consumption in highly scalable master-worker virtual infrastructures. Our approach is developed on two premises: the cloud-consumer always has a budget and cloud physical resources are limited. Using our approach, the cloud administration is able to maximize per-user financial profit. We show that there is an equilibrium point at which our method achieves resource sharing proportional to each user's budget. Ultimately, this approach allows us to answer the question of how many resources a consumer should request from the seemingly endless pool provided by the cloud.
【Keywords】: cloud computing; microeconomics; resource allocation; Hadoop; cloud administration; cloud hardware; cloud physical resources; cloud resources; cloud-consumer; distributed processing frameworks; infrastructure-as-a-service clouds; master-worker virtual infrastructures; microeconomics; price discrimination; profit maximization; resource sharing policies; Biological system modeling; Cloud computing; Quality of service; Resource management; Software; Time factors; Virtual machining
【Paper Link】 【Pages】:87-98
【Authors】: PengCheng Xiong ; Yun Chi ; Shenghuo Zhu ; Hyun Jin Moon ; Calton Pu ; Hakan Hacigümüs
【Abstract】: In a cloud computing environment, resources are shared among different clients. Intelligently managing and allocating resources among various clients is important for system providers, whose business model relies on managing the infrastructure resources in a cost-effective manner while satisfying the client service level agreements (SLAs). In this paper, we address the issue of how to intelligently manage the resources in a shared cloud database system and present SmartSLA, a cost-aware resource management system. SmartSLA consists of two main components: the system modeling module and the resource allocation decision module. The system modeling module uses machine learning techniques to learn a model that describes the potential profit margins for each client under different resource allocations. Based on the learned model, the resource allocation decision module dynamically adjusts the resource allocations in order to achieve the optimum profits. We evaluate SmartSLA by using the TPC-W benchmark with workload characteristics derived from real-life systems. The performance results indicate that SmartSLA can successfully compute predictive models under different hardware resource allocations, such as CPU and memory, as well as database specific resources, such as the number of replicas in the database systems. The experimental results also show that SmartSLA can provide intelligent service differentiation according to factors such as variable workloads, SLA levels, resource costs, and deliver improved profit margins.
【Keywords】: cloud computing; database management systems; learning (artificial intelligence); resource allocation; virtualisation; SmartSLA; TPC-W benchmark; cloud computing environment; database systems; intelligent management; machine learning techniques; resource allocation decision module; resource management system; service level agreements; shared cloud database system; system modeling module; virtualized resources; Databases; Linear regression; Machine learning; Regression tree analysis; Resource management; System performance; Time factors; cloud computing; database systems; multitenant databases; virtualization
【Paper Link】 【Pages】:99-110
【Authors】: Stefan Aulbach ; Michael Seibold ; Dean Jacobs ; Alfons Kemper
【Abstract】: Software-as-a-Service applications commonly consolidate multiple businesses into the same database to reduce costs. This practice makes it harder to implement several essential features of enterprise applications. The first is support for master data, which should be shared rather than replicated for each tenant. The second is application modification and extension, which applies both to the database schema and master data it contains. The third is evolution of the schema and master data, which occurs as the application and its extensions are upgraded. These features cannot be easily implemented in a traditional DBMS and, to the extent that they are currently offered at all, they are generally implemented within the application layer. This approach reduces the DBMS to a `dumb data repository' that only stores data rather than managing it. In addition, it complicates development of the application since many DBMS features have to be re-implemented. Instead, a next-generation multi-tenant DBMS should provide explicit support for Extensibility, Data Sharing and Evolution. As these three features are strongly related, they cannot be implemented independently from each other. Therefore, we propose FLEXSCHEME which captures all three aspects in one integrated model. In this paper, we focus on efficient storage mechanisms for this model and present a novel versioning mechanism, called XOR Delta, which is based on XOR encoding and is optimized for main-memory DBMSs.
【Keywords】: business data processing; database management systems; encoding; peer-to-peer computing; service-oriented architecture; FlexScheme; XOR delta; XOR encoding; data sharing; database schema; dumb data repository; enterprise application; mainmemory DBMS; next generation multitenant DBMS; software-as-a-service application; storage mechanism; versioning mechanism; Biological system modeling; Business; Context; Data models; Data structures; Databases; Object oriented modeling
【Paper Link】 【Pages】:111-122
【Authors】: Luping Ding ; Karen Works ; Elke A. Rundensteiner
【Abstract】: Data stream management systems (DSMS) processing long-running queries over large volumes of stream data must typically deliver time-critical responses. We propose the first semantic query optimization (SQO) approach that utilizes dynamic substream metadata at runtime to find a more efficient query plan than the one selected at compilation time. We identify four SQO techniques guaranteed to result in performance gains. Based on classic satisfiability theory we then design a lightweight query optimization algorithm that efficiently detects SQO opportunities at runtime. At the logical level, our algorithm instantiates multiple concurrent SQO plans, each processing different partially overlapping substreams. Our novel execution paradigm employs multi-modal operators to support the execution of these concurrent SQO logical plans in a single physical plan. This highly agile execution strategy reduces resource utilization while supporting lightweight adaptivity. Our extensive experimental study in the CAPE stream processing system using both synthetic and real data confirms that our optimization techniques significantly reduce query execution times, up to 60%, compared to the traditional approach.
【Keywords】: Algorithm design and analysis; Cognition; Complexity theory; Optimization; Query processing; Runtime; Semantics
【Paper Link】 【Pages】:123-134
【Authors】: Mo Liu ; Elke A. Rundensteiner ; Daniel J. Dougherty ; Chetan Gupta ; Song Wang ; Ismail Ari ; Abhay Mehta
【Abstract】: Complex event processing (CEP) over event streams has become increasingly important for real-time applications ranging from health care, supply chain management to business intelligence. These monitoring applications submit complex queries to track sequences of events that match a given pattern. As these systems mature the need for increasingly complex nested sequence query support arises, while the state-of-art CEP systems mostly support the execution of flat sequence queries only. To assure real-time responsiveness and scalability for pattern detection even on huge volume high-speed streams, efficient processing techniques must be designed. In this paper, we first analyze the prevailing nested pattern query processing strategy and identify several serious shortcomings. Not only are substantial subsequences first constructed just to be subsequently discarded, but also opportunities for shared execution of nested subexpressions are overlooked. As foundation, we introduce NEEL, a CEP query language for expressing nested CEP pattern queries composed of sequence, negation, AND and OR operators. To overcome deficiencies, we design rewriting rules for pushing negation into inner subexpressions. Next, we devise a normalization procedure that employs these rules for flattening a nested complex event expression. To conserve CPU and memory consumption, we propose several strategies for efficient shared processing of groups of normalized NEEL subexpressions. These strategies include prefix caching, suffix clustering and customized “bit-marking” execution strategies. We design an optimizer to partition the set of all CEP subexpressions in a NEEL normal form into groups, each of which can then be mapped to one of our shared execution operators. Lastly, we evaluate our technologies by conducting a performance study to assess the CPU processing time using real-world stock trades data. Our results confirm that our NEEL execution in many cases performs 100 fold fast er than the traditional iterative nested execution strategy for real stock market query workloads.
【Keywords】: query languages; query processing; AND operator; NEEL CEP query language; OR operator; bit-marking execution strategy; complex event processing; event streams; flat sequence query; negation operator; nested CEP query processing; nested pattern query processing strategy; nested sequence query support; nested subexpressions; normalization procedure; normalized NEEL subexpressions; prefix caching; rewriting rules; sequence operator; suffix clustering; Reactive power; Recycling; Surgery; Synthetic aperture sonar; Variable speed drives
【Paper Link】 【Pages】:135-146
【Authors】: Maria Kontaki ; Anastasios Gounaris ; Apostolos N. Papadopoulos ; Kostas Tsichlas ; Yannis Manolopoulos
【Abstract】: Anomaly detection is considered an important data mining task, aiming at the discovery of elements (also known as outliers) that show significant diversion from the expected case. More specifically, given a set of objects the problem is to return the suspicious objects that deviate significantly from the typical behavior. As in the case of clustering, the application of different criteria lead to different definitions for an outlier. In this work, we focus on distance-based outliers: an object x is an outlier if there are less than k objects lying at distance at most R from x. The problem offers significant challenges when a stream-based environment is considered, where data arrive continuously and outliers must be detected on-the-fly. There are a few research works studying the problem of continuous outlier detection. However, none of these proposals meets the requirements of modern stream-based applications for the following reasons: (i) they demand a significant storage overhead, (ii) their efficiency is limited and (iii) they lack flexibility. In this work, we propose new algorithms for continuous outlier monitoring in data streams, based on sliding windows. Our techniques are able to reduce the required storage overhead, run faster than previously proposed techniques and offer significant flexibility. Experiments performed on real-life as well as synthetic data sets verify our theoretical study.
【Keywords】: data mining; pattern clustering; security of data; anomaly detection; continuous distance-based outlier monitoring; data mining task; data streams; sliding windows; synthetic data sets; Algorithm design and analysis; Data mining; Data structures; Detection algorithms; Heuristic algorithms; Measurement; Monitoring
【Paper Link】 【Pages】:147-158
【Authors】: Lixing Wang ; Yin Yang ; Xin Miao ; Dimitris Papadias ; Yunhao Liu
【Abstract】: In a wireless sensor network (WSN), each sensor monitors environmental parameters, and reports its readings to a base station, possibly through other nodes. A sensor works in cycles, in each of which it stays active for a fixed duration, and then sleeps until the next cycle. The frequency of such cycles determines the portion of time that a sensor is active, and is the dominant factor on its battery life. The majority of existing work assumes globally synchronized WSN where all sensors have the same frequency. This leads to waste of battery power for applications that entail different accuracy of measurements, or environments where sensor readings have large variability. To overcome this problem, we propose LS, a query processing framework for locally synchronized WSN. We consider that each sensor ni has a distinct sampling frequency fi, which is determined by the application or environment requirements. The complication of LS is that ni has to wake up with a network frequency Fi≥fi, in order to forward messages of other sensors. Our goal is to minimize the sum of Fi without delaying packet transmissions. Specifically, given a routing tree, we first present a dynamic programming algorithm that computes the optimal network frequency of each sensor; then, we develop a heuristic for finding the best tree topology, if this is not fixed in advance.
【Keywords】: dynamic programming; query processing; trees (mathematics); wireless sensor networks; LS; dynamic programming algorithm; local sensor synchronization algorithm; optimal network frequency; query processing framework; routing tree; sensor readings; tree topology; wireless sensor network; Base stations; Frequency measurement; Frequency synchronization; Routing; Synchronization; Topology; Wireless sensor networks
【Paper Link】 【Pages】:159-170
【Authors】: Mihaela A. Bornea ; Antonios Deligiannakis ; Yannis Kotidis ; Vasilis Vassalos
【Abstract】: Active data warehouses have emerged as a new business intelligence paradigm where data in the integrated repository is refreshed in near real-time. This shift of practices achieves higher consistency between the stored information and the latest updates, which in turn influences crucially the output of decision making processes. In this paper we focus on the changes required in the implementation of Extract Transform Load (ETL) operations which now need to be executed in an online fashion. In particular, the ETL transformations frequently include the join between an incoming stream of updates and a disk-resident table of historical data or metadata. In this context we propose a novel Semi-Streaming Index Join (SSIJ) algorithm that maximizes the throughput of the join by buffering stream tuples and then judiciously selecting how to best amortize expensive disk seeks for blocks of the stored relation among a large number of stream tuples. The relation blocks required for joining with the stream are loaded from disk based on an optimal plan. In order to maximize the utilization of the available memory space for performing the join, our technique incorporates a simple but effective cache replacement policy for managing the retrieved blocks of the relation. Moreover, SSIJ is able to adapt to changing characteristics of the stream (i.e. arrival rate, data distribution) by dynamically adjusting the allocated memory between the cached relation blocks and the stream. Our experiments with a variety of synthetic and real data sets demonstrate that SSIJ consistently outperforms the state-of-the-art algorithm in terms of the maximum sustainable throughput of the join while being also able to accommodate deadlines on stream tuple processing.
【Keywords】: competitive intelligence; data warehouses; decision making; information storage; meta data; storage management; ETL transformation; SSIJ; active data warehouse; business intelligence paradigm; cache replacement policy; decision making process; disk resident table; extract transform load operation; integrated data repository; memory allocation; memory space; metadata; near real time execution; semistreamed index join algorithm; stream tuple buffering; Buffer storage; Data warehouses; Heuristic algorithms; Indexes; Memory management; Radiation detectors; Throughput
【Paper Link】 【Pages】:171-182
【Authors】: Eftychia Baikousi ; Georgios Rogkakos ; Panos Vassiliadis
【Abstract】: How similar are two data-cubes? In other words, the question under consideration is: given two sets of points in a multidimensional hierarchical space, what is the distance value between them? In this paper we explore various distance functions that can be used over multidimensional hierarchical spaces. We organize the discussed functions with respect to the properties of the dimension hierarchies, levels and values. In order to discover which distance functions are more suitable and meaningful to the users, we conducted two user study analysis. The first user study analysis concerns the most preferred distance function between two values of a dimension. The findings of this user study indicate that the functions that seem to fit better the user needs are characterized by the tendency to consider as closest to a point in a multidimensional space, points with the smallest shortest path with respect to the same dimension hierarchy. The second user study aimed in discovering which distance function between two data cubes, is mostly preferred by users. The two functions that drew the attention of users where (a) the summation of distances between every cell of a cube with the most similar cell of another cube and (b) the Hausdorff distance function. Overall, the former function was preferred by users than the latter; however the individual scores of the tests indicate that this advantage is rather narrow.
【Keywords】: data analysis; data mining; Hausdorff distance function; data-cubes; multidimensional data; multidimensional hierarchical space; similarity measures; Cities and towns; Computer science; Europe; Lattices; Taxonomy; USA Councils; Weight measurement
【Paper Link】 【Pages】:183-194
【Authors】: Arnab Nandi ; Cong Yu ; Philip Bohannon ; Raghu Ramakrishnan
【Abstract】: Cube computation over massive datasets is critical for many important analyses done in the real world. Unlike commonly studied algebraic measures such as SUM that are amenable to parallel computation, efficient cube computation of holistic measures such as TOP-K is non-trivial and often impossible with current methods. In this paper we detail real-world challenges in cube materialization tasks on Web-scale datasets. Specifically, we identify an important subset of holistic measures and introduce MR-Cube, a MapReduce based framework for efficient cube computation on these measures. We provide extensive experimental analyses over both real and synthetic data. We demonstrate that, unlike existing techniques which cannot scale to the 100 million tuple mark for our datasets, MR-Cube successfully and efficiently computes cubes with holistic measures over billion-tuple datasets.
【Keywords】: Internet; data analysis; MR-Cube; MapReduce based framework; TOP-K; Web-scale datasets; cube computation; distributed cube materialization; holistic measures; Algorithm design and analysis; Cities and towns; Current measurement; Distributed databases; Lattices; Marketing and sales; USA Councils
【Paper Link】 【Pages】:195-206
【Authors】: Alfons Kemper ; Thomas Neumann
【Abstract】: The two areas of online transaction processing (OLTP) and online analytical processing (OLAP) present different challenges for database architectures. Currently, customers with high rates of mission-critical transactions have split their data into two separate systems, one database for OLTP and one so-called data warehouse for OLAP. While allowing for decent transaction rates, this separation has many disadvantages including data freshness issues due to the delay caused by only periodically initiating the Extract Transform Load-data staging and excessive resource consumption due to maintaining two separate information systems. We present an efficient hybrid system, called HyPer, that can handle both OLTP and OLAP simultaneously by using hardware-assisted replication mechanisms to maintain consistent snapshots of the transactional data. HyPer is a main-memory database system that guarantees the ACID properties of OLTP transactions and executes OLAP query sessions (multiple queries) on the same, arbitrarily current and consistent snapshot. The utilization of the processor-inherent support for virtual memory management (address translation, caching, copy on update) yields both at the same time: unprecedentedly high transaction rates as high as 100000 per second and very fast OLAP query response times on a single system executing both workloads in parallel. The performance analysis is based on a combined TPC-C and TPC-H benchmark.
【Keywords】: data mining; data warehouses; database management systems; transaction processing; virtual storage; ACID properties; HyPer; OLAP query sessions; data freshness issues; data warehouse; database architectures; excessive resource consumption; extract transform load-data staging; hardware-assisted replication mechanisms; information systems; main-memory database system; mission-critical transactions; online analytical processing; online transaction processing; virtual memory snapshots; Business; Database systems; Memory management; Servers; Shadow mapping
【Paper Link】 【Pages】:207-218
【Authors】: Ning Jin ; Wei Wang
【Abstract】: Discriminative subgraphs can be used to characterize complex graphs, construct graph classifiers and generate graph indices. The search space for discriminative subgraphs is usually prohibitively large. Most measurements of interestingness of discriminative subgraphs are neither monotonic nor antimonotonic with respect to subgraph frequencies. Therefore, branch-and-bound algorithms are unable to mine discriminative subgraphs efficiently. We discover that search history of discriminative subgraph mining is very useful in computing empirical upper-bounds of discrimination scores of subgraphs. We propose a novel discriminative subgraph mining method, LTS (Learning To Search), which begins with a greedy algorithm that first samples the search space through subgraph probing and then explores the search space in a branch and bound fashion leveraging the search history of these samples. Extensive experiments have been performed to analyze the gain in performance by taking into account search history and to demonstrate that LTS can significantly improve performance compared with the state-of-the-art discriminative subgraph mining algorithms.
【Keywords】: data mining; graph theory; greedy algorithms; learning (artificial intelligence); pattern classification; branch and bound algorithm; discriminative subgraph mining method; graph classifier; graph indices; greedy algorithm; learning to search; Accuracy; Algorithm design and analysis; Chemical compounds; Classification algorithms; Frequency estimation; History; Kernel
【Paper Link】 【Pages】:219-230
【Authors】: Tantan Liu ; Gagan Agrawal
【Abstract】: In recent years, one mode of data dissemination has become extremely popular, which is the deep web. A key characteristics of deep web data sources is that data can only be accessed through the limited query interface they support. This paper develops a methodology for mining the deep web. Because these data sources cannot be accessed directly, thus, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. Unlike existing sampling based methods, which are typically applied on relational databases or streaming data, sampling costs, and not the computation or memory costs, are the dominant consideration in designing the algorithm. In this paper, we specifically target the frequent itemset mining problem, and develop a method based on the theory of active learning. We focus on effectively obtaining a sample that can achieve a good estimation for the support values of 1-itemsets comprising an output attribute. In our method, a Bayesian network is utilized to describe the relationship between the input and the output attributes. We have evaluated our method using one synthetic and two real datasets. Our comparison shows significant gains in estimation accuracy from both the novel aspects of our work, i.e., the use of active learning and modeling a deep web source with a Bayesian network. On all three datasets, by sampling less than 10% of all data records, we could achieve more than 95% accuracy in estimating the support of frequent itemsets.
【Keywords】: Bayes methods; data mining; information dissemination; learning (artificial intelligence); query processing; relational databases; sampling methods; Bayesian network; active learning; data access; data dissemination; data mining; data streaming; dataset sampling; deep web databases; frequent itemset mining; query interface; relational databases; synthetic datasets; Accuracy; Bayesian methods; Data mining; Estimation; Itemsets; Sampling methods
【Paper Link】 【Pages】:231-242
【Authors】: Amol Ghoting ; Rajasekar Krishnamurthy ; Edwin P. D. Pednault ; Berthold Reinwald ; Vikas Sindhwani ; Shirish Tatikonda ; Yuanyuan Tian ; Shivakumar Vaithyanathan
【Abstract】: MapReduce is emerging as a generic parallel programming paradigm for large clusters of machines. This trend combined with the growing need to run machine learning (ML) algorithms on massive datasets has led to an increased interest in implementing ML algorithms on MapReduce. However, the cost of implementing a large class of ML algorithms as low-level MapReduce jobs on varying data and machine cluster sizes can be prohibitive. In this paper, we propose SystemML in which ML algorithms are expressed in a higher-level language and are compiled and executed in a MapReduce environment. This higher-level language exposes several constructs including linear algebra primitives that constitute key building blocks for a broad class of supervised and unsupervised ML algorithms. The algorithms expressed in SystemML are compiled and optimized into a set of MapReduce jobs that can run on a cluster of machines. We describe and empirically evaluate a number of optimization strategies for efficiently executing these algorithms on Hadoop, an open-source MapReduce implementation. We report an extensive performance evaluation on three ML algorithms on varying data and cluster sizes.
【Keywords】: data analysis; high level languages; learning (artificial intelligence); linear algebra; optimisation; parallel programming; SystemML; data cluster; declarative machine learning; higher level language; linear algebra; machine cluster; open source MapReduce; optimization strategy; parallel programming; supervised ML algorithm; unsupervised ML algorithm; Clustering algorithms; Computer architecture; Machine learning; Machine learning algorithms; Optimization; Runtime; Semantics
【Paper Link】 【Pages】:243-254
【Authors】: U. Kang ; Duen Horng Chau ; Christos Faloutsos
【Abstract】: How do we find patterns and anomalies, on graphs with billions of nodes and edges, which do not fit in memory? How to use parallelism for such terabyte-scale graphs? In this work, we focus on inference, which often corresponds, intuitively, to “guilt by association” scenarios. For example, if a person is a drug-abuser, probably its friends are so, too; if a node in a social network is of male gender, his dates are probably females. We show how to do inference on such huge graphs through our proposed HADOOP Line graph Fixed Point (HA-LFP), an efficient parallel algorithm for sparse billion-scale graphs, using the HADOOP platform. Our contributions include (a) the design of HA-LFP, observing that it corresponds to a fixed point on a line graph induced from the original graph; (b) scalability analysis, showing that our algorithm scales up well with the number of edges, as well as with the number of machines; and (c) experimental results on two private, as well as two of the largest publicly available graphs - the Web Graphs from Yahoo! (6.6 billion edges and 0.24 Tera bytes), and the Twitter graph (3.7 billion edges and 0.13 Tera bytes). We evaluated our algorithm using M45, one of the top 50 fastest supercomputers in the world, and we report patterns and anomalies discovered by our algorithm, which would be invisible otherwise.
【Keywords】: data mining; graphs; inference mechanisms; parallel algorithms; social networking (online); Hadoop Line graph Fixed Point; Twitter graph; Web Graphs; Yahoo!; graph mining; inference; parallel algorithm; scalability analysis; sparse billion-scale graphs; supercomputers; Algorithm design and analysis; Belief propagation; Convergence; Data mining; Equations; Inference algorithms; Scalability; Belief Propagation; Graph Mining; HA-LFP; Hadoop
【Paper Link】 【Pages】:255-266
【Authors】: Badrish Chandramouli ; Jonathan Goldstein ; Roger S. Barga ; Mirek Riedewald ; Ivo Santos
【Abstract】: A distributed event processing system consists of one or more nodes (machines), and can execute a directed acyclic graph (DAG) of operators called a dataflow (or query), over long-running high-event-rate data sources. An important component of such a system is cost estimation, which predicts or estimates the “goodness” of a given input, i.e., operator graph and/or assignment of individual operators to nodes. Cost estimation is the foundation for solving many problems: optimization (plan selection and distributed operator placement), provisioning, admission control, and user reporting of system misbehavior. Latency is a significant user metric in many commercial real-time applications. Users are usually interested in quantiles of latency, such as worst-case or 99th percentile. However, existing cost estimation techniques for event-based dataflows use metrics that, while they may have the side-effect of being correlated with latency, do not directly or provably estimate latency. In this paper, we propose a new cost estimation technique using a metric called Mace (Maximum cumulative excess). Mace is provably equivalent to maximum system latency in a (potentially complex, multi-node) distributed event-based system. The close relationship to latency makes Mace ideal for addressing the problems described earlier. Experiments with real-world datasets on Microsoft StreamInsight deployed over 1-13 nodes in a data center validate our ability to closely estimate latency (within 4%), and the use of Mace for plan selection and distributed operator placement.
【Keywords】: costing; data flow graphs; directed graphs; distributed processing; optimisation; query processing; Mace; Microsoft Streamlnsight; admission control; commercial real time application; cost estimation; directed acyclic graph; distributed event processing system; event based dataflow; high event rate data source; latency estimation; maximum cumulative excess; maximum system latency; operator graph; Estimation; Measurement; Nickel; Optimal scheduling; Real time systems; Runtime; Silicon
【Paper Link】 【Pages】:267-278
【Authors】: Albert Yu ; Pankaj K. Agarwal ; Jun Yang
【Abstract】: We study the problem of assigning subscribers to brokers in a wide-area content-based publish/subscribe system. A good assignment should consider both subscriber interests in the event space and subscriber locations in the network space, and balance multiple performance criteria including bandwidth, delay, and load balance. The resulting optimization problem is NP-complete, so systems have turned to heuristics and/or simpler algorithms that ignore some performance criteria. Evaluating these approaches has been challenging because optimal solutions remain elusive for realistic problem sizes. To enable proper evaluation, we develop a Monte Carlo approximation algorithm with good theoretical properties and robustness to workload variations. To make it computationally feasible, we combine the ideas of linear programming, randomized rounding, coreset, and iterative reweighted sampling. We demonstrate how to use this algorithm as a yardstick to evaluate other algorithms, and why it is better than other choices of yardsticks. With its help, we show that a simple greedy algorithm works well for a number of workloads, including one generated from publicly available statistics on Google Groups. We hope that our algorithms are not only useful in their own right, but our principled approach toward evaluation will also be useful in future evaluation of solutions to similar problems in content-based publish/subscribe.
【Keywords】: Monte Carlo methods; approximation theory; linear programming; message passing; middleware; publishing; sampling methods; wide area networks; Google Groups; Monte Carlo approximation algorithm; NP-complete problem; iterative reweighted sampling; linear programming; network space; optimization; randomized rounding; subscriber assignment; subscriber locations; wide-area content-based publish/subscribe; yardstick; Approximation algorithms; Bandwidth; Bismuth; Complexity theory; Filtering algorithms; Filtering theory; Subscriptions
【Paper Link】 【Pages】:279-290
【Authors】: Qijun Zhu ; Dik Lun Lee ; Wang-Chien Lee
【Abstract】: We propose a novel collaborative caching framework to support spatial query processing in Mobile Peer-to-Peer Networks (MP2PNs). To maximize cache sharing among clients, each client caches not only data objects but also parts of the index structure built on the spatial objects. Thus, we call the proposed method structure-embedded collaborative caching (SECC). By introducing a novel index structure called Signature Augment Tree (SAT), we address two crucial issues in SECC. First, we propose a cost-efficient collaborative query processing method in MP2PNs, including peer selection and result merge from multiple peers. Second, we develop a novel collaborative cache replacement policy which maximizes cache effectiveness by considering not only the peer itself but also its neighbors. We implement two SECC schemes, namely, the periodical and adaptive SAT-based schemes, with different SAT maintenance policies. Simulation results show that our SECC schemes significantly outperform other collaborative caching methods which are based on existing spatial caching schemes in a number of metrics, including traffic volume, query latency and power consumption.
【Keywords】: cache storage; client-server systems; groupware; mobile computing; peer-to-peer computing; query processing; trees (mathematics); SAT maintenance policies; SAT-based schemes; collaborative query processing; data objects; mobile P2P networks; mobile peer-to-peer networks; peer selection; signature augment tree; spatial queries; spatial query processing; structure-embedded collaborative caching; Collaboration; Indexes; Mobile communication; Peer to peer computing; Query processing; Servers; Spatial databases
【Paper Link】 【Pages】:291-302
【Authors】: Yu Cao ; Chun Chen ; Fei Guo ; Dawei Jiang ; Yuting Lin ; Beng Chin Ooi ; Hoang Tam Vo ; Sai Wu ; Quanqing Xu
【Abstract】: Cloud computing represents a paradigm shift driven by the increasing demand of Web based applications for elastic, scalable and efficient system architectures that can efficiently support their ever-growing data volume and large-scale data analysis. A typical data management system has to deal with real-time updates by individual users, and as well as periodical large scale analytical processing, indexing, and data extraction. While such operations may take place in the same domain, the design and development of the systems have somehow evolved independently for transactional and periodical analytical processing. Such a system-level separation has resulted in problems such as data freshness as well as serious data storage redundancy. Ideally, it would be more efficient to apply ad-hoc analytical processing on the same data directly. However, to the best of our knowledge, such an approach has not been adopted in real implementation. Intrigued by such an observation, we have designed and implemented epiC, an elastic power-aware data-itensive Cloud platform for supporting both data intensive analytical operations (ref. as OLAP) and online transactions (ref. as OLTP). In this paper, we present ES2 - the elastic data storage system of epiC, which is designed to support both functionalities within the same storage. We present the system architecture and the functions of each system component, and experimental results which demonstrate the efficiency of the system.
【Keywords】: cloud computing; data mining; transaction processing; ES2; OLAP; OLTP; ad-hoc analytical processing; cloud computing; cloud data storage system; data intensive analytical operations; data management system; data storage redundancy; epiC; online analytical processing; online transaction processing; periodical analytical processing; power-aware data-intensive cloud platform; transactional analytical processing; Access control; Catalogs; Data models; Distributed databases; Indexing
【Paper Link】 【Pages】:303-314
【Authors】: Julia Stoyanovich ; Susan B. Davidson ; Tova Milo ; Val Tannen
【Abstract】: Many real-world applications deal with uncertain or missing data, prompting a surge of activity in the area of probabilistic databases. A shortcoming of prior work is the assumption that an appropriate probabilistic model, along with the necessary probability distributions, is given. We address this shortcoming by presenting a framework for learning a set of inference ensembles, termed meta-rule semi-lattices, or MRSL, from the complete portion of the data. We use the MRSL to infer probability distributions for missing data, and demonstrate experimentally that high accuracy is achieved when a single attribute value is missing per tuple. We next propose an inference algorithm based on Gibbs sampling that accurately predicts the probability distribution for multiple missing values. We also develop an optimization that greatly improves performance of multi-attribute inference for collections of tuples, while maintaining high accuracy. Finally, we develop an experimental framework to evaluate the efficiency and accuracy of our approach.
【Keywords】: inference mechanisms; statistical databases; statistical distributions; Gibbs sampling; MRSL; inference ensemble algorithm; meta-rule semilattices; missing data; probabilistic database model; probability distributions; tuple collection; Accuracy; Association rules; Computational modeling; Itemsets; Probabilistic logic; Probability distribution
【Paper Link】 【Pages】:315-326
【Authors】: Robert Fink ; Dan Olteanu ; Swaroop Rath
【Abstract】: Extensive work has recently been done on the evaluation of positive queries on probabilistic databases. The case of queries with negation has notoriously been left out, since it raises serious additional challenges to efficient query evaluation. This paper provides a complete framework for the evaluation of full relational algebra queries in probabilistic databases. In particular, it proposes exact and approximate evaluation techniques for relational algebra queries on representation systems that can accommodate any finite probability space over relational databases. Key ingredients to these techniques are (1) the manipulation of nested propositional expressions used for probability computation without unfolding them into disjunctive normal form, and (2) efficient computation of lower and upper probability bounds of such expressions by deriving coarser expressions in tractable theories such as one occurrence form. We complement our evaluation techniques with a tractability map for relational algebra queries without repeating relation symbols and for quantified queries such as set inclusion, equality, incomparability, and relational division, which are expressible in relational algebra using nested negation and repeating relation symbols. Based on this tractability study, we syntactically define a practical class of tractable relational algebra queries. We incorporated this framework in the SPROUT engine and show its efficiency experimentally in TPC-H and RFID scenarios.
【Keywords】: probability; query processing; relational algebra; relational databases; SPROUT engine; approximate evaluation techniques; exact evaluation techniques; finite probability space; full relational algebra queries; positive query evaluation; probabilistic databases; relational databases; representation systems; Algebra; Approximation methods; Cost accounting; Probabilistic logic; Query processing; Random variables
【Paper Link】 【Pages】:327-338
【Authors】: Saket Sathe ; Hoyoung Jeung ; Karl Aberer
【Abstract】: Although efficient processing of probabilistic databases is a well-established field, a wide range of applications are still unable to benefit from these techniques due to the lack of means for creating probabilistic databases. In fact, it is a challenging problem to associate concrete probability values with given time-series data for forming a probabilistic database, since the probability distributions used for deriving such probability values vary over time. In this paper, we propose a novel approach to create tuple-level probabilistic databases from (imprecise) time-series data. To the best of our knowledge, this is the first work that introduces a generic solution for creating probabilistic databases from arbitrary time series, which can work in online as well as offline fashion. Our approach consists of two key components. First, the dynamic density metrics that infer time-dependent probability distributions for time series, based on various mathematical models. Our main metric, called the GARCH metric, can robustly capture such evolving probability distributions regardless of the presence of erroneous values in a given time series. Second, the Ω-View builder that creates probabilistic databases from the probability distributions inferred by the dynamic density metrics. For efficient processing, we introduce the σ-cache that reuses the information derived from probability values generated at previous times. Extensive experiments over real datasets demonstrate the effectiveness of our approach.
【Keywords】: database management systems; statistical distributions; time series; Ω-View builder; σ-cache; GARCH metric; dynamic density metrics; imprecise time-series data; time-dependent probability distributions; tuple-level probabilistic database creation; Databases; Mathematical model; Measurement; Probabilistic logic; Probability density function; Probability distribution; Time series analysis
【Paper Link】 【Pages】:339-350
【Authors】: Thomas Bernecker ; Tobias Emrich ; Hans-Peter Kriegel ; Nikos Mamoulis ; Matthias Renz ; Andreas Züfle
【Abstract】: In this paper, we propose a novel, effective and efficient probabilistic pruning criterion for probabilistic similarity queries on uncertain data. Our approach supports a general uncertainty model using continuous probabilistic density functions to describe the (possibly correlated) uncertain attributes of objects. In a nutshell, the problem to be solved is to compute the PDF of the random variable denoted by the probabilistic domination count: Given an uncertain database object B, an uncertain reference object R and a set D of uncertain database objects in a multi-dimensional space, the probabilistic domination count denotes the number of uncertain objects in D that are closer to R than B. This domination count can be used to answer a wide range of probabilistic similarity queries. Specifically, we propose a novel geometric pruning filter and introduce an iterative filter-refinement strategy for conservatively and progressively estimating the probabilistic domination count in an efficient way while keeping correctness according to the possible world semantics. In an experimental evaluation, we show that our proposed technique allows to acquire tight probability bounds for the probabilistic domination count quickly, even for large uncertain databases.
【Keywords】: iterative methods; probability; query processing; uncertainty handling; very large databases; iterative filter-refinement strategy; large uncertain databases; probabilistic density functions; probabilistic domination count; probabilistic pruning approach; random variable; semantics; similarity queries; Approximation methods; Databases; Handheld computers; Nearest neighbor searches; Probabilistic logic; Random variables; Uncertainty
【Paper Link】 【Pages】:351-362
【Authors】: Ju Fan ; Guoliang Li ; Lizhu Zhou
【Abstract】: SQL is a classical and powerful tool for querying relational databases. However, it is rather hard for inexperienced users to pose SQL queries, as they are required to be proficient in SQL syntax and have a thorough understanding of the underlying schema. To give users gratification, we propose SQLSUGG, an effective and user-friendly keyword-based method to help various users formulate SQL queries. SQLSUGG suggests SQL queries as users type in keywords, and can save users' typing efforts and help users avoid tedious SQL debugging. To achieve high suggestion effectiveness, we propose queryable templates to model the structures of SQL queries. We propose a template ranking model to suggest templates relevant to query keywords. We generate SQL queries from each suggested template based on the degree of matchings between keywords and attributes. For efficiency, we propose a progressive algorithm to compute top-k templates, and devise an efficient method to generate SQL queries from templates. We have implemented our methods on two real data sets, and the experimental results show that our method achieves high effectiveness and efficiency.
【Keywords】: SQL; query processing; relational databases; user interfaces; SQL query suggestion; SQLSUGG method; Structured Query Language; query keywords; queryable templates; relational databases; top-k templates; user-friendly database; Aggregates; Databases; Estimation; Frequency estimation; Generators; Keyword search; Syntactics
【Paper Link】 【Pages】:363-374
【Authors】: Lu Qin ; Jeffrey Xu Yu ; Lijun Chang
【Abstract】: Keyword search in RDBs has been extensively studied in recent years. The existing studies focused on finding all or top-k interconnected tuple-structures that contain keywords. In reality, the number of such interconnected tuple-structures for a keyword query can be large. It becomes very difficult for users to obtain any valuable information more than individual interconnected tuple-structures. Also, it becomes challenging to provide a similar mechanism like group-&-aggregate for those interconnected tuple-structures. In this paper, we study computing structural statistics keyword queries by extending the group-&-aggregate framework. We consider an RDB as a large directed graph where nodes represent tuples, and edges represent the links among tuples. Instead of using tuples as a member in a group to be grouped, we consider rooted subgraphs. Such a rooted subgraph represents an interconnected tuple-structure among tuples and some of the tuples contain keywords. The dimensions of the rooted subgraphs are determined by dimensional-keywords in a data driven fashion. Two rooted subgraphs are grouped into the same group if they are isomorphic based on the dimensions or in other words the dimensional-keywords. The scores of the rooted subgraphs are computed by a user-given score function if the rooted subgraphs contain some of general keywords. Here, the general keywords are used to compute scores rather than determining dimensions. The aggregates are computed using an SQL aggregate function for every group based on the scores computed. We give our motivation using a real dataset. We propose new approaches to compute structural statistics keyword queries, perform extensive performance studies using two large real datasets and a large synthetic dataset, and confirm the effectiveness and efficiency of our approach.
【Keywords】: SQL; directed graphs; query processing; relational databases; statistical analysis; SQL aggregate function; directed graph; group-&-aggregate framework; keyword query; keyword search; relational databases; rooted subgraphs; structural statistics; top-k interconnected tuple-structures; user-given score function; Aggregates; Cities and towns; Computers; Keyword search; Monitoring; Relational databases
【Paper Link】 【Pages】:375-386
【Authors】: Mahendra Chavan ; Ravindra Guravannavar ; Karthik Ramachandra ; S. Sudarshan
【Abstract】: Synchronous execution of queries or Web service requests forces the calling application to block until the query/request is satisfied. The performance of applications can be significantly improved by asynchronous submission of queries, which allows the application to perform other processing instead of blocking while the query is executed, and to concurrently issue multiple queries. Concurrent submission of multiple queries can allow the query execution engine to better utilize multiple processors and disks, and to reorder disk IO requests to minimize seeks. Concurrent submission also reduces the impact of network round-trip latency and delays at the database, when processing multiple queries. However, manually writing applications to exploit asynchronous query submission is tedious. In this paper we address the issue of automatically transforming a program written assuming synchronous query submission, to one that exploits asynchronous query submission. Our program transformation method is based on dataflow analysis and is framed as a set of transformation rules. Our rules can handle query executions within loops, unlike some of the earlier work in this area. We have built a tool that implements our transformation techniques on Java code that uses JDBC calls; our tool can be extended to handle Web service calls. We have carried out a detailed experimental study on several real-life applications rewritten using our transformation techniques. The experimental study shows the effectiveness of the proposed rewrite techniques, both in terms of their applicability and performance gains achieved.
【Keywords】: Java; Web services; data flow analysis; program compilers; query processing; rewriting systems; JDBC calls; Java code; Web service calls; Web service requests; asynchronous query submission; dataflow analysis; multiple processors; network delays; network round-trip latency; program transformations; query execution engine; rewrite techniques; Computational modeling; Databases; Engines; Observers; Prefetching; Servers
【Paper Link】 【Pages】:387-398
【Authors】: Atish Das Sarma ; Ashwin Lall ; Danupon Nanongkai ; Richard J. Lipton ; Jun (Jim) Xu
【Abstract】: The study of skylines and their variants has received considerable attention in recent years. Skylines are essentially sets of most interesting (undominated) tuples in a database. However, since the skyline is often very large, much research effort has been devoted to identifying a smaller subset of (say k) \“representative skyline\” points. Several different definitions of representative skylines have been considered. Most of these formulations are intuitive in that they try to achieve some kind of clustering \“spread\” over the entire skyline, with k points. In this work, we take a more principled approach in defining the representative skyline objective. One of our main contributions is to formulate the problem of displaying k representative skyline points such that the probability that a random user would click on one of them is maximized. Two major research questions arise naturally from this formulation. First, how does one mathematically model the likelihood with which a user is interested in and will "click" on a certain tuple? Second, how does one negotiate the absence of the knowledge of an explicit set of target users; in particular what do we mean by "a random user"? To answer the first question, we model users based on a novel formulation of threshold preferences which we will motivate further in the paper. To answer the second question, we assume a probability distribution of users instead of a fixed set of users. While this makes the problem harder, it lends more mathematical structures that can be exploited as well, as one can now work with probabilities of thresholds and handle cumulative density functions. On the theoretical front, our objective is NP-hard. For the case of a finite set of users with known thresholds, we present a simple greedy algorithm that attains an approximation ratio of (1 - 1/e) of the optimal. For the case of user distributions, we show that a careful yet similar greedy algorithm achieves the same- - approximation ratio. Unfortunately, it turns out that this algorithm is rather involved and computationally expensive. So we present a threshold sampling based algorithm that is more computationally affordable and, for any fixed \∈ >;; 0, has an approximation ratio of (1 - 1/e - \∈). We perform experiments on both real and synthetic data to show that our algorithm significantly outperforms previously proposed approaches.
【Keywords】: approximation theory; greedy algorithms; query processing; statistical distributions; approximation ratio; greedy algorithm; mathematical structures; mathematically model; probability distribution; random user; representative skylines; threshold-based preference distributions; Approximation algorithms; Approximation methods; Databases; Density functional theory; Greedy algorithms; Optimized production technology; Probability density function
【Paper Link】 【Pages】:399-409
【Authors】: Charu C. Aggarwal ; Yuchen Zhao ; Philip S. Yu
【Abstract】: A number of applications in social networks, telecommunications, and mobile computing create massive streams of graphs. In many such applications, it is useful to detect structural abnormalities which are different from the “typical” behavior of the underlying network. In this paper, we will provide first results on the problem of structural outlier detection in massive network streams. Such problems are inherently challenging, because the problem of outlier detection is specially challenging because of the high volume of the underlying network stream. The stream scenario also increases the computational challenges for the approach. We use a structural connectivity model in order to define outliers in graph streams. In order to handle the sparsity problem of massive networks, we dynamically partition the network in order to construct statistically robust models of the connectivity behavior. We design a reservoir sampling method in order to maintain structural summaries of the underlying network. These structural summaries are designed in order to create robust, dynamic and efficient models for outlier detection in graph streams. We present experimental results illustrating the effectiveness and efficiency of our approach.
【Keywords】: media streaming; mobile computing; network theory (graphs); sampling methods; social networking (online); graph streams; massive network streams; mobile computing; reservoir sampling method; social networks; sparsity problem; structural connectivity model; structural outlier detection; telecommunications; Estimation; Image edge detection; Probability; Reservoirs; Robustness; Sampling methods; Social network services
【Paper Link】 【Pages】:410-421
【Authors】: Ye Wang ; Srinivasan Parthasarathy ; Shirish Tatikonda
【Abstract】: Outlier detection is fundamental to a variety of database and analytic tasks. Recently, distance-based outlier detection has emerged as a viable and scalable alternative to traditional statistical and geometric approaches. In this article we explore the role of ranking for the efficient discovery of distance-based outliers from large high dimensional data sets. Specifically, we develop a light-weight ranking scheme that is powered by locality sensitive hashing, which reorders the database points according to their likelihood of being an outlier. We provide theoretical arguments to justify the rationale for the approach and subsequently conduct an extensive empirical study highlighting the effectiveness of our approach over extant solutions. We show that our ranking scheme improves the efficiency of the distance-based outlier discovery process by up to 5-fold. Furthermore, we find that using our approach the top outliers can often be isolated very quickly, typically by scanning less than 3% of the data set.
【Keywords】: data handling; file organisation; distance-based outlier detection; distance-based outlier discovery process; light-weight ranking scheme; locality sensitive hashing; locality sensitive outlier detection; ranking driven approach; Algorithm design and analysis; Approximation algorithms; Artificial neural networks; Clustering algorithms; Databases; Nearest neighbor searches; Optimization
【Paper Link】 【Pages】:422-433
【Authors】: Bin Jiang ; Jian Pei
【Abstract】: This paper studies the problem of outlier detection on uncertain data. We start with a comprehensive model considering both uncertain objects and their instances. An uncertain object has some inherent attributes and consists of a set of instances which are modeled by a probability density distribution. We detect outliers at both the instance level and the object level. To detect outlier instances, it is a prerequisite to know normal instances. By assuming that uncertain objects with similar properties tend to have similar instances, we learn the normal instances for each uncertain object using the instances of objects with similar properties. Consequently, outlier instances can be detected by comparing against normal ones. Furthermore, we can detect outlier objects most of whose instances are outliers. Technically, we use a Bayesian inference algorithm to solve the problem, and develop an approximation algorithm and a filtering algorithm to speed up the computation. An extensive empirical study on both real data and synthetic data verifies the effectiveness and efficiency of our algorithms.
【Keywords】: Bayes methods; data handling; inference mechanisms; statistical distributions; Bayesian inference algorithm; approximation algorithm; filtering algorithm; instance level; object level; outlier detection; probability density distribution; uncertain data; uncertain objects; Approximation algorithms; Approximation methods; Equations; Inference algorithms; Kernel; Mathematical model; Temperature measurement
【Paper Link】 【Pages】:434-445
【Authors】: Emmanuel Müller ; Matthias Schiffer ; Thomas Seidl
【Abstract】: Outlier mining is an important data analysis task to distinguish exceptional outliers from regular objects. For outlier mining in the full data space, there are well established methods which are successful in measuring the degree of deviation for outlier ranking. However, in recent applications traditional outlier mining approaches miss outliers as they are hidden in subspace projections. Especially, outlier ranking approaches measuring deviation on all available attributes miss outliers deviating from their local neighborhood only in subsets of the attributes. In this work, we propose a novel outlier ranking based on the objects deviation in a statistically selected set of relevant subspace projections. This ensures to find objects deviating in multiple relevant subspaces, while it excludes irrelevant projections showing no clear contrast between outliers and the residual objects. Thus, we tackle the general challenges of detecting outliers hidden in subspaces of the data. We provide a selection of subspaces with high contrast and propose a novel ranking based on an adaptive degree of deviation in arbitrary subspaces. In thorough experiments on real and synthetic data we show that our approach outperforms competing outlier ranking approaches by detecting outliers in arbitrary subspace projections.
【Keywords】: data analysis; data analysis; outlier ranking; statistical selection; subspace projection; synthetic data; Variable speed drives
【Paper Link】 【Pages】:446-457
【Authors】: Fei Chiang ; Renée J. Miller
【Abstract】: Integrity constraints play an important role in data design. However, in an operational database, they may not be enforced for many reasons. Hence, over time, data may become inconsistent with respect to the constraints. To manage this, several approaches have proposed techniques to repair the data, by finding minimal or lowest cost changes to the data that make it consistent with the constraints. Such techniques are appropriate for the old world where data changes, but schemas and their constraints remain fixed. In many modern applications however, constraints may evolve over time as application or business rules change, as data is integrated with new data sources, or as the underlying semantics of the data evolves. In such settings, when an inconsistency occurs, it is no longer clear if there is an error in the data (and the data should be repaired), or if the constraints have evolved (and the constraints should be repaired). In this work, we present a novel unified cost model that allows data and constraint repairs to be compared on an equal footing. We consider repairs over a database that is inconsistent with respect to a set of rules, modeled as functional dependencies (FDs). FDs are the most common type of constraint, and are known to play an important role in maintaining data quality. We evaluate the quality and scalability of our repair algorithms over synthetic data and present a qualitative case study using a well-known real dataset. The results show that our repair algorithms not only scale well for large datasets, but are able to accurately capture and correct inconsistencies, and accurately decide when a data repair versus a constraint repair is best.
【Keywords】: data integrity; constraint repair; data design; data quality; data repair; data sources; functional dependencies; integrity constraints; unified cost model; Cities and towns; Computational modeling; Data models; Databases; Maintenance engineering; Redundancy; Semantics
【Paper Link】 【Pages】:458-469
【Authors】: Jiannan Wang ; Guoliang Li ; Jianhua Feng
【Abstract】: String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this paper, we propose a new similarity metrics, called “fuzzy token matching based similarity”, which extends token-based similarity functions (e.g., Jaccard similarity and Cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity metrics and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. Experimental results show that our approach achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods.
【Keywords】: fuzzy set theory; string matching; very large databases; database community; fast-join; fuzzy token matching; signature-based method; similarity metrics; string similarity join; Cleaning; Collaboration; Filtering; Iron; Measurement; Transforms; Upper bound
【Paper Link】 【Pages】:470-481
【Authors】: Shaoxu Song ; Lei Chen ; Philip S. Yu
【Abstract】: To study data dependencies over heterogeneous data in dataspaces, we define a general dependency form, namely comparable dependencies (CDs), which specifies constraints on comparable attributes. It covers the semantics of a broad class of dependencies in databases, including functional dependencies (FDs), metric functional dependencies (MFDs), and matching dependencies (MDs). As we illustrated, comparable dependencies are useful in real practice of dataspaces, e.g., semantic query optimization. Due to the heterogeneous data in dataspaces, the first question, known as the validation problem, is to determine whether a dependency (almost) holds in a data instance. Unfortunately, as we proved, the validation problem with certain error or confidence guarantee is generally hard. In fact, the confidence validation problem is also NP-hard to approximate to within any constant factor. Nevertheless, we develop several approaches for efficient approximation computation, including greedy and randomized approaches with an approximation bound on the maximum number of violations that an object may introduce. Finally, through an extensive experimental evaluation on real data, we verify the superiority of our methods.
【Keywords】: computational complexity; data handling; NP-hard problem; comparable dependencies; confidence validation problem; data dependencies; dataspaces; greedy approach; matching dependencies; metric functional dependencies; randomized approach; Approximation methods; Image color analysis; Measurement uncertainty; Query processing; Semantics
【Paper Link】 【Pages】:482-493
【Authors】: Stephen Revilak ; Patrick E. O'Neil ; Elizabeth J. O'Neil
【Abstract】: Many popular database management systems provide snapshot isolation (SI) for concurrency control, either in addition to or in place of full serializability based on locking. Snapshot isolation was introduced in 1995, with noted anomalies that can lead to serializability violations. Full serializability was provided in 2008 and improved in 2009 by aborting transactions in dangerous structures, which had been shown in 2005 to be precursors to potential SI anomalies. This approach resulted in a runtime environment guaranteeing a serializable form of snapshot isolation (which we call SSI or ESSI) for arbitrary applications. But transactions in a dangerous structure frequently do not cause true anomalies so, as the authors point out, their method is conservative: it can cause unnecessary aborts. In the current paper, we demonstrate our PSSI algorithm to detect cycles in a snapshot isolation dependency graph and abort transactions to break the cycle. This algorithm provides a much more precise criterion to perform aborts. We have implemented our algorithm in an open source production database system (MySQL/InnoDB), and our performance study shows that PSSI throughput improves on ESSI, with significantly fewer aborts.
【Keywords】: SQL; concurrency control; database management systems; graph theory; InnoDB; MySQL; concurrency control; database management systems; open source production database system; precisely serializable snapshot isolation; serializability violations; snapshot isolation dependency graph; Concurrency control; History; Medical services; Phantoms; Runtime; Silicon; Testing
【Paper Link】 【Pages】:494-505
【Authors】: Balaji Palanisamy ; Ling Liu
【Abstract】: This paper presents MobiMix, a road network based mix-zone framework to protect location privacy of mobile users traveling on road networks. In contrast to spatial cloaking based location privacy protection, the approach in MobiMix is to break the continuity of location exposure by using mix-zones, where no applications can trace user movement. This paper makes two original contributions. First, we provide the formal analysis on the vulnerabilities of directly applying theoretical rectangle mix-zones to road networks in terms of anonymization effectiveness and attack resilience. We argue that effective mix-zones should be constructed and placed by carefully taking into consideration of multiple factors, such as the geometry of the zones, the statistical behavior of the user population, the spatial constraints on movement patterns of the users, and the temporal and spatial resolution of the location exposure. Second, we develop a suite of road network mix-zone construction methods that provide higher level of attack resilience and yield a specified lower-bound on the level of anonymity. We evaluate the MobiMix approach through extensive experiments conducted on traces produced by GTMobiSim on different scales of geographic maps. Our experiments show that MobiMix offers high level of anonymity and high level of resilience to attacks compared to existing mix-zone approaches.
【Keywords】: data privacy; mobile computing; mobile handsets; MobiMix; geographic map; location privacy protection; mobile user; road network; spatial constraint; Entropy; Mobile communication; Privacy; Roads; Size measurement; Timing; Trajectory
【Paper Link】 【Pages】:506-516
【Authors】: Keith B. Frikken ; Kyle Kauffman ; Aaron Steele
【Abstract】: Sensor networks have the potential to allow users to “query the physical world” by querying the sensor nodes for an aggregate result. However, a concern with such aggregation is that a few corrupt nodes in the network may manipulate the results that the querier sees. There has been a substantial amount of work on providing integrity for sensor network computations. However to the best of our knowledge, this prior work has one or more of the following two limitations: i) the methods only work for a specific aggregation function, or ii) the methods do not consider adversaries whose goal is to prevent the base station and the querier from receiving the result. In this paper we present the first scheme that provides general aggregation for sensor networks in the presence of malicious adversaries. The generality of the scheme results from the ability to securely evaluate any algorithm in the streaming model of computation. The main idea of this paper is to convert the commonly used aggregation tree into an aggregation list, and a process for doing this conversion is also presented. This result is an interesting and important first step towards achieving the realization of general secure querying of a sensor network.
【Keywords】: telecommunication security; wireless sensor networks; aggregation function; aggregation list; aggregation tree; malicious adversary; malicious nodes; sensor aggregation security; sensor networks; Aggregates; Base stations; Computational modeling; Monitoring; Protocols; Security; Vegetation; Aggregation; Integrity; Sensor Networks
【Paper Link】 【Pages】:517-528
【Authors】: Stavros Papadopoulos ; Aggelos Kiayias ; Dimitris Papadias
【Abstract】: In-network aggregation is a popular methodology adopted in wireless sensor networks, which reduces the energy expenditure in processing aggregate queries (such as SUM, MAX, etc.) over the sensor readings. Recently, research has focused on secure in-network aggregation, motivated (i) by the fact that the sensors are usually deployed in open and unsafe environments, and (ii) by new trends such as outsourcing, where the aggregation process is delegated to an untrustworthy service. This new paradigm necessitates the following key security properties: data confidentiality, integrity, authentication, and freshness. The majority of the existing work on the topic is either unsuitable for large-scale sensor networks, or provides only approximate answers for SUM queries (as well as their derivatives, e.g., COUNT, AVG, etc). Moreover, there is currently no approach offering both confidentiality and integrity at the same time. Towards this end, we propose a novel and efficient scheme called SIES. SIES is the first solution that supports Secure In-network processing of Exact SUM queries, satisfying all security properties. It achieves this goal through a combination of homomorphic encryption and secret sharing. Furthermore, SIES is lightweight (it relies on inexpensive hash operations and modular additions/multiplications), and features a very small bandwidth consumption (in the order of a few bytes). Consequently, SIES constitutes an ideal method for resource-constrained sensors.
【Keywords】: authorisation; computer network security; cryptography; data integrity; data privacy; query processing; wireless sensor networks; SIES; aggregate query processing; authentication; bandwidth consumption; data confidentiality; data integrity; energy expenditure; exact SUM queries; homomorphic encryption; large-scale sensor networks; secret sharing; secure in-network aggregation; security properties; untrustworthy service; wireless sensor networks; Aggregates; Data models; Encryption; Generators; Seals
【Paper Link】 【Pages】:529-540
【Authors】: Yeye He ; Siddharth Barman ; Jeffrey F. Naughton
【Abstract】: In comparison to the extensive body of existing work considering publish-once, static anonymization, dynamic anonymization is less well studied. Previous work, most notably m-invariance, has made considerable progress in devising a scheme that attempts to prevent individual records from being associated with too few sensitive values. We show, however, that in the presence of updates, even an m-invariant table can be exploited by a new type of attack we call the “equivalence-attack.” To deal with the equivalence attack, we propose a graph-based anonymization algorithm that leverages solutions to the classic “min-cut/max-flow” problem, and demonstrate with experiments that our algorithm is efficient and effective in preventing equivalence attacks.
【Keywords】: data privacy; publishing; table lookup; anonymized data; dynamic anonymization; equivalence attacks; graph-based anonymization algorithm; m-invariant table; static anonymization; Cancer; Diseases; Heuristic algorithms; Joining processes; Partitioning algorithms; Privacy; Publishing
【Paper Link】 【Pages】:541-552
【Authors】: Dingming Wu ; Man Lung Yiu ; Christian S. Jensen ; Gao Cong
【Abstract】: Web users and content are increasingly being geo-positioned. This development gives prominence to spatial keyword queries, which involve both the locations and textual descriptions of content. We study the efficient processing of continuously moving top-k spatial keyword (MkSK) queries over spatial keyword data. State-of-the-art solutions for moving queries employ safe zones that guarantee the validity of reported results as long as the user remains within a zone. However, existing safe zone methods focus solely on spatial locations and ignore text relevancy. We propose two algorithms for computing safe zones that guarantee correct results at any time and that aim to optimize the computation on the server as well as the communication between the server and the client. We exploit tight and conservative approximations of safe zones and aggressive computational space pruning. Empirical studies with real data suggest that our proposals are efficient.
【Keywords】: Internet; client-server systems; query processing; Web content; Web users; aggressive computational space pruning; approximations; client-server communication; continuously moving top- k spatial keyword queries; text relevancy; Equations; Euclidean distance; Indexes; Mobile communication; Proposals; Servers; Spatial databases
【Paper Link】 【Pages】:553-564
【Authors】: Alex X. Liu ; Ke Shen ; Eric Torng
【Abstract】: Hamming distance has been widely used in many application domains, such as near-duplicate detection and pattern recognition. We study Hamming distance range query problems, where the goal is to find all strings in a database that are within a Hamming distance bound k from a query string. If k is fixed, we have a static Hamming distance range query problem. If k is part of the input, we have a dynamic Hamming distance range query problem. For the static problem, the prior art uses lots of memory due to its aggressive replication of the database. For the dynamic range query problem, as far as we know, there is no space and time efficient solution for arbitrary databases. In this paper, we first propose a static Hamming distance range query algorithm called HEngined, which addresses the space issue in prior art by dynamically expanding the query on the fly. We then propose a dynamic Hamming distance range query algorithm called HEngined, which addresses the limitation in prior art using a divide-and-conquer strategy. We implemented our algorithms and conducted side-by-side comparisons on large real-world and synthetic datasets. In our experiments, HEngines uses 4.65 times less space and processes queries 16% faster than the prior art, and HEngined processes queries 46 times faster than linear scan while using only 1.7 times more space.
【Keywords】: divide and conquer methods; query processing; HEngined; HEngines; divide-and-conquer strategy; dynamic Hamming distance range query problem; large scale Hamming distance query processing; static Hamming distance range query problem; Approximation algorithms; Art; Hamming distance; Heuristic algorithms; High definition video; Query processing
【Paper Link】 【Pages】:565-576
【Authors】: Man Lung Yiu ; Eric Lo ; Duncan Yung
【Abstract】: A moving kNN query continuously reports the k nearest neighbors of a moving query point. In addition to the query result, a service provider that evaluates moving queries often returns mobile clients a safe region that bounds the validity of query results to minimize the communication cost between the two parties. However, when a service provider is not trustworthy, it may send inaccurate query results or incorrect safe regions to clients. In this paper, we present a framework and algorithms to authenticate results and safe regions of moving kNN queries. Extensive experiments on both real and synthetic datasets show that our methods are efficient in terms of both computation time and communication costs.
【Keywords】: mobile computing; query processing; security of data; k nearest neighbor; mobile clients; moving kNN query; query authentication; service provider; Authentication; Generators; Mobile communication; Nearest neighbor searches; Query processing; Servers; Spatial databases
【Paper Link】 【Pages】:577-588
【Authors】: Muhammad Aamir Cheema ; Xuemin Lin ; Wenjie Zhang ; Ying Zhang
【Abstract】: Given a set of objects and a query q, a point p is called the reverse k nearest neighbor (RkNN) of q if q is one of the k closest objects of p. In this paper, we introduce the concept of influence zone which is the area such that every point inside this area is the RkNN of q and every point outside this area is not the RkNN. The influence zone has several applications in location based services, marketing and decision support systems. It can also be used to efficiently process RkNN queries. First, we present efficient algorithm to compute the influence zone. Then, based on the influence zone, we present efficient algorithms to process RkNN queries that significantly outperform existing best known techniques for both the snapshot and continuous RkNN queries. We also present a detailed theoretical analysis to analyse the area of the influence zone and IO costs of our RkNN processing algorithms. Our experiments demonstrate the accuracy of our theoretical analysis.
【Keywords】: learning (artificial intelligence); pattern recognition; query processing; RkNN queries; decision support system; location based services; marketing; reverse k nearest neighbor; Accuracy; Algorithm design and analysis; Decision support systems; Driver circuits; Monitoring; Nearest neighbor searches; Recurrent neural networks
【Paper Link】 【Pages】:589-600
【Authors】: Jorge-Arnulfo Quiané-Ruiz ; Christoph Pinkel ; Jörg Schad ; Jens Dittrich
【Abstract】: MapReduce is a computing paradigm that has gained a lot of popularity as it allows non-expert users to easily run complex analytical tasks at very large-scale. At such scale, task and node failures are no longer an exception but rather a characteristic of large-scale systems. This makes fault-tolerance a critical issue for the efficient operation of any application. MapReduce automatically reschedules failed tasks to available nodes, which in turn recompute such tasks from scratch. However, this policy can significantly decrease performance of applications. In this paper, we propose a family of Recovery Algorithms for Fast-Tracking (RAFT) MapReduce. As ease-of-use is a major feature of MapReduce, RAFT focuses on simplicity and also non-intrusiveness, in order to be implementation-independent. To efficiently recover from task failures, RAFT exploits the fact that MapReduce produces and persists intermediate results at several points in time. RAFT piggy-backs checkpoints on the task progress computation. To deal with multiple node failures, we propose query metadata checkpointing. We keep track of the mapping between input key-value pairs and intermediate data for all reduce tasks. Thereby, RAFT does not need to re-execute completed map tasks entirely. Instead RAFT only recomputes intermediate data that were processed for local reduce tasks and hence not shipped to another node for processing. We also introduce a scheduling strategy taking full advantage of these recovery algorithms. We implemented RAFT on top of Hadoop and evaluated it on a 45-node cluster using three common analytical tasks. Overall, our experimental results demonstrate that RAFT outperforms Hadoop runtimes by 23% on average under task and node failures. The results also show that RAFT has negligible runtime overhead.
【Keywords】: checkpointing; meta data; scheduling; software fault tolerance; Hadoop; RAFTing MapReduce; fault-tolerance; input key-value pairs; multiple node failures; query metadata checkpointing; recovery algorithms for fast-tracking; scheduling strategy; task progress computation; Checkpointing; Delay; Fault tolerance; Fault tolerant systems; File systems; Resumes; Runtime
【Paper Link】 【Pages】:601-612
【Authors】: Haibo Hu ; Jianliang Xu ; Chushi Ren ; Byron Choi
【Abstract】: Query processing that preserves both the data privacy of the owner and the query privacy of the client is a new research problem. It shows increasing importance as cloud computing drives more businesses to outsource their data and querying services. However, most existing studies, including those on data outsourcing, address the data privacy and query privacy separately and cannot be applied to this problem. In this paper, we propose a holistic and efficient solution that comprises a secure traversal framework and an encryption scheme based on privacy homomorphism. The framework is scalable to large datasets by leveraging an index-based approach. Based on this framework, we devise secure protocols for processing typical queries such as k-nearest-neighbor queries (kNN) on R-tree index. Moreover, several optimization techniques are presented to improve the efficiency of the query processing protocols. Our solution is verified by both theoretical analysis and performance study.
【Keywords】: cloud computing; cryptography; data privacy; query processing; R-tree index; cloud computing; data privacy; encryption scheme; index-based approach; k-nearest-neighbor queries; optimization techniques; privacy homomorphism; private query processing; untrusted data cloud; Data privacy; Encryption; Indexes; Privacy; Protocols; Query processing
【Paper Link】 【Pages】:613-624
【Authors】: Kamal Zellag ; Bettina Kemme
【Abstract】: While online transaction processing applications heavily rely on the transactional properties provided by the underlying infrastructure, they often choose to not use the highest isolation level, i.e., serializability, because of the potential performance implications of costly strict two-phase locking concurrency control. Instead, modern transaction systems, consisting of an application server tier and a database tier, offer several levels of isolation providing a trade-off between performance and consistency. While it is fairly well known how to identify the anomalies that are possible under a certain level of isolation, it is much more difficult to quantify the amount of anomalies that occur during run-time of a given application. In this paper, we address this issue and present a new approach to detect, in realtime, consistency anomalies for arbitrary multi-tier applications. As the application is running, our tool detect anomalies online indicating exactly the transactions and data items involved. Furthermore, we classify the detected anomalies into patterns showing the business methods involved as well as their occurrence frequency. We use the RUBiS benchmark to show how the introduction of a new transaction type can have a dramatic effect on the number of anomalies for certain isolation levels, and how our tool can quickly detect such problem transactions. Therefore, our system can help designers to either choose an isolation level where the anomalies do not occur or to change the transaction design to avoid the anomalies.
【Keywords】: business process re-engineering; concurrency control; data mining; pattern classification; transaction processing; RUBiS benchmark; business methods; consistency anomalies; data items; database tier; isolation level; multitier architectures; online transaction processing; pattern classification; real-time quantification; serializability; server tier; transactional properties; two-phase locking concurrency control; Business; Concurrency control; Database systems; Schedules; Servers; Silicon
【Paper Link】 【Pages】:625-636
【Authors】: Mihaela A. Bornea ; Orion Hodson ; Sameh Elnikety ; Alan Fekete
【Abstract】: This paper presents a method that allows a replicated database system to provide a global isolation level stronger than the isolation level provided on each individual database replica. We propose a new multi-version concurrency control algorithm called, serializable generalized snapshot isolation (SGSI), that targets middleware replicated database systems. Each replica runs snapshot isolation locally and the replication middleware guarantees global one-copy serializability. We introduce novel techniques to provide a stronger global isolation level, namely readset extraction and enhanced certification that prevents read-write and write-write conflicts in a replicated setting. We prove the correctness of the proposed algorithm, and build a prototype replicated database system to evaluate SGSI performance experimentally. Extensive experiments with an 8 replica database system under the TPC-W workload mixes demonstrate the practicality and low overhead of the algorithm.
【Keywords】: certification; concurrency control; middleware; queueing theory; SGSI performance; database system; enhanced certification; middleware; multiversion concurrency; serializable generalized snapshot isolation; server system; Concurrency control; Database systems; Engines; History; Middleware; Silicon
【Paper Link】 【Pages】:637-648
【Authors】: Fajar Ardian ; Sourav S. Bhowmick
【Abstract】: In many real-world applications, it is important to create a local archive containing versions of structured results of continuous queries (queries that are evaluated periodically) submitted to autonomous database-driven Web sites (e.g., deep Web). Such history of digital information is a potential gold mine for all kinds of scientific, media and business analysts. An important task in this context is to maintain the set of common keys of the underlying archived results as they play pivotal role in data modeling and analysis, query processing, and entity tracking. A set of attributes in a structured data is a common key iff it is a key for all versions of the data in the archive. Due to the data-driven nature of key discovery from the archive, unlike traditional keys, the common keys are not temporally invariant. That is, keys identified in one version may be different from those in another version. Hence, in this paper, we propose a novel technique to maintain common keys in an archive containing a sequence of versions of evolutionary continuous query results. Given the current common key set of existing versions and a new snapshot, we propose an algorithm called COKE (COmmon KEy maintenancE) which incrementally maintains the common key set without undertaking expensive minimal keys computation from the new snapshot. Furthermore, it exploits certain interesting evolutionary features of real-world data to further reduce the computation cost. Our exhaustive empirical study demonstrates that COKE has excellent performance and is orders of magnitude faster than a baseline approach for maintenance of common keys.
【Keywords】: Web sites; query processing; COKE algorithm; autonomous database-driven Web sites; common key maintenance; continuous query results; data analysis; data modeling; deep Websites; entity tracking; local archive; query processing; Accuracy; Maintenance engineering; Monitoring; Partitioning algorithms; Query processing; Web sites
【Paper Link】 【Pages】:649-660
【Authors】: Arash Termehchy ; Marianne Winslett ; Yodsawalai Chodpathumwan
【Abstract】: Real-world databases often have extremely complex schemas. With thousands of entity types and relationships, each with a hundred or so attributes, it is extremely difficult for new users to explore the data and formulate queries. Schema free query interfaces (SFQIs) address this problem by allowing users with no knowledge of the schema to submit queries. We postulate that SFQIs should deliver the same answers when given alternative but equivalent schemas for the same underlying information. In this paper, we introduce and formally define design independence, which captures this property for SFQIs. We establish a theoretical framework to measure the amount of design independence provided by an SFQI. We show that most current SFQIs provide a very limited degree of design independence. We also show that SFQIs based on the statistical properties of data can provide design independence when the changes in the schema do not introduce or remove redundancy in the data. We propose a novel XML SFQI called Duplication Aware Coherency Ranking (DA-CR) based on information-theoretic relationships among the data items in the database, and prove that DA-CR is design independent. Our extensive empirical study using three real-world data sets shows that the average case design independence of current SFQIs is considerably lower than that of DA-CR. We also show that the ranking quality of DA-CR is better than or equal to that of current SFQI methods.
【Keywords】: database management systems; query processing; DA-CR; XML SFQI; design independence; duplication aware coherency ranking; information-theoretic relationships; query formulation; real-world databases; schema free query interfaces; schema independent; Algorithm design and analysis; Data models; Database languages; Databases; Design methodology; Redundancy; XML
【Paper Link】 【Pages】:661-672
【Authors】: Yifei Lu ; Wei Wang ; Jianxin Li ; Chengfei Liu
【Abstract】: An important facility to aid keyword search on XML data is suggesting alternative queries when user queries contain typographical errors. Query suggestion thus can improve users' search experience by avoiding returning empty result or results of poor qualities. In this paper, we study the problem of effectively and efficiently providing quality query suggestions for keyword queries on an XML document. We illustrate certain biases in previous work and propose a principled and general framework, XClean, based on the state-of-the-art language model. Compared with previous methods, XClean can accommodate different error models and XML keyword query semantics without losing rigor. Algorithms have been developed that compute the top-k suggestions efficiently. We performed an extensive experiment study using two large-scale real datasets. The experiment results demonstrate the effectiveness and efficiency of the proposed methods.
【Keywords】: XML; query processing; XClean; XML document; XML keyword query semantics; keyword search; quality query suggestions; top-k suggestions; valid spelling suggestion; Algorithm design and analysis; Cleaning; Databases; Insurance; Probabilistic logic; Vocabulary; XML
【Paper Link】 【Pages】:673-684
【Authors】: Jianxin Li ; Chengfei Liu ; Rui Zhou ; Wei Wang
【Abstract】: Despite the proliferation of work on XML keyword query, it remains open to support keyword query over probabilistic XML data. Compared with traditional keyword search, it is far more expensive to answer a keyword query over probabilistic XML data due to the consideration of possible world semantics. In this paper, we firstly define the new problem of studying top-k keyword search over probabilistic XML data, which is to retrieve k SLCA results with the k highest probabilities of existence. And then we propose two efficient algorithms. The first algorithm PrStack can find k SLCA results with the k highest probabilities by scanning the relevant keyword nodes only once. To further improve the efficiency, we propose a second algorithm EagerTopK based on a set of pruning properties which can quickly prune unsatisfied SLCA candidates. Finally, we implement the two algorithms and compare their performance with analysis of extensive experimental results.
【Keywords】: XML; probability; query processing; EagerTopK algorithm; PrStack algorithm; XML keyword query; k SLCA results; probabilistic XML data; top-k keyword search; Encoding; Equations; Keyword search; Mathematical model; Probabilistic logic; Semantics; XML
【Paper Link】 【Pages】:685-696
【Authors】: Daisy Zhe Wang ; Long Wei ; Yunyao Li ; Frederick Reiss ; Shivakumar Vaithyanathan
【Abstract】: Recently, there has been increasing interest in extending relational query processing to efficiently support extraction operators, such as dictionaries and regular expressions, over text data. Many text processing queries are sophisticated in that they involve multiple extraction and join operators, resulting in many possible query plans. However, there has been little research on building the selectivity or cost estimation for these extraction operators, which is crucial for an optimizer to pick a good query plan. In this paper, we define the problem of selectivity estimation for dictionaries and regular expressions, and propose to develop document synopses over a text corpus, from which the selectivity can be estimated. We first adapt the language models in the Natural Language Processing literature to form the top-k n-gram synopsis as the baseline document synopsis. Then we develop two classes of novel document synopses: stratified bloom filter synopsis and roll-up synopsis. We also develop techniques to decompose a complicated regular expression into subparts to achieve more effective and accurate estimation. We conduct experiments over the Enron email corpus using both real-world and synthetic workloads to compare the accuracy of the selectivity estimation over different classes and variations of synopses. The results show that, the top-k stratified bloom filter synopsis and the roll-up synopsis is the most accurate in dictionary and regular expression selectivity estimation respectively.
【Keywords】: data structures; dictionaries; natural language processing; query processing; text analysis; Enron email corpus; database management; dictionaries; document synopses; extraction operators; join operators; natural language processing; regular expressions; relational query processing; roll-up synopsis; selectivity estimation; text data; text processing queries; top-k stratified bloom filter synopsis; Accuracy; Adaptation model; Arrays; Blogs; Data mining; Dictionaries; Estimation
【Paper Link】 【Pages】:697-708
【Authors】: Tingjian Ge
【Abstract】: Uncertain data is quite common nowadays in a variety of modern database applications. At the same time, the join operation is one of the most important but expensive operations in SQL. However, join queries on uncertain data have not been adequately addressed thus far. In this paper, we study the SQL join operation on uncertain attributes. We observe and formalize two kinds of join operations on such data, namely v-join and d-join. They are each useful for different applications. Using probability theory, we then devise efficient query processing algorithms for these join operations. Specifically, we use probability bounds that are based on the moments of random variables to either early accept or early reject a candidate v-join result tuple. We also devise an indexing mechanism and an algorithm called Two-End Zigzag Join to further save I/O costs. For d-join, we first observe that it can be reduced to a special form of similarity join in a multidimensional space. We then design an efficient algorithm called condensed d-join and an optimal condensation scheme based on dynamic programming. Finally, we perform a comprehensive empirical study using both real datasets and synthetic datasets.
【Keywords】: SQL; data handling; database management systems; dynamic programming; probability; query processing; SQL join operation; d-join data; dynamic programming; indexing mechanism; join queries; optimal condensation scheme; probability theory; query processing algorithms; two-end zigzag; uncertain data handling; v-join data; Algorithm design and analysis; Indexing; Probabilistic logic; Random variables; Reactive power; Uncertainty
【Paper Link】 【Pages】:709-720
【Authors】: Kaushik Chakrabarti ; Surajit Chaudhuri ; Venkatesh Ganti
【Abstract】: Optimizing execution of top-k queries over record-id ordered, compressed lists is challenging. The threshold family of algorithms cannot be effectively used in such cases. Yet, improving execution of such queries is of great value. For example, top-k keyword search in information retrieval (IR) engines represents an important scenario where such optimization can be directly beneficial. In this paper, we develop novel algorithms to improve execution of such queries over state of the art techniques. Our main insights are pruning based on fine-granularity bounds and traversing the lists based on judiciously chosen “intervals” rather than individual records. We formally study the optimality characteristics of the proposed algorithms. Our algorithms require minimal changes and can be easily integrated into IR engines. Our experiments on real-life datasets show that our algorithm outperform the state of the art techniques by a factor of 3-6 in terms of query execution times.
【Keywords】: query processing; compressed lists; information retrieval engines; interval-based pruning; query execution times; top-k keyword search; top-k processing; top-k queries; Engines; Indexes; Keyword search; Organizations; Partitioning algorithms; Query processing; Upper bound
【Paper Link】 【Pages】:721-732
【Authors】: Xuemin Lin ; Ying Zhang ; Wenjie Zhang ; Muhammad Aamir Cheema
【Abstract】: In many applications involving the multiple criteria optimal decision making, users may often want to make a personal trade-off among all optimal solutions. As a key feature, the skyline in a multi-dimensional space provides the minimum set of candidates for such purposes by removing all points not preferred by any (monotonic) utility/scoring functions; that is, the skyline removes all objects not preferred by any user no mater how their preferences vary. Driven by many applications with uncertain data, the probabilistic skyline model is proposed to retrieve uncertain objects based on skyline probabilities. Nevertheless, skyline probabilities cannot capture the preferences of monotonic utility functions. Motivated by this, in this paper we propose a novel skyline operator, namely stochastic skyline. In the light of the expected utility principle, stochastic skyline guarantees to provide the minimum set of candidates for the optimal solutions over all possible monotonic multiplicative utility functions. In contrast to the conventional skyline or the probabilistic skyline computation, we show that the problem of stochastic skyline is NP-complete with respect to the dimensionality. Novel and efficient algorithms are developed to efficiently compute stochastic skyline over multi-dimensional uncertain data, which run in polynomial time if the dimensionality is fixed. We also show, by theoretical analysis and experiments, that the size of stochastic skyline is quite similar to that of conventional skyline over certain data. Comprehensive experiments demonstrate that our techniques are efficient and scalable regarding both CPU and IO costs.
【Keywords】: computational complexity; decision making; NP-complete; decision making; polynomial time; probabilistic skyline model; stochastic skyline operator; uncertain data; Data models; Games; Polynomials; Probabilistic logic; Probability; Stochastic processes; Testing
【Paper Link】 【Pages】:733-744
【Authors】: Laure Berti-Equille ; Tamraparni Dasu ; Divesh Srivastava
【Abstract】: Quantitative Data Cleaning (QDC) is the use of statistical and other analytical techniques to detect, quantify, and correct data quality problems (or glitches). Current QDC approaches focus on addressing each category of data glitch individually. However, in real-world data, different types of data glitches co-occur in complex patterns. These patterns and interactions between glitches offer valuable clues for developing effective domain-specific quantitative cleaning strategies. In this paper, we address the shortcomings of the extant QDC methods by proposing a novel framework, the DEC (Detect-Explore-Clean) framework. It is a comprehensive approach for the definition, detection and cleaning of complex, multi-type data glitches. We exploit the distributions and interactions of different types of glitches to develop data-driven cleaning strategies that may offer significant advantages over blind strategies. The DEC framework is a statistically rigorous methodology for evaluating and scoring glitches and selecting the quantitative cleaning strategies that result in cleaned data sets that are statistically proximal to user specifications. We demonstrate the efficacy and scalability of the DEC framework on very large real-world and synthetic data sets.
【Keywords】: data handling; statistical analysis; DEC framework; QDC methods; analytical techniques; complex glitch pattern discovery; data quality problems; data sets; detect-explore-clean framework; quantitative data cleaning; statistical techniques; user specifications; Aggregates; Cleaning; Data mining; Data structures; Joining processes; Joints; Scalability
【Paper Link】 【Pages】:745-756
【Authors】: Guimei Liu ; Mengling Feng ; Yue Wang ; Limsoon Wong ; See-Kiong Ng ; Tzia Liang Mah ; Edmund Jon Deoon Lee
【Abstract】: Hypothesis testing is a well-established tool for scientific discovery. Conventional hypothesis testing is carried out in a hypothesis-driven manner. A scientist must first formulate a hypothesis based on his/her knowledge and experience, and then devise a variety of experiments to test it. Given the rapid growth of data, it has become virtually impossible for a person to manually inspect all the data to find all the interesting hypotheses for testing. In this paper, we propose and develop a data-driven system for automatic hypothesis testing and analysis. We define a hypothesis as a comparison between two or more sub-populations. We find sub-populations for comparison using frequent pattern mining techniques and then pair them up for statistical testing. We also generate additional information for further analysis of the hypotheses that are deemed significant. We conducted a set of experiments to show the efficiency of the proposed algorithms, and the usefulness of the generated hypotheses. The results show that our system can help users (1) identify significant hypotheses; (2) isolate the reasons behind significant hypotheses; and (3) find confounding factors that form Simpson's Paradoxes with discovered significant hypotheses.
【Keywords】: data mining; statistical testing; Simpson paradox; data-driven system; exploratory hypothesis analysis; exploratory hypothesis testing; frequent pattern mining techniques; scientific discovery; statistical testing; Data mining; Error analysis; Load modeling; Medical services; Probability; Space exploration
【Paper Link】 【Pages】:757-768
【Authors】: Hetal Thakkar ; Nikolay Laptev ; Hamid Mousavi ; Barzan Mozafari ; Vincenzo Russo ; Carlo Zaniolo
【Abstract】: The problem of supporting data mining applications proved to be difficult for database management systems and it is now proving to be very challenging for data stream management systems (DSMSs), where the limitations of SQL are made even more severe by the requirements of continuous queries. The major technical advances that achieved separately on DSMSs and on data stream mining algorithms have failed to converge and produce powerful data stream mining systems. Such systems, however, are essential since the traditional pull-based approach of cache mining is no longer applicable, and the push-based computing mode of data streams and their bursty traffic complicate application development. For instance, to write mining applications with quality of service (QoS) levels approaching those of DSMSs, a mining analyst would have to contend with many arduous tasks, such as support for data buffering, complex storage and retrieval methods, scheduling, fault-tolerance, synopsis-management, load shedding, and query optimization. Our Stream Mill Miner (SMM) system solves these problems by providing a data stream mining workbench that combines the ease of specifying high-level mining tasks, as in Weka, with the performance and QoS guarantees of a DSMS. This is accomplished in three main steps. The first is an open and extensible DSMS architecture where KDD queries can be easily expressed as user-defined aggregates (UDAs) - our system combines that with the efficiency of synoptic data structures and mining-aware load shedding and optimizations. The second key component of SMM is its integrated library of fast mining algorithms that are light enough to be effective on data streams. The third advanced feature of SMM is a Mining Model Definition Language (MMDL) that allows users to define the flow of mining tasks, integrated with a simple box&arrow GUI, to shield the mining analyst from the complexities of lower-level queries. SMM is the first DSMS capable of online mining and t his paper describes its architecture, design, and performance on mining queries.
【Keywords】: SQL; data mining; database management systems; query languages; query processing; DSMS architecture; KDD query; SMM; SQL; cache mining; data buffering; data mining application; data stream management system; database management system; fault tolerance; integrated library; knowledge discovery; mining aware load shedding; mining model definition language; pull based approach; push based computing mode; quality of service level; query optimization; retrieval method; stream mill miner system; synopsis management; synoptic data structure; user defined aggregate; Aggregates; Data mining; Graphical user interfaces; Humidity; Libraries; Quality of service; Training
【Paper Link】 【Pages】:769-779
【Authors】: Brian Quanz ; Jun Huan ; Meenakshi Mishra
【Abstract】: Effectively utilizing readily available auxiliary data to improve predictive performance on new modeling tasks is a key problem in data mining. In this research the goal is to transfer knowledge between sources of data, particularly when ground truth information for the new modeling task is scarce or is expensive to collect where leveraging any auxiliary sources of data becomes a necessity. Towards seamless knowledge transfer among tasks, effective representation of the data is a critical but yet not fully explored research area for the data engineer and data miner. Here we present a technique based on the idea of sparse coding, which essentially attempts to find an embedding for the data by assigning feature values based on subspace cluster membership. We modify the idea of sparse coding by focusing the identification of shared clusters between data when source and target data may have different distributions. In our paper, we point out cases where a direct application of sparse coding will lead to a failure of knowledge transfer. We then present the details of our extension to sparse coding, by incorporating distribution distance estimates for the embedded data, and show that the proposed algorithm can overcome the shortcomings of the sparse coding algorithm on synthetic data and achieve improved predictive performance on a real world chemical toxicity transfer learning task.
【Keywords】: data mining; data structures; encoding; feature extraction; knowledge management; learning (artificial intelligence); pattern clustering; chemical toxicity transfer learning task; data cluster; data engineer; data mining; data representation; data source; embedded data; feature extraction; knowledge transfer; low quality data; sparse coding; subspace cluster membership; synthetic data; Encoding; Equations; Estimation; Feature extraction; Kernel; Knowledge transfer; Optimization
【Paper Link】 【Pages】:780-791
【Authors】: Véronique Benzaken ; Jean-Daniel Fekete ; Pierre-Luc Hemery ; Wael Khemiri ; Ioana Manolescu
【Abstract】: Visual analytics aims at combining interactive data visualization with data analysis tasks. Given the explosion in volume and complexity of scientific data, e.g., associated to biological or physical processes or social networks, visual analytics is called to play an important role in scientific data management. Most visual analytics platforms, however, are memory-based, and are therefore limited in the volume of data handled. More over, the integration of each new algorithm (e.g. for clustering) requires integrating it by hand into the platform. Finally, they lack the capability to define and deploy well-structured processes where users with different roles interact in a coordinated way sharing the same data and possibly the same visualizations. We have designed and implemented EdiFlow, a workflow platform for visual analytics applications. EdiFlow uses a simple structured process model, and is backed by a persistent database, storing both process information and process instance data. EdiFlow processes provide the usual process features (roles, structured control) and may integrate visual analytics tasks as activities. We present its architecture, deployment on a sample application, and main technical challenges involved.
【Keywords】: data analysis; workflow management software; data analysis; data intensive interactive workflow; data visualization; persistent database; scientific data management; visual analytic; Data models; Data visualization; Databases; Electronic publishing; Encyclopedias; Visual analytics
【Paper Link】 【Pages】:792-803
【Authors】: Nirmesh Malviya ; Samuel Madden ; Arnab Bhattacharya
【Abstract】: In this paper, we address the problem of answering continuous route planning queries over a road network, in the presence of updates to the delay (cost) estimates of links. A simple approach to this problem would be to recompute the best path for all queries on arrival of every delay update. However, such a naive approach scales poorly when there are many users who have requested routes in the system. Instead, we propose two new classes of approximate techniques - K-paths and proximity measures to substantially speed up processing of the set of designated routes specified by continuous route planning queries in the face of incoming traffic delay updates. Our techniques work through a combination of pre-computation of likely good paths and by avoiding complete recalculations on every delay update, instead only sending the user new routes when delays change significantly. Based on an experimental evaluation with 7,000 drives from real taxi cabs, we found that the routes delivered by our techniques are within 5% of the best shortest path and have run times an order of magnitude or less compared to a naive approach.
【Keywords】: question answering (information retrieval); road traffic; traffic engineering computing; K-path technique; continuous route planning query answering system; dynamic route planning; proximity measures; road network; traffic delay updates; Delay; Heuristic algorithms; Monitoring; Planning; Real time systems; Roads; Routing
【Paper Link】 【Pages】:804-815
【Authors】: Xiaokui Xiao ; Bin Yao ; Feifei Li
【Abstract】: Optimal location (OL) queries are a type of spatial queries particularly useful for the strategic planning of resources. Given a set of existing facilities and a set of clients, an OL query asks for a location to build a new facility that optimizes a certain cost metric (defined based on the distances between the clients and the facilities). Several techniques have been proposed to address OL queries, assuming that all clients and facilities reside in an Lp space. In practice, however, movements between spatial locations are usually confined by the underlying road network, and hence, the actual distance between two locations can differ significantly from their Lp distance. Motivated by the deficiency of the existing techniques, this paper presents the first study on OL queries in road networks. We propose a unified framework that addresses three variants of OL queries that find important applications in practice, and we instantiate the framework with several novel query processing algorithms. We demonstrate the efficiency of our solutions through extensive experiments with real data.
【Keywords】: query processing; road traffic; traffic engineering computing; visual databases; cost metric; optimal location query; query processing algorithm; road network database; spatial query; Algorithm design and analysis; Artificial neural networks; Cities and towns; Equations; Measurement; Query processing; Roads
【Paper Link】 【Pages】:816-827
【Authors】: Hua Lu ; Bin Yang ; Christian S. Jensen
【Abstract】: To facilitate a variety of applications, positioning systems are deployed in indoor settings. For example, Bluetooth and RFID positioning are deployed in airports to support real-time monitoring of delays as well as off-line flow and space usage analyses. Such deployments generate large collections of tracking data. Like in other data management applications, joins are indispensable in this setting. However, joins on indoor tracking data call for novel techniques that take into account the limited capabilities of the positioning systems as well as the specifics of indoor spaces. This paper proposes and studies probabilistic, spatio-temporal joins on historical indoor tracking data. Two meaningful types of join are defined. They return object pairs that satisfy spatial join predicates either at a time point or during a time interval. The predicates considered include “same X,” where X is a semantic region such as a room or hallway. Based on an analysis on the uncertainty inherent to indoor tracking data, effective join probabilities are formalized and evaluated for object pairs. Efficient two-phase hash-based algorithms are proposed for the point and interval joins. In a filter-and-refine framework, an R-tree variant is proposed that facilitates the retrieval of join candidates, and pruning rules are supplied that eliminate candidate pairs that do not qualify. An empirical study on both synthetic and real data shows that the proposed techniques are efficient and scalable.
【Keywords】: Bluetooth; indoor communication; probability; radiofrequency identification; trees (mathematics); Bluetooth positioning; R-tree variant; RFID positioning; data management; filter-and-refine framework; historical indoor tracking data; off-line flow analyses; probabilistic spatio-temporal joins; space usage analyses; symbolic indoor tracking data; two-phase hash-based algorithms; Bluetooth; Global Positioning System; Probabilistic logic; Radiofrequency identification; Semantics; Trajectory; Uncertainty
【Paper Link】 【Pages】:828-839
【Authors】: Zenan Zhou ; Wei Wu ; Xiaohui Li ; Mong-Li Lee ; Wynne Hsu
【Abstract】: The MaxBRNN problem finds a region such that setting up a new service site within this region would guarantee the maximum number of customers by proximity. This problem assumes that each customer only uses the service provided by his/her nearest service site. However, in reality, a customer tends to go to his/her k nearest service sites. To handle this, MaxBRNN can be extended to the MaxBRkNN problem which finds an optimal region such that setting up a service site in this region guarantees the maximum number of customers who would consider the site as one of their k nearest service locations. We further generalize the MaxBRkNN problem to reflect the real world scenario where customers may have different preferences for different service sites, and at the same time, service sites may have preferred targeted customers. In this paper, we present an efficient solution called MaxFirst to solve this generalized MaxBRkNN problem. The algorithm works by partitioning the space into quadrants and searches only in those quadrants that potentially contain an optimal region. During the space partitioning, we compute the upper and lower bounds of the size of a quadrant's BRkNN, and use these bounds to prune the unpromising quadrants. Experiment results show that MaxFirst can be two to three orders of magnitude faster than the state-of-the-art algorithm.
【Keywords】: customer services; query processing; MaxBRNN problem; MaxFirst; k nearest service location; space partitioning; Computational modeling; Conferences; Indexes; Nearest neighbor searches; Partitioning algorithms; Recurrent neural networks; Upper bound
【Paper Link】 【Pages】:840-851
【Authors】: Evangelia Kalyvianaki ; Wolfram Wiesemann ; Quang Hieu Vu ; Daniel Kuhn ; Peter Pietzuch
【Abstract】: When users submit new queries to a distributed stream processing system (DSPS), a query planner must allocate physical resources, such as CPU cores, memory and network bandwidth, from a set of hosts to queries. Allocation decisions must provide the correct mix of resources required by queries, while achieving an efficient overall allocation to scale in the number of admitted queries. By exploiting overlap between queries and reusing partial results, a query planner can conserve resources but has to carry out more complex planning decisions. In this paper, we describe SQPR, a query planner that targets DSPSs in data centre environments with heterogeneous resources. SQPR models query admission, allocation and reuse as a single constrained optimisation problem and solves an approximate version to achieve scalability. It prevents individual resources from becoming bottlenecks by re-planning past allocation decisions and supports different allocation objectives. As our experimental evaluation in comparison with a state-of-the-art planner shows SQPR makes efficient resource allocation decisions, even with a high utilisation of resources, with acceptable overheads.
【Keywords】: distributed processing; query processing; resource allocation; SQPR; SQPR models query admission; centre environments; constrained optimisation problem; distributed stream processing system; query planner; resource allocation decisions; stream query planning; Bandwidth; Digital signal processing; Load modeling; Optimization; Planning; Relays; Resource management
【Paper Link】 【Pages】:852-863
【Authors】: K. V. M. Naidu ; Rajeev Rastogi ; Scott Satkin ; Anand Srinivasan
【Abstract】: In this paper, we study the problem of efficiently computing multiple aggregation queries over a data stream. In order to share computation, prior proposals have suggested instantiating certain intermediate aggregates which are then used to generate the final answers for input queries. In this work, we make a number of important contributions aimed at improving the execution and generation of query plans containing intermediate aggregates. These include: (1) a different hashing model, which has low eviction rates, and also allows us to accurately estimate the number of evictions, (2) a comprehensive query execution cost model based on these estimates, (3) an efficient greedy heuristic for constructing good low-cost query plans, (4) provably near-optimal and optimal algorithms for allocating the available memory to aggregates in the query plan when the input data distribution is Zipf-like and Uniform, respectively, and (5) a detailed performance study with real-life IP flow data sets, which show that our multiple aggregates computation techniques consistently outperform the best-known approach.
【Keywords】: data handling; query processing; storage management; IP flow data sets; Zipf-like; aggregation queries; data streams; greedy heuristic; hashing model; input data distribution; memory-constrained aggregate computation; query execution cost model; query plan execution; query plan generation; Aggregates; Analytical models; Computational modeling; Data models; IP networks; Memory management; Resource management
【Paper Link】 【Pages】:864-875
【Authors】: Pit Fender ; Guido Moerkotte
【Abstract】: Finding an optimal execution order of join operations is a crucial task in every cost-based query optimizer. Since there are many possible join trees for a given query, the overhead of the join (tree) enumeration algorithm per valid join tree should be minimal. In the case of a clique-shaped query graph, the best known top-down algorithm has a complexity of Θ(n2) per join tree, where n is the number of relations. In this paper, we present an algorithm that has an according O(1) complexity in this case. We show experimentally that this more theoretical result has indeed a high impact on the performance in other non-clique settings. This is especially true for cyclic query graphs. Further, we evaluate the performance of our new algorithm and compare it with the best top-down and bottom-up algorithms described in the literature.
【Keywords】: computational complexity; graph theory; query processing; tree data structures; trees (mathematics); bottom-up algorithm; clique shaped query graph; computational complexity; cost based query optimizer; cyclic query graph; join tree; optimal execution order; top-down join enumeration algorithm; Algorithm design and analysis; Complexity theory; Data structures; Heuristic algorithms; Niobium; Optimization; Partitioning algorithms
【Paper Link】 【Pages】:876-887
【Authors】: Sang-Won Lee ; Bongki Moon
【Abstract】: Recently, a new buffer and storage management strategy called In-Page Logging (IPL) has been proposed for database systems based on flash memory. Its main objective is to overcome the limitations of flash memory such as erase-before-write and asymmetric read/write speeds by storing changes made to a data page in a form of log records without overwriting the data page itself. Since it maintains a series of changes made to a data page separately from the original data page until they are merged, the IPL scheme provides unique opportunities to design light-weight transactional support for database systems. In this paper, we propose the transactional IPL (TIPL) scheme that takes advantage of the IPL log records to support multiversion read consistency and light-weight database recovery. Due to the dual use of IPL log records, namely, for snapshot isolation and fast recovery as well as flash-aware write optimization, TIPL achieves transactional support for flash memory database systems that minimizes the space and time overhead during normal database processing and shortens the database recovery time.
【Keywords】: buffer storage; database management systems; flash memories; IPL scheme; asymmetric read-write speeds; buffer management strategy; database recovery; erase-before-write speeds; flash memory database systems; flash-aware write optimization; multiversion read consistency; multiversion read recovery; snapshot isolation; storage management strategy; transactional in-page logging; Ash; Buffer storage; Database systems; Memory management; Optimization; Servers
【Paper Link】 【Pages】:888-899
【Authors】: Alexander Behm ; Chen Li ; Michael J. Carey
【Abstract】: An approximate string query is to find from a collection of strings those that are similar to a given query string. Answering such queries is important in many applications such as data cleaning and record linkage, where errors could occur in queries as well as the data. Many existing algorithms have focused on in-memory indexes. In this paper we investigate how to efficiently answer such queries in a disk-based setting, by systematically studying the effects of storing data and indexes on disk. We devise a novel physical layout for an inverted index to answer queries and we study how to construct it with limited buffer space. To answer queries, we develop a cost-based, adaptive algorithm that balances the I/O costs of retrieving candidate matches and accessing inverted lists. Experiments on large, real datasets verify that simply adapting existing algorithms to a disk-based setting does not work well and that our new techniques answer queries efficiently. Further, our solutions significantly outperform a recent tree-based index, BED-tree.
【Keywords】: query processing; question answering (information retrieval); trees (mathematics); BED-tree; adaptive algorithm; approximate string query answering; data cleaning; disk-based setting; external memory; in-memory indexes; large data sets; record linkage; Adaptive algorithms; Approximation algorithms; Indexes; Layout; Maintenance engineering; Organizations; Standards organizations
【Paper Link】 【Pages】:900-911
【Authors】: Zaiben Chen ; Heng Tao Shen ; Xiaofang Zhou
【Abstract】: The booming industry of location-based services has accumulated a huge collection of users' location trajectories of driving, cycling, hiking, etc. In this work, we investigate the problem of discovering the Most Popular Route (MPR) between two locations by observing the traveling behaviors of many previous users. This new query is beneficial to travelers who are asking directions or planning a trip in an unfamiliar city/area, as historical traveling experiences can reveal how people usually choose routes between locations. To achieve this goal, we firstly develop a Coherence Expanding algorithm to retrieve a transfer network from raw trajectories, for indicating all the possible movements between locations. After that, the Absorbing Markov Chain model is applied to derive a reasonable transfer probability for each transfer node in the network, which is subsequently used as the popularity indicator in the search phase. Finally, we propose a Maximum Probability Product algorithm to discover the MPR from a transfer network based on the popularity indicators in a breadth-first manner, and we illustrate the results and performance of the algorithm by extensive experiments.
【Keywords】: Markov processes; mobile computing; probability; query processing; traffic information systems; Markov Chain model; coherence expanding algorithm; location-based services; maximum probability product algorithm; most popular route; transfer network; transfer probability; traveling behaviors; user location trajectories; Clustering algorithms; Coherence; Joining processes; Markov processes; Planning; Roads; Trajectory
【Paper Link】 【Pages】:912-923
【Authors】: Xiaowei Ying ; Xintao Wu ; Daniel Barbará
【Abstract】: Social networks are vulnerable to various attacks such as spam emails, viral marketing and the such. In this paper we develop a spectrum based detection framework to discover the perpetrators of these attacks. In particular, we focus on Random Link Attacks (RLAs) in which the malicious user creates multiple false identities and interactions among those identities to later proceed to attack the regular members of the network. We show that RLA attackers can be filtered by using their spectral coordinate characteristics, which are hard to hide even after the efforts by the attackers of resembling as much as possible the rest of the network. Experimental results show that our technique is very effective in detecting those attackers and outperforms techniques previously published.
【Keywords】: fraud; social networking (online); random link attacks; social networks; spam emails; spectral coordinate characteristics; spectrum based fraud detection framework; viral marketing; Approximation methods; Blogs; Collaboration; Eigenvalues and eigenfunctions; Electronic mail; Social network services; Topology
【Paper Link】 【Pages】:924-935
【Authors】: Francesco Bonchi ; Aristides Gionis ; Tamir Tassa
【Abstract】: Analyzing the structure of social networks is of interest in a wide range of disciplines, but such activity is limited by the fact that these data represent sensitive information and can not be published in their raw form. One of the approaches to sanitize network data is to randomly add or remove edges from the graph. Recent studies have quantified the level of anonymity that is obtained by random perturbation by means of a-posteriori belief probabilities and, by conducting experiments on small datasets, arrived at the conclusion that random perturbation can not achieve meaningful levels of anonymity without deteriorating the graph features. We offer a new information-theoretic perspective on this issue. We make an essential distinction between image and preimage anonymity and propose a more accurate quantification, based on entropy, of the anonymity level that is provided by the perturbed network. We explain why the entropy-based quantification, which is global, is more adequate than the previously used local quantification based on a-posteriori belief. We also prove that the anonymity level quantified by means of entropy is always greater than or equal to the one based on a-posteriori belief probabilities. In addition, we introduce and explore the method of random sparsification, which randomly removes edges, without adding new ones. Extensive experimentation on several very large datasets shows that randomization techniques for identity obfuscation are back in the game, as they may achieve meaningful levels of anonymity while still preserving features of the original graph.
【Keywords】: entropy; graph theory; graphs; social networking (online); entropy-based quantification; identity obfuscation; information theory; random perturbation; social network; Entropy; Gallium; Privacy; Probability distribution; Random variables; Social network services; Uncertainty
【Paper Link】 【Pages】:936-947
【Authors】: Tingjian Ge ; David Grabiner ; Stanley B. Zdonik
【Abstract】: Array database systems are architected for scientific and engineering applications. In these applications, the value of a cell is often imprecise and uncertain. There are at least two reasons that a Monte Carlo query processing algorithm is usually required for such uncertain data. Firstly, a probabilistic graphical model must often be used to model correlation, which requires a Monte Carlo inference algorithm for the operations in our database. Secondly, mathematical operators required by science and engineering domains are much more complex than those of SQL. State-of-the-art query processing uses Monte Carlo approximation. We give an example of using Markov Random Fields combined with an array's chunking or tiling mechanism to model correlated data. We then propose solutions for two of the most challenging problems in this framework, namely the expensive array join operation, and the determination and optimization of stopping conditions of Monte Carlo query processing. Finally, we perform an extensive empirical study on a real world application.
【Keywords】: Markov processes; Monte Carlo methods; approximation theory; database management systems; mathematical operators; probability; query processing; Markov random fields; Monte Carlo approximation; Monte Carlo inference algorithm; Monte Carlo query processing algorithm; array chunking mechanism; array database systems; array tiling mechanism; mathematical operators; probabilistic graphical model; uncertain multidimensional array data; Arrays; Monte Carlo methods; Query processing; Semantics; Temperature distribution; Temperature sensors
【Paper Link】 【Pages】:948-959
【Authors】: Roger Moussalli ; Mariam Salloum ; Walid A. Najjar ; Vassilis J. Tsotras
【Abstract】: In recent years, XML-based Publish-Subscribe Systems have become popular due to the increased demand of timely event-notification. Users (or subscribers) pose complex profiles on the structure and content of the published messages. If a profile matches the message, the message is forwarded to the interested subscriber. As the amount of published content continues to grow, current software-based systems will not scale. We thus propose a novel architecture to exploit parallelism of twig matching on FPGAs. This approach yields up to three orders of magnitude higher throughput when compared to conventional approaches bound by the sequential aspect of software computing. This paper, presents a novel method for performing unordered holistic twig matching on FPGAs without any false positives, and whose throughput is independent of the complexity of the user queries or the characteristics of the input XML stream. Furthermore, we present experimental comparison of different granularities of twig matching, namely path-based (root-to-leaf) and pair-based (parent-child or ancestor-descendant).We provide comprehensive experiments that compare the throughput, area utilization and the accuracy of matching (percent of false positives) of our holistic, path-based and pair-based FPGA approaches.
【Keywords】: XML; dynamic programming; field programmable gate arrays; software engineering; FPGA; XML based publish subscribe system; dynamic programming; event notification; software computing; twig matching; Database languages; Field programmable gate arrays; Hardware; Logic gates; Software; Throughput; XML
【Paper Link】 【Pages】:960-971
【Authors】: Yun Peng ; Byron Choi ; Jianliang Xu
【Abstract】: Recent applications including the Semantic Web, Web ontology and XML have sparked a renewed interest on graph-structured databases. Among others, twig queries have been a popular tool for retrieving subgraphs from graph-structured databases. To optimize twig queries, selectivity estimation has been a crucial and classical step. However, the majority of existing works on selectivity estimation focuses on relational and tree data. In this paper, we investigate selectivity estimation of twig queries on possibly cyclic graph data. To facilitate selectivity estimation on cyclic graphs, we propose a matrix representation of graphs derived from prime labeling - a scheme for reachability queries on directed acyclic graphs. With this representation, we exploit the consecutive ones property (C1P) of matrices. As a consequence, a node is mapped to a point in a two-dimensional space whereas a query is mapped to multiple points. We adopt histograms for scalable selectivity estimation. We perform an extensive experimental evaluation on the proposed technique and show that our technique controls the estimation error under 1.3% on XMARK and DBLP, which is more accurate than previous techniques. On TREEBANK, we produce RMSE and NRMSE 6.8 times smaller than previous techniques.
【Keywords】: XML; directed graphs; matrix algebra; ontologies (artificial intelligence); query processing; reachability analysis; semantic Web; DBLP; TREEBANK; Web ontology; XMARK; XML; directed acyclic graphs; graph-structured databases; matrix consecutive ones property; prime labeling; reachability queries; selectivity estimation; semantic Web; twig query optimisation
【Paper Link】 【Pages】:972-983
【Authors】: Ioana Manolescu ; Konstantinos Karanasos ; Vasilis Vassalos ; Spyros Zoupanos
【Abstract】: We consider the problem of rewriting XQuery queries using multiple materialized XQuery views. The XQuery dialect we use to express views and queries corresponds to tree patterns (returning data from several nodes, at different granularities, ranging from node identifiers to full XML subtrees) with value joins. We provide correct and complete algorithms for finding minimal rewritings, in which no view is redundant. Our work extends the state of the art by considering more flexible views than the mostly XPath 1.0 dialects previously considered, and more powerful rewritings. We implemented our algorithms and assess their performance through a set of experiments.
【Keywords】: XML; query processing; rewriting systems; tree data structures; XML subtrees; XPath 1.0; XQuery dialect; XQuery queries rewriting; multiple materialized XQuery views; tree patterns; Algebra; Argon; Books; Navigation; Semantics; Special issues and sections; XML
【Paper Link】 【Pages】:984-994
【Authors】: Thomas Neumann ; Guido Moerkotte
【Abstract】: Accurate cardinality estimates are essential for a successful query optimization. This is not only true for relational DBMSs but also for RDF stores. An RDF database consists of a set of triples and, hence, can be seen as a relational database with a single table with three attributes. This makes RDF rather special in that queries typically contain many self joins. We show that relational DBMSs are not well-prepared to perform cardinality estimation in this context. Further, there are hardly any special cardinality estimation methods for RDF databases. To overcome this lack of appropriate cardinality estimation methods, we introduce characteristic sets together with new cardinality estimation methods based upon them. We then show experimentally that the new methods are-in the RDF context-highly superior to the estimation methods employed by commercial DBMSs and by the open-source RDF store RDF-3X.
【Keywords】: data models; optimisation; public domain software; query processing; relational databases; RDF database; RDF query; RDF-3X; cardinality estimation; characteristic set; open source RDF; query optimization; relational DBMS; relational database; Accuracy; Books; Correlation; Estimation; Histograms; Resource description framework; Semantics
【Paper Link】 【Pages】:995-1006
【Authors】: Mohamed E. Khalefa ; Mohamed F. Mokbel ; Justin J. Levandoski
【Abstract】: Preference queries are essential to a wide spectrum of applications including multi-criteria decision-making tools and personalized databases. Unfortunately, most of the evaluation techniques for preference queries assume that the set of preferred attributes are stored in only one relation, waiving on a wide set of queries that include preference computations over multiple relations. This paper presents PrefJoin, an efficient preference-aware join query operator, designed specifically to deal with preference queries over multiple relations. PrefJoin consists of four main phases: Local Pruning, Data Preparation, Joining, and Refining that filter out, from each input relation, those tuples that are guaranteed not to be in the final preference set, associate meta data with each non-filtered tuple that will be used to optimize the execution of the next phases, produce a subset of join result that are relevant for the given preference function, and refine these tuples respectively. An interesting characteristic of PrefJoin is that it tightly integrates preference computation with join hence we can early prune those tuples that are guaranteed not to be an answer, and hence it saves significant unnecessary computations cost. PrefJoin supports a variety of preference function including skyline, multi-objective and k-dominance preference queries. We show the correctness of PrefJoin. Experimental evaluation based on a real system implementation inside PostgreSQL shows that PrefJoin consistently achieves from one to three orders of magnitude performance gain over its competitors in various scenarios.
【Keywords】: SQL; meta data; query processing; PostgreSQL; PrefJoin; data preparation; k-dominance preference queries; local pruning; meta data; multicriteria decision-making tools; nonfiltered tuple; personalized databases; preference queries; preference-aware join query operator; skyline; Algorithm design and analysis; Data structures; Decision making; Engines; Indexing
【Paper Link】 【Pages】:1007-1018
【Authors】: Yangjun Chen ; Yibin Chen
【Abstract】: Let G(V, E) be a digraph (directed graph) with n nodes and e edges. Digraph G = (V, E) is the reflexive, transitive closure if (v, u) ∈ E iff there is a path from v to u in G. Efficient storage of G is important for supporting reachability queries which are not only common on graph databases, but also serve as fundamental operations used in many graph algorithms. A lot of strategies have been suggested based on the graph labeling, by which each node is assigned with certain labels such that the reachability of any two nodes through a path can be determined by their labels. Among them are interval labelling, chain decomposition, and 2-hop labeling. However, due to the very large size of many real world graphs, the computational cost and size of labels using existing methods would prove too expensive to be practical. In this paper, we propose a new approach to decompose a graph into a series of spanning trees which may share common edges, to transform a reachability query over a graph into a set of queries over trees. We demonstrate both analytically and empirically the efficiency and effectiveness of our method.
【Keywords】: directed graphs; trees (mathematics); 2-hop labeling; chain decomposition; directed acyclic graph; graph database; graph decomposition; interval labelling; reachability query; spanning trees; transitive closure compression; Aerospace electronics; Algorithm design and analysis; Gold; Indexes; Labeling; Matrix decomposition; Vegetation; Directed graphs; reachability queries; spanning trees; transitive closure compression
【Paper Link】 【Pages】:1019-1030
【Authors】: Xi Zhang ; Jan Chomicki
【Abstract】: We propose a “logic + SQL” framework for set preferences. Candidate best sets are represented using profiles consisting of scalar features. This reduces set preferences to tuple preferences over set profiles. We propose two optimization techniques: superpreference and M-relation. Superpreference targets dominated profiles. It reduces the input size by filtering out tuples not belonging to any best k-subset. M-relation targets repeated profiles. It consolidates tuples that are exchangeable with regard to the given set preference, and therefore avoids redundant computation of the same profile. We show the results of an experimental study that demonstrates the efficacy of the optimizations.
【Keywords】: SQL; formal logic; query processing; set theory; M-relation optimization technique; k-subset; logic + SQL framework; preference queries; set preferences; set profiles; superpreference optimization technique; Additives; Aggregates; Algebra; Context; Databases; Generators; Optimization
【Paper Link】 【Pages】:1031-1042
【Authors】: Muhammad Aamir Cheema ; Xuemin Lin ; Haixun Wang ; Jianmin Wang ; Wenjie Zhang
【Abstract】: Top-k pairs queries have many real applications. k closest pairs queries, k furthest pairs queries and their bichromatic variants are some of the examples of the top-k pairs queries that rank the pairs on distance functions. While these queries have received significant research attention, there does not exist a unified approach that can efficiently answer all these queries. Moreover, there is no existing work that supports top-k pairs queries based on generic scoring functions. In this paper, we present a unified approach that supports a broad class of top-k pairs queries including the queries mentioned above. Our proposed approach allows the users to define a local scoring function for each attribute involved in the query and a global scoring function that computes the final score of each pair by combining its scores on different attributes. We propose efficient internal and external memory algorithms and our theoretical analysis shows that the expected performance of the algorithms is optimal when two or less attributes are involved. Our approach does not require any pre-built indexes, is easy to implement and has low memory requirement. We conduct extensive experiments to demonstrate the efficiency of our proposed approach.
【Keywords】: query processing; set theory; visual databases; bichromatic variants; distance functions; external memory algorithms; generic scoring functions; multidimensional space; top-k pair queries; unified approach; Algorithm design and analysis; Color; Complexity theory; Image color analysis; Indexes; Memory management; Spatial databases
【Paper Link】 【Pages】:1043-1054
【Authors】: David Lo ; Bolin Ding ; Lucia ; Jiawei Han
【Abstract】: We are interested in scalable mining of a non-redundant set of significant recurrent rules from a sequence database. Recurrent rules have the form “whenever a series of precedent events occurs, eventually a series of consequent events occurs”. They are intuitive and characterize behaviors in many domains. An example is the domain of software specification, in which the rules capture a family of properties beneficial to program verification and bug detection. We enhance a past work on mining recurrent rules by Lo, Khoo, and Liu to perform mining more scalably. We propose a new set of pruning properties embedded in a new mining algorithm. Performance and case studies on benchmark synthetic and real datasets show that our approach is much more efficient and outperforms the state-of-the-art approach in mining recurrent rules by up to two orders of magnitude.
【Keywords】: data mining; formal specification; bidirectional mining; bug detection; nonredundant recurrent rule mining; program verification; pruning property; sequence database; software specification domain; Association rules; Complexity theory; Databases; Redundancy; Semantics; Software
【Paper Link】 【Pages】:1055-1066
【Authors】: Qian Wan ; Raymond Chi-Wing Wong ; Yu Peng
【Abstract】: The importance of dominance and skyline analysis has been well recognized in multi-criteria decision making applications. Most previous studies focus on how to help customers find a set of “best” possible products from a pool of given products. In this paper, we identify an interesting problem, finding top-k profitable products, which has not been studied before. Given a set of products in the existing market, we want to find a set of k “best” possible products such that these new products are not dominated by the products in the existing market. In this problem, we need to set the prices of these products such that the total profit is maximized. We refer such products as top-k profitable products. A straightforward solution is to enumerate all possible subsets of size k and find the subset which gives the greatest profit. However, there are an exponential number of possible subsets. In this paper, we propose solutions to find the top-k profitable products efficiently. An extensive performance study using both synthetic and real datasets is reported to verify its effectiveness and efficiency.
【Keywords】: decision making; industrial economics; profitability; set theory; dominance analysis; market; multicriteria decision making; real datasets; skyline analysis; subset; top-k profitable products; Companies; Complexity theory; Dynamic programming; Greedy algorithms; Heuristic algorithms; Indexes; Portable computers
【Paper Link】 【Pages】:1067-1078
【Authors】: Liangcai Shu ; Aiyou Chen ; Ming Xiong ; Weiyi Meng
【Abstract】: In many telecom and web applications, there is a need to identify whether data objects in the same source or different sources represent the same entity in the real-world. This problem arises for subscribers in multiple services, customers in supply chain management, and users in social networks when there lacks a unique identifier across multiple data sources to represent a real-world entity. Entity resolution is to identify and discover objects in the data sets that refer to the same entity in the real world. We investigate the entity resolution problem for large data sets where efficient and scalable solutions are needed. We propose a novel unsupervised blocking algorithm, namely SPectrAl Neighborhood (SPAN), which constructs a fast bipartition tree for the records based on spectral clustering such that real entities can be identified accurately by neighborhood records in the tree. There are two major novel aspects in our approach: 1)We develop a fast algorithm that performs spectral clustering without computing pairwise similarities explicitly, which dramatically improves the scalability of the standard spectral clustering algorithm; 2) We utilize a stopping criterion specified by Newman-Girvan modularity in the bipartition process. Our experimental results with both synthetic and real-world data demonstrate that SPAN is robust and outperforms other blocking algorithms in terms of accuracy while it is efficient and scalable to deal with large data sets.
【Keywords】: pattern clustering; social networking (online); supply chain management; telecommunication computing; telecommunication services; trees (mathematics); Newman-Girvan modularity; Web application; bipartition tree; entity resolution; multiple service subscriber; social networks users; spectral clustering; spectral neighborhood blocking algorithm; stopping criterion; supply chain management; telecom application; Algorithm design and analysis; Clustering algorithms; Communities; Complexity theory; Object recognition; Partitioning algorithms; Sparse matrices
【Paper Link】 【Pages】:1079-1090
【Authors】: Dijun Luo ; Chris H. Q. Ding ; Heng Huang ; Feiping Nie
【Abstract】: This paper addresses the scalability issue in spectral analysis which has been widely used in data management applications. Spectral analysis techniques enjoy powerful clustering capability while suffer from high computational complexity. In most of previous research, the bottleneck of computational complexity of spectral analysis stems from the construction of pairwise similarity matrix among objects, which costs at least O(n2) where n is the number of the data points. In this paper, we propose a novel estimator of the similarity matrix using K-means accumulative consensus matrix which is intrinsically sparse. The computational cost of the accumulative consensus matrix is O(nlogn). We further develop a Non-negative Matrix Factorization approach to derive clustering assignment. The overall complexity of our approach remains O(nlogn). In order to validate our method, we (1) theoretically show the local preserving and convergent property of the similarity estimator, (2) validate it by a large number of real world datasets and compare the results to other state-of-the-art spectral analysis, and (3) apply it to large-scale data clustering problems. Results show that our approach uses much less computational time than other state-of-the-art clustering methods, meanwhile provides comparable clustering qualities. We also successfully apply our approach to a 5-million dataset on a single machine using reasonable time. Our techniques open a new direction for high-quality large-scale data analysis.
【Keywords】: computational complexity; data analysis; matrix decomposition; pattern clustering; K-means accumulative consensus matrix; clustering capability; computational complexity; consensus spectral clustering; convergent property; data clustering problems; data management; data points; large-scale data analysis; near-linear time; nonnegative matrix factorization; similarity matrix; spectral analysis techniques; Clustering algorithms; Computational complexity; Laplace equations; Manifolds; Matrix decomposition; Sparse matrices; Spectral analysis
【Paper Link】 【Pages】:1091-1102
【Authors】: Charu C. Aggarwal ; Haixun Wang
【Abstract】: In this paper, we will examine the problem of dimensionality reduction of massive disk-resident data sets. Graph mining has become important in recent years because of its numerous applications in community detection, social networking, and web mining. Many graph data sets are defined on massive node domains in which the number of nodes in the underlying domain is very large. As a result, it is often difficult to store and hold the information necessary in order to retrieve and index the data. Most known methods for dimensionality reduction are effective only for data sets defined on modest domains. Furthermore, while the problem of dimensionality reduction is most relevant to the problem of massive data sets, these algorithms are inherently not designed for the case of disk-resident data in terms of the order in which the data is accessed on disk. This is a serious limitation which restricts the applicability of current dimensionality reduction methods. Furthermore, since dimensionality reduction methods are typically designed for database applications such as indexing, it is important to design the underlying data reduction method, so that it can be effectively used for such applications. In this paper, we will examine the difficult problem of dimensionality reduction of graph data in the difficult case in which the underlying number of nodes are very large and the data set is disk-resident. We will propose an effective sampling algorithm for dimensionality reduction and show how to perform the dimensionality reduction in a limited number of passes on disk. We will also design the technique to be highly interpretable and friendly for indexing applications. We will illustrate the effectiveness and efficiency of the approach on a number of real data sets.
【Keywords】: data mining; graph theory; data reduction method; dimensionality reduction method; disk-resident data sets; graph indexing; graph mining; graph retrieval; Algorithm design and analysis; Bridges; Computational complexity; Data mining; Indexing; Partitioning algorithms; Social network services
【Paper Link】 【Pages】:1103-1114
【Authors】: Dongxiang Zhang ; Divyakant Agrawal ; Gang Chen ; Anthony K. H. Tung
【Abstract】: Nearest neighbor (NN) search in high dimensional space is an essential query in many multimedia retrieval applications. Due to the curse of dimensionality, existing index structures might perform even worse than a simple sequential scan of data when answering exact NN query. To improve the efficiency of NN search, locality sensitive hashing (LSH) and its variants have been proposed to find approximate NN. They adopt hash functions that can preserve the Euclidean distance so that similar objects have a high probability of colliding in the same bucket. Given a query object, candidate for the query result is obtained by accessing the points that are located in the same bucket. To improve the precision, each hash table is associated with m hash functions to recursively hash the data points into smaller buckets and remove the false positives. On the other hand, multiple hash tables are required to guarantee a high retrieval recall. Thus, tuning a good tradeoff between precision and recall becomes the main challenge for LSH. Recently, locality sensitive B-tree(LSB-tree) has been proposed to ensure both quality and efficiency. However, the index uses random I/O access. When the multimedia database is large, it requires considerable disk I/O cost to obtain an approximate ratio that works in practice. In this paper, we propose a novel index structure, named HashFile, for efficient retrieval of multimedia objects. It combines the advantages of random projection and linear scan. Unlike the LSH family in which each bucket is associated with a concatenation of m hash values, we only recursively partition the dense buckets and organize them as a tree structure. Given a query point q, the search algorithm explores the buckets near the query object in a top-down manner. The candidate buckets in each node are stored sequentially in increasing order of the hash value and can be efficiently loaded into memory for linear scan. HashFile can support both exact and approximate NN queries. Experimental results show that HashFile performs better than existing indexes both in answering both types of NN queries.
【Keywords】: file organisation; multimedia systems; query processing; trees (mathematics); HashFile index structure; hash functions; hash tables; locality sensitive B-tree; locality sensitive hashing; multimedia data; multimedia retrieval applications; nearest neighbor search; query object; query point; Lead
【Paper Link】 【Pages】:1115-1126
【Authors】: Karsten Klein ; Nils Kriege ; Petra Mutzel
【Abstract】: Efficient subgraph queries in large databases are a time-critical task in many application areas as e.g. biology or chemistry, where biological networks or chemical compounds are modeled as graphs. The NP-completeness of the underlying subgraph isomorphism problem renders an exact subgraph test for each database graph infeasible. Therefore efficient methods have to be found that avoid most of these tests but still allow to identify all graphs containing the query pattern. We propose a new approach based on the filter-verification paradigm, using a new hash-key fingerprint technique with a combination of tree and cycle features for filtering and a new subgraph isomorphism test for verification. Our approach is able to cope with edge and vertex labels and also allows to use wild card patterns for the search. We present an experimental comparison of our approach with state-of-the-art methods using a benchmark set of both real world and generated graph instances that shows its practicability. Our approach is implemented as part of the Scaffold Hunter software, a tool for the visual analysis of chemical compound databases.
【Keywords】: data visualisation; database indexing; fingerprint identification; graph theory; optimisation; query processing; trees (mathematics); very large databases; CT-Index; NP-completeness; Scaffold Hunter software; cycle features; edge labels; filter-verification paradigm; fingerprint-based graph indexing; hash-key fingerprint technique; large databases; subgraph isomorphism problem; subgraph queries; time-critical task; trees; vertex labels; visual analysis; wild card patterns; Biology; Chemical compounds; Encoding; Feature extraction; Indexing
【Paper Link】 【Pages】:1127-1138
【Authors】: Eugene Wu ; Samuel Madden
【Abstract】: Many data-intensive websites use databases that grow much faster than the rate that users access the data. Such growing datasets lead to ever-increasing space and performance overheads for maintaining and accessing indexes. Furthermore, there is often considerable skew with popular users and recent data accessed much more frequently. These observations led us to design Shinobi, a system which uses horizontal partitioning as a mechanism for improving query performance to cluster the physical data, and increasing insert performance by only indexing data that is frequently accessed. We present database design algorithms that optimally partition tables, drop indexes from partitions that are infrequently queried, and maintain these partitions as workloads change. We show a 60× performance improvement over traditionally indexed tables using a real-world query workload derived from a traffic monitoring application.
【Keywords】: database indexing; pattern clustering; query processing; Shinobi system; data cluster; data-intensive Web sites; database design algorithm; fine-grained indexing; horizontal partitioning mechanism; insert performance; partition tables; query performance; query workload; Data models; Indexing; Optimization; Partitioning algorithms; Random access memory
【Paper Link】 【Pages】:1139-1150
【Authors】: Suprio Ray ; Bogdan Simion ; Angela Demke Brown
【Abstract】: The volume of spatial data generated and consumed is rising exponentially and new applications are emerging as the costs of storage, processing power and network bandwidth continue to decline. Database support for spatial operations is fast becoming a necessity rather than a niche feature provided by a few products. However, the spatial functionality offered by current commercial and open-source relational databases differs significantly in terms of available features, true geodetic support, spatial functions and indexing. Benchmarks play a crucial role in evaluating the functionality and performance of a particular database, both for application users and developers, and for the database developers themselves. In contrast to transaction processing, however, there is no standard, widely used benchmark for spatial database operations. In this paper, we present a spatial database benchmark called Jackpine. Our benchmark is portable (it can support any database with a JDBC driver implementation) and includes both micro benchmarks and macro workload scenarios. The micro benchmark component tests basic spatial operations in isolation; it consists of queries based on the Dimensionally Extended 9-intersection model of topological relations and queries based on spatial analysis functions. Each macro workload includes a series of queries that are based on a common spatial data application. These macro scenarios include map search and browsing, geocoding, reverse geocoding, flood risk analysis, land information management and toxic spill analysis. We use Jackpine to evaluate the spatial features in 2 open source databases and 1 commercial offering.
【Keywords】: relational databases; visual databases; JDBC driver implementation; Jackpine; database developer; database support; information management; network bandwidth; open source relational database; spatial analysis function; spatial data generation; spatial database; Benchmark testing; Geospatial analysis; Indexes; Information management; Spatial databases
【Paper Link】 【Pages】:1151-1162
【Authors】: Vinayak R. Borkar ; Michael J. Carey ; Raman Grover ; Nicola Onose ; Rares Vernica
【Abstract】: Hyracks is a new partitioned-parallel software platform designed to run data-intensive computations on large shared-nothing clusters of computers. Hyracks allows users to express a computation as a DAG of data operators and connectors. Operators operate on partitions of input data and produce partitions of output data, while connectors repartition operators' outputs to make the newly produced partitions available at the consuming operators. We describe the Hyracks end user model, for authors of dataflow jobs, and the extension model for users who wish to augment Hyracks' built-in library with new operator and/or connector types. We also describe our initial Hyracks implementation. Since Hyracks is in roughly the same space as the open source Hadoop platform, we compare Hyracks with Hadoop experimentally for several different kinds of use cases. The initial results demonstrate that Hyracks has significant promise as a next-generation platform for data-intensive applications.
【Keywords】: data handling; parallel programming; DAG; Hyracks end user model; data connectors; data operators; data-intensive computing; open source Hadoop platform; partitioned-parallel software platform; Computational modeling; Computer architecture
【Paper Link】 【Pages】:1163-1174
【Authors】: Marcos R. Vieira ; Humberto Luiz Razente ; Maria Camila Nardini Barioni ; Marios Hadjieleftheriou ; Divesh Srivastava ; Caetano Traina Jr. ; Vassilis J. Tsotras
【Abstract】: In this paper we describe a general framework for evaluation and optimization of methods for diversifying query results. In these methods, an initial ranking candidate set produced by a query is used to construct a result set, where elements are ranked with respect to relevance and diversity features, i.e., the retrieved elements should be as relevant as possible to the query, and, at the same time, the result set should be as diverse as possible. While addressing relevance is relatively simple and has been heavily studied, diversity is a harder problem to solve. One major contribution of this paper is that, using the above framework, we adapt, implement and evaluate several existing methods for diversifying query results. We also propose two new approaches, namely the Greedy with Marginal Contribution (GMC) and the Greedy Randomized with Neighborhood Expansion (GNE) methods. Another major contribution of this paper is that we present the first thorough experimental evaluation of the various diversification techniques implemented in a common framework. We examine the methods' performance with respect to precision, running time and quality of the result. Our experimental results show that while the proposed methods have higher running times, they achieve precision very close to the optimal, while also providing the best result quality. While GMC is deterministic, the randomized approach (GNE) can achieve better result quality if the user is willing to tradeoff running time.
【Keywords】: query processing; greedy randomized with neighborhood expansion method; greedy with marginal contribution method; query result diversification; query result evaluation; query result optimization; Approximation algorithms; Approximation methods; Clustering algorithms; Dispersion; Force measurement; Optimization; Taxonomy
【Paper Link】 【Pages】:1175-1186
【Authors】: Shetal Shah ; S. Sudarshan ; Suhas Kajbaje ; Sandeep Patidar ; Bhanu Pratap Gupta ; Devang Vira
【Abstract】: Complex SQL queries are widely used today, but it is rather difficult to check if a complex query has been written correctly. Formal verification based on comparing a specification with an implementation is not applicable, since SQL queries are essentially a specification without any implementation. Queries are usually checked by running them on sample datasets and checking that the correct result is returned; there is no guarantee that all possible errors are detected. In this paper, we address the problem of test data generation for checking correctness of SQL queries, based on the query mutation approach for modeling errors. Our presentation focuses in particular on a class of join/outer-join mutations, comparison operator mutations, and aggregation operation mutations, which are a common cause of error. To minimize human effort in testing, our techniques generate a test suite containing small and intuitive test datasets. The number of datasets generated, is linear in the size of the query, although the number of mutations in the class we consider is exponential. Under certain assumptions on constraints and query constructs, the test suite we generate is complete for a subclass of mutations that we define, i.e., it kills all non-equivalent mutations in this subclass.
【Keywords】: SQL; formal verification; query processing; SQL mutants; SQL queries; aggregation operation mutations; comparison operator mutations; constraint-based approach; correctness checking; error modelling; formal verification; outer-join mutations; query mutation approach; test data generation; Aggregates; Algebra; Databases; Humans; Polynomials; Programming; Testing
【Paper Link】 【Pages】:1187-1198
【Authors】: Morten Middelfart ; Torben Bach Pedersen
【Abstract】: This paper describes the implementation of so-called sentinels in the TARGIT BI Suite. Sentinels are a novel type of rules that can warn a user if one or more measure changes in a multi-dimensional data cube are expected to cause a change to another measure critical to the user. Sentinels notify users based on previous observations, e.g., that revenue might drop within two months if an increase in customer problems combined with a decrease in website traffic is observed. In this paper we show how users, without any prior technical knowledge, can mine and use sentinels in the TARGIT BI Suite. We present in detail how sentinels are mined from data, and how sentinels are scored. We describe in detail how the sentinel mining algorithm is implemented in the TARGIT BI Suite, and show that our implementation is able to discover strong and useful sentinels that could not be found when using sequential pattern mining or correlation techniques. We demonstrate, through extensive experiments, that mining and usage of sentinels is feasible with good performance for the typical users on a real, operational data warehouse.
【Keywords】: competitive intelligence; data mining; data warehouses; TARGIT BI suite; business intelligence; correlation techniques; data mining; operational data warehouse; pattern mining; sentinel mining algorithm; website traffic; Bidirectional control; Bismuth; Business; Context; Data mining; Data warehouses; Time measurement
【Paper Link】 【Pages】:1199-1208
【Authors】: Yongqiang He ; Rubao Lee ; Yin Huai ; Zheng Shao ; Namit Jain ; Xiaodong Zhang ; Zhiwei Xu
【Abstract】: MapReduce-based data warehouse systems are playing important roles of supporting big data analytics to understand quickly the dynamics of user behavior trends and their needs in typical Web service providers and social network sites (e.g., Facebook). In such a system, the data placement structure is a critical factor that can affect the warehouse performance in a fundamental way. Based on our observations and analysis of Facebook production systems, we have characterized four requirements for the data placement structure: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) strong adaptivity to highly dynamic workload patterns. We have examined three commonly accepted data placement structures in conventional databases, namely row-stores, column-stores, and hybrid-stores in the context of large data analysis using MapReduce. We show that they are not very suitable for big data processing in distributed systems. In this paper, we present a big data placement structure called RCFile (Record Columnar File) and its implementation in the Hadoop system. With intensive experiments, we show the effectiveness of RCFile in satisfying the four requirements. RCFile has been chosen in Facebook data warehouse system as the default option. It has also been adopted by Hive and Pig, the two most widely used data analysis systems developed in Facebook and Yahoo!
【Keywords】: Web services; data analysis; data structures; data warehouses; query processing; social networking (online); Facebook production systems; Hadoop system; MapReduce-based warehouse systems; RCFile; Web service providers; Yahoo!; data analytics; data placement structure; data processing; distributed systems; fast data loading; fast query processing; large data analysis; record columnar file; social network sites; storage space utilization; user behavior trends; Data compression; Data handling; Data storage systems; Data warehouses; Facebook; Information management; Query processing
【Paper Link】 【Pages】:1209-1220
【Authors】: Pankaj Gulhane ; Amit Madaan ; Rupesh R. Mehta ; Jeyashankher Ramamirtham ; Rajeev Rastogi ; Sandeepkumar Satpal ; Srinivasan H. Sengamedu ; Ashwin Tengli ; Charu Tiwari
【Abstract】: Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a Web site, (2) Picking the appropriate sample pages for wrapper inference, (3) Learning XPath-based extraction rules that are robust to variations in site structure, (4) Detecting site changes by monitoring sample pages, and (5) Optimizing editorial costs by reusing rules, etc. The system is deployed in production and currently extracts more than 250 million records from more than 200 Web sites. To the best of our knowledge, Vertex is the first system to do high-precision information extraction at Web scale.
【Keywords】: Internet; information retrieval; Web-scale information extraction; XPath-based extraction rules; structured record extraction; template-based Web pages; vertex wrapper induction system; wrapper inference; Clustering algorithms; Data mining; Humans; Monitoring; Noise measurement; Web pages
【Paper Link】 【Pages】:1187-1198
【Authors】: Morten Middelfart ; Torben Bach Pedersen
【Abstract】: This paper describes the implementation of so-called sentinels in the TARGIT BI Suite. Sentinels are a novel type of rules that can warn a user if one or more measure changes in a multi-dimensional data cube are expected to cause a change to another measure critical to the user. Sentinels notify users based on previous observations, e.g., that revenue might drop within two months if an increase in customer problems combined with a decrease in website traffic is observed. In this paper we show how users, without any prior technical knowledge, can mine and use sentinels in the TARGIT BI Suite. We present in detail how sentinels are mined from data, and how sentinels are scored. We describe in detail how the sentinel mining algorithm is implemented in the TARGIT BI Suite, and show that our implementation is able to discover strong and useful sentinels that could not be found when using sequential pattern mining or correlation techniques. We demonstrate, through extensive experiments, that mining and usage of sentinels is feasible with good performance for the typical users on a real, operational data warehouse.
【Keywords】: competitive intelligence; data mining; data warehouses; TARGIT BI suite; business intelligence; correlation techniques; data mining; operational data warehouse; pattern mining; sentinel mining algorithm; website traffic; Bidirectional control; Bismuth; Business; Context; Data mining; Data warehouses; Time measurement
【Paper Link】 【Pages】:1199-1208
【Authors】: Yongqiang He ; Rubao Lee ; Yin Huai ; Zheng Shao ; Namit Jain ; Xiaodong Zhang ; Zhiwei Xu
【Abstract】: MapReduce-based data warehouse systems are playing important roles of supporting big data analytics to understand quickly the dynamics of user behavior trends and their needs in typical Web service providers and social network sites (e.g., Facebook). In such a system, the data placement structure is a critical factor that can affect the warehouse performance in a fundamental way. Based on our observations and analysis of Facebook production systems, we have characterized four requirements for the data placement structure: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) strong adaptivity to highly dynamic workload patterns. We have examined three commonly accepted data placement structures in conventional databases, namely row-stores, column-stores, and hybrid-stores in the context of large data analysis using MapReduce. We show that they are not very suitable for big data processing in distributed systems. In this paper, we present a big data placement structure called RCFile (Record Columnar File) and its implementation in the Hadoop system. With intensive experiments, we show the effectiveness of RCFile in satisfying the four requirements. RCFile has been chosen in Facebook data warehouse system as the default option. It has also been adopted by Hive and Pig, the two most widely used data analysis systems developed in Facebook and Yahoo!
【Keywords】: Web services; data analysis; data structures; data warehouses; query processing; social networking (online); Facebook production systems; Hadoop system; MapReduce-based warehouse systems; RCFile; Web service providers; Yahoo!; data analytics; data placement structure; data processing; distributed systems; fast data loading; fast query processing; large data analysis; record columnar file; social network sites; storage space utilization; user behavior trends; Data compression; Data handling; Data storage systems; Data warehouses; Facebook; Information management; Query processing
【Paper Link】 【Pages】:1209-1220
【Authors】: Pankaj Gulhane ; Amit Madaan ; Rupesh R. Mehta ; Jeyashankher Ramamirtham ; Rajeev Rastogi ; Sandeepkumar Satpal ; Srinivasan H. Sengamedu ; Ashwin Tengli ; Charu Tiwari
【Abstract】: Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a Web site, (2) Picking the appropriate sample pages for wrapper inference, (3) Learning XPath-based extraction rules that are robust to variations in site structure, (4) Detecting site changes by monitoring sample pages, and (5) Optimizing editorial costs by reusing rules, etc. The system is deployed in production and currently extracts more than 250 million records from more than 200 Web sites. To the best of our knowledge, Vertex is the first system to do high-precision information extraction at Web scale.
【Keywords】: Internet; information retrieval; Web-scale information extraction; XPath-based extraction rules; structured record extraction; template-based Web pages; vertex wrapper induction system; wrapper inference; Clustering algorithms; Data mining; Humans; Monitoring; Noise measurement; Web pages
【Paper Link】 【Pages】:1221-1231
【Authors】: Ru Fang ; Hui-I Hsiao ; Bin He ; C. Mohan ; Yun Wang
【Abstract】: Storage class memory (SCM), a new generation of memory technology, offers non-volatility, high-speed, and byte-addressability, which combines the best properties of current hard disk drives (HDD) and main memory. With these extraordinary features, current systems and software stacks need to be redesigned to get significantly improved performance by eliminating disk input/output (I/O) barriers; and simpler system designs by avoiding complicated data format transformations. In current DBMSs, logging and recovery are the most important components to enforce the atomicity and durability of a database. Traditionally, database systems rely on disks for logging transaction actions and log records are forced to disks when a transaction commits. Because of the slow disk I/O speed, logging becomes one of the major bottlenecks for a DBMS. Exploiting SCM as a persistent memory for transaction logging can significantly reduce logging overhead. In this paper, we present the detailed design of an SCM-based approach for DBMSs logging, which achieves high performance by simplified system design and better concurrency support. We also discuss solutions to tackle several major issues arising during system recovery, including hole detection, partial write detection, and any-point failure recovery. This new logging approach is used to replace the traditional disk based logging approach in DBMSs. To analyze the performance characteristics of our SCM-based logging approach, we implement the prototype on IBM SolidDB. In common circumstances, our experimental results show that the new SCM-based logging approach provides as much as 7 times throughput improvement over disk-based logging in the Telecommunication Application Transaction Processing (TATP) benchmark.
【Keywords】: database management systems; disc drives; hard discs; transaction processing; DBMS logging; IBM SolidDB; SCM-based approach; any-point failure recovery; hard disk drives; high performance database logging; hole detection; log records; logging transaction actions; main memory; partial write detection; storage class memory; telecommunication application transaction processing benchmark; Computer crashes; Data structures; Databases; Hardware; Phase change materials; Random access memory; Writing
【Paper Link】 【Pages】:1232-1241
【Authors】: Sivaramakrishnan Narayanan ; Florian Waas
【Abstract】: Enterprise database systems handle a variety of diverse query workloads that are of different importance to the business. For example, periodic reporting queries are usually mission critical whereas ad-hoc queries by analysts tend to be less crucial. It is desirable to enable database administrators to express (and modify) the importance of queries at a simple and intuitive level. The mechanism used to enforce these priorities must be robust, adaptive and efficient. In this paper, we present a mechanism that continuously determines and re-computes the ideal target velocity of concurrent database processes based on their run-time statistics to achieve this prioritization. In this scheme, every process autonomously adjusts its resource consumption using basic control theory principles. The self-regulating and decentralized design of the system enables effective prioritization even in the presence of exceptional situations, including software defects or unexpected/unplanned query termination with no measurable overhead. We have implemented this approach in Greenplum Parallel Database and demonstrate its effectiveness and general applicability in a series of experiments.
【Keywords】: business data processing; parallel databases; query processing; Greenplum parallel database; concurrent database processes; control theory principles; database queries; dynamic prioritization; enterprise database systems; resource consumption; run-time statistics; Computer architecture; Equations; Mathematical model; Monitoring; Query processing
【Paper Link】 【Pages】:1242-1253
【Authors】: Mohamed H. Ali ; Badrish Chandramouli ; Jonathan Goldstein ; Roman Schindlauer
【Abstract】: Microsoft StreamInsight (StreamInsight, for brevity) is a platform for developing and deploying streaming applications, which need to run continuous queries over high-data-rate streams of input events. StreamInsight leverages a well-defined temporal stream model and operator algebra, as the underlying basis for processing long-running continuous queries over event streams. This allows StreamInsight to handle imperfections in event delivery and to provide correctness guarantees on the generated output. StreamInsight natively supports a diverse range of off-the-shelf streaming operators. In order to cater to a much broader range of customer scenarios and applications, StreamInsight has recently introduced a new extensibility infrastructure. With this infrastructure, StreamInsight enables developers to integrate their domain expertise within the query pipeline in the form of user defined modules (functions, operators, and aggregates). This paper describes the extensibility framework in StreamInsight; an ongoing effort at Microsoft SQL Server to support the integration of user-defined modules in a stream processing system. More specifically, the paper addresses the extensibility problem from three perspectives: the query writer's perspective, the user defined module writer's perspective, and the system's internal perspective. The paper introduces and addresses a range of new and subtle challenges that arise when we try to add extensibility to a streaming system, in a manner that is easy to use, powerful, and practical. We summarize our experience and provide future directions for supporting stream-oriented workloads in different business domains.
【Keywords】: SQL; file servers; media streaming; query processing; Microsoft SQL Server; Microsoft Streamlnsight; continuous queries; data-rate streams; domain expertise; extensibility framework; operator algebra; query pipeline; stream processing system; temporal stream model; Aggregates; Algebra; Business; Monitoring; Payloads; Query processing; Semantics
【Paper Link】 【Pages】:1254
【Authors】: Maximilian Ahrens ; Gustavo Alonso
【Abstract】: Existing relational databases are facing significant challenges as the hardware infrastructure and the underlying platform change from single CPUs to virtualized multicore machines arranged in large clusters. The problems are both technical and related to the licensing models currently in place. In this short abstract we briefly outline the challenges faced by organizations trying to virtualize and bring existing relational databases into the cloud.
【Keywords】: multiprocessing systems; relational databases; virtualisation; hardware infrastructure; licensing model; relational database; virtualized multicore machine; Cloud computing; Computational modeling; Engines; Licenses; Multicore processing; Relational databases
【Paper Link】 【Pages】:1255-1263
【Authors】: Philip A. Bernstein ; Istvan Cseri ; Nishant Dani ; Nigel Ellis ; Ajay Kalhan ; Gopal Kakivaya ; David B. Lomet ; Ramesh Manne ; Lev Novik ; Tomas Talius
【Abstract】: Cloud SQL Server is a relational database system designed to scale-out to cloud computing workloads. It uses Microsoft SQL Server as its core. To scale out, it uses a partitioned database on a shared-nothing system architecture. Transactions are constrained to execute on one partition, to avoid the need for two-phase commit. The database is replicated for high availability using a custom primary-copy replication scheme. It currently serves as the storage engine for Microsoft's Exchange Hosted Archive and SQL Azure.
【Keywords】: SQL; cloud computing; relational databases; replicated databases; software architecture; Microsoft Exchange Hosted Archive; Microsoft SQL server; SQL Azure; cloud SQL server; cloud computing; custom primary-copy replication scheme; relational database system; shared-nothing system architecture; storage engine; Availability; Data models; Fabrics; Protocols; Relational databases; Servers
【Paper Link】 【Pages】:1264-1275
【Authors】: Jan Schaffner ; Benjamin Eckart ; Dean Jacobs ; Christian Schwarz ; Hasso Plattner ; Alexander Zeier
【Abstract】: In Software-as-a-Service, multiple tenants are typically consolidated into the same database instance to reduce costs. For analytics-as-a-service, in-memory column databases are especially suitable because they offer very short response times. This paper studies the automation of operational tasks in multi-tenant in-memory column database clusters. As a prerequisite, we develop a model for predicting whether the assignment of a particular tenant to a server in the cluster will lead to violations of response time goals. This model is then extended to capture drops in capacity incurred by migrating tenants between servers. We present an algorithm for moving tenants around the cluster to ensure that response time goals are met. In so doing, the number of servers in the cluster may be dynamically increased or decreased. The model is also extended to manage multiple copies of a tenant's data for scalability and availability. We validated the model with an implementation of a multi-tenant clustering framework for SAP's in-memory column database TREX.
【Keywords】: cloud computing; database management systems; pattern clustering; storage management; TREX; analytics-as-a-service; automating cluster management tasks; in-memory column database performance prediction; multitenant in-memory column database clusters; software-as-a-service; Benchmark testing; Databases; Equations; Mathematical model; Rocks; Servers; Time factors
【Paper Link】 【Pages】:1276-1279
【Authors】: Salvatore Raunich ; Erhard Rahm
【Abstract】: The proliferation of ontologies and taxonomies in many domains increasingly demands the integration of multiple such ontologies to provide a unified view on them. We demonstrate a new automatic approach to merge large taxonomies such as product catalogs or web directories. Our approach is based on an equivalence matching between a source and target taxonomy to merge them. It is target-driven, i.e. it preserves the structure of the target taxonomy as much as possible. Further, we show how the approach can utilize additional relationships between source and target concepts to semantically improve the merge result.
【Keywords】: ontologies (artificial intelligence); ATOM; Web directories; automatic target-driven ontology merging; equivalence matching; product catalogs; target taxonomy; Automobiles; Catalogs; Merging; Ontologies; Redundancy; Semantics; Taxonomy
【Paper Link】 【Pages】:1280-1283
【Authors】: Xiang Li ; Christoph Quix ; David Kensche ; Sandra Geisler ; Lisong Guo
【Abstract】: Mediated schemas lie at the center of the well recognized data integration architecture. Classical data integration systems rely on a mediated schema created by human experts through an intensive design process. Automatic generation of mediated schemas is still a goal to be achieved. We generate mediated schemas by merging multiple source schemas interrelated by tuple-generating dependencies (tgds). Schema merging is the process to consolidate multiple schemas into a unified view. The task becomes particularly challenging when the schemas are highly heterogeneous and autonomous. Existing approaches fall short in various aspects, such as restricted expressiveness of input mappings, lacking data level interpretation, the output mapping is not in a logical language (or not given at all), and being confined to binary merging. We present here a novel system which is able to perform native n-ary schema merging using P2P style tgds as input. Suited in the scenario of generating mediated schemas for data integration, the system opts for a minimal schema signature retaining all certain answers of conjunctive queries. Logical output mappings are generated to support the mediated schemas, which enable query answering and, in some cases, query rewriting.
【Keywords】: data handling; peer-to-peer computing; query processing; question answering (information retrieval); P2P style; conjunctive queries; data dependencies; data integration architecture; input mapping restricted expressiveness; lacking data level interpretation; logical output mappings; mediated schema automatic generation; multiple source schema merging; n-ary schema merging; query answering; query rewriting; tuple-generating dependencies; Cognition; Engines; Joints; Merging; Query processing; Semantics
【Paper Link】 【Pages】:1284-1287
【Authors】: Mahendra Chavan ; Ravindra Guravannavar ; Karthik Ramachandra ; S. Sudarshan
【Abstract】: We present DBridge, a novel static analysis and program transformation tool to optimize database access. Traditionally, rewrite of queries and programs are done independently, by the database query optimzier and the language compiler respectively, leaving out many optimization opportunities. Our tool aims to bridge this gap by performing holistic transformations, which include both program and query rewrite. Many applications invoke database queries multiple times with different parameter values. Such query invocations made using imperative loops are often the cause of poor performance due to random I/O and round trip delays. In practice, such performance issues are addressed by manually rewriting the application to make it set oriented. Such manual rewriting of programs is often time consuming and error prone. Guravannavar et. al. propose program analysis and transformation techniques for automatically rewriting an application to make it set oriented. DBridge implements these program transformation techniques for Java programs that use JDBC to access database. In this demonstration, we showcase the holistic program/query transformations that DBridge can perform, over a variety of scenarios taken from real-world applications. We then walk through the design of DBridge, which uses the SOOT optimization framework for static analysis. Finally, we demonstrate the performance gains achieved through the transformations.
【Keywords】: Java; program compilers; program diagnostics; query processing; DBridge; Java programs; SOOT optimization framework; database access optimization; database query optimizer; imperative loops; language compiler; program analysis; program rewrite tool; program transformation tool; set-oriented query execution; static analysis; Context; Database systems; Decorrelation; Java; Performance gain; Transforms
【Paper Link】 【Pages】:1288-1291
【Authors】: Costandinos Costa ; Christos Laoudias ; Demetrios Zeinalipour-Yazti ; Dimitrios Gunopulos
【Abstract】: In this demonstration paper, we present a powerful distributed framework for finding similar trajectories in a smartphone network, without disclosing the traces of participating users. Our framework, exploits opportunistic and participatory sensing in order to quickly answer queries of the form: “Report objects (i.e., trajectories) that follow a similar spatio-temporal motion to Q, where Q is some query trajectory.” SmartTrace, relies on an in-situ data storage model, where geo-location data is recorded locally on smartphones for both performance and privacy reasons. SmartTrace then deploys an efficient top-K query processing algorithm that exploits distributed trajectory similarity measures, resilient to spatial and temporal noise, in order to derive the most relevant answers to Q quickly and efficiently. Our demonstration shows how the SmartTrace algorithmics are ported on a network of Android-based smartphone devices with impressive query response times. To demonstrate the capabilities of SmartTrace during the conference, we will allow the attendees to query local smartphone networks in the following two modes: (i) Interactive Mode, where devices will be handed out to participants aiming to identify who is moving similar to the querying node; and (ii) Trace-driven Mode, where a large-scale deployment can be launched in order to show how the K most similar trajectories can be identified quickly and efficiently. The conference attendees will be able to appreciate how interesting spatio-temporal search applications can be implemented efficiently (for performance reasons) and without disclosing the complete user traces to the query processor (for privacy reasons)1. For instance, an attendee might be able to determine other attendees that have participated in common sessions, in order to initiate new discussions and collaborations, without knowing their trajectory or revealing his/her own trajectory either.
【Keywords】: query processing; data storage model; geolocation data; query processing algorithm; query trajectory; smartphone network; spatio-temporal motion; Google; IEEE 802.11 Standards; Indoor environments; Mobile communication; Privacy; Sensors; Trajectory
【Paper Link】 【Pages】:1292-1295
【Authors】: Louis Woods ; Jens Teubner ; Gustavo Alonso
【Abstract】: We demonstrate a hardware implementation of a complex event processor, built on top of field-programmable gate arrays (FPGAs). Compared to CPU-based commodity systems, our solution shows distinctive advantages for stream monitoring tasks, e.g., wire-speed processing and predictable performance. The demonstration is based on a query-to-hardware compiler for complex event patterns that we presented at VLDB 2010 [1]. By example of a click stream monitoring application, we illustrate the inner workings of our compiler and indicate how FPGAs can act as efficient and reliable processors for event streams.
【Keywords】: field programmable gate arrays; file servers; microprocessor chips; pattern matching; program processors; CPU-based commodity systems; FPGA; VLDB 2010; click stream monitoring application; complex event patterns; complex event processor; field-programmable gate arrays; predictable performance; query-to-hardware compiler; real-time pattern matching; wire-speed processing; Automata; Decoding; Event detection; Field programmable gate arrays; Hardware; Monitoring; Pattern matching
【Paper Link】 【Pages】:1296-1299
【Authors】: Lei Dou ; Daniel Zinn ; Timothy M. McPhillips ; Sven Köhler ; Sean Riddle ; Shawn Bowers ; Bertram Ludäscher
【Abstract】: Scientific workflow systems are used to integrate existing software components (actors) into larger analysis pipelines to perform in silico experiments. Current approaches for handling data in nested-collection structures, as required in many scientific domains, lead to many record-management actors (shims) that make the workflow structure overly complex, and as a consequence hard to construct, evolve and maintain. By constructing and executing workflows from bioinformatics and geosciences in the Kepler system, we will demonstrate how COMAD (Collection-Oriented Modeling and Design), an extension of conventional workflow design, addresses these shortcomings. In particular, COMAD provides a hierarchical data stream model (as in XML) and a novel declarative configuration language for actors that functions as a middleware layer between the workflow's data model (streaming nested collections) and the actor's data model (base data and lists thereof). Our approach allows actor developers to focus on the internal actor processing logic oblivious to the workflow structure. Actors can then be re-used in various workflows simply by adapting actor configurations. Due to streaming nested collections and declarative configurations, COMAD workflows can usually be realized as linear data processing pipelines, which often reflect the scientific data analysis intention better than conventional designs. This linear structure not only simplifies actor insertions and deletions (workflow evolution), but also decreases the overall complexity of the workflow, reducing future effort in maintenance.
【Keywords】: data analysis; middleware; records management; COMAD; Collection-Oriented Modeling and Design; XML; data collection; hierarchical data stream model; internal actor processing; middleware; record management; scientific workflow design; software component; Assembly; Bioinformatics; Data models; Humidity; Phylogeny; XML
【Paper Link】 【Pages】:1300-1303
【Authors】: Emilien Antoine ; Alban Galland ; Kristian Lyngbaek ; Amélie Marian ; Neoklis Polyzotis
【Abstract】: The demonstration presents the WebdamExchange system, a distributed knowledge base management system with access rights, localization and provenance. This system is based on the exchange of logical statements that describe documents, collections, access rights, keys and localization information and updates of this data. We illustrate how the model can be used in a social-network context to help users keep control on their data on the web. In particular, we show how users within very different schemes of data-distribution (centralized, dht, unstructured P2P, etc.) can still transparently collaborate while keeping a good control over their own data.
【Keywords】: deductive databases; distributed databases; electronic data interchange; social networking (online); WebdamExchange system; access rights; data-distribution; distributed knowledge base management system; localization; provenance; social networking; Access control; Cryptography; Data models; Facebook; Permission
【Paper Link】 【Pages】:1304-1307
【Authors】: Eric Peukert ; Julian Eberius ; Erhard Rahm
【Abstract】: We present the Auto Mapping Core (AMC), a new framework that supports fast construction and tuning of schema matching approaches for specific domains such as ontology alignment, model matching or database-schema matching. Distinctive features of our framework are new visualisation techniques for modelling matching processes, stepwise tuning of parameters, intermediate result analysis and performance-oriented rewrites. Furthermore, existing matchers can be plugged into the framework to comparatively evaluate them in a common environment. This allows deeper analysis of behaviour and shortcomings in existing complex matching systems.
【Keywords】: data visualisation; ontologies (artificial intelligence); pattern matching; AMC; auto mapping core; database-schema matching; matching processes; matching systems; model matching; ontology alignment; schema matching; visualisation techniques; Computational modeling; Libraries; Matched filters; Noise; Ontologies; Tuning; Visualization
【Paper Link】 【Pages】:1308-1311
【Authors】: Daniel Deutch ; Ohad Greenshpan ; Boris Kostenko ; Tova Milo
【Abstract】: We introduce in this Demonstration a system called Trivia Masster that generates a very large Database of facts in a variety of topics, and uses it for question answering. The facts are collected from human users (the “crowd”); the system motivates users to contribute to the Database by using a Trivia Game, where users gain points based on their contribution. A key challenge here is to provide a suitable Data Cleaning mechanism that allows to identify which of the facts (answers to Trivia questions) submitted by users are indeed correct / reliable, and consequently how many points to grant users, how to answer questions based on the collected data, and which questions to present to the Trivia players, in order to improve the data quality. As no existing single Data Cleaning technique provides a satisfactory solution to this challenge, we propose here a novel approach, based on a declarative framework for defining recursive and probabilistic Data Cleaning rules. Our solution employs an algorithm that is based on Markov Chain Monte Carlo Algorithms.
【Keywords】: Markov processes; Monte Carlo methods; computer games; data analysis; probability; question answering (information retrieval); very large databases; Markov chain Monte Carlo; data cleaning mechanism; data quality improvement; probabilistic data cleaning rule; question answering; trivia game; trivia masster; very large database; Cleaning; Databases; Games; Markov processes; Monte Carlo methods; Probabilistic logic; Reliability
【Paper Link】 【Pages】:1312-1315
【Authors】: Rubi Boim ; Tova Milo ; Slava Novgorodov
【Abstract】: In this demo we present DiRec, a plug-in that allows Collaborative Filtering (CF) Recommender systems to diversify the recommendations that they present to users. DiRec estimates items diversity by comparing the rankings that different users gave to the items, thereby enabling diversification even in common scenarios where no semantic information on the items is available. Items are clustered based on a novel notion of priority-medoids that provides a natural balance between the need to present highly ranked items vs. highly diverse ones. We demonstrate the operation of DiRec in the context of a movie recommendation system. We show the advantage of recommendation diversification and its feasibility even in the absence of semantic information.
【Keywords】: groupware; information filtering; pattern clustering; recommender systems; semantic Web; DiRec; collaborative filtering recommender system; movie recommendation system; priority medoid; recommendation diversification; semantic information; semanticless collaborative filtering; Approximation methods; Collaboration; Context; Correlation; Motion pictures; Recommender systems; Semantics
【Paper Link】 【Pages】:1316-1319
【Authors】: Yu-Ling Hsueh ; Roger Zimmermann ; Wei-Shinn Ku ; Yifan Jin
【Abstract】: Skyline query processing has become an important feature in multi-dimensional, data-intensive applications. Such computations are especially challenging under dynamic conditions, when either snapshot queries need to be answered with short user response times or when continuous skyline queries need to be maintained efficiently over a set of objects that are frequently updated. To achieve high performance, we have recently designed the ESC algorithm, an Efficient update approach for Skyline Computations. ESC creates a pre-computed candidate skyline set behind the first skyline (a “second line of defense,” so to speak) that facilitates an incremental, two-stage skyline update strategy which results in a quicker query response time for the user. Our demonstration presents the two-threaded SkyEngine system that builds upon and extends the base-features of the ESC algorithm with innovative, user-oriented functionalities that are termed SkyAlert and AutoAdjust. These functions enable a data or service provider to be informed about and gain the opportunity of automatically promoting its data records to remain part of the skyline, if so desired. The SkyEngine demonstration includes both a server and a web browser based client. Finally, the SkyEngine system also provides visualizations that reveal its internal performance statistics.
【Keywords】: online front-ends; query formulation; query processing; search engines; AutoAdjust; ESC algorithm; SkyAlert; SkyEngine; Skyline query processing; Skyline search engine; Web browser; continuous Skyline computations; data records; data-intensive applications; multidimensional applications; server; user response times; user-oriented functionalities; Algorithm design and analysis; Computer science; Heuristic algorithms; Query processing; Time factors; User interfaces
【Paper Link】 【Pages】:1320-1323
【Authors】: Federico Cavalieri ; Giovanna Guerrini ; Marco Mesiti
【Abstract】: Data on the Web mostly are in XML format and the need often arises to update their structure, commonly described by an XML Schema. When a schema is modified the effects of the modification on documents need to be faced. XSUpdate is a language that allows to easily identify parts of an XML Schema, apply a modification primitive on them and finally define an adaptation for associated documents, while Eχup is the corresponding engine for processing schema modification and document adaptation statements. Purpose of this demonstration is to provide an overview of the facilities of the XSUpdate language and of the Eχup system.
【Keywords】: XML; EXup system; Web; XML schemas; XSUpdate language; associated documents; document adaptation statements; processing schema modification; Conferences; Engines; Graphical user interfaces; Java; Organizations; Semantics; XML
【Paper Link】 【Pages】:1324-1327
【Authors】: Vadim Eisenberg ; Yaron Kanza
【Abstract】: The impedance mismatch problem that occurs when relational data is being processed by object-oriented (OO) programs, also occurs when OO programs process RDF data, on the Semantic Web. The impedance mismatch problem stems from the inherent differences between RDF and the data model of OO languages. In this paper, we illustrate a solution to this problem. Essentially, we modify an OO language so that RDF individuals become first-class citizens in the language, and objects of the language become first-class citizens in RDF. Three important benefits that follow from this modification are: (1) it becomes natural to use the language as a persistent programming language, (2) the language supports implicit integration of data from multiple data sources, and (3) SPARQL queries and inference can be applied to objects during the run of a program. This demo presents such a modified programming language, namely Ruby on Semantic Web, which is an extension of the Ruby programming language. The demo includes a system, where users can run applications, written in Ruby on Semantic Web, over multiple data sources. In the demo we run code examples. The effects of the execution on the data sources and on the state of the objects in memory are presented visually, in real time.
【Keywords】: object-oriented languages; object-oriented programming; relational databases; semantic Web; Ruby programming language; SPARQL inference; SPARQL queries; impedance mismatch problem; multiple data sources; object-oriented languages; object-oriented programs; relational data; semantic web; Companies; Data models; OWL; Relational databases; Resource description framework
【Paper Link】 【Pages】:1328-1331
【Authors】: Paolo Biondi ; Matteo Golfarelli ; Stefano Rizzi
【Abstract】: In this demonstration we present MYOLAP, a Java-based tool that allows OLAP analyses to be personalized and enhanced by expressing “soft” query constraints in the form of user preferences. MYOLAP is based on a novel preference algebra and a preference evaluation algorithm specifically devised for the OLAP domain. Preferences are formulated either visually or through an extension of the MDX language, and user interaction with the results is mediated by a visual graph-like structure that shows better-than relationships between different sets of data. The demonstration will show how analysis sessions can benefit from coupling ad-hoc preference constructors with the classical OLAP operators, and in particular how MYOLAP supports users in expressing preference queries, analyzing their results, and navigating datacubes.
【Keywords】: data analysis; data mining; query processing; Java-based tool; MDX language; MYOLAP; preference algebra; preference evaluation algorithm; preference queries; preference-based datacube analysis; soft query constraints; visual graph-like structure; Algebra; Data structures; Data visualization; Databases; Economics; Navigation; Visualization
【Paper Link】 【Pages】:1332-1335
【Authors】: Xiuli Ma ; Hongmei Xiao ; Shuiyuan Xie ; Qiong Li ; Qiong Luo ; Chunhua Tian
【Abstract】: Clean drinking water and safe water supply is vital to our life. Recent advances in technologies have made it possible to deploy smart sensor networks in large water distribution networks to monitor and identify the water quality online. In such a large-scale real-time monitoring application, large amounts of data stream out of multiple concurrent sensors continuously. In this paper, we present a system to monitor and analyze the sensor data streams online, find and summarize the spatio-temporal distribution patterns and correlations in co-evolving data, detect contamination events rapidly and facilitate corrective actions or notification. The system consists of an online data mining engine and a GUI providing the user with the current patterns discovered in the network, and an alerter notifying the user if there is anomalous water quality in the network.
【Keywords】: computerised monitoring; contamination; data mining; graphical user interfaces; intelligent sensors; water quality; water supply; GUI; clean drinking water; contamination events; large water distribution networks; large-scale real-time monitoring; multiple concurrent sensors; online data mining engine; online monitoring; safe water supply; sensor data streams; smart sensor networks; spatio-temporal distribution patterns; Correlation; Graphical user interfaces; Intelligent sensors; Monitoring; Real time systems; Water pollution
【Paper Link】 【Pages】:1336-1339
【Authors】: Mu-Woong Lee ; Seung-won Hwang ; Sunghun Kim
【Abstract】: To support rapid and efficient software development, we propose to demonstrate our tool, integrating code search into software development process. For example, a developer, right during writing a module, can find a code piece sharing the same syntactic structure from a large code corpus representing the wisdom of other developers in the same team (or in the universe of open-source code). While there exist commercial code search engines on the code universe, they treat software as text (thus oblivious of syntactic structure), and fail at finding semantically related code. Meanwhile, existing tools, searching for syntactic clones, do not focus on efficiency, focusing on “post-mortem” usage scenario of detecting clones “after” the code development is completed. In clear contrast, we focus on optimizing efficiency for syntactic code search and making this search “interactive” for large-scale corpus, to complement the existing two lines of research. From our demonstration, we will show how such interactive search supports rapid software development, as similarly claimed lately in SE and HCI communities. As an enabling technology, we design efficient index building and traversal techniques, optimized for code corpus and code search workload. Our tool can identify relevant code in the corpus of 1.7 million code pieces in a sub-second response time, without compromising any accuracy obtained by a state-of-the-art tool, as we report our extensive evaluation results in.
【Keywords】: program compilers; search engines; software engineering; HCI community; SE community; code corpus; code piece sharing; code search integration; commercial code search engine; interactive search; large scale corpus; post mortem; software development; syntactic clone; syntactic structure; Buildings; Cloning; Indexes; Java; Programming; Search engines; Syntactics
【Paper Link】 【Pages】:1340-1343
【Authors】: Karen Works ; Elke A. Rundensteiner
【Abstract】: Given the nature of high volume streaming environments, not all tuples can be processed within the required response time. In such instances, it is crucial to dedicate resources to producing the most important results. We will demonstrate the Proactive Promotion Engine (PP) which employs a new preferential resource allocation methodology for priority processing of stream tuples. Our key contributions include: 1) our promotion continuous query language allows the specification of priorities within a query, 2) our promotion query algebra supports proactive promotion query processing, 3) our promotion query optimization locates an optimized PP query plan, and 4) our adaptive promotion control adapts online which subset of tuples are given priority online within a single physical query plan. Our “Portland Home Arrest” demonstration facilitates the capture of in-flight criminals using data generated by the Virginia Tech Network Dynamics and Simulation Science Laboratory via simulation-based modeling techniques.
【Keywords】: query formulation; query languages; query processing; resource allocation; Portland home arrest; Virginia tech network dynamics; adaptive promotion control; inflight criminal; physical query plan; proactive promotion engine; proactive promotion query processing; promotion continuous query language; promotion query algebra; promotion query optimization; resource allocation; simulation science laboratory; simulation-based modeling; stream tuple processing; volume streaming; Algebra; Monitoring; Query processing; Resource management; Runtime; Semantics; Time factors
【Paper Link】 【Pages】:1344-1347
【Authors】: Serena Sorrentino ; Sonia Bergamaschi ; Maciej Gawinecki
【Abstract】: Schema matching is the problem of finding relationships among concepts across heterogeneous data sources (heterogeneous in format and structure). Schema matching systems usually exploit lexical and semantic information provided by lexical databases/thesauri to discover intra/inter semantic relationships among schema elements. However, most of them obtain poor performance on real world scenarios due to the significant presence of “non-dictionary words”. Non-dictionary words include compound nouns, abbreviations and acronyms. In this paper, we present NORMS (NORMalizer of Schemata), a tool performing schema label normalization to increase the number of comparable labels extracted from schemata.
【Keywords】: data handling; pattern matching; NORMS; automatic tool; heterogeneous data sources; lexical databases; schema label normalization; schema matching system; schemata normalizer; thesauri; Accuracy; Companies; Graphical user interfaces; Humans; Manuals; Semantics; Thesauri
【Paper Link】 【Pages】:1348-1351
【Authors】: Sergej Zerr ; Kerstin Bischoff ; Sergey Chernov
【Abstract】: Web 2.0 applications are a rich source of multimedia resources, that describe sights, events, whether conditions, traffic situations and other relevant objects along the user's route. Compared to static sight descriptions, Web 2.0 resources can provide up-to-date visual information, which has been found important or interesting by the other users. Some algorithms have been suggested recently for the landmark finding problem from photos. Still, if users want related videos or background information about a particular place of interest it is necessary to contact different social platforms or general search engines. In this paper we present GuideMe!-a mobile application that automatically identifies landmark tags from Flickr groups and gathers relevant sightseeing resources from various Web 2.0 social platforms.
【Keywords】: Internet; mobile computing; multimedia systems; search engines; social networking (online); Flickr groups; GuideMe!; Web 2.0 applications; general search engines; landmark finding problem; mobile application; multimedia resources; sightseeing resources; social platforms; traffic situations; up-to-date visual information; Cities and towns; Mobile communication; Mobile handsets; Videos; Web services; YouTube
【Paper Link】 【Pages】:1352-1355
【Authors】: Min Xie ; Laks V. S. Lakshmanan ; Peter T. Wood
【Abstract】: Classical recommender systems provide users with a list of recommendations where each recommendation consists of a single item, e.g., a book or a DVD. However, applications such as travel planning can benefit from a system capable of recommending packages of items, under a user-specified budget and in the form of sets or sequences. In this context, there is a need for a system that can recommend top-k packages for the user to choose from. In this paper, we propose a novel system, CompRec-Trip, which can automatically generate composite recommendations for travel planning. The system leverages rating information from underlying recommender systems, allows flexible package configuration and incorporates users' cost budgets on both time and money. Furthermore, the proposed CompRec-Trip system has a rich graphical user interface which allows users to customize the returned composite recommendations and take into account external local information.
【Keywords】: graphical user interfaces; mobile computing; recommender systems; travel industry; CompRec-Trip; composite recommendation system; graphical user interface; recommender systems; recommending packages; top-fc packages; travel planning; user-specified budget; Approximation algorithms; Approximation methods; Databases; Heuristic algorithms; Planning; Recommender systems; Upper bound
【Paper Link】 【Pages】:1356-1359
【Authors】: Ioannis K. Paparrizos ; Hoyoung Jeung ; Karl Aberer
【Abstract】: As sensors continue to proliferate, the capabilities of effectively querying not only sensor data but also its metadata becomes important in a wide range of applications. This paper demonstrates a search system that utilizes various techniques and tools for querying sensor metadata and visualizing the results. Our system provides an easy-to-use query interface, built upon semantic technologies where users can freely store and query their metadata. Going beyond basic keyword search, the system provides a variety of advanced functionalities tailored for sensor metadata search; ordering search results according to our ranking mechanism based on the PageRank algorithm, recommending pages that contain relevant metadata information to given search conditions, presenting search results using various visualization tools, and offering dynamic hypergraphs and tag clouds of metadata. The system has been running as a real application and its effectiveness has been proved by a number of users.
【Keywords】: data visualisation; meta data; sensors; dynamic hypergraph; keyword search; query interface; querying sensor metadata; semantic technology; Data visualization; Joining processes; Resource description framework; Semantics; Tag clouds; Visualization
【Paper Link】 【Pages】:1360
【Authors】: M. Tamer Özsu ; Patrick Valduriez ; Serge Abiteboul ; Bettina Kemme ; Ricardo Jiménez-Peris ; Beng Chin Ooi
【Abstract】: Work on distributed data management commenced shortly after the introduction of the relational model in the mid-1970's. 1970's and 1980's were very active periods for the development of distributed relational database technology, and claims were made that in the following ten years centralized databases will be an “antique curiosity” and most organizations will move toward distributed database managers [1]. That prediction has certainly become true, and all commercial DBMSs today are distributed.
【Keywords】: Books; Client-server systems; Database systems; Distributed databases; Hardware; Social network services
【Paper Link】 【Pages】:1361
【Authors】: Goetz Graefe
【Abstract】: In the context of data management, robustness is usually associated with resilience against failure, recovery, redundancy, disaster preparedness, etc. Robust query processing, on the other hand, is about robustness of performance and of scalability. It is more than progress reporting or predictability. A system that fails predictably or obviously performs poorly may be better than an unpredictable one, but it is not robust.
【Keywords】: Indexes; Measurement; Query processing; Robustness; Seminars; Tuning
【Paper Link】 【Pages】:1362-1365
【Authors】: Benjamin Bustos ; Tomás Skopal
【Abstract】: This tutorial surveys domains employing non-metric functions for effective similarity search, and methods for efficient non-metric similarity search in very large collections.
【Keywords】: query processing; nonmetric similarity search problems; similarity search; very large collections; Extraterrestrial measurements; Indexing; Proteins; Search problems; USA Councils
【Paper Link】 【Pages】:1366-1369
【Authors】: Sarah Cohen Boulakia ; Ulf Leser
【Abstract】: Ever since the advent of high-throughput biology (e.g., the Human Genome Project), integrating the large number of diverse biological data sets has been considered as one of the most important tasks for advancement in the biological sciences. Whereas the early days of research in this area were dominated by virtual integration systems (such as multi-/federated databases), the current predominantly used architecture uses materialization. Systems are built using ad-hoc techniques and a large amount of scripting. However, recent years have seen a shift in the understanding of what a “data integration system” actually should do, revitalizing research in this direction. In this tutorial, we review the past and current state of data integration for the Life Sciences and discuss recent trends in detail, which all pose challenges for the database community.
【Keywords】: biology computing; data handling; ad-hoc techniques; biological data sets; biological sciences; data integration; life sciences; virtual integration systems; Bioinformatics; Biology; Data models; Distributed databases; Ontologies; Semantics
【Paper Link】 【Pages】:1370-1373
【Authors】: Goetz Graefe ; Harumi A. Kuno
【Abstract】: In summary, the core design of B-trees has remained unchanged in 40 years: balanced trees, pages or other units of I/O as nodes, efficient root-to-leaf search, splitting and merging nodes, etc. On the other hand, an enormous amount of research and development has improved every aspect of B-trees including data contents such as multi-dimensional data, access algorithms such as multi-dimensional queries, data organization within each node such as compression and cache optimization, concurrency control such as separation of latching and locking, recovery such as multi-level recovery, etc. Gray and Reuter believed in 1993 that “B-trees are by far the most important access path structure in database and file systems.” It seems that this statement remains true today. B-tree indexes are likely to gain new importance in relational databases due to the advent of flash storage. Fast access latencies permit many more random I/O operations than traditional disk storage, thus shifting the break-even point between a full-bandwidth scan and a B-tree index search, even if the scan has the benefit of columnar database storage. We hope that this tutorial of B-tree techniques will stimulate research and development of modern B-tree indexing techniques for future data management systems.
【Keywords】: concurrency control; relational databases; tree data structures; B-tree index search; cache optimization; concurrency control; data content; data organization; file system; flash storage; relational databases; Bandwidth; Data structures; Encoding; Indexes; Maintenance engineering; Tutorials
【Paper Link】 【Pages】:1374-1377
【Authors】: Jayant R. Haritsa
【Abstract】: The automated optimization of declarative SQL queries is a classical problem that has been diligently addressed by the database community over several decades. However, due to its inherent complexities and challenges, the topic has largely remained a "black art", and the quality of the query optimizer continues to be a key differentiator between competing database products, with large technical teams involved in their design and implementation. Over the past few years, a fresh perspective on the behavior of modern query optimizers has arisen through the introduction and development of the "plan diagram" concept. A plan diagram is a visual representation of the plan choices made by the optimizer over a space of input parameters, such as relational selectivities. In this tutorial, we provide a detailed walk-through of plan diagrams, their processing, and their applications. We begin by showcasing a variety of plan diagrams that provide intriguing insights into current query optimizer implementations. A suite of techniques for efficiently producing plan diagrams are then outlined. Subsequently, we present a suite of post-processing algorithms that take optimizer plan diagrams as input, and output new diagrams with demonstrably superior query processing characteristics, such as robustness to estimation errors. Following up, we explain how these offline characteristics can be internalized in the query optimizer, resulting in an intrinsically improved optimizer that directly produces high quality plan diagrams. Finally, we enumerate a variety of open technical problems, and promising future research directions. All the plan diagrams in the tutorial are sourced from popular industrial-strength query optimizers operating on benchmark decision-support environments, and will be graphically displayed on the Picasso visualization platform.
【Keywords】: SQL; data structures; data visualisation; optimisation; query processing; visual databases; Picasso visualization platform; automated optimization; benchmark decision-support environments; database community; database products; declarative SQL queries; query optimizer plan diagrams; query processing characteristics; visual representation; Complexity theory; Optimization; Query processing; Robustness; Tutorials
【Paper Link】 【Pages】:1378-1379
【Authors】: Dario Colazzo ; Giorgio Ghelli ; Carlo Sartiani
【Abstract】: Schemas have always played a crucial role in database management. For traditional relational and object databases, schemas have a relatively simple structure, and this eases their use for optimizing and typechecking queries. In the context of XML databases, things change. Several different schema languages have been defined, tailored for different application classes. Moreover, XML schema languages are inherently more complex, as they host mechanisms for describing highly irregular and flexible structures. In this tutorial we will describe the theoretical models behind these languages, their formal properties, and will also present the complexity of the basic decision problems. We will explore some theoretical and practical applications of schemas for query processing; finally, we will discuss how decision problems can be efficiently solved, at the price of some restrictions on the expressible types.
【Keywords】: XML; query processing; relational databases; XML databases; XML processing; XML schema languages; database management; object databases; query optimization; query processing; query typechecking; relational databases; Complexity theory; Query processing; Tutorials; Web sites; XML
【Paper Link】 【Pages】:1380-1383
【Authors】: Yi Chen ; Wei Wang ; Ziyang Liu
【Abstract】: Empowering users to access databases using simple keywords can relieve users from the steep learning curve of mastering a structured query language and understanding complex and possibly fast-evolving data schemas. In this tutorial, we give an overview of the state-of-the-art techniques for supporting keyword-based search and exploration on databases. Several topics will be discussed, including query result definition, ranking functions, result generation and top-k query processing, snippet generation, result clustering, result comparison, query cleaning and suggestion, performance optimization, and search quality evaluation. Various data models will be discussed, including relational data, XML data, graph-structured data, data streams, and workflows. Finally we identify the challenges and opportunities for future research to advance the field.
【Keywords】: XML; query processing; relational databases; XML data; keyword based search; query processing; relational database; state-of-the-art technique; structured query language; Keyword search; Relational databases; Spatial databases; Tutorials; XML