IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1-5 April, 2012. IEEE Computer Society 【DBLP Link】
【Paper Link】 【Pages】:1-4
【Authors】: Serge Abiteboul ; Emilien Antoine ; Julia Stoyanovich
【Abstract】: This paper addresses the challenges faced by everyday Web users, who interact with inherently heterogeneous and distributed information. Managing such data is currently beyond the skills of casual users. We describe ongoing work that has as its goal the development of foundations for declarative distributed data management. In this approach, we see the Web as a knowledge base consisting of distributed logical facts and rules. Our objective is to enable automated reasoning over this knowledge base, ultimately improving the quality of service and of data. For this, we use Webdamlog, a Datalog-style language with rule delegation. We outline ongoing efforts on the Web dam Exchange platform that combines Webdamlog evaluation with communication and security protocols.
【Keywords】: DATALOG; Internet; cryptographic protocols; inference mechanisms; knowledge based systems; Datalog-style language; Web dam exchange platform; Web viewing; Webdamlog evaluation; automated reasoning; communication protocols; declarative distributed data management; distributed knowledge base; distributed logical facts; distributed logical rules; quality of service; rule delegation; security protocols; Access control; Cognition; Distributed databases; Knowledge based systems; Protocols; Rocks
【Paper Link】 【Pages】:5
【Authors】: Surajit Chaudhuri
【Abstract】: One buzzword that has been popular in the last couple of years is Big Data. In simplest terms, Big Data symbolizes the aspiration to build platforms and tools to ingest, store and analyze data that can be voluminous, diverse, and possibly fast changing. In this talk, I will try to reflect on a few of the technical problems presented by the exploration of Big Data. Some of these challenges in data analytics have been addressed by our community in the past in a more traditional relational database context but only with mixed results. I will review these quests and study some of the key lessons learned. At the same time, significant developments such as the emergence of cloud infrastructure and availability of data rich web services hold the potential for transforming our industry. I will discuss the unique opportunities they present for Big Data Analytics.
【Keywords】: Web services; cloud computing; data analysis; relational databases; big data analytics; cloud infrastructure; data rich Web services; relational database; Abstracts; Awards activities; Conferences; Data engineering; Data handling; Data storage systems; Information management
【Paper Link】 【Pages】:6
【Authors】: Peter Druschel
【Abstract】: Summary form only given. Cooperation and trust play an increasingly important role in today's information systems. For instance, peer-to-peer systems like BitTorrent, Sopcast and Skype are powered by resource contributions from participating users, federated systems like the Internet have to respect the interests, policies and laws of participating organizations and countries; in the Cloud, users entrust their data and computation to third-part infrastructure. In this talk, we consider accountability as a way to facilitate transparency and trust in cooperative systems. We look at practical techniques to account for the integrity of distributed, cooperative computations, and look at some of the difficulties and open problems in accountability.
【Keywords】: cloud computing; information systems; peer-to-peer computing; trusted computing; BitTorrent; Internet; Skype; Sopcast; accountability; cloud; cooperative computations; cooperative information systems; peer-to-peer systems; trust; Awards activities; Computer science; Conferences; Data engineering; Educational institutions; Software systems
【Paper Link】 【Pages】:7-8
【Authors】: Michael Stonebraker ; Anastasia Ailamaki ; Jeremy Kepner ; Alexander S. Szalay
【Abstract】: For many decades, users in scientific fields (domain scientists) have resorted to either home-grown tools or legacy software for the management of their data. Technological advancements nowadays necessitate many of the properties such as data independence, scalability, and functionality found in the roadmap of DBMS technology, DBMS products, however, are not yet ready to address scientific application and user needs. Recent efforts toward building a science DBMS indicate that there is a long way ahead of us, paved by a research agenda that is rich in interesting and challenging problems.
【Keywords】: database management systems; scientific information systems; DBMS products; DBMS technology; data management; domain scientists; home-grown tools; legacy software; science DBMS; scientific application; scientific data bases; scientific fields; Arrays; Large Hadron Collider; Linear algebra; Servers; Software; Standards; USA Councils
【Paper Link】 【Pages】:9-19
【Authors】: Cuneyt Gurcan Akcora ; Barbara Carminati ; Elena Ferrari
【Abstract】: Several efforts have been made for more privacy aware Online Social Networks (OSNs) to protect personal data against various privacy threats. However, despite the relevance of these proposals, we believe there is still the lack of a conceptual model on top of which privacy tools have to be designed. Central to this model should be the concept of risk. Therefore, in this paper, we propose a risk measure for OSNs. The aim is to associate a risk level with social network users in order to provide other users with a measure of how much it might be risky, in terms of disclosure of private information, to have interactions with them. We compute risk levels based on similarity and benefit measures, by also taking into account the user risk attitudes. In particular, we adopt an active learning approach for risk estimation, where user risk attitude is learned from few required user interactions. The risk estimation process discussed in this paper has been developed into a Facebook application and tested on real data. The experiments show the effectiveness of our proposal.
【Keywords】: data privacy; graph theory; learning (artificial intelligence); risk analysis; security of data; social networking (online); user interfaces; Facebook application; OSN; active learning; benefit measures; personal data protection; privacy aware online social networks; privacy threats; privacy tools; private information disclosure; risk estimation; risk estimation process; risk level; similarity measures; social graph; user interactions; user risk attitudes; Accuracy; Clustering algorithms; Estimation; Facebook; Labeling; Privacy
【Paper Link】 【Pages】:20-31
【Authors】: Graham Cormode ; Cecilia M. Procopiuc ; Divesh Srivastava ; Entong Shen ; Ting Yu
【Abstract】: Differential privacy has recently emerged as the de facto standard for private data release. This makes it possible to provide strong theoretical guarantees on the privacy and utility of released data. While it is well-understood how to release data based on counts and simple functions under this guarantee, it remains to provide general purpose techniques to release data that is useful for a variety of queries. In this paper, we focus on spatial data such as locations and more generally any multi-dimensional data that can be indexed by a tree structure. Directly applying existing differential privacy methods to this type of data simply generates noise. We propose instead the class of "private spatial decompositions'': these adapt standard spatial indexing methods such as quad trees and kd-trees to provide a private description of the data distribution. Equipping such structures with differential privacy requires several steps to ensure that they provide meaningful privacy guarantees. Various basic steps, such as choosing splitting points and describing the distribution of points within a region, must be done privately, and the guarantees of the different building blocks composed to provide an overall guarantee. Consequently, we expose the design space for private spatial decompositions, and analyze some key examples. A major contribution of our work is to provide new techniques for parameter setting and post-processing the output to improve the accuracy of query answers. Our experimental study demonstrates that it is possible to build such decompositions efficiently, and use them to answer a variety of queries privately with high accuracy.
【Keywords】: data privacy; indexing; quadtrees; data distribution; differential privacy; differentially private spatial decompositions; kd-trees; multi-dimensional data; parameter setting; quad trees; query answers; standard spatial indexing methods; tree structure; Accuracy; Data privacy; Noise; Noise measurement; Privacy; Spatial databases; Vegetation
【Paper Link】 【Pages】:32-43
【Authors】: Jia Xu ; Zhenjie Zhang ; Xiaokui Xiao ; Yin Yang ; Ge Yu
【Abstract】: Differential privacy (DP) is a promising scheme for releasing the results of statistical queries on sensitive data, with strong privacy guarantees against adversaries with arbitrary background knowledge. Existing studies on DP mostly focus on simple aggregations such as counts. This paper investigates the publication of DP-compliant histograms, which is an important analytical tool for showing the distribution of a random variable, e.g., hospital bill size for certain patients. Compared to simple aggregations whose results are purely numerical, a histogram query is inherently more complex, since it must also determine its structure, i.e., the ranges of the bins. As we demonstrate in the paper, a DP-compliant histogram with finer bins may actually lead to significantly lower accuracy than a coarser one, since the former requires stronger perturbations in order to satisfy DP. Moreover, the histogram structure itself may reveal sensitive information, which further complicates the problem. Motivated by this, we propose two novel algorithms, namely Noise First and Structure First, for computing DP-compliant histograms. Their main difference lies in the relative order of the noise injection and the histogram structure computation steps. Noise First has the additional benefit that it can improve the accuracy of an already published DP-complaint histogram computed using a naiive method. Going one step further, we extend both solutions to answer arbitrary range queries. Extensive experiments, using several real data sets, confirm that the proposed methods output highly accurate query answers, and consistently outperform existing competitors.
【Keywords】: data privacy; query processing; statistical databases; DP-compliant histograms; arbitrary background knowledge; differential privacy; differentially private histogram publication; histogram query; histogram structure computation steps; noise first algorithm; noise injection; random variable distribution; sensitive data; statistical queries; structure first algorithm; Databases; Heuristic algorithms; Histograms; Noise; Noise measurement; Privacy; Sensitivity
【Paper Link】 【Pages】:44-53
【Authors】: Russell Paulet ; Md. Golam Kaosar ; Xun Yi ; Elisa Bertino
【Abstract】: In this paper we present a solution to one of the location-based query problems. This problem is defined as follows: (i) a user wants to query a database of location data, known as Points Of Interest (POI), and does not want to reveal his/her location to the server due to privacy concerns, (ii) the owner of the location data, that is, the location server, does not want to simply distribute its data to all users. The location server desires to have some control over its data, since the data is its asset. Previous solutions have used a trusted anonymiser to address privacy, but introduced the impracticality of trusting a third party. More recent solutions have used homomorphic encryption to remove this weakness. Briefly, the user submits his/her encrypted coordinates to the server and the server would determine the user's location homomorphically, and then the user would acquire the corresponding record using Private Information Retrieval techniques. We propose a major enhancement upon this result by introducing a similar two stage approach, where the homomorphic comparison step is replaced with Oblivious Transfer to achieve a more secure solution for both parties. The solution we present is efficient and practical in many scenarios. We also include the results of a working prototype to illustrate the efficiency of our protocol.
【Keywords】: cryptography; data privacy; query processing; POI; content-protecting location based queries; database querying; homomorphic encryption; location data; location server; location-based query problems; oblivious transfer; points of interest; privacy-preserving location based queries; private informaion retrieval techniques; Databases; Encryption; Mobile communication; Privacy; Protocols; Servers
【Paper Link】 【Pages】:54-65
【Authors】: Jie Bao ; Mohamed F. Mokbel ; Chi-Yin Chow
【Abstract】: This paper presents the Geo Feed system, a location-aware news feed system that provides a new platform for its users to get spatially related message updates from either their friends or favorite news sources. Geo Feed distinguishes itself from all existing news feed systems in that it takes into account the spatial extents of messages and user locations when deciding upon the selected news feed. Geo Feed is equipped with three different approaches for delivering the news feed to its users, namely, spatial pull, spatial push, and shared push. Then, the main challenge of Geo Feed is to decide on when to use each of these three approaches to which users. Geo Feed is equipped with a smart decision model that decides about using these approaches in a way that: (a) minimizes the system overhead for delivering the location-aware news feed, and (b) guarantees a certain response time for each user to obtain the requested location-aware news feed. Experimental results, based on real and synthetic data, show that Geo Feed outperforms existing news feed systems in terms of response time and maintenance cost.
【Keywords】: mobile computing; social networking (online); GeoFeed system; location aware news feed system; location-aware newsfeed; smart decision model; Data structures; Feeds; Indexes; Maintenance engineering; Query processing; Social network services; Time factors
【Paper Link】 【Pages】:66-77
【Authors】: Stefan Endrullis ; Andreas Thor ; Erhard Rahm
【Abstract】: Programmatic data integration approaches such as mashups have become a viable approach to dynamically integrate web data at runtime. Key data sources for mashups include entity search engines and hidden databases that need to be queried via source-specific search interfaces or web forms. Current mashups are typically restricted to simple query approaches such as using keyword search. Such approaches may need a high number of queries if many objects have to be found. Furthermore, the effectiveness of the queries may be limited, i.e., they may miss relevant results. We therefore propose more advanced search strategies that aim at finding a set of entities with high efficiency and high effectiveness. Our strategies use different kinds of queries that are determined by source-specific query generators. Furthermore, the queries are selected based on the characteristics of input entities. We introduce a flexible model for entity search strategies that includes a ranking of candidate queries determined by different query generators. We describe different query generators and outline their use within four entity search strategies. These strategies apply different query ranking and selection approaches to optimize efficiency and effectiveness. We evaluate our search strategies in detail for two domains: product search and publication search. The comparison with a standard keyword search shows that the proposed search strategies provide significant improvements in both domains.
【Keywords】: Internet; data integration; optimisation; query formulation; search engines; Web forms; candidate queries; dynamical Web data integration; entity search engine; entity search strategies; flexible model; hidden database; key data sources; mashup applications; product search; programmatic data integration approaches; publication search; query ranking; query selection approach; source-specific query generators; source-specific search interfaces; standard keyword search; Books; Complexity theory; Conferences; Data engineering; Generators; Search engines; Search problems
【Paper Link】 【Pages】:78-89
【Authors】: Xiaohui Yu ; Huxia Shi
【Abstract】: Keyword search over databases, popularized by keyword search in WWW, allows ordinary users to access database information without the knowledge of structured query languages and database schemas. Most of the previous studies in this area use IR-style ranking, which fail to consider the importance of the query answers. In this paper, we propose CI-RANK, a new approach for keyword search in databases, which considers the importance of individual nodes in a query answer and the cohesiveness of the result structure in a balanced way. CI-RANK is built upon a carefully designed model call Random Walk with Message Passing that helps capture the relationships between different nodes in the query answer. We develop a branch and bound algorithm to support the efficient generation of top-k query answers. Indexing methods are also introduced to further speed up the run-time processing of queries. Extensive experiments conducted on two real data sets with a real user query log confirm the effectiveness and efficiency of CI-RANK.
【Keywords】: database management systems; indexing; query formulation; question answering (information retrieval); random processes; tree searching; CI-Rank; IR-style ranking; branch-and-bound algorithm; collective importance; database information; database schema; indexing method; keyword search; message passing; random walk; real user query log; structured query language; top-k query answer; Algorithm design and analysis; Computational modeling; Databases; Educational institutions; Keyword search; Message passing; Motion pictures
【Paper Link】 【Pages】:90-101
【Authors】: Badrish Chandramouli ; Jonathan Goldstein ; Songyun Duan
【Abstract】: "Big Data" in map-reduce (M-R) clusters is often fundamentally temporal in nature, as are many analytics tasks over such data. For instance, display advertising uses Behavioral Targeting (BT) to select ads for users based on prior searches, page views, etc. Previous work on BT has focused on techniques that scale well for offline data using M-R. However, this approach has limitations for BT-style applications that deal with temporal data: (1) many queries are temporal and not easily expressible in M-R, and moreover, the set-oriented nature of M-R front-ends such as SCOPE is not suitable for temporal processing, (2) as commercial systems mature, they may need to also directly analyze and react to real-time data feeds since a high turnaround time can result in missed opportunities, but it is difficult for current solutions to naturally also operate over real-time streams. Our contributions are twofold. First, we propose a novel framework called TiMR (pronounced timer), that combines a time-oriented data processing system with a M-R framework. Users write and submit analysis algorithms as temporal queries - these queries are succinct, scale-out-agnostic, and easy to write. They scale well on large-scale offline data using TiMR, and can work unmodified over real-time streams. We also propose new cost-based query fragmentation and temporal partitioning schemes for improving efficiency with TiMR. Second, we show the feasibility of this approach for BT, with new temporal algorithms that exploit new targeting opportunities. Experiments using real data from a commercial ad platform show that TiMR is very efficient and incurs orders-of-magnitude lower development effort. Our BT solution is easy and succinct, and performs up to several times better than current schemes in terms of memory, learning time, and click-through-rate/coverage.
【Keywords】: Internet; advertising; data analysis; query processing; BT-style applications; M-R front-ends; SCOPE; TiMR; Web advertising; analytics tasks; behavioral targeting; big data; click-through-rate; cost-based query fragmentation; display advertising; large-scale omine data; map-reduce clusters; page views; prior searches; real-time data feeds; temporal analytics; temporal partitioning schemes; temporal queries; Advertising; Distributed databases; Feeds; Information management; Monitoring; Real time systems; Semantics
【Paper Link】 【Pages】:102-113
【Authors】: Aubrey Tatarowicz ; Carlo Curino ; Evan P. C. Jones ; Sam Madden
【Abstract】: The standard way to get linear scaling in a distributed OLTP DBMS is to horizontally partition data across several nodes. Ideally, this partitioning will result in each query being executed at just one node, to avoid the overheads of distributed transactions and allow nodes to be added without increasing the amount of required coordination. For some applications, simple strategies, such as hashing on primary key, provide this property. Unfortunately, for many applications, including social networking and order-fulfillment, many-to-many relationships cause simple strategies to result in a large fraction of distributed queries. Instead, what is needed is a fine-grained partitioning, where related individual tuples (e.g., cliques of friends) are co-located together in the same partition. Maintaining such a fine-grained partitioning requires the database to store a large amount of metadata about which partition each tuple resides in. We call such metadata a lookup table, and present the design of a data distribution layer that efficiently stores these tables and maintains them in the presence of inserts, deletes, and updates. We show that such tables can provide scalability for several difficult to partition database workloads, including Wikipedia, Twitter, and TPC-E. Our implementation provides 40% to 300% better performance on these workloads than either simple range or hash partitioning and shows greater potential for further scale-out.
【Keywords】: data mining; distributed databases; meta data; table lookup; data distribution layer; distributed OLTP DBMS; distributed databases; distributed queries; distributed transactions; fine-grained partitioning; lookup tables; metadata; order-fulfillment; social networking; Distributed databases; Indexes; Routing protocols; Social network services; Throughput
【Paper Link】 【Pages】:114-125
【Authors】: Richard T. Snodgrass ; Dengfeng Gao ; Rui Zhang ; Stephen W. Thomas
【Abstract】: We show how to extend temporal support of SQL to the Turing-complete portion of SQL, that of persistent stored modules (PSM). Our approach requires minor new syntax beyond that already in SQL/Temporal to define and to invoke PSM routines, thereby extending the current, sequenced, and non-sequenced semantics of queries to PSM routines. Temporal upward compatibility (existing applications work as before when one or more tables are rendered temporal) is ensured. We provide a transformation that converts Temporal SQL/PSM to conventional SQL/PSM. To support sequenced evaluation of PSM routines, we define two different slicing approaches, maximal slicing and per-statement slicing. We compare these approaches empirically using a comprehensive benchmark and provide a heuristic for choosing between them.
【Keywords】: SQL; Turing machines; database management systems; program slicing; programming language semantics; DBMS; current semantics; database management system; maximal slicing; nonsequenced semantics; per-statement slicing; persistent stored modules; sequenced semantics; temporal SQL-PSM; temporal query languages; temporal support; temporal upward compatibility; turing-complete portion; Context; Database languages; Databases; Semantics; Standards; Syntactics; Transforms
【Paper Link】 【Pages】:126-137
【Authors】: Norifumi Nishikawa ; Miyuki Nakano ; Masaru Kitsuregawa
【Abstract】: Power, especially that consumed for storing data, and cooling costs for data centers have increased rapidly. The main applications running at data centers are data intensive applications such as large file servers or database systems. Recently, power management of the data intensive applications has been emphasized in the literature. Such reports discuss the importance of power savings. However, these reports lack research on power management models for the efficient use of data intensive applications' I/O behaviors. This paper proposes a novel energy efficient storage management system that monitors both application- and device-level I/O patterns at run time, and uses not only the device-level I/O pattern but also application level patterns. First, the design of the proposed model combined with such large data intensive applications will be shown. The key features of the model are i) classifying application-level I/O into four patterns using run-time access behaviors such as the length of idle time and read/write frequency, and ii) adopting an appropriate power-saving method-based on these application level I/O patterns. Next, the proposed method is quantitatively evaluated with typical data intensive applications such as file servers, OLTP, and DSS. It is shown that energy efficient storage management is effective in achieving large power savings compared with traditional approaches while an application is running.
【Keywords】: computer centres; energy conservation; storage management; I/O behaviors; application-level I/O patterns; data centers; device-level I/O patterns; energy efficient storage management system; large data intensive applications; power management; power-saving method; run-time access behaviors; Cache storage; Decision support systems; Delay; Energy storage; Monitoring; Power demand
【Paper Link】 【Pages】:138-149
【Authors】: Eric R. Schendel ; Ye Jin ; Neil Shah ; Jackie Chen ; Choong-Seock Chang ; Seung-Hoe Ku ; Stéphane Ethier ; Scott Klasky ; Robert Latham ; Robert B. Ross ; Nagiza F. Samatova
【Abstract】: Efficient handling of large volumes of data is a necessity for exascale scientific applications and database systems. To address the growing imbalance between the amount of available storage and the amount of data being produced by high speed (FLOPS) processors on the system, data must be compressed to reduce the total amount of data placed on the file systems. General-purpose loss less compression frameworks, such as zlib and bzlib2, are commonly used on datasets requiring loss less compression. Quite often, however, many scientific data sets compress poorly, referred to as hard-to-compress datasets, due to the negative impact of highly entropic content represented within the data. An important problem in better loss less data compression is to identify the hard-to-compress information and subsequently optimize the compression techniques at the byte-level. To address this challenge, we introduce the In-Situ Orthogonal Byte Aggregate Reduction Compression (ISOBAR-compress) methodology as a preconditioner of loss less compression to identify and optimize the compression efficiency and throughput of hard-to-compress datasets.
【Keywords】: data compression; data handling; entropy; natural sciences computing; FLOPS; ISOBAR preconditioner; bzlib2; data handling; database system; entropic content; exascale scientific application; file system; general-purpose loss less compression framework; hard-to-compress dataset; high-throughput lossless data compression; in-situ orthogonal byte aggregate reduction compression methodology; processor; scientific data set compression; Arrays; Data models; Equations; Noise; Probability distribution; Standards; Throughput
【Paper Link】 【Pages】:150-161
【Authors】: Badrish Chandramouli ; David Maier ; Jonathan Goldstein
【Abstract】: A facility for merging equivalent data streams can support multiple capabilities in a data stream management system (DSMS), such as query-plan switching and high availability. One can logically view a data stream as a temporal table of events, each associated with a lifetime (time interval) over which the event contributes to output. In many applications, the "same" logical stream may present itself physically in multiple physical forms, for example, due to disorder arising in transmission or from combining multiple sources, and modifications of earlier events. Merging such streams correctly is challenging when the streams may differ physically in timing, order, and composition. This paper introduces a new stream operator called Logical Merge (LMerge) that takes multiple logically consistent streams as input and outputs a single stream that is compatible with all of them. LMerge can handle the dynamic attachment and detachment of input streams. We present a range of algorithms for LMerge that can exploit compile-time stream properties for efficiency. Experiments with Stream Insight, a commercial DSMS, show that LMerge is sometimes orders-of-magnitude more efficient than enforcing determinism on inputs, and that there is benefit to using specialized algorithms when stream variability is limited. We also show that LMerge and its extensions can provide performance benefits in several real-world applications.
【Keywords】: data handling; database management systems; query processing; LMerge; Logical Merge; data stream management system; equivalent data streams; multiple physical forms; query-plan switching; Availability; Finite element methods; Hafnium; Merging; Payloads; Real time systems; Throughput
【Paper Link】 【Pages】:162-173
【Authors】: Srikanta Tirthapura ; David P. Woodruff
【Abstract】: On a stream of two dimensional data items (x,y) where x is an item identifier, and y is a numerical attribute, a correlated aggregate query requires us to first apply a selection predicate along the second (y) dimension, followed by an aggregation along the first (x) dimension. For selection predicates of the form (y <; c) or (y >;, c), where parameter c is provided at query time, we present new streaming algorithms and lower bounds for estimating statistics of the resulting sub stream of elements that satisfy the predicate. We provide the first sub linear space algorithms for a large family of statistics in this model, including frequency moments. We experimentally validate our algorithms, showing that their memory requirements are significantly smaller than existing linear storage schemes for large datasets, while simultaneously achieving fast per-record processing time. We also study the problem when the items have weights. Allowing negative weights allows for analyzing values which occur in the symmetric difference of two datasets. We give a strong space lower bound which holds even if the algorithm is allowed up to a logarithmic number of passes over the data(before the query is presented). We complement this with a small space algorithm which uses a logarithmic number of passes.
【Keywords】: data handling; query processing; statistics; correlated aggregate estimation; correlated aggregate query; frequency moments; general method; item identifier; linear storage schemes; negative weights; numerical attribute; pass logarithmic number; selection predicate; statistics estimation; streaming algorithms; sublinear space algorithms; two dimensional data item stream; Aggregates; Complexity theory; Data structures; Estimation; Frequency estimation; Silicon; USA Councils
【Paper Link】 【Pages】:174-185
【Authors】: Tingjian Ge ; Fujun Liu
【Abstract】: Previous work has introduced probability distributions as first-class components in uncertain stream database systems. A lacking element is the fact of how accurate these probability distributions are. This indeed has a profound impact on the accuracy of query results presented to end users. While there is some previous work that studies unreliable intermediate query results in the tuple uncertainty model, to the best of our knowledge, we are the first to consider an uncertain stream database in which accuracy is taken into consideration all the way from the learned distributions based on raw data samples to the query results. We perform an initial study of various components in an accuracy-aware uncertain stream database system, including the representation of accuracy information and how to obtain query results' accuracy. In addition, we propose novel predicates based on hypothesis testing for decision-making using data with limited accuracy. We augment our study with a comprehensive set of experimental evaluations.
【Keywords】: database management systems; decision making; query processing; statistical distributions; accuracy information representation; accuracy-aware uncertain stream database system; decision-making; end user; first-class component; hypothesis testing; intermediate query; probability distribution; tuple uncertainty model; Accuracy; Databases; Histograms; Probability distribution; Random variables; Roads; Uncertainty
【Paper Link】 【Pages】:186-197
【Authors】: Lu An Tang ; Yu Zheng ; Jing Yuan ; Jiawei Han ; Alice Leung ; Chih-Chieh Hung ; Wen-Chih Peng
【Abstract】: The advance of object tracking technologies leads to huge volumes of spatio-temporal data collected in the form of trajectory data stream. In this study, we investigate the problem of discovering object groups that travel together (i.e., traveling companions) from trajectory stream. Such technique has broad applications in the areas of scientific study, transportation management and military surveillance. To discover traveling companions, the monitoring system should cluster the objects of each snapshot and intersect the clustering results to retrieve moving-together objects. Since both clustering and intersection steps involve high computational overhead, the key issue of companion discovery is to improve the algorithm's efficiency. We propose the models of closed companion candidates and smart intersection to accelerate data processing. A new data structure termed traveling buddy is designed to facilitate scalable and flexible companion discovery on trajectory stream. The traveling buddies are micro-groups of objects that are tightly bound together. By only storing the object relationships rather than their spatial coordinates, the buddies can be dynamically maintained along trajectory stream with low cost. Based on traveling buddies, the system can discover companions without accessing the object details. The proposed methods are evaluated with extensive experiments on both real and synthetic datasets. The buddy-based method is an order of magnitude faster than existing methods. It also outperforms other competitors with higher precision and recall in companion discovery.
【Keywords】: data mining; data structures; buddy-based method; data processing; data structure; monitoring system; moving-together object retrieval; object group discovery; object tracking technologies; real datasets; spatio-temporal data; streaming trajectories; synthetic datasets; trajectory data stream; trajectory stream; traveling companions discovery; Animals; Clustering algorithms; Complexity theory; Monitoring; Spatial indexes; Trajectory; Transportation
【Paper Link】 【Pages】:198-209
【Authors】: Dayu Yuan ; Prasenjit Mitra ; Huiwen Yu ; C. Lee Giles
【Abstract】: Sub graph search is a popular query scenario on graph databases. Given a query graph q, the sub graph search algorithm returns all database graphs having q as a sub graph. To efficiently implement a subgraph search, subgraph features are mined in order to index the graph database. Many subgraph feature mining approaches have been proposed. They are all "mine-at-once" algorithms in which the whole feature set is mined in one run before building a stable graph index. However, due to the change of environments (such as an update of the graph database and the increase of available memory), the index needs to be updated to accommodate such changes. Most of the "mine-at-once" algorithms involve frequent subgraph or subtree mining over the whole graph database. Also, constructing and deploying a new index involves an expensive disk operation such that it is inefficient to re-mine the features and rebuild the index from scratch. We observe that, under most cases, it is sufficient to update a small part of the graph index. Here we propose an "iterative subgraph mining" algorithm which iteratively finds one feature to insert into (or remove from) the index. Since the majority of indexing features and the index structure are not changed, the algorithm can be frequently invoked. We define an objective function that guides the feature mining. Next, we propose a basic branch and bound algorithm to mine the features. Finally, we design an advanced search algorithm, which quickly finds a near-optimum subgraph feature and reduces the search space. Experiments show that our feature mining algorithm is 5 times faster than the popular graph indexing algorithm gIndex, and that features mined by our iterative algorithm have a better filtering rate for the subgraph search problem.
【Keywords】: data mining; database indexing; iterative methods; query processing; tree searching; trees (mathematics); branch and bound algorithm; disk operation; feature remining; feature set; filtering rate; graph database indexing; index rebuilding; index structure; indexing feature insertion; indexing feature removal; iterative graph feature mining; mine-at-once algorithm; near-optimum subgraph feature; objective function; query graph; search space reduction; subgraph feature mining; subgraph search algorithm; subtree mining
【Paper Link】 【Pages】:210-221
【Authors】: Xiaoli Wang ; Xiaofeng Ding ; Anthony K. H. Tung ; Shanshan Ying ; Hai Jin
【Abstract】: Graphs are popular models for representing complex structure data and similarity search for graphs has become a fundamental research problem. Many techniques have been proposed to support similarity search based on the graph edit distance. However, they all suffer from certain drawbacks: high computational complexity, poor scalability in terms of database size, or not taking full advantage of indexes. To address these problems, in this paper, we propose SEGOS, an indexing and query processing framework for graph similarity search. First, an effective two-level index is constructed off-line based on sub-unit decomposition of graphs. Then, a novel search strategy based on the index is proposed. Two algorithms adapted from TA and CA methods are seamlessly integrated into the proposed strategy to enhance graph search. More specially, the proposed framework is easy to be pipelined to support continuous graph pruning. Extensive experiments are conducted on two real datasets to evaluate the effectiveness and scalability of our approaches.
【Keywords】: computational complexity; data structures; database indexing; graph theory; query formulation; query processing; SEGOS; complex structure data; computational complexity; continuous graph pruning; database size; graph edit distance; graph indexing method; graph search; graph similarity search; indexing processing framework; query processing framework; search strategy; sub-unit graph decomposition; two-level index; Approximation algorithms; Indexing; Query processing; Scalability; Search problems
【Paper Link】 【Pages】:222-233
【Authors】: Changjiu Jin ; Sourav S. Bhowmick ; Byron Choi ; Shuigeng Zhou
【Abstract】: In a previous paper, we laid out the vision of a novel graph query processing paradigm where instead of processing a visual query graph after its construction, it interleaves visual query formulation and processing by exploiting the latency offered by the GUI to filter irrelevant matches and prefetch partial query results [8]. Our first attempt at implementing this vision, called GBLENDER [8], shows significant improvement in system response time (SRT) for sub graph containment queries. However, GBLENDER suffers from two key drawbacks, namely inability to handle visual sub graph similarity queries and inefficient support for visual query modification, limiting its usage in practical environment. In this paper, we propose a novel algorithm called PRAGUE (Practical visu Al Graph QUery Blender), that addresses these limitations by exploiting a novel data structure called spindle-shaped graphs (SPIG). A SPIG succinctly records various information related to the set of super graphs of a newly added edge in the visual query fragment. Specifically, PRAGUE realizes a unified visual framework to support SPIG-based processing of modification-efficient sub graph containment and similarity queries. Extensive experiments on real-world and synthetic datasets demonstrate effectiveness of PRAGUE.
【Keywords】: data visualisation; graph theory; query formulation; query processing; GBLENDER; PRAGUE; SPIG; SPIG-based processing; SRT; data structure; modification-efficient subgraph containment query; practical visual graph query blender; practical visual subgraph query formulation; query processing; spindle-shaped graphs; system response time; visual query fragment; visual query modification inefficient support; visual subgraph similarity queries handling inability; Educational institutions; Graphical user interfaces; Indexes; Query processing; Search problems; Visualization
【Paper Link】 【Pages】:234-245
【Authors】: Walaa Eldin Moustafa ; Amol Deshpande ; Lise Getoor
【Abstract】: There is increasing interest in analyzing networks of all types including social, biological, sensor, computer, and transportation networks. Broadly speaking, we may be interested in global network-wide analysis (e.g., centrality analysis, community detection) where the properties of the entire network are of interest, or local ego-centric analysis where the focus is on studying the properties of nodes (egos) by analyzing their neighborhood sub graphs. In this paper we propose and study ego-centric pattern census queries, a new type of graph analysis query, where a given structural pattern is searched for in every node's neighborhood and the counts are reported or used in further analysis. This kind of analysis is useful in many domains in social network analysis including opinion leader identification, node classification, link prediction, and role identification. We propose an SQL-based declarative language to support this class of queries, and develop a series of efficient query evaluation algorithms for it. We evaluate our algorithms on a variety of synthetically generated graphs. We also show an application of our language in a real-world scenario for predicting future collaborations from DBLP data.
【Keywords】: SQL; graph theory; query processing; DBLP data; SQL-based declarative language; biological network; centrality analysis; community detection; computer network; ego-centric graph pattern census; ego-centric pattern census queries; global network-wide analysis; graph analysis query; link prediction; local ego-centric analysis; neighborhood subgraph; node classification; node properties; opinion leader identification; query evaluation algorithm; role identification; sensor network; social network analysis; structural pattern; transportation network; Algorithm design and analysis; Indexing; Organizations; Pattern matching; Query processing
【Paper Link】 【Pages】:246-257
【Authors】: Katrin Haegler ; Frank Fiedler ; Christian Böhm
【Abstract】: Efficient similarity search in uncertain data is a central problem in many modern applications such as biometric identification, stock market analysis, sensor networks, medical imaging, etc. In such applications, the feature vector of an object is not exactly known but is rather defined by a probability density function like a Gaussian Mixture Model (GMM). Previous work is limited to axis-parallel Gaussian distributions, hence, correlations between different features are not considered in the similarity search. In this paper, we propose a novel, efficient similarity search technique for general GMMs without independence assumption for the attributes, named SUDN, which approximates the actual components of a GMM in a conservative but tight way. A filter-refinement architecture guarantees no false dismissals, due to conservativity, as well as a good filter selectivity, due to the tightness of our approximations. An extensive experimental evaluation of SUDN demonstrates a considerable speed-up of similarity queries on general GMMs and an increase in accuracy compared to existing approaches.
【Keywords】: Gaussian processes; data handling; query formulation; vectors; SUDN; feature vector; filter-refinement architecture; non-axis parallel Gaussian mixture models; probability density function; similarity search; uncertain data searching; Approximation methods; Covariance matrix; Databases; Gaussian distribution; Probability density function; Uncertainty; Vectors; MLIQ; gaussian mixture model; non-axis parallel GMM; similarity search; uncertain data
【Paper Link】 【Pages】:258-269
【Authors】: Graham Cormode ; Divesh Srivastava ; Entong Shen ; Ting Yu
【Abstract】: Uncertainties in data can arise for a number of reasons: when data is incomplete, contains conflicting information or has been deliberately perturbed or coarsened to remove sensitive details. An important case which arises in many real applications is when the data describes a set of possibilities, but with cardinality constraints. These constraints represent correlations between tuples encoding, e.g. that at most two possible records are correct, or that there is an (unknown) one-to-one mapping between a set of tuples and attribute values. Although there has been much effort to handle uncertain data, current systems are not equipped to handle such correlations, beyond simple mutual exclusion and co-existence constraints. Vitally, they have little support for efficiently handling aggregate queries on such data. In this paper, we aim to address some of these deficiencies, by introducing LICM (Linear Integer Constraint Model), which can succinctly represent many types of tuple correlations, particularly a class of cardinality constraints. We motivate and explain the model with examples from data cleaning and masking sensitive data, to show that it enables modeling and querying such data, which was not previously possible. We develop an efficient strategy to answer conjunctive and aggregate queries on possibilistic data by describing how to implement relational operators over data in the model. LICM compactly integrates the encoding of correlations, query answering and lineage recording. In combination with off-the-shelf linear integer programming solvers, our approach provides exact bounds for aggregate queries. Our prototype implementation demonstrates that query answering with LICM can be effective and scalable.
【Keywords】: data handling; integer programming; linear programming; query processing; LICM; aggregate query answering; attribute values; cardinality constraints; co-existence constraints; conjunctive query answering; correlation encoding; data cleaning; lineage recording; linear integer constraint model; mutual exclusion; off-the-shelf linear integer programming solvers; possibilistic data; sensitive data masking; uncertain data handling; Aggregates; Correlation; Data models; Encoding; Query processing; Semantics
【Paper Link】 【Pages】:270-281
【Authors】: Yongxin Tong ; Lei Chen ; Bolin Ding
【Abstract】: In recent years, many new applications, such as sensor network monitoring and moving object search, show a growing amount of importance of uncertain data management and mining. In this paper, we study the problem of discovering threshold-based frequent closed item sets over probabilistic data. Frequent item set mining over probabilistic database has attracted much attention recently. However, existing solutions may lead an exponential number of results due to the downward closure property over probabilistic data. Moreover, it is hard to directly extend the successful experiences from mining exact data to a probabilistic environment due to the inherent uncertainty of data. Thus, in order to obtain a reasonable result set with small size, we study discovering frequent closed item sets over probabilistic data. We prove that even a sub-problem of this problem, computing the frequent closed probability of an item set, is #P-Hard. Therefore, we develop an efficient mining algorithm based on depth-first search strategy to obtain all probabilistic frequent closed item sets. To reduce the search space and avoid redundant computation, we further design several probabilistic pruning and bounding techniques. Finally, we verify the effectiveness and efficiency of the proposed methods through extensive experiments.
【Keywords】: data mining; probability; #P-hard; bounding technique; data management; data mining; depth-first search strategy; moving object search; probabilistic data; probabilistic frequent closed item sets; probabilistic pruning technique; sensor network monitoring; threshold-based frequent closed itemsets; Algorithm design and analysis; Data mining; Itemsets; Probabilistic logic; Search problems
【Paper Link】 【Pages】:282-293
【Authors】: Dan Olteanu ; Hongkai Wen
【Abstract】: In many applications of probabilistic databases, the probabilities are mere degrees of uncertainty in the data and are not otherwise meaningful to the user. Often, users care only about the ranking of answers in decreasing order of their probabilities or about a few most likely answers. In this paper, we investigate the problem of ranking query answers in probabilistic databases. We give a dichotomy for ranking in case of conjunctive queries without repeating relation symbols: it is either in polynomial time or NP-hard. Surprisingly, our syntactic characterisation of tractable queries is not the same as for probability computation. The key observation is that there are queries for which probability computation is #P-hard, yet ranking can be computed in polynomial time. This is possible whenever probability computation for distinct answers has a common factor that is hard to compute but irrelevant for ranking. We complement this tractability analysis with an effective ranking technique for conjunctive queries. Given a query, we construct a share plan, which exposes sub queries whose probability computation can be shared or ignored across query answers. Our technique combines share plans with incremental approximate probability computation of sub queries. We implemented our technique in the SPROUT query engine and report on performance gains of orders of magnitude over Monte Carlo simulation using FPRAS and exact probability computation based on knowledge compilation.
【Keywords】: computational complexity; database management systems; probability; query processing; FPRAS; Monte Carlo simulation; NP-hard; SPROUT query engine; conjunctive queries; incremental approximate probability computation; knowledge compilation; polynomial time; probabilistic databases; query answers; ranking technique; Approximation methods; Monte Carlo methods; Polynomials; Probabilistic logic; Query processing; Random variables
【Paper Link】 【Pages】:294-305
【Authors】: Steven Euijong Whang ; Hector Garcia-Molina
【Abstract】: Entity resolution (ER) is the problem of identifying which records in a database represent the same entity. Often, records of different types are involved (e.g., authors, publications, institutions, venues), and resolving records of one type can impact the resolution of other types of records. In this paper we propose a flexible, modular resolution framework where existing ER algorithms developed for a given record type can be plugged in and used in concert with other ER algorithms. Our approach also makes it possible to run ER on subsets of similar records at a time, important when the full data is too large to resolve together. We study the scheduling and coordination of the individual ER algorithms in order to resolve the full data set. We then evaluate our joint ER techniques on synthetic and real data and show the scalability of our approach.
【Keywords】: database management systems; file organisation; ER algorithm coordination; ER algorithms scheduling; joint entity resolution; Databases; Erbium; Joints; Organizations; Partitioning algorithms; Runtime; Scalability
【Paper Link】 【Pages】:306-317
【Authors】: Eric Peukert ; Julian Eberius ; Erhard Rahm
【Abstract】: Mapping complex metadata structures is crucial in a number of domains such as data integration, ontology alignment or model management. To speed up the generation of such mappings, automatic matching systems were developed to compute mapping suggestions that can be corrected by a user. However, constructing and tuning match strategies still requires a high manual effort by matching experts as well as correct mappings to evaluate generated mappings. We therefore propose a self-configuring schema matching system that is able to automatically adapt to the given mapping problem at hand. Our approach is based on analyzing the input schemas as well as intermediate matching results. A variety of matching rules use the analysis results to automatically construct and adapt an underlying matching process for a given match task. We comprehensively evaluate our approach on different mapping problems from the schema, ontology and model management domains. The evaluation shows that our system is able to robustly return good quality mappings across different mapping problems and domains.
【Keywords】: data integration; data structures; meta data; ontologies (artificial intelligence); pattern matching; automatic matching system; complex metadata structure mapping; data integration; generated mapping evaluation; mapping generation; model management domain; ontology alignment; self-configuring schema matching system; Adaptive systems; Libraries; Noise; Ontologies; Pattern matching; Robustness; Tuning
【Paper Link】 【Pages】:318-329
【Authors】: Wenfei Fan ; Jianzhong Li ; Nan Tang ; Wenyuan Yu
【Abstract】: This paper investigates the problem of incremental detection of errors in distributed data. Given a distributed database D, a set Sigma of conditional functional dependencies (CFDs), the set V of violations of the CFDs in D, and updates Delta D to D, it is to find, with minimum data shipment, changes Delta V to V in response to Delta D. The need for the study is evident since real-life data is often dirty, distributed and is frequently updated. It is often prohibitively expensive to recompute the entire set of violations when D is updated. We show that the incremental detection problem is NP-complete for D partitioned either vertically or horizontally, even when Sigma and D are fixed. Nevertheless, we show that it is bounded and better still, actually optimal: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of Delta D and Delta V, independent of the size of the database D. We provide such incremental algorithms for vertically partitioned data, and show that the algorithms are optimal. We further propose optimization techniques for the incremental algorithm over vertical partitions to reduce data shipment. We verify experimentally, using real-life data on Amazon Elastic Compute Cloud (EC2), that our algorithms substantially outperform their batch counterparts even when Delta V is reasonably large.
【Keywords】: computational complexity; data handling; distributed databases; error detection; CFD; NP-complete problem; conditional functional dependencies; distributed data inconsistency; distributed database; incremental error detection; minimum data shipment; real-life data; Cities and towns; Computational fluid dynamics; Distributed databases; Marine vehicles; Optimization; Partitioning algorithms
【Paper Link】 【Pages】:330-341
【Authors】: Todd J. Green ; Zachary G. Ives
【Abstract】: A major challenge faced by today's information systems is that of evolution as data usage evolves or new data resources become available. Modern organizations sometimes exchange data with one another via declarative mappings among their databases, as in data exchange and collaborative data sharing systems. Such mappings are frequently revised and refined as new data becomes available, new cross-reference tables are created, and corrections are made. A fundamental question is how to handle changes to these mapping definitions, when the organizations each materialize the results of applying the mappings to the available data. We consider how to incrementally recompute these database instances in this setting, reusing (if possible) previously computed instances to speed up computation. We develop a principled solution that performs cost-based exploration of recomputation versus reuse, and simultaneously handles updates to source data and mapping definitions through a single, unified mechanism. Our solution also takes advantage of provenance information, when present, to speed up computation even further. We present an implementation that takes advantage of an off-the-shelf DBMS's query processing system, and we show experimentally that our approach provides substantial performance benefits.
【Keywords】: database management systems; electronic data interchange; groupware; query processing; DBMS query processing system; collaborative data sharing systems; cross-reference tables; data change; data exchange; data resources; data usage; database instances; declarative mappings; information systems; mapping change; materialized instance recomputation; provenance information; recomputation cost-based exploration; Collaboration; Data models; Databases; Organisms; Semantics; Silicon; Standards
【Paper Link】 【Pages】:342-353
【Authors】: Manish Singh ; Qiang Zhu ; H. V. Jagadish
【Abstract】: Numerous applications such as wireless communication and telematics need to keep track of evolution of spatio-temporal data for a limited past. Limited retention may even be required by regulations. In general, each data entry can have its own user specified lifetime. It is desired that expired entries are automatically removed by the system through some garbage collection mechanism. This kind of limited retention can be achieved by using a sliding window semantics similar to that from stream data processing. However, due to the large volume and relatively long lifetime of data in the aforementioned applications (in contrast to the real-time transient streaming data), the sliding window here needs to be maintained for data on disk rather than in memory. It is a new challenge to provide fast access to the information from the recent past and, at the same time, facilitate efficient deletion of the expired entries. In this paper, we propose a disk based, two-layered, sliding window indexing scheme for discretely moving spatio-temporal data. Our index can support efficient processing of standard time slice and interval queries and delete expired entries with almost no overhead. In existing historical spatio-temporal indexing techniques, deletion is either infeasible or very inefficient. Our sliding window based processing model can support both current and past entries, while many existing historical spatio-temporal indexing techniques cannot keep these two types of data together in the same index. Our experimental comparison with the best known historical index (i.e., the MV3R tree) for discretely moving spatio-temporal data shows that our index is about five times faster in terms of insertion time and comparable in terms of search performance. MV3R follows a partial persistency model, whereas our index can support very efficient deletion and update.
【Keywords】: database indexing; query processing; storage management; temporal databases; MV3R; SWST; data entry; discretely moving spatio-temporal data; disk based indexing scheme; garbage collection; historical spatio-temporal indexing techniques; information access; interval queries; limited retention; real-time transient streaming data; sliding window based processing model; sliding window indexing scheme; sliding window semantics; sliding window spatio-temporal data; stream data processing; two-layered indexing scheme; Data models; History; Indexing; Maintenance engineering; Trajectory
【Paper Link】 【Pages】:354-365
【Authors】: Tobias Emrich ; Hans-Peter Kriegel ; Nikos Mamoulis ; Matthias Renz ; Andreas Züfle
【Abstract】: The problem of modeling and managing uncertain data has received a great deal of interest, due to its manifold applications in spatial, temporal, multimedia and sensor databases. There exists a wide range of work covering spatial uncertainty in the static (snapshot) case, where only one point of time is considered. In contrast, the problem of modeling and querying uncertain spatio-temporal data has only been treated as a simple extension of the spatial case, disregarding time dependencies between consecutive timestamps. In this work, we present a framework for efficiently modeling and querying uncertain spatio-temporal data. The key idea of our approach is to model possible object trajectories by stochastic processes. This approach has three major advantages over previous work. First it allows answering queries in accordance with the possible worlds model. Second, dependencies between object locations at consecutive points in time are taken into account. And third it is possible to reduce all queries on this model to simple matrix multiplications. Based on these concepts we propose efficient solutions for different probabilistic spatio-temporal queries. In an experimental evaluation we show that our approaches are several order of magnitudes faster than state-of-the-art competitors.
【Keywords】: matrix multiplication; multimedia databases; probability; query processing; stochastic processes; temporal databases; visual databases; consecutive timestamps; matrix multiplications; multimedia databases; probabilistic spatio-temporal queries; query answering; sensor databases; spatial databases; stochastic processes; temporal databases; time dependencies; uncertain data management; uncertain data modeling; uncertain spatio-temporal data querying; Data models; Probabilistic logic; Query processing; Stochastic processes; Trajectory; Uncertainty
【Paper Link】 【Pages】:366-377
【Authors】: Jianzhong Qi ; Rui Zhang ; Lars Kulik ; Dan Lin ; Yuan Xue
【Abstract】: We propose and study a new type of location optimization problem: given a set of clients and a set of existing facilities, we select a location from a given set of potential locations for establishing a new facility so that the average distance between a client and her nearest facility is minimized. We call this problem the min-dist location selection problem, which has a wide range of applications in urban development simulation, massively multiplayer online games, and decision support systems. We explore two common approaches to location optimization problems and propose methods based on those approaches for solving this new problem. However, those methods either need to maintain an extra index or fall short in efficiency. To address their drawbacks, we propose a novel method (named MND), which has very close performance to the fastest method but does not need an extra index. We provide a detailed comparative cost analysis on the various algorithms. We also perform extensive experiments to evaluate their empirical performance and validate the efficiency of the MND method.
【Keywords】: decision support systems; query processing; MND method; comparative cost analysis; decision support systems; location optimization problem; massively multiplayer online games; min-dist location selection problem; min-dist location selection query; urban development simulation; Algorithm design and analysis; Games; Indexes; Optimization; Search problems; Software; Software algorithms
【Paper Link】 【Pages】:378-389
【Authors】: Jia Pan ; Dinesh Manocha
【Abstract】: We present a new Bi-level LSH algorithm to perform approximate k-nearest neighbor search in high dimensional spaces. Our formulation is based on a two-level scheme. In the first level, we use a RP-tree that divides the dataset into sub-groups with bounded aspect ratios and is used to distinguish well-separated clusters. During the second level, we compute a single LSH hash table for each sub-group along with a hierarchical structure based on space-filling curves. Given a query, we first determine the sub-group that it belongs to and perform k-nearest neighbor search within the suitable buckets in the LSH hash table corresponding to the sub-group. Our algorithm also maps well to current GPU architectures and can improve the quality of approximate KNN queries as compared to prior LSH-based algorithms. We highlight its performance on two large, high-dimensional image datasets. Given a runtime budget, Bi-level LSH can provide better accuracy in terms of recall or error ration. Moreover, our formulation reduces the variation in runtime cost or the quality of results.
【Keywords】: graphics processing units; query processing; trees (mathematics); GPU architecture; RP-tree; approximate k-nearest neighbor search; bilevel locality sensitive hashing; hierarchical structure; high dimensional spaces; high-dimensional image datasets; k-nearest neighbor computation; space-filling curves; two-level scheme; Approximation algorithms; Complexity theory; Lattices; Partitioning algorithms; Probes; Runtime; Shape
【Paper Link】 【Pages】:390-401
【Authors】: Mert Akdere ; Ugur Çetintemel ; Matteo Riondato ; Eli Upfal ; Stanley B. Zdonik
【Abstract】: Accurate query performance prediction (QPP) is central to effective resource management, query optimization and query scheduling. Analytical cost models, used in current generation of query optimizers, have been successful in comparing the costs of alternative query plans, but they are poor predictors of execution latency. As a more promising approach to QPP, this paper studies the practicality and utility of sophisticated learning-based models, which have recently been applied to a variety of predictive tasks with great success, in both static (i.e., fixed) and dynamic query workloads. We propose and evaluate predictive modeling techniques that learn query execution behavior at different granularities, ranging from coarse-grained plan-level models to fine-grained operator-level models. We demonstrate that these two extremes offer a tradeoff between high accuracy for static workload queries and generality to unforeseen queries in dynamic workloads, respectively, and introduce a hybrid approach that combines their respective strengths by selectively composing them in the process of QPP. We discuss how we can use a training workload to (i) pre-build and materialize such models offline, so that they are readily available for future predictions, and (ii) build new models online as new predictions are needed. All prediction models are built using only static features (available prior to query execution) and the performance values obtained from the offline execution of the training workload. We fully implemented all these techniques and extensions on top of Postgre SQL and evaluated them experimentally by quantifying their effectiveness over analytical workloads, represented by well-established TPC-H data and queries. The results provide quantitative evidence that learning-based modeling for QPP is both feasible and effective for both static and dynamic workload scenarios.
【Keywords】: learning (artificial intelligence); optimisation; query processing; scheduling; Postgre SQL; analytical cost models; coarse-grained plan-level models; dynamic query workloads; fine-grained operator-level models; learning-based modeling; learning-based query performance modeling; predictive modeling techniques; query execution behavior; query optimization; query optimizers; query scheduling; resource management; static query workloads; Accuracy; Analytical models; Buildings; Data models; Predictive models; Training; Training data
【Paper Link】 【Pages】:402-413
【Authors】: Günes Aluç ; David DeHaan ; Ivan T. Bowman
【Abstract】: Query plan caching eliminates the need for repeated query optimization, hence, it has strong practical implications for relational database management systems (RDBMSs). Unfortunately, existing approaches consider only the query plan generated at the expected values of parameters that characterize the query, data and the current state of the system, while these parameters may take different values during the lifetime of a cached plan. A better alternative is to harvest the optimizer's plan choice for different parameter values, populate the cache with promising query plans, and select a cached plan based upon current parameter values. To address this challenge, we propose a parametric plan caching (PPC) framework that uses an online plan space clustering algorithm. The clustering algorithm is density-based, and it exploits locality-sensitive hashing as a pre-processing step so that clusters in the plan spaces can be efficiently stored in database histograms and queried in constant time. We experimentally validate that our approach is precise, efficient in space-and-time and adaptive, requiring no eager exploration of the plan spaces of the optimizer.
【Keywords】: cache storage; pattern clustering; query processing; relational databases; PPC; RDBMS; database histogram; density-based clustering; locality sensitive hashing; online plan space clustering algorithm; parametric plan caching; query plan caching; relational database management system; repeated query optimization; Clustering algorithms; Computational modeling; Couplings; History; Optimization; Prediction algorithms; Query processing
【Paper Link】 【Pages】:414-425
【Authors】: Pit Fender ; Guido Moerkotte ; Thomas Neumann ; Viktor Leis
【Abstract】: Finding the optimal execution order of join operations is a crucial task of today's cost-based query optimizers. There are two approaches to identify the best plan: bottom-up and top-down join enumeration. For both optimization strategies efficient algorithms have been published. However, only the top-down approach allows for branch-and-bound pruning. Two pruning techniques can be found in the literature. We add six new ones. Combined, they improve performance roughly by an average factor of 2 - 5. Even more important, our techniques improve the worst case by two orders of magnitude. Additionally, we introduce a new, very efficient, and easy to implement top-down join enumeration algorithm. This algorithm, together with our improved pruning techniques, yields a performance which is by an average factor of 6 - 9 higher than the performance of the original top-down enumeration algorithm with the original pruning methods.
【Keywords】: database management systems; optimisation; query processing; tree searching; DBMS; bottom-up join enumeration; branch-and-bound pruning; cost-based query optimizers; optimal execution order; optimization strategies; robust pruning; top-down join enumeration algorithms; Complexity theory; Heuristic algorithms; Optimization; Partitioning algorithms; Prediction algorithms; Robustness; Upper bound
【Paper Link】 【Pages】:426-437
【Authors】: Anastasios Arvanitis ; Georgia Koutrika
【Abstract】: In implementing preference-aware query processing, a straightforward option is to build a plug-in on top of the database engine. However, treating the DBMS as a black box affects both the expressivity and performance of queries with preferences. In this paper, we argue that preference-aware query processing needs to be pushed closer to the DBMS. We present a preference-aware relational data model that extends database tuples with preferences and an extended algebra that captures the essence of processing queries with preferences. A key novelty of our preference model itself is that it defines a preference in three dimensions showing the tuples affected, their preference scores and the credibility of the preference. Our query processing strategies push preference evaluation inside the query plan and leverage its algebraic properties for finer-grained query optimization. We experimentally evaluate the proposed strategies. Finally, we compare our framework to a pure plug-in implementation and we show its feasibility and advantages.
【Keywords】: algebra; query processing; relational databases; DBMS; algebraic property; database tuples; finer-grained query optimization; preference-aware query processing; preference-aware relational data model; preference-aware relational databases; Aggregates; Algebra; Data models; Engines; Motion pictures; Query processing
【Paper Link】 【Pages】:438-449
【Authors】: Hua Lu ; Xin Cao ; Christian S. Jensen
【Abstract】: Indoor spaces accommodate large numbers of spatial objects, e.g., points of interest (POIs), and moving populations. A variety of services, e.g., location-based services and security control, are relevant to indoor spaces. Such services can be improved substantially if they are capable of utilizing indoor distances. However, existing indoor space models do not account well for indoor distances. To address this shortcoming, we propose a data management infrastructure that captures indoor distance and facilitates distance-aware query processing. In particular, we propose a distance-aware indoor space model that integrates indoor distance seamlessly. To enable the use of the model as a foundation for query processing, we develop accompanying, efficient algorithms that compute indoor distances for different indoor entities like doors as well as locations. We also propose an indexing framework that accommodates indoor distances that are pre-computed using the proposed algorithms. On top of this foundation, we develop efficient algorithms for typical indoor, distance-aware queries. The results of an extensive experimental evaluation demonstrate the efficacy of the proposals.
【Keywords】: indexing; query processing; POI; data management infrastructure; distance-aware indoor space model; indexing framework; indoor distance-aware query processing; location-based services; moving populations; points of interest; security control; spatial objects; Buildings; Computational modeling; Legged locomotion; Partitioning algorithms; Query processing; Solid modeling; Topology
【Paper Link】 【Pages】:450-461
【Authors】: Justin J. Levandoski ; Mohamed Sarwat ; Ahmed Eldawy ; Mohamed F. Mokbel
【Abstract】: This paper proposes LARS, a location-aware recommender system that uses location-based ratings to produce recommendations. Traditional recommender systems do not consider spatial properties of users nor items, LARS, on the other hand, supports a taxonomy of three novel classes of location-based ratings, namely, spatial ratings for non-spatial items, non-spatial ratings for spatial items, and spatial ratings for spatial items. LARS exploits user rating locations through user partitioning, a technique that influences recommendations with ratings spatially close to querying users in a manner that maximizes system scalability while not sacrificing recommendation quality. LARS exploits item locations using travel penalty, a technique that favors recommendation candidates closer in travel distance to querying users in a way that avoids exhaustive access to all spatial items. LARS can apply these techniques separately, or in concert, depending on the type of location-based rating available. Experimental evidence using large-scale real-world data from both the Foursquare location-based social network and the Movie Lens movie recommendation system reveals that LARS is efficient, scalable, and capable of producing recommendations twice as accurate compared to existing recommendation approaches.
【Keywords】: query processing; recommender systems; social networking (online); LARS; MovieLens movie recommendation system; foursquare location-based social network; large-scale real-world data; location-aware recommender system; location-based ratings; nonspatial ratings for spatial items; spatial ratings for nonspatial items; spatial ratings for spatial items; system scalability maximization; taxonomy; travel distance; travel penalty; user partitioning; user query; Collaboration; Computational modeling; Maintenance engineering; Merging; Motion pictures; Recommender systems; Scalability
【Paper Link】 【Pages】:462-473
【Authors】: Miao Qiao ; Hong Cheng ; Lijun Chang ; Jeffrey Xu Yu
【Abstract】: Shortest distance query between two nodes is a fundamental operation in large-scale networks. Most existing methods in the literature take a landmark embedding approach, which selects a set of graph nodes as landmarks and computes the shortest distances from each landmark to all nodes as an embedding. To handle a shortest distance query between two nodes, the precomputed distances from the landmarks to the query nodes are used to compute an approximate shortest distance based on the triangle inequality. In this paper, we analyze the factors that affect the accuracy of the distance estimation in the landmark embedding approach. In particular we find that a globally selected, query-independent landmark set plus the triangulation based distance estimation introduces a large relative error, especially for nearby query nodes. To address this issue, we propose a query-dependent local landmark scheme, which identifies a local landmark close to the specific query nodes and provides a more accurate distance estimation than the traditional global landmark approach. Specifically, a local landmark is defined as the least common ancestor of the two query nodes in the shortest path tree rooted at a global landmark. We propose efficient local landmark indexing and retrieval techniques, which are crucial to achieve low offline indexing complexity and online query complexity. Two optimization techniques on graph compression and graph online search are also proposed, with the goal to further reduce index size and improve query accuracy. Our experimental results on large-scale social networks and road networks demonstrate that the local landmark scheme reduces the shortest distance estimation error significantly when compared with global landmark embedding.
【Keywords】: approximation theory; data compression; embedded systems; query processing; social networking (online); trees (mathematics); approximate shortest distance computing; global landmark approach; graph compression; graph nodes; graph online search; index size; landmark embedding approach; large relative error; large-scale networks; large-scale social networks; least common ancestor; local landmark close; local landmark indexing techniques; local landmark retrieval techniques; offline indexing complexity; online query complexity; optimization techniques; precomputed distances; query accuracy; query nodes; query-dependent local landmark scheme; query-independent landmark set; road networks; shortest distance estimation error; shortest distance query; shortest path tree; triangle inequality; triangulation based distance estimation; Accuracy; Arrays; Complexity theory; Estimation; Indexing; Roads
【Paper Link】 【Pages】:474-485
【Authors】: Guoliang Li ; Jianhua Feng ; Jing Xu
【Abstract】: Location-based services (LBS) have been widely accepted by mobile users. Many LBS users have direction-aware search requirement that answers must be in the search direction. However to the best of our knowledge there is not yet any research available that investigates direction-aware search. A straightforward method first finds candidates without considering the direction constraint, and then generates the answers by pruning those candidates which invalidate the direction constraint. However this method is rather expensive as it involves a lot of useless computation on many unnecessary directions. To address this problem, we propose a direction-aware spatial keyword search method which inherently supports direction-aware search. We devise novel direction-aware indexing structures to prune unnecessary directions. We develop effective pruning techniques and search algorithms to efficiently answer a direction-aware query. As users may dynamically change their search directions, we propose to incrementally answer a query. Experimental results on real datasets show that our method achieves high performance and outperforms existing methods significantly.
【Keywords】: mobile computing; tree searching; DESKS; direction constraint; direction-aware indexing structure; direction-aware query; direction-aware search requirement; direction-aware spatial keyword search; location-based services; mobile users; pruning technique; search algorithm; Complexity theory; Equations; Indexing; Keyword search; Mobile radio mobility management
【Paper Link】 【Pages】:486-497
【Authors】: Raman Grover ; Michael J. Carey
【Abstract】: In this paper we address the problem of using MapReduce to sample a massive data set in order to produce a fixed-size sample whose contents satisfy a given predicate. While it is simple to express this computation using MapReduce, its default Hadoop execution is dependent on the input size and is wasteful of cluster resources. This is unfortunate, as sampling queries are fairly common (e.g., for exploratory data analysis at Facebook), and the resulting waste can significantly impact the performance of a shared cluster. To address such use cases, we present the design, implementation and evaluation of a Hadoop execution model extension that supports incremental job expansion. Under this model, a job consumes input as required and can dynamically govern its resource consumption while producing the required results. The proposed mechanism is able to support a variety of policies regarding job growth rates as they relate to cluster capacity and current load. We have implemented the mechanism in Hadoop, and we present results from an experimental performance study of different job growth policies under both single- and multi-user workloads.
【Keywords】: data handling; Hadoop execution model extension; MapReduce; cluster capacity; cluster resource; fixed-size sample; incremental job expansion; job growth policy; massive data sampling; multiuser workload; predicate-based sampling; resource consumption; single-user workload; Availability; Delay; Facebook; Indexes; Load modeling; Runtime; Time factors
【Paper Link】 【Pages】:498-509
【Authors】: Foto N. Afrati ; Anish Das Sarma ; David Menestrina ; Aditya G. Parameswaran ; Jeffrey D. Ullman
【Abstract】: Fuzzy/similarity joins have been widely studied in the research community and extensively used in real-world applications. This paper proposes and evaluates several algorithms for finding all pairs of elements from an input set that meet a similarity threshold. The computation model is a single MapReduce job. Because we allow only one MapReduce round, the Reduce function must be designed so a given output pair is produced by only one task, for many algorithms, satisfying this condition is one of the biggest challenges. We break the cost of an algorithm into three components: the execution cost of the mappers, the execution cost of the reducers, and the communication cost from the mappers to reducers. The algorithms are presented first in terms of Hamming distance, but extensions to edit distance and Jaccard distance are shown as well. We find that there are many different approaches to the similarity-join problem using MapReduce, and none dominates the others when both communication and reducer costs are considered. Our cost analyses enable applications to pick the optimal algorithm based on their communication, memory, and cluster requirements.
【Keywords】: data analysis; distributed processing; fuzzy set theory; Hamming distance; Jaccard distance; MapReduce job; data analytics; edit distance; fuzzy joins; optimal algorithm; similarity joins; similarity-join problem; Algorithm design and analysis; Approximation algorithms; Clustering algorithms; Computational modeling; Educational institutions; Hamming distance; Program processors
【Paper Link】 【Pages】:510-521
【Authors】: Younghoon Kim ; Kyuseok Shim
【Abstract】: There is a wide range of applications that require finding the top-k most similar pairs of records in a given database. However, computing such top-k similarity joins is a challenging problem today, as there is an increasing trend of applications that expect to deal with vast amounts of data. For such data-intensive applications, parallel executions of programs on a large cluster of commodity machines using the MapReduce paradigm have recently received a lot of attention. In this paper, we investigate how the top-k similarity join algorithms can get benefits from the popular MapReduce framework. We first develop the divide-and-conquer and branch-and-bound algorithms. We next propose the all pair partitioning and essential pair partitioning methods to minimize the amount of data transfers between map and reduce functions. We finally perform the experiments with not only synthetic but also real-life data sets. Our performance study confirms the effectiveness and scalability of our MapReduce algorithms.
【Keywords】: database management systems; divide and conquer methods; parallel databases; tree searching; MapReduce; MapReduce framework; branch-and-bound algorithms; commodity machines; data transfers; data-intensive applications; database; divide-and-conquer algorithms; pair partitioning methods; parallel program executions; parallel top-k similarity join algorithms; top-k most similar pairs; Approximation algorithms; Arrays; Clustering algorithms; Complexity theory; Euclidean distance; Indexes; Partitioning algorithms
【Paper Link】 【Pages】:522-533
【Authors】: Benjamin Gufler ; Nikolaus Augsten ; Angelika Reiser ; Alfons Kemper
【Abstract】: MapReduce has emerged as a popular tool for distributed and scalable processing of massive data sets and is being used increasingly in e-science applications. Unfortunately, the performance of MapReduce systems strongly depends on an even data distribution while scientific data sets are often highly skewed. The resulting load imbalance, which raises the processing time, is even amplified by high runtime complexity of the reducer tasks. An adaptive load balancing strategy is required for appropriate skew handling. In this paper, we address the problem of estimating the cost of the tasks that are distributed to the reducers based on a given cost model. An accurate cost estimation is the basis for adaptive load balancing algorithms and requires to gather statistics from the mappers. This is challenging: (a) Since the statistics from all mappers must be integrated, the mapper statistics must be small. (b) Although each mapper sees only a small fraction of the data, the integrated statistics must capture the global data distribution. (c) The mappers terminate after sending the statistics to the controller, and no second round is possible. Our solution to these challenges consists of two components. First, a monitoring component executed on every mapper captures the local data distribution and identifies its most relevant subset for cost estimation. Second, an integration component aggregates these subsets approximating the global data distribution.
【Keywords】: cloud computing; computational complexity; distributed databases; natural sciences computing; resource allocation; statistical distributions; MapReduce; adaptive load balancing strategy; batch-style job processing; cloud computing environments; cost estimation; distributed databases; e-science applications; global data distribution; integration component; load imbalance; local data distribution; mapper statistics; massive data set distributed processing; massive data set scalable processing; processing time; runtime complexity; scalable cardinality estimates; scientific data sets; skew handling; Approximation methods; Clustering algorithms; Estimation; Histograms; Load management; Monitoring; Partitioning algorithms
【Paper Link】 【Pages】:534-545
【Authors】: Guo-Jun Qi ; Charu C. Aggarwal ; Thomas S. Huang
【Abstract】: The problem of community detection in social media has been widely studied in the social networking community in the context of the structure of the underlying graphs. Most community detection algorithms use the links between the nodes in order to determine the dense regions in the graph. These dense regions are the communities of social media in the graph. Such methods are typically based purely on the linkage structure of the underlying social media network. However, in many recent applications, edge content is available in order to provide better supervision to the community detection process. Many natural representations of edges in social interactions such as shared images and videos, user tags and comments are naturally associated with content on the edges. While some work has been done on utilizing node content for community detection, the presence of edge content presents unprecedented opportunities and flexibility for the community detection process. We will show that such edge content can be leveraged in order to greatly improve the effectiveness of the community detection process in social media networks. We present experimental results illustrating the effectiveness of our approach.
【Keywords】: graph theory; social networking (online); community detection algorithms; edge content; edge natural representations; graph dense regions; social media networks; social networking community; underlying graph structure; Clustering algorithms; Communities; Couplings; Image edge detection; Media; Social network services; Vectors; ignore
【Paper Link】 【Pages】:546-557
【Authors】: Chen Liu ; Sai Wu ; Shouxu Jiang ; Anthony K. H. Tung
【Abstract】: The abundance of Web 2.0 resources in various media formats calls for better resource integration to enrich user experience. This naturally leads to a new cross-modal resource search requirement, in which a query is a resource in one modal and the results are closely related resources in other modalities. With cross-modal search, we can better exploit existing resources. Tags associated with Web 2.0 resources are intuitive medium to link resources with different modality together. However, tagging is by nature an ad hoc activity. They often contain noises and are affected by the subjective inclination of the tagger. Consequently, linking resources simply by tags will not be reliable. In this paper, we propose an approach for linking tagged resources to concepts extracted from Wikipedia, which has become a fairly reliable reference over the last few years. Compared to the tags, the concepts are therefore of higher quality. We develop effective methods for cross-modal search based on the concepts associated with resources. Extensive experiments were conducted, and the results show that our solution achieves good performance.
【Keywords】: Internet; query processing; social networking (online); Web 2.0 resource; Wikipedia; ad hoc activity; concept extraction; cross domain search; linking tagged resource; media format; query processing; resource integration; resource search requirement; user experience; Electronic publishing; Encyclopedias; Internet; Semantics; Silicon; Vectors
【Paper Link】 【Pages】:558-569
【Authors】: Junjie Yao ; Bin Cui ; Zijun Xue ; Qingyun Liu
【Abstract】: Recently, lots of micro-blog message sharing applications have emerged on the web. Users can publish short messages freely and get notified by the subscriptions instantly. Prominent examples include Twitter, Facebook's statuses, and Sina Weibo in China. The Micro-blog platform becomes a useful service for real time information creation and propagation. However, these messages' short length and dynamic characters have posed great challenges for effective content understanding. Additionally, the noise and fragments make it difficult to discover the temporal propagation trail to explore development of micro-blog messages. In this paper, we propose a provenance model to capture connections between micro-blog messages. Provenance refers to data origin identification and transformation logging, demonstrating of great value in recent database and workflow systems. To cope with the real time micro-message deluge, we utilize a novel message grouping approach to encode and maintain the provenance information. Furthermore, we adopt a summary index and several adaptive pruning strategies to implement efficient provenance updating. Based on the index, our provenance solution can support rich query retrieval and intuitive message tracking for effective message organization. Experiments conducted on a real dataset verify the effectiveness and efficiency of our approach. Provenance refers to data origin identification and transformation monitoring, which has been demonstrated of great value in database and workflow systems. In this paper, we propose a provenance model in micro-blog platforms, and design an indexing scheme to support provenance-based message discovery and maintenance, which can capture the interactions of messages for effective message organization. To cope with the real time micro-message tornadoes, we introduce a novel virtual annotation grouping approach to encode and maintain the provenance information. Furthermore, we design a summary index and adaptive prun- ng strategies to facilitate efficient message update. Based on this provenance index, our approach can support query and message tracking in micro-blog systems. Experiments conducted on real datasets verify the effectiveness and efficiency of our approach.
【Keywords】: Internet; electronic messaging; indexing; query processing; social networking (online); China; Facebook statuses; Sina Weibo; Twitter; World Wide Web; adaptive pruning strategies; data origin identification; database; message grouping approach; message organization; micro-blog message sharing applications; micro-blog messages; micro-blog platforms; provenance-based indexing support; real time information creation; real time information propagation; real time micro-message deluge; rich query retrieval; short messages; support provenance-based message discovery; temporal propagation trail; transformation logging; transformation monitoring; workflow systems; Blogs; Context; Indexing; Media; Noise; Twitter
【Paper Link】 【Pages】:570-581
【Authors】: Luke Dickens ; Ian Molloy ; Jorge Lobo ; Pau-Chen Cheng ; Alessandra Russo
【Abstract】: An understanding of information flow has many applications, including for maximizing marketing impact on social media, limiting malware propagation, and managing undesired disclosure of sensitive information. This paper presents scalable methods for both learning models of information flow in networks from data, based on the Independent Cascade Model, and predicting probabilities of unseen flow from these models. Our approach is based on a principled probabilistic construction and results compare favourably with existing methods in terms of accuracy of prediction and scalable evaluation, with the addition that we are able to evaluate a broader range of queries than previously shown, including probability of joint and/or conditional flow, as well as reflecting model uncertainty. Exact evaluation of flow probabilities is exponential in the number of edges and naive sampling can also be expensive, so we propose sampling in an efficient Markov-Chain Monte-Carlo fashion using the Metropolis-Hastings algorithm -- details described in the paper. We identify two types of data, those where the paths of past flows are known -- attributed data, and those where only the endpoints are known -- unattributed data. Both data types are addressed in this paper, including training methods, example real world data sets, and experimental evaluation. In particular, we investigate flow data from the Twitter microblogging service, exploring the flow of messages through retweets (tweet forwards) for the attributed case, and the propagation of hash tags (metadata tags) and urls for the unattributed case.
【Keywords】: Markov processes; Monte Carlo methods; data flow analysis; directed graphs; learning (artificial intelligence); meta data; probability; sampling methods; security of data; social networking (online); Markov-Chain Monte-Carlo methoc; Metropolis-Hastings algorithm; Twitter microblogging service; URL; conditional flow probability; data type identification; directed graph; hash tag propagation; independent cascade model; information flow; joint flow probability; malware propagation; marketing impact maximization; message flow; metadata tags; model uncertainty; naive sampling; principled probabilistic construction; retweet; sensitive information undesired disclosure management; social media; stochastic model learning; tweet forward; unattributed data; unseen flow probability; Accuracy; Data models; Equations; Joints; Predictive models; Twitter; Uncertainty
【Paper Link】 【Pages】:582-593
【Authors】: Gang Chen ; Tianlei Hu ; Dawei Jiang ; Peng Lu ; Kian-Lee Tan ; Hoang Tam Vo ; Sai Wu
【Abstract】: The corporate network is often used for sharing information among the participating companies and facilitating collaboration in a certain industry sector where companies share a common interest. It can effectively help the companies to reduce their operational costs and increase the revenues. However, the inter-company data sharing and processing poses unique challenges to such a data management system including scalability, performance, throughput, and security. In this paper, we present Best Peer++, a system which delivers elastic data sharing services for corporate network applications in the cloud based on Best Peer -- a peer-to-peer (P2P) based data management platform. By integrating cloud computing, database, and P2P technologies into one system, Best Peer++ provides an economical, flexible and scalable platform for corporate network applications and delivers data sharing services to participants based on the widely accepted pay-as-you-go business model. We evaluate Best Peer++ on Amazon EC2 Cloud platform. The benchmarking results show that Best Peer++ outperforms Hadoop DB, a recently proposed large-scale data processing system, in performance when both systems are employed to handle typical corporate network workloads. The benchmarking results also demonstrate that Best Peer++ achieves near linear scalability for throughput with respect to the number of peer nodes.
【Keywords】: business data processing; cloud computing; database management systems; groupware; peer-to-peer computing; security of data; BestPeer++; P2P technologies; cloud computing; collaboration; corporate network; data management system; database; information sharing; inter-company data sharing; large-scale data processing platform; pay-as-you-go business model; peer-to-peer network; scalability; security; Companies; Indexes; Peer to peer computing; Query processing; Servers
【Paper Link】 【Pages】:594-605
【Authors】: Minqi Zhou ; Heng Tao Shen ; Xiaofang Zhou ; Weining Qian ; Aoying Zhou
【Abstract】: Estimating the global data distribution in Peer-to-Peer (P2P) networks is an important issue and has yet to be well addressed. It can benefit many P2P applications, such as load balancing analysis, query processing, and data mining. Inspired by the inversion method for random variate generation, in this paper we present a novel model named distribution-free data density estimation for dynamic ring-based P2P networks to achieve high estimation accuracy with low estimation cost regardless of distribution models of the underlying data. It generates random samples for any arbitrary distribution by sampling the global cumulative distribution function and is free from sampling bias. In P2P networks, the key idea for distribution-free estimation is to sample a small subset of peers for estimating the global data distribution over the data domain. Algorithms on computing and sampling the global cumulative distribution function based on which global data distribution is estimated are introduced with detailed theoretical analysis. Our extensive performance study confirms the effectiveness and efficiency of our methods in ring-based P2P networks.
【Keywords】: data mining; peer-to-peer computing; query processing; resource allocation; data mining; distribution-free data density estimation; dynamic ring-based P2P networks; global cumulative distribution function; global data distribution estimation; load balancing analysis; peer-to-peer networks; query processing; random variate generation; ring-based P2P networks; Distribution functions; Estimation; Histograms; Indexes; Peer to peer computing; Probability density function; Probability distribution
【Paper Link】 【Pages】:606-617
【Authors】: Christos Doulkeridis ; Akrivi Vlachou ; Kjetil Nørvåg ; Yannis Kotidis ; Neoklis Polyzotis
【Abstract】: In this paper, we study efficient processing of rank joins in highly distributed systems, where servers store fragments of relations in an autonomous manner. Existing rank-join algorithms exhibit poor performance in this setting due to excessive communication costs or high latency. We propose a novel distributed rank-join framework that employs data statistics, maintained as histograms, to determine the subset of each relational fragment that needs to be fetched to generate the top-k join results. At the heart of our framework lies a distributed score bound estimation algorithm that produces sufficient score bounds for each relation, that guarantee the correctness of the rank-join result set, when the histograms are accurate. Furthermore, we propose a generalization of our framework that supports approximate statistics, in the case that the exact statistical information is not available. An extensive experimental study validates the efficiency of our framework and demonstrates its advantages over existing methods.
【Keywords】: distributed processing; query processing; statistics; approximate statistics; data statistics; distributed rank-join framework; distributed score bound estimation algorithm; highly distributed systems; rank join processing; rank-join result set; relational fragment subset; top-k join result generation; Approximation algorithms; Distributed databases; Estimation; Histograms; Marketing and sales; Query processing; Servers
【Paper Link】 【Pages】:618-629
【Authors】: Lars Kolb ; Andreas Thor ; Erhard Rahm
【Abstract】: The effectiveness and scalability of MapReduce-based implementations of complex data-intensive tasks depend on an even redistribution of data between map and reduce tasks. In the presence of skewed data, sophisticated redistribution approaches thus become necessary to achieve load balancing among all reduce tasks to be executed in parallel. For the complex problem of entity resolution, we propose and evaluate two approaches for such skew handling and load balancing. The approaches support blocking techniques to reduce the search space of entity resolution, utilize a preprocessing MapReduce job to analyze the data distribution, and distribute the entities of large blocks among multiple reduce tasks. The evaluation on a real cloud infrastructure shows the value and effectiveness of the proposed load balancing approaches.
【Keywords】: cloud computing; data integration; MapReduce; blocking technique; complex data-intensive task; data redistribution; entity resolution; load balancing; real cloud infrastructure; search space; skew handling; skewed data; Computational modeling; Erbium; Image color analysis; Indexes; Load management; Memory management; Scalability
【Paper Link】 【Pages】:630-641
【Authors】: Liang Jeff Chen ; Philip A. Bernstein ; Peter Carlin ; Dimitrije Filipovic ; Michael Rys ; Nikita Shamgunov ; James F. Terwilliger ; Milos Todic ; Sasa Tomasevic ; Dragan Tomic
【Abstract】: XML is commonly supported by SQL database systems. However, existing mappings of XML to tables can only deliver satisfactory query performance for limited use cases. In this paper, we propose a novel mapping of XML data into one wide table whose columns are sparsely populated. This mapping provides good performance for document types and queries that are observed in enterprise applications but are not supported efficiently by existing work. XML queries are evaluated by translating them into SQL queries over the wide sparsely-populated table. We show how to translate full XPath 1.0 into SQL. Based on the characteristics of the new mapping, we present rewriting optimizations that minimize the number of joins. Experiments demonstrate that query evaluation over the new mapping delivers considerable improvements over existing techniques for the target use cases.
【Keywords】: SQL; XML; document handling; optimisation; query processing; rewriting systems; table lookup; SQL database systems; SQL query; XML mapping; XML query; XPath 1.0; document query; document types; query evaluation; rewriting optimizations; satisfactory query performance; sparsely populated columns; sparsely-populated table; target use cases; wide sparse table; Algebra; Encoding; Indexes; Optimization; Query processing; XML
【Paper Link】 【Pages】:642-653
【Authors】: Curtis E. Dyreson ; Sourav S. Bhowmick
【Abstract】: A limitation of XQuery is that a programmer has to be familiar with the shape of the data to query it effectively. And if that shape changes, or if the shape is other than what the programmer expects, the query may fail. One way to avoid this limitation is to transform the data into a desired shape. A data transformation is a rearrangement of data into a new shape. In this paper, we present the semantics and implementation of XMorph 2.0, a shape-polymorphic data transformation language for XML. An XMorph program can act as a query guard. The guard both transforms data to the shape needed by the query and determines whether and how the transformation potentially loses information, a transformation that loses information may lead to a query yielding an inaccurate result. This paper describes how to use XMorph as a query guard, gives a formal semantics for shape-to-shape transformations, documents how XMorph determines how a transformation potentially loses information, and describes the XMorph implementation.
【Keywords】: XML; formal specification; query processing; XML data querying; XMorph 2.0; XQuery; data transformation; formal semantics; query guard; shape-polymorphic data transformation language; shape-to-shape transformations; Conferences; Data engineering
【Paper Link】 【Pages】:654-665
【Authors】: Yanghua Xiao ; Ji Hong ; Wanyun Cui ; Zhenying He ; Wei Wang ; Guodong Feng
【Abstract】: Labeling schemes lie at the core of query processing for many tree-structured data such as XML data that is flooding the web. A labeling scheme that can simultaneously and efficiently support various relationship queries on trees (such as parent/children, descendant/ancestor, etc.), computation of lowest common ancestors (LCA) and update of trees, is desired for effective and efficient management of tree-structured data. Although a variety of labeling schemes such as prefix-based labeling, interval-based labeling and prime-based labeling as well as their variants have been available to us for encoding static and dynamic trees, these labeling schemes usually show weakness in one aspect or another. In this paper, we propose an integer-based labeling scheme branch code as well as its compressed version as our major solution to simultaneously support efficient query processing on both static and dynamic ordered trees with affordable storage cost. The proposed branch code can answer common queries on ordered trees in constant time, which comes at the cost of consuming O(N log N) storage. To reduce storage cost to O(N), a compressed branch code is further developed. We also give a relationship determination algorithm purely using compressed branch code, which is of quite low possibility to produce false positive results as verified by experimental results. With the support of splay trees, branch code can also support dynamic trees so that updates and queries can be implemented with O(log N) amortized cost. All the results above are either theoretically proved or verified by experimental studies.
【Keywords】: XML; computational complexity; query processing; storage management; tree data structures; LCA; World Wide Web; XML data; amortized cost; compressed branch code; dynamic ordered trees; encoding; integer-based labeling scheme; interval-based labeling; lowest common ancestor; prefix-based labeling; prime-based labeling; query answering; query processing; relationship determination algorithm; relationship queries; splay trees; static ordered trees; storage cost; tree-structured data management; Encoding; Equations; Labeling; Mathematical model; Query processing; Vegetation; XML
【Paper Link】 【Pages】:666-677
【Authors】: Wangchao Le ; Anastasios Kementsietsidis ; Songyun Duan ; Feifei Li
【Abstract】: This paper revisits the classical problem of multi-query optimization in the context of RDF/SPARQL. We show that the techniques developed for relational and semi-structured data/query languages are hard, if not impossible, to be extended to account for RDF data model and graph query patterns expressed in SPARQL. In light of the NP-hardness of the multi-query optimization for SPARQL, we propose heuristic algorithms that partition the input batch of queries into groups such that each group of queries can be optimized together. An essential component of the optimization incorporates an efficient algorithm to discover the common sub-structures of multiple SPARQL queries and an effective cost model to compare candidate execution plans. Since our optimization techniques do not make any assumption about the underlying SPARQL query engine, they have the advantage of being portable across different RDF stores. The extensive experimental studies, performed on three popular RDF stores, show that the proposed techniques are effective, efficient and scalable.
【Keywords】: computational complexity; data models; query languages; query processing; relational databases; NP-hardness; RDF data model; RDF/SPARQL; SPARQL query engine; graph query patterns; relational data/query languages; scalable multiquery optimization; semistructured data/query languages; Buildings; Context; Optimization; Partitioning algorithms; Pattern matching; Resource description framework; World Wide Web
【Paper Link】 【Pages】:678-689
【Authors】: Jiexing Li ; Rimma V. Nehme ; Jeffrey F. Naughton
【Abstract】: Progress indicators for SQL queries were first published in 2004 with the simultaneous and independent proposals from Chaudhuri et al. and Luo et al. In this paper, we implement both progress indicators in the same commercial RDBMS to investigate their performance. We summarize common cases in which they are both accurate and cases in which they fail to provide reliable estimates. Although there are differences in their performance, much more striking is the similarity in the errors they make due to a common simplifying uniform future speed assumption. While the developers of these progress indicators were aware that this assumption could cause errors, they neither explored how large the errors might be nor did they investigate the feasibility of removing the assumption. To rectify this we propose a new query progress indicator, similar to these early progress indicators but without the uniform speed assumption. Experiments show that on the TPC-H benchmark, on queries for which the original progress indicators have errors up to 30X the query running time, the new progress indicator is accurate to within 10 percent. We also discuss the sources of the errors that still remain and shed some light on what would need to be done to eliminate them.
【Keywords】: SQL; costing; query processing; relational databases; GSLPI; RDBMS; SQL queries; TPC-H benchmark; cost-based query progress indicator; error source; query running time; Benchmark testing; Estimation error; Indexes; Pipelines; Servers
【Paper Link】 【Pages】:690-701
【Authors】: Rui Zhang ; Richard T. Snodgrass ; Saumya Debray
【Abstract】: Relational database management systems are general in the sense that they can handle arbitrary schemas, queries, and modifications, this generality is implemented using runtime metadata lookups and tests that ensure that control is channelled to the appropriate code in all cases. Unfortunately, these lookups and tests are carried out even when information is available that renders some of these operations superfluous, leading to unnecessary runtime overheads. This paper introduces micro-specialization, an approach that uses relation- and query-specific information to specialize the DBMS code at runtime and thereby eliminate some of these overheads. We develop a taxonomy of approaches and specialization times and propose a general architecture that isolates most of the creation and execution of the specialized code sequences in a separate DBMS-independent module. Through three illustrative types of micro-specializations applied to PostgreSQL, we show that this approach requires minimal changes to a DBMS and can improve the performance simultaneously across a wide range of queries, modifications, and bulk-loading, in terms of storage, CPU usage, and I/O time of the TPC-H and TPC-C benchmarks.
【Keywords】: SQL; query processing; relational databases; DBMS; PostgreSQL; TPC-C benchmark; TPC-H benchmark; arbitrary scheme handling; microspecialization; modification handling; queries handling; query-specific information; relation-specific information; relational database management system; runtime metadata lookups; specialized code sequence; Arrays; Benchmark testing; Catalogs; Query processing; Runtime; Taxonomy
【Paper Link】 【Pages】:702-713
【Authors】: Willis Lang ; Srinath Shankar ; Jignesh M. Patel ; Ajay Kalhan
【Abstract】: As traditional and mission-critical relational database workloads migrate to the cloud in the form of Database-as-a-Service (DaaS), there is an increasing motivation to provide performance goals in Service Level Objectives (SLOs). Providing such performance goals is challenging for DaaS providers as they must balance the performance that they can deliver to tenants and the data center's operating costs. In general, aggressively aggregating tenants on each server reduces the operating costs but degrades performance for the tenants, and vice versa. In this paper, we present a framework that takes as input the tenant workloads, their performance SLOs, and the server hardware that is available to the DaaS provider, and outputs a cost-effective recipe that specifies how much hardware to provision and how to schedule the tenants on each hardware resource. We evaluate our method and show that it produces effective solutions that can reduce the costs for the DaaS provider while meeting performance goals.
【Keywords】: cloud computing; relational databases; security of data; DaaS; Database-as-a-Service; Service Level Objectives; cloud; data center operating costs; mission-critical relational database; multi-tenant performance SLO; tenant workloads; Benchmark testing; Databases; Hardware; Measurement; Optimization; Servers; Synthetic aperture sonar
【Paper Link】 【Pages】:714-725
【Authors】: David B. Lomet ; Alan Fekete ; Rui Wang ; Peter Ward
【Abstract】: A database supporting multiple versions of records may use the versions to support queries of the past or to increase concurrency by enabling reads and writes to be concurrent. We introduce a new concurrency control approach that enables all SQL isolation levels including serializability to utilize multiple versions to increase concurrency while also supporting transaction time database functionality. The key insight is to manage a range of possible timestamps for each transaction that captures the impact of conflicts that have occurred. Using these ranges as constraints often permits concurrent access where lock based concurrency control would block. This can also allow blocking instead of some aborts that are common in earlier multi-version concurrency techniques. Also, timestamp ranges can be used to conservatively find deadlocks without graph based cycle detection. Thus, our multi-version support can enhance performance of current time data access via improved concurrency, while supporting transaction time functionality.
【Keywords】: SQL; concurrency control; graph theory; query processing; transaction processing; SQL isolation level; concurrency control approach; deadlock; graph based cycle detection; lock based concurrency control; multiversion concurrency; query processing; serializability; time data access; timestamp range conflict management; transaction time database functionality; Concurrency control; Concurrent computing; Database systems; Laser mode locking; Proposals; System recovery
【Paper Link】 【Pages】:726-737
【Authors】: Nora Derouiche ; Bogdan Cautis ; Talel Abdessalem
【Abstract】: We present in this paper a novel approach for extracting structured data from the Web, whose goal is to harvest real-world items from template-based HTML pages (the structured Web). It illustrates a two-phase querying of the Web, in which an intentional description of the data that is targeted is first provided, in a flexible and widely applicable manner. The extraction process leverages then both the input description and the source structure. Our approach is domain-independent, in the sense that it applies to any relation, either flat or nested, describing real-world items. Extensive experiments on five different domains and comparison with the main state of the art extraction systems from literature illustrate its flexibility and precision. We advocate via our technique that automatic extraction and integration of complex structured data can be done fast and effectively, when the redundancy of the Web meets knowledge over the to-be-extracted data.
【Keywords】: Internet; query processing; automatic extraction; domain knowledge; input description; intentional description; source structure; structured Web data; template-based HTML pages; two-phase querying; Data mining; Feature extraction; HTML; Semantics; Silicon; Web pages; Wrapping
【Paper Link】 【Pages】:738-749
【Authors】: Lukasz Golab ; Howard J. Karloff ; Flip Korn ; Barna Saha ; Divesh Srivastava
【Abstract】: Many applications process data in which there exists a ``conservation law'' between related quantities. For example, in traffic monitoring, every incoming event, such as a packet's entering a router or a car's entering an intersection, should ideally have an immediate outgoing counterpart. We propose a new class of constraints -- Conservation Rules -- that express the semantics and characterize the data quality of such applications. We give confidence metrics that quantify how strongly a conservation rule holds and present approximation algorithms (with error guarantees) for the problem of discovering a concise summary of subsets of the data that satisfy a given conservation rule. Using real data, we demonstrate the utility of conservation rules and we show order-of-magnitude performance improvements of our discovery algorithms over naive approaches.
【Keywords】: approximation theory; data handling; approximation algorithm; confidence metrics; conservation law; conservation rules; data quality; discovery algorithm; error guarantees; naive approach; Approximation algorithms; Delay; Electricity; IP networks; Monitoring; Roads
【Paper Link】 【Pages】:750-761
【Abstract】: After decades of effort working on database performance, the quality and the usability of database systems have received more attention in recent years. In particular, the feature of explaining missing tuples in a query result, or the so-called "why-not" questions, has recently become an active topic. In this paper, we study the problem of answering why-not questions on top-k queries. Our motivation is that we know many users love to use top-k queries when they are making multi-criteria decisions. However, they often feel frustrated when they are asked to quantify their feeling as a set of numeric weightings, and feel even more frustrated after they see the query results do not include their expected answers. In this paper, we use the query-refinement method to approach the problem. Given as inputs the original top-k query and a set of missing tuples, our algorithm returns to the user a refined top-k query that includes the missing tuples. A case study and experimental results show that our approach returns high quality explanations to users efficiently.
【Keywords】: human computer interaction; query processing; software performance evaluation; database systems usability; multicriteria decision making; query-refinement method; top-k queries; why-not questions; Approximation algorithms; Database systems; Equations; Quadratic programming; Usability; Vectors
【Paper Link】 【Pages】:762-773
【Authors】: Dong Deng ; Guoliang Li ; Jianhua Feng
【Abstract】: Dictionary-based entity extraction has attracted much attention from the database community recently, which locates sub strings in a document into predefined entities (e.g., person names or locations). To improve extraction recall, a recent trend is to provide approximate matching between sub strings of the document and entities by tolerating minor errors. In this paper we study dictionary-based approximate entity extraction with edit-distance constraints. Existing methods have several limitations. First, they need to tune many parameters to achieve high performance. Second, they are inefficient for large edit-distance thresholds. We propose a trie-based method to address these problems. We first partition each entity into a set of segments, and then use a trie structure to index segments. To extract similar entities, we search segments from the document, and extend the matching segments in both entities and the document to find similar pairs. We develop an extension-based method to efficiently find similar string pairs by extending the matching segments. We optimize our partition scheme and select the best partition strategy to improve the extraction performance. Experimental results show that our method achieves much higher performance compared with state-of-the-art studies.
【Keywords】: approximation theory; dictionaries; indexing; information retrieval; search problems; string matching; tree data structures; approximate matching; database community; dictionary-based approximate entity extraction; document substring; edit-distance constraints; edit-distance thresholds; entity partitioning; entity substring; error tolerance; extension-based method; extraction performance improvement; partition scheme optimization; segment indexing; segment matching; segment search; trie-based method; Complexity theory; Dictionaries; Heuristic algorithms; Indexes; Partitioning algorithms; Search problems; Strontium
【Paper Link】 【Pages】:774-785
【Authors】: Pei Lee ; Laks V. S. Lakshmanan ; Jeffrey Xu Yu
【Abstract】: Search for objects similar to a given query object in a network has numerous applications including web search and collaborative filtering. We use the notion of structural similarity to capture the commonality of two objects in a network, e.g., if two nodes are referenced by the same node, they may be similar. Meeting-based methods including SimRank and P-Rank capture structural similarity very well. Deriving inspiration from PageRank, SimRank has gained popularity by a natural intuition and domain independence. Since it's computationally expensive, subsequent work has focused on optimizing and approximating the computation of SimRank. In this paper, we approach SimRank from a top-k querying perspective where given a query node v, we are interested in finding the top-k nodes that have the highest SimRank score w.r.t. v. The only known approaches for answering such queries are either a naive algorithm of computing the similarity matrix for all node pairs or computing the similarity vector by comparing the query node v with each other node independently, and then picking the top-k. None of these approaches can handle top-k structural similarity search efficiently by scaling to very large graphs consisting of millions of nodes. We propose an algorithmic framework called TopSim based on transforming the top-k SimRank problem on a graph G to one of finding the top-k nodes with highest authority on the product graph G G. We further accelerate Top Sim by merging similarity paths and develop a more efficient algorithm called Top Sim-SM. Two heuristic algorithms, Trun-Top Sim-SM and Prio-Top Sim-SM, are also proposed to approximate Top Sim-SM on scale-free graphs to trade accuracy for speed, based on truncated random walk and prioritizing propagation respectively. We analyze the accuracy and performance of Top Sim family algorithms and report the results of a detailed experimental study.
【Keywords】: Internet; collaborative filtering; graph theory; query processing; P-Rank; PageRank; Prio-TopSim-SM; TopSim; TopSim-SM; Trun-TopSim-SM; Web search; collaborative filtering; meeting-based methods; naive algorithm; query object; scale-free graphs; similarity matrix; top-k SimRank problem; top-k nodes; top-k querying perspective; top-k structural similarity search; truncated random walk; Accuracy; Approximation algorithms; Approximation methods; Collaboration; Context; Couplings; Upper bound
【Paper Link】 【Pages】:786-797
【Authors】: Mohammad Sadoghi ; Hans-Arno Jacobsen
【Abstract】: The efficient processing of large collections of Boolean expressions plays a central role in major data intensive applications ranging from user-centric processing and personalization to real-time data analysis. Emerging applications such as computational advertising and selective information dissemination demand determining and presenting to an end-user only the most relevant content that is both user-consumable and suitable for limited screen real estate of target devices. To retrieve the most relevant content, we present BE-Tree, a novel indexing data structure designed for effective hierarchical top-k pattern matching, which as its by-product also reduces the operational cost of processing millions of patterns. To further reduce processing cost, BE-Tree employs an adaptive and non-rigid space-cutting technique designed to efficiently index Boolean expressions over a high-dimensional continuous space. At the core of BE-Tree lie two innovative ideas: (1) a bi-directional tree expansion build as a top-down (data and space clustering) and a bottom-up growths (space clustering), which together enable indexing only non-empty continuous sub-spaces, and (2) an overlap-free splitting strategy. Finally, the performance of BE-Tree is proven through a comprehensive experimental comparison against state-of-the-art index structures for matching Boolean expressions.
【Keywords】: Boolean algebra; indexing; message passing; middleware; pattern matching; tree data structures; BE*-tree; Boolean expression collection processing; Boolean expression matching; bidirectional tree expansion; bottom-up growths; data intensive applications; hierarchical top-k pattern matching; indexing data structure; less-top-k matching capitalization; nonempty continuous subspaces; overlap-free splitting strategy; personalization; publish-subscribe; real-time data analysis; relevance matters; space-cutting technique; user-centric processing; Computational modeling; Data models; Heuristic algorithms; Indexing; Subscriptions
【Paper Link】 【Pages】:798-809
【Authors】: Zhitao Shen ; Muhammad Aamir Cheema ; Xuemin Lin ; Wenjie Zhang ; Haixun Wang
【Abstract】: Top-k pairs queries have received significant attention by the research community. k-closest pairs queries, k-furthest pairs queries and their variants are among the most well studied special cases of the top-k pairs queries. In this paper, we present the first approach to answer a broad class of top-k pairs queries over sliding windows. Our framework handles multiple top-k pairs queries and each query is allowed to use a different scoring function, a different value of k and a different size of the sliding window. Although the number of possible pairs in the sliding window is quadratic to the number of objects N in the sliding window, we efficiently answer the top-k pairs query by maintaining a small subset of pairs called K-sky band which is expected to consist of O(K log(N/K)) pairs. For all the queries that use the same scoring function, we need to maintain only one K-sky band. We present efficient techniques for the K-sky band maintenance and query answering. We conduct a detailed complexity analysis and show that the expected cost of our approach is reasonably close to the lower bound cost. We experimentally verify this by comparing our approach with a specially designed supreme algorithm that assumes the existence of an oracle and meets the lower bound cost.
【Keywords】: computational complexity; query processing; K-sky band maintenance; complexity analysis; k-closest pairs queries; k-furthest pairs queries; oracle; query answering; scoring function; sliding windows; supreme algorithm; top-k pairs monitoring; top-k pairs queries; Algorithm design and analysis; Complexity theory; Conferences; Educational institutions; Maintenance engineering; Monitoring; Query processing
【Paper Link】 【Pages】:810-821
【Authors】: Albert Yu ; Pankaj K. Agarwal ; Jun Yang
【Abstract】: We consider how to support a large number of users over a wide-area network whose interests are characterised by range top-k continuous queries. Given an object update, we need to notify users whose top-k results are affected. Simple solutions include using a content-driven network to notify all users whose interest ranges contain the update (ignoring top-k), or using a server to compute only the affected queries and notifying them individually. The former solution generates too much network traffic, while the latter overwhelms the server. We present a geometric framework for the problem that allows us to describe the set of affected queries succinctly with messages that can be efficiently disseminated using content-driven networks. We give fast algorithms to reformulate each update into a set of messages whose number is provably optimal, with or without knowing all user interests. We also present extensions to our solution, including an approximate algorithm that trades off between the cost of server-side reformulation and that of user-side post-processing, as well as efficient techniques for batch updates.
【Keywords】: client-server systems; content management; query processing; affected query; approximate algorithm; batch updates; content-driven network; geometric framework; network traffic; object update; range top-k continuous query; range top-k subscriptions processing; server-side reformulation; user interests; user notification; user-side post-processing; wide-area network; Approximation algorithms; Indexes; Servers; Standards; Subscriptions; Tiles
【Paper Link】 【Pages】:822-833
【Authors】: Jongik Kim ; Hongrae Lee
【Abstract】: Similarity searches are essential in many applications including data cleaning and near duplicate detection. Many similarity search algorithms first generate candidate records, and then identify true matches among them. A major focus of those algorithms has been on how to reduce the number of candidate records in the early stage of similarity query processing. One of the most commonly used techniques to reduce the candidate size is the prefix filtering principle, which exploits the document frequency ordering of tokens. In this paper, we propose a novel partitioning technique that considers multiple token orderings based on token co-occurrence statistics. Experimental results show that the proposed technique is effective in reducing the number of candidate records and as a result improves the performance of existing algorithms significantly.
【Keywords】: document handling; query processing; data cleaning; document frequency ordering; efficient exact similarity searches; multiple token orderings; near duplicate detection; prefix filtering principle; similarity query processing; similarity search algorithm; token cooccurrence statistics; Cleaning; Dictionaries; Indexes; Merging; Partitioning algorithms; Query processing; Search problems
【Paper Link】 【Pages】:834-845
【Authors】: Xiang Zhao ; Chuan Xiao ; Xuemin Lin ; Wei Wang
【Abstract】: Graphs are widely used to model complicated data semantics in many applications in bioinformatics, chemistry, social networks, pattern recognition, etc. A recent trend is to tolerate noise arising from various sources, such as erroneous data entry, and find similarity matches. In this paper, we study the graph similarity join problem that returns pairs of graphs such that their edit distances are no larger than a threshold. Inspired by the q-gram idea for string similarity problem, our solution extracts paths from graphs as features for indexing. We establish a lower bound of common features to generate candidates. An efficient algorithm is proposed to exploit both matching and mismatching features to improve the filtering and verification on candidates. We demonstrate the proposed algorithm significantly outperforms existing approaches with extensive experiments on publicly available datasets.
【Keywords】: data analysis; graph theory; indexing; information filtering; query processing; data semantics; edit distance constraints; filtering techniques; graph similarity join problem; indexing; q-gram idea; string similarity problem; verification techniques; Approximation algorithms; Australia; Complexity theory; Educational institutions; Filtering; Greedy algorithms; Indexes
【Paper Link】 【Pages】:846-857
【Authors】: Shaoxu Song ; Lei Chen ; Hong Cheng
【Abstract】: The importance of introducing distance constraints to data dependencies, such as differential dependencies (DDs) [28], has recently been recognized. The metric distance constraints are tolerant to small variations, which enable them apply to wide data quality checking applications, such as detecting data violations. However, the determination of distance thresholds for the metric distance constraints is non-trivial. It often relies on a truth data instance which embeds the distance constraints. To find useful distance threshold patterns from data, there are several guidelines of statistical measures to specify, e.g., support, confidence and dependent quality. Unfortunately, given a data instance, users might not have any knowledge about the data distribution, thus it is very challenging to set the right parameters. In this paper, we study the determination of distance thresholds for metric distance constraints, in a parameter-free style. Specifically, we compute an expected utility based on the statistical measures from the data. According to our analysis as well as experimental verification, distance threshold patterns with higher expected utility could offer better usage in real applications, such as violation detection. We then develop efficient algorithms to determine the distance thresholds having the maximum expected utility. Finally, our extensive experimental evaluation demonstrates the effectiveness and efficiency of the proposed methods.
【Keywords】: security of data; statistical analysis; data dependencies; data distribution; data violation detection; differential dependencies; distance thresholds parameter-free determination; metric distance constraints; Algorithm design and analysis; Association rules; Cleaning; Educational institutions; Lakes; Measurement; Pattern matching
【Paper Link】 【Pages】:858-869
【Authors】: Wush Chi-Hsuan Wu ; Mi-Yen Yeh ; Jian Pei
【Abstract】: Errors in measurement can be categorized into two types: systematic errors that are predictable, and random errors that are inherently unpredictable and have null expected value. Random error is always present in a measurement. More often than not, readings in time series may contain inherent random errors due to causes like dynamic error, drift, noise, hysteresis, digitalization error and limited sampling frequency. Random errors may affect the quality of time series analysis substantially. Unfortunately, most of the existing time series mining and analysis methods, such as similarity search, clustering, and classification tasks, do not address random errors, possibly because random error in a time series, which can be modeled as a random variable of unknown distribution, is hard to handle. In this paper, we tackle this challenging problem. Taking similarity search as an example, which is an essential task in time series analysis, we develop MISQ, a statistical approach for random error reduction in time series analysis. The major intuition in our method is to use only the readings at different time instants in a time series to reduce random errors. We achieve a highly desirable property in MISQ: it can ensure that the recall is above a user-specified threshold. An extensive empirical study on 20 benchmark real data sets clearly shows that our method can lead to better performance than the baseline method without random error reduction in real applications such as classification. Moreover, MISQ achieves good quality in similarity search.
【Keywords】: data mining; pattern classification; pattern clustering; time series; MISQ; classification task; clustering task; digitalization error; drift; dynamic error; hysteresis; limited sampling frequency; noise; random error reduction; similarity search; statistical approach; systematic error; time series analysis; time series mining; user-specified threshold; Error analysis; Measurement uncertainty; Random variables; Reactive power; Time measurement; Time series analysis
【Paper Link】 【Pages】:870-881
【Authors】: Fei Chen ; Xixuan Feng ; Christopher Re ; Min Wang
【Abstract】: Statistical information extraction (IE) programs are increasingly used to build real-world IE systems such as Alibaba, CiteSeer, Kylin, and YAGO. Current statistical IE approaches consider the text corpora underlying the extraction program to be static. However, many real-world text corpora are dynamic (documents are inserted, modified, and removed). As the corpus evolves, and IE programs must be applied repeatedly to consecutive corpus snapshots to keep extracted information up to date. Applying IE from scratch to each snapshot may be inefficient: a pair of consecutive snapshots may change very little, but unaware of this, the program must run again from scratch. In this paper, we present CRFlex, a system that efficiently executes such repeated statistical IE, by recycling previous IE results to enable incremental update. As the first step, CRFlex focuses on statistical IE programs which use a leading statistical model, Conditional Random Fields (CRFs). We show how to model properties of the CRF inference algorithms for incremental update and how to exploit them to correctly recycle previous inference results. Then we show how to efficiently capture and store intermediate results of IE programs for subsequent recycling. We find that there is a tradeoff between the I/O cost spent on reading and writing intermediate results, and CPU cost we can save from recycling those intermediate results. Therefore we present a cost-based solution to determine the most efficient recycling approach for any given CRF-based IE program and an evolving corpus. We conduct extensive experiments with CRF-based IE programs for 3 IE tasks over a real-world data set to demonstrate the utility of our approach.
【Keywords】: information retrieval; input-output programs; statistical analysis; text analysis; CRFlex; I/O cost; IE programs; conditional random fields; evolving text; real-world IE systems; statistical information extraction programs; text corpora; Context; Data mining; Feature extraction; Labeling; Recycling; Vectors; Viterbi algorithm
【Paper Link】 【Pages】:882-893
【Authors】: Chong Sun ; Jeffrey F. Naughton ; Siddharth Barman
【Abstract】: We consider the approximate string membership checking (ASMC) problem of extracting all the strings or sub strings in a document that approximately match some string in a given dictionary. To solve this problem, the current state-of-art approach involves first applying an approximate, fast filter, then applying a more expensive exact verification algorithm to the strings that pass the filter. Correspondingly, many string filters have been proposed. We note that different filters are good at eliminating different strings, depending on the characteristics of the strings in both the documents and the dictionary. We suspect that no single filter will dominate all other filters everywhere. Given an ASMC problem instance and a set of string filters, we need to select the optimal filter to maximize the performance. Furthermore, in our experiments we found that in some cases a sequence of filters dominates any of the filters of the sequence in isolation, and that the best set of filters and their ordering depend upon the specific problem instance encountered. Accordingly, we propose that the approximate match problem be viewed as an optimization problem, and evaluate a number of techniques for solving this optimization problem.
【Keywords】: document handling; formal verification; information filters; string matching; ASMC; approximate fast filter; approximate string membership checking problem; document substring; exact verification algorithm; multiple filter optimization-based approach; optimal filter; optimization problem; string filters; string matching; Approximation algorithms; Approximation methods; Dictionaries; Estimation; Matched filters; Optimization; Pipelines
【Paper Link】 【Pages】:894-904
【Authors】: Charu C. Aggarwal ; Yuchen Zhao ; Philip S. Yu
【Abstract】: Text clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. In most cases, the data is not purely available in text form. A lot of side-information is available along with the text documents. Such side-information may be of different kinds, such as the links in the document, user-access behavior from web logs, or other non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount of information for clustering purposes. However, the relative importance of this side-information may be difficult to estimate, especially when some of the information is noisy. In such cases, it can be risky to incorporate side-information into the clustering process, because it can either improve the quality of the representation for clustering, or can add noise to the process. Therefore, we need a principled way to perform the clustering process, so as to maximize the advantages from using this side information. In this paper, we design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective clustering approach. We present experimental results on a number of real data sets in order to illustrate the advantages of using such an approach.
【Keywords】: Internet; pattern clustering; probability; social networking (online); text analysis; Web logs; classical partitioning algorithms; information networks; nontextual attributes; online forums; probabilistic models; side information; social networks; text clustering; text documents; unstructured data; user-access behavior; Approximation methods; Clustering algorithms; Coherence; Context; Noise measurement; Partitioning algorithms; Probabilistic logic
【Paper Link】 【Pages】:905-916
【Authors】: Junfeng Zhou ; Zhifeng Bao ; Wei Wang ; Tok Wang Ling ; Ziyang Chen ; Xudong Lin ; Jingfeng Guo
【Abstract】: In this paper, we focus on efficient keyword query processing for XML data based on the SLCA and ELCA semantics. We propose a novel form of inverted lists for keywords which include IDs of nodes that directly or indirectly contain a given keyword. We propose a family of efficient algorithms that are based on the set intersection operation for both semantics. We show that the problem of SLCA/ELCA computation becomes finding a set of nodes that appear in all involved inverted lists and satisfy certain conditions. We also propose several optimization techniques to further improve the query processing performance. We have conducted extensive experiments with many alternative methods. The results demonstrate that our proposed methods outperform previous methods by up to two orders of magnitude in many cases.
【Keywords】: XML; programming language semantics; query processing; set theory; ELCA computation; ELCA semantics; SLCA computation; SLCA semantics; XML data; XML keyword queries; exclusive lowest common ancestor; keyword query processing; optimization techniques; set intersection operation; smallest lowest common ancestor; Arrays; Complexity theory; Educational institutions; Probes; Query processing; Semantics; XML
【Paper Link】 【Pages】:917-928
【Authors】: Nikolay Laptev ; Carlo Zaniolo
【Abstract】: Complex pattern queries play a critical role in many applications that must efficiently search databases and data streams. Current techniques support the search for multiple patterns using deterministic or non-deterministic automata. In practice however, the static pattern representation does not fully utilize available system resources, subsequently suffering from poor performance. Therefore a low overhead auto-reconfigurable automaton is needed that optimizes pattern matching performance. In this paper, we propose a dynamic system that entails the efficient and reliable evaluation of a very large number of pattern queries on a resource constrained system under changing stress-load. Our system prototype, Morpheus, pre-computes several query pattern representations, named templates, which are then morphed into a required form during run-time. Morpheus uses templates to speed up dynamic automaton reconfiguration. Results from empirical studies confirm the benefits of our approach, with three orders of magnitude improvement achieved in the overall pattern matching performance with the help of dynamic reconfiguration. This is accomplished only with a modest increase in amortized memory usage.
【Keywords】: data structures; deterministic automata; pattern matching; query processing; autoreconfigurable automaton; data stream; database searching; deterministic automata; dynamic automaton reconfiguration; dynamic configuration morphing; massive pattern query optimization; nondeterministic automata; pattern matching; query pattern representation; resource constrained system; static pattern representation; Automata; Doped fiber amplifiers; Dynamic scheduling; Engines; Equations; Optimization; Pattern matching
【Paper Link】 【Pages】:929-940
【Authors】: Shenoda Guirguis ; Mohamed A. Sharaf ; Panos K. Chrysanthis ; Alexandros Labrinidis
【Abstract】: Aggregate Continuous Queries (ACQs) are both a very popular class of Continuous Queries (CQs) and also have a potentially high execution cost. As such, optimizing the processing of ACQs is imperative for Data Stream Management Systems (DSMSs) to reach their full potential in supporting (critical) monitoring applications. For multiple ACQs that vary in window specifications and pre-aggregation filters, existing multiple ACQs optimization schemes assume a processing model where each ACQ is computed as a final-aggregation of a sub-aggregation. In this paper, we propose a novel processing model for ACQs, called Tri Ops, with the goal of minimizing the repetition of operator execution at the sub-aggregation level. We also propose Tri Weave, a Tri Ops-aware multi-query optimizer. We analytically and experimentally demonstrate the performance gains of our proposed schemes which shows their superiority over alternative schemes. Finally, we generalize Tri Weave to incorporate the classical subsumption-based multi-query optimization techniques.
【Keywords】: data analysis; database management systems; query processing; ACQ processing optimization; DSMS; Tri Ops-aware multiquery optimizer; Tri Weave; critical monitoring application; data stream management system; execution cost; multiple aggregate continuous queries; operator execution repetition minimization; preaggregation filter; processing model; subsumption-based multiquery optimization technique; three-level processing; window specification; Aggregates; Computational modeling; Equations; Mathematical model; Monitoring; Optimization; Weaving
【Paper Link】 【Pages】:941-952
【Authors】: Farhan Tauheed ; Laurynas Biveinis ; Thomas Heinis ; Felix Schürmann ; Henry Markram ; Anastasia Ailamaki
【Abstract】: Neuroscientists increasingly use computational tools in building and simulating models of the brain. The amounts of data involved in these simulations are immense and efficiently managing this data is key. One particular problem in analyzing this data is the scalable execution of range queries on spatial models of the brain. Known indexing approaches do not perform well even on today's small models which represent a small fraction of the brain, containing only few millions of densely packed spatial elements. The problem of current approaches is that with the increasing level of detail in the models, also the overlap in the tree structure increases, ultimately slowing down query execution. The neuroscientists' need to work with bigger and more detailed (denser) models thus motivates us to develop a new indexing approach. To this end we develop FLAT, a scalable indexing approach for dense data sets. We base the development of FLAT on the key observation that current approaches suffer from overlap in case of dense data sets. We hence design FLAT as an approach with two phases, each independent of density. In the first phase it uses a traditional spatial index to retrieve an initial object efficiently. In the second phase it traverses the initial object's neighborhood to retrieve the remaining query result. Our experimental results show that FLAT not only outperforms R-Tree variants from a factor of two up to eight but that it also achieves independence from data set size and density.
【Keywords】: brain models; data analysis; indexing; neurophysiology; query processing; tree data structures; FLAT; I-O overhead; R-tree variants; brain simulations; computational tools; data analysis; data set size; dense data sets; densely packed spatial elements; indexing approaches; neuroscientists; query execution; range queries acceleration; scalable indexing approach; spatial brain models; tree structure; Brain models; Computational modeling; Data structures; Indexing; Neurons
【Paper Link】 【Pages】:953-964
【Authors】: Junjie Yao ; Bin Cui ; Liansheng Hua ; Yuxin Huang
【Abstract】: Textual web pages dominate web search engines nowadays. However, there is also a striking increase of structured data on the web. Efficient keyword query processing on structured data has attracted enough attention, but effective query understanding has yet to be investigated. In this paper, we focus on the problem of keyword query reformulation in the structured data scenario. These reformulated queries provide alternative descriptions of original input. They could better capture users' information need and guide users to explore related items in the target structured data. We propose an automatic keyword query reformulation approach by exploiting structural semantics in the underlying structured data sources. The reformulation solution is decomposed into two stages, i.e., offline term relation extraction and online query generation. We first utilize a heterogenous graph to model the words and items in structured data, and design an enhanced Random Walk approach to extract relevant terms from the graph context. In the online query reformulation stage, we introduce an efficient probabilistic generation module to suggest substitutable reformulated queries. Extensive experiments are conducted on a real-life data set, and our approach yields promising results.
【Keywords】: Internet; data structures; graph theory; query processing; Web search engines; automatic keyword query reformulation approach; heterogenous graph; keyword query processing; structural semantics; structured data; textual Web pages; Context; Data mining; Data models; Probabilistic logic; Semantics; Vectors; Web search
【Paper Link】 【Pages】:965-976
【Authors】: Po-Yuen Wong ; Tak-Ming Chan ; Man Hon Wong ; Kwong-Sak Leung
【Abstract】: The studies of protein-DNA bindings between transcription factors (TFs) and transcription factor binding sites (TFBSs) are important bioinformatics topics. High-resolution (length<;10) TF-TFBS binding cores are discovered by expensive and time-consuming 3D structure experiments. Recent association rule mining approaches on low-resolution binding sequences (TF length>;490) are shown promising in identifying accurate binding cores without using any 3D structures. While the current association rule mining method on this problem addresses exact sequences only, the most recent ad hoc method for approximation does not establish any formal model and is limited by experimentally known patterns. As biological mutations are common, it is desirable to formally extend the exact model into an approximate one. In this paper, we formalize the problem of mining approximate protein-DNA association rules from sequence data and propose a novel efficient algorithm to predict protein-DNA binding cores. Our two-phase algorithm first constructs two compact intermediate structures called frequent sequence tree (FS-Tree) and frequent sequence class tree (FSCTree). Approximate association rules are efficiently generated from the structures and bioinformatics concepts (position weight matrix and information content) are further employed to prune meaningless rules. Experimental results on real data show the performance and applicability of the proposed algorithm.
【Keywords】: DNA; bioinformatics; data mining; matrix algebra; proteins; trees (mathematics); 3D structure experiment; FS-Tree; FSCTree; TFBS; approximate protein-DNA binding cores; association rule mining; bioinformatics; biological mutation; compact intermediate structure; frequent sequence class tree; frequent sequence tree; high-resolution binding cores; information content; low-resolution binding sequences; position weight matrix; protein-DNA association rules; transcription factor binding sites; Approximation methods; Association rules; Databases; Proteins; Pulse width modulation
【Paper Link】 【Pages】:977-988
【Authors】: Hua Lu ; Christian S. Jensen
【Abstract】: The skyline of a multidimensional point set consists of the points that are not dominated by other points. In a scenario where product features are represented by multidimensional points, the skyline points may be viewed as representing competitive products. A product provider may wish to upgrade uncompetitive products to become competitive, but wants to take into account the upgrading cost. We study the top-k product upgrading problem. Given a set P of competitor products, a set T of products that are candidates for upgrade, and an upgrading cost function f that applies to T, the problem is to return the k products in T that can be upgraded to not be dominated by any products in P at the lowest cost. This problem is non-trivial due to not only the large data set sizes, but also to the many possibilities for upgrading a product. We identify and provide solutions for the different options for upgrading an uncompetitive product, and combine the solutions into a single solution. We also propose a spatial join-based solution that assumes P and T are indexed by an R-tree. Given a set of products in the same R-tree node, we derive three lower bounds on their upgrading costs. These bounds are employed by the join approach to prune upgrade candidates with uncompetitive upgrade costs. Empirical studies with synthetic and real data show that the join approach is efficient and scalable.
【Keywords】: product quality; trees (mathematics); R-tree; lower bound; multidimensional point set; product features; skyline points; top-k product upgrading problem; uncompetitive products; upgrading cost; Cameras; Cellular phones; Computer science; Cost function; Educational institutions; Manufacturing; Sorting
【Paper Link】 【Pages】:989-1000
【Authors】: Yu Peng ; Raymond Chi-Wing Wong ; Liangliang Ye ; Philip S. Yu
【Abstract】: Sequence analysis is very important in our daily life. Typically, each sequence is associated with an ordered list of elements. For example, in a movie rental application, a customer's movie rental record containing an ordered list of movies is a sequence example. Most studies about sequence analysis focus on subsequence matching which finds all sequences stored in the database such that a given query sequence is a subsequence of each of these sequences. In many applications, elements are associated with properties or attributes. For example, each movie is associated with some attributes like "Director" and "Actors". Unfortunately, to the best of our knowledge, all existing studies about sequence analysis do not consider the attributes of elements. In this paper, we propose two problems. The first problem is: given a query sequence and a set of sequences, considering the attributes of elements, we want to find all sequences which are matched by this query sequence. This problem is called attribute-based subsequence matching (ASM). All existing applications for the traditional subsequence matching problem can also be applied to our new problem provided that we are given the attributes of elements. We propose an efficient algorithm for problem ASM. The key idea to the efficiency of this algorithm is to compress each whole sequence with potentially many associated attributes into just a triplet of numbers. By dealing with these very compressed representations, we greatly speed up the attribute-based subsequence matching. The second problem is to find all frequent attribute-based subsequence. We also adapt an existing efficient algorithm for this second problem to show we can use the algorithm developed for the first problem. Empirical studies show that our algorithms are scalable in large datasets. In particular, our algorithms run at least an order of magnitude faster than a straightforward method in most cases. This work can stimulate a number of existing data mini- g problems which are fundamentally based on subsequence matching such as sequence classification, frequent sequence mining, motif detection and sequence matching in bioinformatics.
【Keywords】: data analysis; data mining; query processing; sequences; attribute-based subsequence matching; attribute-based subsequence mining; bioinformatics; data mining problems; frequent attribute-based subsequence; frequent sequence mining; motif detection; movie rental application; movie rental record; query sequence; sequence analysis; sequence classification; sequence matching; Approximation algorithms; Bioinformatics; Biology; Communities; Data mining; Equations; Motion pictures
【Paper Link】 【Pages】:1001-1012
【Authors】: Dhaval Patel ; Wynne Hsu ; Mong-Li Lee
【Abstract】: Many frequent pattern mining algorithms have been developed for categorical, numerical, time series, or interval data. However, little attention has been given to integrate these algorithms so as to mine frequent patterns from multiple domain datasets for classification. In this paper, we introduce the notion of a heterogenous pattern to capture the associations among different kinds of data. We propose a unified framework for mining multiple domain datasets and design an iterative algorithm called HTMiner. HTMiner discovers essential heterogenous patterns for classification and performs instance elimination. This instance elimination step reduces the problem size progressively by removing training instances which are correctly covered by the discovered essential heterogenous pattern. Experiments on two real world datasets show that the HTMiner is efficient and can significantly improve the classification accuracy.
【Keywords】: data mining; iterative methods; pattern classification; HTMiner; categorical data; classification; frequent pattern mining integration; heterogenous pattern discovery; instance elimination step; interval data; iterative algorithm; multiple domain dataset mining; numerical data; time series data; Accuracy; Algorithm design and analysis; Blood pressure; Data mining; Indexes; Itemsets; Time series analysis
【Paper Link】 【Pages】:1013-1024
【Authors】: Adam Seering ; Philippe Cudré-Mauroux ; Samuel Madden ; Michael Stonebraker
【Abstract】: In this paper, we describe a versioned database storage manager we are developing for the SciDB scientific database. The system is designed to efficiently store and retrieve array-oriented data, exposing a "no-overwrite" storage model in which each update creates a new "version" of an array. This makes it possible to perform comparisons of versions produced at different times or by different algorithms, and to create complex chains and trees of versions. We present algorithms to efficiently encode these versions, minimizing storage space while still providing efficient access to the data. Additionally, we present an optimal algorithm that, given a long sequence of versions, determines which versions to encode in terms of each other (using delta compression) to minimize total storage space or query execution cost. We compare the performance of these algorithms on real world data sets from the National Oceanic and Atmospheric Administration (NOAA), Open Street Maps, and several other sources. We show that our algorithms provide better performance than existing version control systems not optimized for array data, both in terms of storage size and access time, and that our delta-compression algorithms are able to substantially reduce the total storage space when versions exist with a high degree of similarity.
【Keywords】: configuration management; data compression; database management systems; query processing; scientific information systems; storage management; NOAA; National Oceanic and Atmospheric Administration; SciDB scientific database; access time; array-oriented data; data access; delta compression; delta-compression algorithm; no-overwrite storage model; open street maps; optimal algorithm; query execution cost; scientific array database; storage size; storage space; version control system; versioned database storage manager; Arrays; Data models; Databases; Encoding; Image coding; Layout; Prototypes
【Paper Link】 【Pages】:1025-1036
【Authors】: Lu An Tang ; Xiao Yu ; Sangkyum Kim ; Jiawei Han ; Wen-Chih Peng ; Yizhou Sun ; Hector Gonzalez ; Sebastian Seith
【Abstract】: A Cyber-Physical System (CPS) integrates physical devices (e.g., sensors, cameras) with cyber (or informational) components to form a situation-integrated analytical system that may respond intelligently to dynamic changes of the real-world situations. CPS claims many promising applications, such as traffic observation, battlefield surveillance and sensor-network based monitoring. One important research topic in CPS is about the atypical event analysis, i.e., retrieving the events from large amount of data and analyzing them with spatial, temporal and other multi-dimensional information. Many traditional approaches are not feasible for such analysis since they use numeric measures and cannot describe the complex atypical events. In this study, we propose a new model of atypical cluster to effectively represent those events and efficiently retrieve them from massive data. The micro-cluster is designed to summarize individual events, and the macro-cluster is used to integrate the information from multiple event. To facilitate scalable, flexible and online analysis, the concept of significant cluster is defined and a guided clustering algorithm is proposed to retrieve significant clusters in an efficient manner. We conduct experiments on real datasets with the size of more than 50 GB, the results show that the proposed method can provide more accurate information with only 15% to 20% time cost of the baselines.
【Keywords】: data analysis; pattern clustering; traffic information systems; CPS; atypical cluster; atypical events; battlefield surveillance; clustering algorithm; cyber-physical data; cyber-physical system; information integration; macrocluster; microcluster; multidimensional analysis; numeric measures; sensor-network based monitoring; situation-integrated analytical system; traffic observation; Clustering algorithms; Complexity theory; Indexes; Monitoring; Query processing; Roads
【Paper Link】 【Pages】:1037-1048
【Authors】: Fabian Keller ; Emmanuel Müller ; Klemens Böhm
【Abstract】: Outlier mining is a major task in data analysis. Outliers are objects that highly deviate from regular objects in their local neighborhood. Density-based outlier ranking methods score each object based on its degree of deviation. In many applications, these ranking methods degenerate to random listings due to low contrast between outliers and regular objects. Outliers do not show up in the scattered full space, they are hidden in multiple high contrast subspace projections of the data. Measuring the contrast of such subspaces for outlier rankings is an open research challenge. In this work, we propose a novel subspace search method that selects high contrast subspaces for density-based outlier ranking. It is designed as pre-processing step to outlier ranking algorithms. It searches for high contrast subspaces with a significant amount of conditional dependence among the subspace dimensions. With our approach, we propose a first measure for the contrast of subspaces. Thus, we enhance the quality of traditional outlier rankings by computing outlier scores in high contrast projections only. The evaluation on real and synthetic data shows that our approach outperforms traditional dimensionality reduction techniques, naive random projections as well as state-of-the-art subspace search techniques and provides enhanced quality for outlier ranking.
【Keywords】: data analysis; data mining; conditional dependence; data analysis; density-based outlier ranking; high contrast subspace projection; outlier mining; outlier ranking algorithm; quality enhancement; subspace dimension; subspace search method; Atmospheric measurements; Correlation; Data mining; Density measurement; Joints; Noise level; Probability density function
【Paper Link】 【Pages】:1049-1060
【Authors】: Yang Zhang ; Srinivasan Parthasarathy
【Abstract】: Cliques are topological structures that usually provide important information for understanding the structure of a graph or network. However, detecting and extracting cliques efficiently is known to be very hard. In this paper, we define and introduce the notion of a Triangle K-Core, a simpler topological structure and one that is more tractable and can moreover be used as a proxy for extracting clique-like structure from large graphs. Based on this definition we first develop a localized algorithm for extracting Triangle K-Cores from large graphs. Subsequently we extend the simple algorithm to accommodate dynamic graphs (where edges can be dynamically added and deleted). Finally, we extend the basic definition to support various template pattern cliques with applications to network visualization and event detection on graphs and networks. Our empirical results reveal the efficiency and efficacy of the proposed methods on many real world datasets.
【Keywords】: computational complexity; graph theory; Triangle K-Core motifs analysis; Triangle K-Core motifs extraction; Triangle K-Core motifs visualization; clique detection; clique extraction; clique-like structure; dynamic graphs; event detection; graph structure; network structure; network visualization; proxy; template pattern cliques; topological structures; Communities; Complexity theory; Gold; Heuristic algorithms; Image edge detection; Upper bound; Visualization
【Paper Link】 【Pages】:1061-1072
【Authors】: Min Soo Kim ; Kyu-Young Whang ; Yang-Sae Moon
【Abstract】: Dimensionality reduction is essential in text mining since the dimensionality of text documents could easily reach several tens of thousands. Most recent efforts on dimensionality reduction, however, are not adequate to large document databases due to lack of scalability. We hence propose a new type of simple but effective dimensionality reduction, called horizontal (dimensionality) reduction, for large document databases. Horizontal reduction converts each text document to a few bitmap vectors and provides tight lower bounds of inter-document distances using those bitmap vectors. Bitmap representation is very simple and extremely fast, and its instance-based nature makes it suitable for large and dynamic document databases. Using the proposed horizontal reduction, we develop an efficient k-nearest neighbor (k-NN) search algorithm for text mining such as classification and clustering, and we formally prove its correctness. The proposed algorithm decreases I/O and CPU overheads simultaneously since horizontal reduction (1) reduces the number of accesses to documents significantly by exploiting the bitmap-based lower bounds in filtering dissimilar documents at an early stage, and accordingly, (2) decreases the number of CPU-intensive computations for obtaining a real distance between high-dimensional document vectors. Extensive experimental results show that horizontal reduction improves the performance of the reduction (preprocessing) process by one to two orders of magnitude compared with existing reduction techniques, and our k-NN search algorithm significantly outperforms the existing ones by one to three orders of magnitude.
【Keywords】: data mining; database management systems; pattern clustering; text analysis; Bitmap representation; bitmap vectors; horizontal reduction; instance-level dimensionality reduction; inter-document distances; k-NN search algorithm; k-nearest neighbor search algorithm; large document databases; similarity search; text documents; text mining; Complexity theory; Discrete Fourier transforms; Euclidean distance; Gold; Large scale integration; Text mining; Vectors; fuzzy system; query interface; user-centric application
【Paper Link】 【Pages】:1073-1083
【Authors】: Uwe Draisbach ; Felix Naumann ; Sascha Szott ; Oliver Wonneberg
【Abstract】: Duplicate detection is the task of identifying all groups of records within a data set that represent the same real-world entity, respectively. This task is difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs of records and (ii) data sets might have a high volume making a pair-wise comparison of all records infeasible. To tackle the second problem, many algorithms have been suggested that partition the data set and compare all record pairs only within each partition. One well-known such approach is the Sorted Neighborhood Method (SNM), which sorts the data according to some key and then advances a window over the data comparing only records that appear within the same window. We propose with the Duplicate Count Strategy (DCS) a variation of SNM that uses a varying window size. It is based on the intuition that there might be regions of high similarity suggesting a larger window size and regions of lower similarity suggesting a smaller window size. Next to the basic variant of DCS, we also propose and thoroughly evaluate a variant called DCS++ which is provably better than the original SNM in terms of efficiency (same results with fewer comparisons).
【Keywords】: records management; sorting; DCS++; SNM; adaptive windows; data sets; duplicate count strategy; duplicate detection; pairwise comparison; real-world entity; sorted neighborhood method; Conferences; Couplings; Data engineering; Object recognition; Sorting; Standards; Volume measurement
【Paper Link】 【Pages】:1084-1095
【Authors】: Jongwuk Lee ; Hyunsouk Cho ; Seung-won Hwang
【Abstract】: Top-k queries have gained considerable attention as an effective means for narrowing down the overwhelming amount of data. This paper studies the problem of constructing an indexing structure that efficiently supports top-k queries for varying scoring functions and retrieval sizes. The existing work can be categorized into three classes: list-, layer-, and view-based approaches. This paper focuses on the layer-based approach, pre-materializing tuples into consecutive multiple layers. The layer-based index enables us to return top-k answers efficiently by restricting access to tuples in the k layers. However, we observe that the number of tuples accessed in each layer can be reduced further. For this purpose, we propose a dual-resolution layer structure. Specifically, we iteratively build coarse-level layers using skylines, and divide each coarse-level layer into fine-level sub layers using convex skylines. The dual-resolution layer is able to leverage not only the dominance relationship between coarse-level layers, named for all-dominance, but also a relaxed dominance relationship between fine-level sub layers, named exists-dominance. Our extensive evaluation results demonstrate that our proposed method significantly reduces the number of tuples accessed than the state-of-the-art methods.
【Keywords】: indexing; query processing; convex skylines; efficient dual-resolution layer indexing; exists-dominance; fine-level sublayers; indexing structure; layer-based approach; relaxed dominance relationship; retrieval sizes; scoring functions; top-k queries; Indexing; Query processing; TV; Vectors
【Paper Link】 【Pages】:1096-1107
【Authors】: Reynold Cheng ; Jian Gong ; David W. Cheung ; Jiefeng Cheng
【Abstract】: A matching between two database schemas, generated by machine learning techniques (e.g., COMA++), is often uncertain. Handling the uncertainty of schema matching has recently raised a lot of research interest, because the quality of applications rely on the matching result. We study query evaluation over an inexact schema matching, which is represented as a set of ``possible mappings'', as well as the probabilities that they are correct. Since the number of possible mappings can be large, evaluating queries through these mappings can be expensive. By observing the fact that the possible mappings between two schemas often exhibit a high degree of overlap, we develop two efficient solutions. We also present a fast algorithm to compute answers with the k highest probabilities. An extensive evaluation on real schemas shows that our approaches improve the query performance by almost an order of magnitude.
【Keywords】: database management systems; learning (artificial intelligence); probability; query processing; uncertainty handling; database schema matching; machine learning techniques; magnitude order; possible mappings; probabilistic query evaluation; query performance; uncertain matching; uncertainty handling; Aggregates; Algorithm design and analysis; Partitioning algorithms; Probabilistic logic; Query processing; Uncertainty
【Paper Link】 【Pages】:1108-1119
【Authors】: Sabbas Burdakis ; Antonios Deligiannakis
【Abstract】: The topic of outlier detection in sensor networks has received significant attention in recent years. Detecting when the measurements of a node become "abnormal'' is interesting, because this event may help detect either a malfunctioning node, or a node that starts observing a local interesting phenomenon (i.e., a fire). In this paper we present a new algorithm for detecting outliers in sensor networks, based on the geometric approach. Unlike prior work. our algorithms perform a distributed monitoring of outlier readings, exhibit 100% accuracy in their monitoring (assuming no message losses), and require the transmission of messages only at a fraction of the epochs, thus allowing nodes to safely refrain from transmitting in many epochs. Our approach is based on transforming common similarity metrics in a way that admits the application of the recently proposed geometric approach. We then propose a general framework and suggest multiple modes of operation, which allow each sensor node to accurately monitor its similarity to other nodes. Our experiments demonstrate that our algorithms can accurately detect outliers at a fraction of the communication cost that a centralized approach would require (even in the case where the central node lies just one hop away from all sensor nodes). Moreover, we demonstrate that these bandwidth savings become even larger as we incorporate further optimizations in our proposed modes of operation.
【Keywords】: geometry; wireless sensor networks; bandwidth saving; distributed monitoring; geometric approach; malfunctioning node detection; message transmission; outlier detection; outlier reading; sensor network; similarity metrics; Accuracy; Bandwidth; Cleaning; Color; Monitoring; Silicon; Vectors
【Paper Link】 【Pages】:1120-1131
【Authors】: Mingwang Tang ; Feifei Li ; Jeff M. Phillips ; Jeffrey Jestes
【Abstract】: In distributed data management, a primary concern is monitoring the distributed data and generating an alarm when a user specified constraint is violated. A particular useful instance is the threshold based constraint, which is commonly known as the distributed threshold monitoring problem [4], [16], [19], [29]. This work extends this useful and fundamental study to distributed probabilistic data that emerge in a lot of applications, where uncertainty naturally exists when massive amounts of data are produced at multiple sources in distributed, networked locations. Examples include distributed observing stations, large sensor fields, geographically separate scientific institutes/units and many more. When dealing with probabilistic data, there are two thresholds involved, the score and the probability thresholds. One must monitor both simultaneously, as such, techniques developed for deterministic data are no longer directly applicable. This work presents a comprehensive study to this problem. Our algorithms have significantly outperformed the baseline method in terms of both the communication cost (number of messages and bytes) and the running time, as shown by an extensive experimental evaluation using several, real large datasets.
【Keywords】: data integration; data mining; probability; deterministic data; distributed data management; distributed probabilistic data; distributed threshold monitoring problem; probability threshold; score threshold; threshold based constraint; user specified constraint; Distributed databases; Marine vehicles; Markov processes; Monitoring; Poles and towers; Probabilistic logic; Uncertainty
【Paper Link】 【Pages】:1132-1143
【Authors】: Dhaval Patel ; Chang Sheng ; Wynne Hsu ; Mong-Li Lee
【Abstract】: Trajectory classification has many useful applications. Existing works on trajectory classification do not consider the duration information of trajectory. In this paper, we extract duration-aware features from trajectories to build a classifier. Our method utilizes information theory to obtain regions where the trajectories have similar speeds and directions. Further, trajectories are summarized into a network based on the MDL principle that takes into account the duration difference among trajectories of different classes. A graph traversal is performed on this trajectory network to obtain the top-k covering path rules for each trajectory. Based on the discovered regions and top-k path rules, we build a classifier to predict the class labels of new trajectories. Experiment results on real-world datasets show that the proposed duration-aware classifier can obtain higher classification accuracy than the state-of-the-art trajectory classifier.
【Keywords】: feature extraction; information theory; network theory (graphs); pattern classification; MDL principle; duration aware feature extraction; duration information; duration-aware classifier; graph traversal; information theory; top-k covering path rule; trajectory classification; trajectory network; Accuracy; Databases; Feature extraction; Gain measurement; Hurricanes; Merging; Trajectory
【Paper Link】 【Pages】:1144-1155
【Authors】: Kai Zheng ; Yu Zheng ; Xing Xie ; Xiaofang Zhou
【Abstract】: The increasing availability of GPS-embedded mobile devices has given rise to a new spectrum of location-based services, which have accumulated a huge collection of location trajectories. In practice, a large portion of these trajectories are of low-sampling-rate. For instance, the time interval between consecutive GPS points of some trajectories can be several minutes or even hours. With such a low sampling rate, most details of their movement are lost, which makes them difficult to process effectively. In this work, we investigate how to reduce the uncertainty in such kind of trajectories. Specifically, given a low-sampling-rate trajectory, we aim to infer its possible routes. The methodology adopted in our work is to take full advantage of the rich information extracted from the historical trajectories. We propose a systematic solution, History based Route Inference System (HRIS), which covers a series of novel algorithms that can derive the travel pattern from historical data and incorporate it into the route inference process. To validate the effectiveness of the system, we apply our solution to the map-matching problem which is an important application scenario of this work, and conduct extensive experiments on a real taxi trajectory dataset. The experiment results demonstrate that HRIS can achieve higher accuracy than the existing map-matching algorithms for low-sampling-rate trajectories.
【Keywords】: Global Positioning System; inference mechanisms; information retrieval; mobile computing; sampling methods; traffic engineering computing; trees (mathematics); uncertainty handling; GPS-embedded mobile devices; HRIS; consecutive GPS points; historical data; historical trajectories; history-based route inference system; information extraction; location trajectories; location-based services; low-sampling-rate trajectories; map-matching problem; taxi trajectory dataset; time interval; travel patterns; uncertainty reduction; Artificial neural networks; Educational institutions; Global Positioning System; Inference algorithms; Roads; Trajectory; Uncertainty
【Paper Link】 【Pages】:1156-1167
【Authors】: Mehmet Kuzu ; Mohammad Saiful Islam ; Murat Kantarcioglu
【Abstract】: In recent years, due to the appealing features of cloud computing, large amount of data have been stored in the cloud. Although cloud based services offer many advantages, privacy and security of the sensitive data is a big concern. To mitigate the concerns, it is desirable to outsource sensitive data in encrypted form. Encrypted storage protects the data against illegal access, but it complicates some basic, yet important functionality such as the search on the data. To achieve search over encrypted data without compromising the privacy, considerable amount of searchable encryption schemes have been proposed in the literature. However, almost all of them handle exact query matching but not similarity matching, a crucial requirement for real world applications. Although some sophisticated secure multi-party computation based cryptographic techniques are available for similarity tests, they are computationally intensive and do not scale for large data sources. In this paper, we propose an efficient scheme for similarity search over encrypted data. To do so, we utilize a state-of-the-art algorithm for fast near neighbor search in high dimensional spaces called locality sensitive hashing. To ensure the confidentiality of the sensitive data, we provide a rigorous security definition and prove the security of the proposed scheme under the provided definition. In addition, we provide a real world application of the proposed scheme and verify the theoretical results with empirical observations on a real dataset.
【Keywords】: cloud computing; cryptography; cloud based services; cloud computing; cryptographic techniques; data privacy; data security; encrypted data; locality sensitive hashing; multiparty computation; query matching; similarity search; state-of-the-art algorithm; Encryption; Feature extraction; Indexes; Measurement; Servers
【Paper Link】 【Pages】:1168-1179
【Authors】: HweeHwa Pang ; Xiaokui Xiao ; Jialie Shen
【Abstract】: The text search queries in an enterprise can reveal the users' topic of interest, and in turn confidential staff or business information. To safeguard the enterprise from consequences arising from a disclosure of the query traces, it is desirable to obfuscate the true user intention from the search engine, without requiring it to be re-engineered. In this paper, we advocate a unique approach to profile the topics that are relevant to the user intention. Based on this approach, we introduce an (ε1, ε2)-privacy model that allows a user to stipulate that topics relevant to her intention at ε1 level should appear to any adversary to be innocuous at ε2 level. We then present a Top Priv algorithm to achieve the customized (ε1, ε2)-privacy requirement of individual users through injecting automatically formulated fake queries. The advantages of Top Priv over existing techniques are confirmed through benchmark queries on a real corpus, with experiment settings fashioned after an enterprise search application.
【Keywords】: data privacy; query processing; search engines; text analysis; (ε1, €2)- privacy model; Top Priv algorithm; enterprise text search; fake queries; search engine; text search queries; user intention; Cryptography; Data privacy; Databases; Privacy; Search engines; Servers; Vectors
【Paper Link】 【Pages】:1180-1191
【Authors】: Katrin Eisenreich ; Jochen Adamek ; Philipp Rösch ; Volker Markl ; Gregor Hackenbroich
【Abstract】: Investigating potential dependencies in data and their effect on future business developments can help experts to prevent misestimations of risks and chances. This makes correlation a highly important factor in risk analysis tasks. Previous research on correlation in uncertain data management addressed foremost the handling of dependencies between discrete rather than continuous distributions. Also, none of the existing approaches provides a clear method for extracting correlation structures from data and introducing assumptions about correlation to independently represented data. To enable risk analysis under correlation assumptions, we use an approximation technique based on copula functions. This technique enables analysts to introduce arbitrary correlation structures between arbitrary distributions and calculate relevant measures over thus correlated data. The correlation information can either be extracted at runtime from historic data or be accessed from a parametrically precomputed structure. We discuss the construction, application and querying of approximate correlation representations for different analysis tasks. Our experiments demonstrate the efficiency and accuracy of the proposed approach, and point out several possibilities for optimization.
【Keywords】: approximation theory; business data processing; correlation methods; data handling; database management systems; query processing; risk analysis; approximate correlation representation querying; approximation technique; copula functions; correlation information extraction; correlation structure extraction; correlation support; data dependencies; databases; discrete distributions; future business development; optimization; parametrically precomputed structure; risk analysis; risk evaluation; risk misestimation prevention; uncertain data management; Accuracy; Correlation; Data mining; Databases; Histograms; Insurance; Joints
【Paper Link】 【Pages】:1192-1203
【Authors】: Hyo-Sang Lim ; Gabriel Ghinita ; Elisa Bertino ; Murat Kantarcioglu
【Abstract】: Sensor networks are being increasingly deployed in many application domains ranging from environment monitoring to supervising critical infrastructure systems (e.g., the power grid). Due to their ability to continuously collect large amounts of data, sensor networks represent a key component in decisionmaking, enabling timely situation assessment and response. However, sensors deployed in hostile environments may be subject to attacks by adversaries who intend to inject false data into the system. In this context, data trustworthiness is an important concern, as false readings may result in wrong decisions with serious consequences (e.g., large-scale power outages). To defend against this threat, it is important to establish trust levels for sensor nodes and adjust node trustworthiness scores to account for malicious interferences. In this paper, we develop a game-theoretic defense strategy to protect sensor nodes from attacks and to guarantee a high level of trustworthiness for sensed data. We use a discrete time model, and we consider that there is a limited attack budget that bounds the capability of the attacker in each round. The defense strategy objective is to ensure that sufficient sensor nodes are protected in each round such that the discrepancy between the value accepted and the truthful sensed value is below a certain threshold. We model the attack-defense interaction as a Stackelberg game, and we derive the Nash equilibrium condition that is sufficient to ensure that the sensed data are truthful within a nominal error bound. We implement a prototype of the proposed strategy and we show through extensive experiments that our solution provides an effective and efficient way of protecting sensor networks from attacks.
【Keywords】: game theory; trusted computing; wireless sensor networks; Nash equilibrium condition; Stackelberg game; attack-defense interaction; critical infrastructure systems; data trustworthiness; environment monitoring; game-theoretic approach; game-theoretic defense strategy; hostile environments; malicious interferences; node trustworthiness scores; sensor networks; situation assessment; Computational modeling; Educational institutions; Games; Monitoring; Power grids; Reliability; Security
【Paper Link】 【Pages】:1204-1206
【Authors】: Oktie Hassanzadeh ; Anastasios Kementsietsidis ; Yannis Velegrakis
【Abstract】: We provide an overview of the current data management research issues in the context of the Semantic Web. The objective is to introduce the audience into the area of the Semantic Web, and to highlight the fact that the area provides many interesting research opportunities for the data management community. A new model, the Resource Description Framework (RDF), coupled with a new query language, called SPARQL, lead us to revisit some classical data management problems, including efficient storage, query optimization, and data integration. These are problems that the Semantic Web community has only recently started to explore, and therefore the experience and long tradition of the database community can prove valuable. We target both experienced and novice researchers that are looking for a thorough presentation of the area and its key research topics.
【Keywords】: SQL; data integration; meta data; query processing; semantic Web; RDF; SPARQL; data integration; data management issues; database community; query language; query optimization; resource description framework; semantic web; storage; Communities; Conferences; Data models; Databases; Resource description framework; Semantics; alternative clustering; data mining; disparate clustering; multi-view clustering; subspace clustering
【Paper Link】 【Pages】:1207-1210
【Authors】: Emmanuel Müller ; Stephan Günnemann ; Ines Färber ; Thomas Seidl
【Abstract】: Traditional clustering algorithms identify just a single clustering of the data. Today's complex data, however, allow multiple interpretations leading to several valid groupings hidden in different views of the database. Each of these multiple clustering solutions is valuable and interesting as different perspectives on the same data and several meaningful groupings for each object are given. Especially for high dimensional data, where each object is described by multiple attributes, alternative clusters in different attribute subsets are of major interest. In this tutorial, we describe several real world application scenarios for multiple clustering solutions. We abstract from these scenarios and provide the general challenges in this emerging research area. We describe state-of-the-art paradigms, we highlight specific techniques, and we give an overview of this topic by providing a taxonomy of the existing clustering methods. By focusing on open challenges, we try to attract young researchers for participating in this emerging research field.
【Keywords】: data mining; pattern clustering; set theory; attribute subsets; database views; grouping objects; high dimensional data; multiple clustering solutions; multiple interpretations; open challenges; single data clustering; state-of-the-art paradigms; young researchers; Algorithm design and analysis; Clustering algorithms; Data mining; Databases; Scalability; Taxonomy; Tutorials; alternative clustering; data mining; disparate clustering; multi-view clustering; subspace clustering
【Paper Link】 【Pages】:1211-1213
【Authors】: Xin Luna Dong ; Divesh Srivastava
【Abstract】: The Web has enabled the availability of a vast amount of useful information in recent years. However, the web technologies that have enabled sources to share their information have also made it easy for sources to copy from each other and often publish without proper attribution. Understanding the copying relationships between sources has many benefits, including helping data providers protect their own rights, improving various aspects of data integration, and facilitating in-depth analysis of information flow. The importance of copy detection has led to a substantial amount of research in many disciplines of Computer Science, based on the type of information considered, such as text, images, videos, software code, and structured data. This seminar explores the similarities and differences between the techniques proposed for copy detection across the different types of information. We also examine the computational challenges associated with large-scale copy detection, indicating how they could be detected efficiently, and identify a range of open problems for the community.
【Keywords】: Internet; copy protection; data integration; reproduction (copying); Web technology; World Wide Web; clone detection; computational challenges; computer science; copy detection; copying relationships; data integration; data providers; in-depth analysis; information flow; information sharing; reuse detection; useful information; Cloning; Communities; Computer science; Databases; Seminars; Software; Videos
【Paper Link】 【Pages】:1214-1217
【Authors】: Jiawei Han ; Yizhou Sun ; Xifeng Yan ; Philip S. Yu
【Abstract】: Most objects and data in the real world are interconnected, forming complex, heterogeneous but often semistructured information networks. However, many database researchers consider a database merely as a data repository that supports storage and retrieval rather than an information-rich, inter-related and multi-typed information network that supports comprehensive data analysis, whereas many network researchers focus on homogeneous networks. Departing from both, we view interconnected, semi-structured datasets as heterogeneous, information-rich networks and study how to uncover hidden knowledge in such networks. For example, a university database can be viewed as a heterogeneous information network, where objects of multiple types, such as students, professors, courses, departments, and multiple typed relationships, such as teach and advise are intertwined together, providing abundant information. In this tutorial, we present an organized picture on mining heterogeneous information networks and introduce a set of interesting, effective and scalable network mining methods. The topics to be covered include (i) database as an information network, (ii) mining information networks: clustering, classification, ranking, similarity search, and meta path-guided analysis, (iii) construction of quality, informative networks by data mining, (iv) trend and evolution analysis in heterogeneous information networks, and (v) research frontiers. We show that heterogeneous information networks are informative, and link analysis on such networks is powerful at uncovering critical knowledge hidden in large semi-structured datasets. Finally, we also present a few promising research directions.
【Keywords】: data analysis; data mining; database management systems; information networks; information retrieval systems; comprehensive data analysis; data mining; data repository; data retrieval; data storage; database researchers; information network analysis; knowledge mining; link analysis; semistructured information networks; Cleaning; Data mining; Databases; Ontologies; Semantics; Sun
【Paper Link】 【Pages】:1218-1221
【Authors】: Arijit Khan ; Yinghui Wu ; Xifeng Yan
【Abstract】: In a wide array of disciplines, data can be modeled as an interconnected network of entities, where various attributes could be associated with both the entities and the relations among them. Knowledge is often hidden in the complex structure and attributes inside these networks. While querying and mining these linked datasets are essential for various applications, traditional graph queries may not be able to capture the rich semantics in these networks. With the advent of complex information networks, new graph queries are emerging, including graph pattern matching and mining, similarity search, ranking and expert finding, graph aggregation and OLAP. These queries require both the topology and content information of the network data, and hence, different from classical graph algorithms such as shortest path, reach ability and minimum cut, which depend only on the structure of the network. In this tutorial, we shall give an introduction of the emerging graph queries, their indexing and resolution techniques, the current challenges and the future research directions.
【Keywords】: data mining; data structures; indexing; pattern matching; publishing; query processing; reachability analysis; OLAP; complex attributes; complex information networks; complex structure; entity interconnected network; expert finding; graph aggregation; graph algorithms; graph pattern matching; graph pattern mining; graph queries; indexing techniques; linked data; linked dataset mining; linked dataset querying; minimum cut; reachability; resolution techniques; shortest path; similarity ranking; similarity search; Data mining; Indexing; Keyword search; Pattern matching; Social network services; Tutorials
【Paper Link】 【Pages】:1222-1224
【Authors】: Jaideep Vaidya
【Abstract】: With the ubiquitous nature and sheer scale of data collection, the problem of data summarization is most critical for effective data management. Classical matrix decomposition techniques have often been used for this purpose, and have been the subject of much study. In recent years, several other forms of decomposition, including Boolean Matrix Decomposition have become of significant practical interest. Since much of the data collected is categorical in nature, it can be viewed in terms of a Boolean matrix. Boolean matrix decomposition (BMD), wherein a boolean matrix is expressed as a product of two Boolean matrices, can be used to provide concise and interpretable representations of Boolean data sets. The decomposed matrices give the set of meaningful concepts and their combination which can be used to reconstruct the original data. Such decompositions are useful in a number of application domains including role engineering, text mining as well as knowledge discovery from databases. In this seminar, we look at the theory underlying the BMD problem, study some of its variants and solutions, and examine different practical applications.
【Keywords】: Boolean algebra; data handling; data mining; matrix decomposition; BMD; Boolean data sets; Boolean matrix decomposition problem; data collection; data engineering; data management; data summarization; knowledge discovery; role engineering; text mining; Access control; Approximation methods; Computational modeling; Conferences; Data mining; Matrix decomposition; USA Councils
【Paper Link】 【Pages】:1225-1228
【Authors】: Thomas Kissinger ; Hannes Voigt ; Wolfgang Lehner
【Abstract】: As databases accumulate growing amounts of data at an increasing rate, adaptive indexing becomes more and more important. At the same time, applications and their use get more agile and flexible, resulting in less steady and less predictable workload characteristics. Being inert and coarse-grained, state-of-the-art index tuning techniques become less useful in such environments. Especially the full-column indexing paradigm results in lot of indexed but never queried data and prohibitively high memory and maintenance costs. In our demonstration, we present Self-Managing Indexes, a novel, adaptive, fine-grained, autonomous indexing infrastructure. In its core, our approach builds on a novel access path that automatically collects useful index information, discards useless index information, and competes with its kind for resources to host its index information. Compared to existing technologies for adaptive indexing, we are able to dynamically grow and shrink our indexes, instead of incrementally enhancing the index granularity. In the demonstration, we visualize performance and system measures for different scenarios and allow the user to interactively change several system parameters.
【Keywords】: database indexing; query processing; SMIX live; access path; adaptive indexing infrastructure; autonomous indexing infrastructure; database; dynamic workload; fine-grained indexing infrastructure; index granularity; index information; queried data; self-managing index infrastructure; system measures; system parameter; workload characteristics; Aerospace electronics; Indexing; Radiation detectors; Tuning
【Paper Link】 【Pages】:1229-1232
【Authors】: Mohammad Sadoghi ; Rija Javed ; Naif Tarafdar ; Harsh Singh ; Rohan Palaniappan ; Hans-Arno Jacobsen
【Abstract】: We present an efficient multi-query event stream platform to support query processing over high-frequency event streams. Our platform is built over reconfigurable hardware -- FPGAs -- to achieve line-rate multi-query processing by exploiting unprecedented degrees of parallelism and potential for pipelining, only available through custom-built, application-specific and low-level logic design. Moreover, a multi-query event stream processing engine is at the core of a wide range of applications including real-time data analytics, algorithmic trading, targeted advertisement, and (complex) event processing.
【Keywords】: field programmable gate arrays; logic design; query processing; FPGA; algorithmic trading; complex event processing; high-frequency event streams; line-rate multiquery processing; low-level logic design; multiquery event stream platform; multiquery stream processing; parallelism; pipelining; real-time data analytics; reconfigurable hardware; targeted advertisement; Algebra; Bandwidth; Field programmable gate arrays; Hardware; Hardware design languages; Parallel processing; Semantics
【Paper Link】 【Pages】:1233-1236
【Authors】: Jia Xu ; Qiushi Bai ; Yu Gu ; Anthony K. H. Tung ; Guoren Wang ; Ge Yu ; Zhenjie Zhang
【Abstract】: The Earth Mover's Distance, or EMD for short, has been proven to be effective for content-based image retrieval. However, due to the cubic complexity of EMD computation, it remains difficult to use EMD in applications with stringent requirement for efficiency. In this paper, we present our new system, called EUDEMON, which utilizes new techniques to support fast Online Video Frame Copy Detection based on the EMD. Given a group of registered frames as queries and a set of targeted detection videos, EUDEMON is capable of identifying relevant frames from the video stream in real time. The significant improvement on efficiency mainly relies on the primal-dual theory in linear programming and well-designed B+ tree filters for adaptive candidate pruning. Generally speaking, our system includes a variety of new features crucial to the deployment of EUDEMON in real applications. First, EUDEMON achieves high throughput even when a large number of queries are registered in the system. Second, EUDEMON contains self-optimization component to automatically enhance the effectiveness of the filters based on the recent content of the video stream. Finally, EUDEMON provides a user-friendly visualization interface, named EMD Flow Chart, to help the users to better understand the alarm with the perspective of the EMD.
【Keywords】: computational complexity; content-based retrieval; data visualisation; filtering theory; image registration; linear programming; object detection; statistical distributions; trees (mathematics); video retrieval; video streaming; B+ tree filter; EMD flow chart; EUDEMON; Earth mover distance; adaptive candidate pruning; content-based image retrieval; cubic complexity; frame identification; frame registration; linear programming; online video frame copy detection; primal dual theory; query processing; self-optimization component; target detection video; user-friendly visualization interface; video stream; Earth; Histograms; Image color analysis; Image retrieval; Robustness; Streaming media; Throughput
【Paper Link】 【Pages】:1237-1240
【Authors】: Meiyu Lu ; Srinivas Bangalore ; Graham Cormode ; Marios Hadjieleftheriou ; Divesh Srivastava
【Abstract】: A key step in validating a proposed idea or system is to evaluate over a suitable dataset. However, to this date there have been no useful tools for researchers to understand which datasets have been used for what purpose, or in what prior work. Instead, they have to manually browse through papers to find the suitable datasets and their corresponding URLs, which is laborious and inefficient. To better aid the dataset discovery process, and provide a better understanding of how and where datasets have been used, we propose a framework to effectively identify datasets within the scientific corpus. The key technical challenges are identification of datasets, and discovery of the association between a dataset and the URLs where they can be accessed. Based on this, we have built a user friendly web-based search interface for users to conveniently explore the dataset-paper relationships, and find relevant datasets and their properties.
【Keywords】: data analysis; document handling; search engines; user interfaces; URL; dataset discovery process; dataset search engine; dataset-paper relationships; research document corpus; scientific corpus; user friendly Web-based search interface; Benchmark testing; Bibliographies; Data mining; Feature extraction; Libraries; Portable document format; Search engines
【Paper Link】 【Pages】:1241-1244
【Authors】: Keivan Kianmehr ; Negar Koochakzadeh ; Reda Alhajj
【Abstract】: The user-centric query interface is very common application that allows expressing both the input and the output using fuzzy terms. This is becoming a need in the evolving internet-based era where web-based applications are very common and the number of users accessing structured databases is increasing rapidly. Restricting the user group to only experts in query coding must be avoided. The Ask Fuzzy system has been developed to address this vital issue which has social and industrial impact. It is an attractive and friendly visual user interface that facilitates expressing queries using both fuzziness and traditional methods. The fuzziness is not expressed explicitly inside the database, it is rather absorbed and effectively handled by an intermediate layer which is cleverly incorporated between the front-end visual user-interface and the back-end database.
【Keywords】: Internet; fuzzy set theory; query processing; user interfaces; visual programming; AskFuzzy; Internet-based era; Web-based applications; attractive visual fuzzy query builder; back-end database; front-end visual user-interface; query coding; structured databases; user-centric query interface; Educational institutions; Fuzzy sets; Optimization; Pragmatics; Relational databases; Visualization; fuzzy system; query interface; user-centric application; visual queries
【Paper Link】 【Pages】:1245-1248
【Authors】: Ulrike Fischer ; Frank Rosenthal ; Wolfgang Lehner
【Abstract】: Forecasts are important to decision-making and risk assessment in many domains. Since current database systems do not provide integrated support for forecasting, it is usually done outside the database system by specially trained experts using forecast models. However, integrating model-based forecasting as a first-class citizen inside a DBMS speeds up the forecasting process by avoiding exporting the data and by applying database-related optimizations like reusing created forecast models. It especially allows subsequent processing of forecast results inside the database. In this demo, we present our prototype F2DB based on PostgreSQL, which allows for transparent processing of forecast queries. Our system automatically takes care of model maintenance when the underlying dataset changes. In addition, we offer optimizations to save maintenance costs and increase accuracy by using derivation schemes for multidimensional data. Our approach reduces the required expert knowledge by enabling arbitrary users to apply forecasting in a declarative way.
【Keywords】: SQL; database management systems; decision making; query processing; risk management; DBMS; F2DB; PostgreSQL; database-related optimization; decision making; derivation scheme; flash-forward database system; maintenance cost; model maintenance; model-based forecasting; multidimensional data; risk assessment; transparent forecast query processing; Analytical models; Data models; Forecasting; Maintenance engineering; Mathematical model; Predictive models; Time series analysis
【Paper Link】 【Pages】:1249-1252
【Authors】: Robert Ikeda ; Junsang Cho ; Charlie Fang ; Semih Salihoglu ; Satoshi Torikai ; Jennifer Widom
【Abstract】: Panda (for Provenance and Data), a system for data-oriented workflows that supports debugging and drill-down using logical provenance-provenance information stored at the processing-node level is demonstrated. In this demonstration, Panda is used to integrate, process, and analyze actual education data from multiple sources.
【Keywords】: data analysis; data integration; educational administrative data processing; program debugging; Panda; data drill-down; data integration; data processing; data-oriented workflows; education data analysis; logical provenance-provenance information storage; processing-node level; provenance and data; provenance-based debugging; Data mining; Debugging; Educational institutions; Electronic publishing; Graphical user interfaces; Information services
【Paper Link】 【Pages】:1253-1256
【Authors】: Ahmed M. Aly ; Asmaa Sallam ; Bala M. Gnanasekaran ; Long-Van Nguyen-Dinh ; Walid G. Aref ; Mourad Ouzzani ; Arif Ghafoor
【Abstract】: The continuous growth of social web applications along with the development of sensor capabilities in electronic devices is creating countless opportunities to analyze the enormous amounts of data that is continuously steaming from these applications and devices. To process large scale data on large scale computing clusters, MapReduce has been introduced as a framework for parallel computing. However, most of the current implementations of the MapReduce framework support only the execution of fixed-input jobs. Such restriction makes these implementations inapplicable for most streaming applications, in which queries are continuous in nature, and input data streams are continuously received at high arrival rates. In this demonstration, we showcase M3, a prototype implementation of the MapReduce framework in which continuous queries over streams of data can be efficiently answered. M3 extends Hadoop, the open source implementation of MapReduce, bypassing the Hadoop Distributed File System (HDFS) to support main-memory-only processing. Moreover, M3 supports continuous execution of the Map and Reduce phases where individual Mappers and Reducers never terminate.
【Keywords】: Internet; file organisation; parallel processing; workstation clusters; Hadoop distributed file system; MapReduce framework; electronic devices; large scale computing clusters; main-memory MapReduce; main-memory-only processing; parallel computing; sensor capabilities; social Web applications; stream processing; streaming applications; Batch production systems; Fault tolerance; Fault tolerant systems; File systems; Monitoring; Query processing; Road transportation
【Paper Link】 【Pages】:1257-1260
【Authors】: Torsten Grust ; Manuel Mayr
【Abstract】: We demonstrate SWITCH, a deep embedding of relational queries into Ruby and Ruby on Rails. With SWITCH, there is no syntactic or stylistic difference between Ruby programs that operate over in-memory array objects or database-resident tables, even if these programs rely on array order or nesting. SWITCH's built-in compiler and SQL code generator guarantee to emit few queries, addressing long-standing performance problems that trace back to Rails' Active Record database binding. "Looks likes Ruby, but performs like handcrafted SQL, " is the ideal that drives the research and development effort behind SWITCH.
【Keywords】: SQL; program compilers; query processing; relational databases; Rail Active Record database binding; Ruby on Rails; Ruby program; SQL code generator; SWITCH; array order; built-in compiler; database-resident tables; in-memory array object; nesting; relational query deep embedding; stylistic difference; syntactic difference; Arrays; DSL; Databases; Rails; Runtime; Switches; Syntactics
【Paper Link】 【Pages】:1261-1264
【Authors】: Rubi Boim ; Ohad Greenshpan ; Tova Milo ; Slava Novgorodov ; Neoklis Polyzotis ; Wang Chiew Tan
【Abstract】: Crowd-based data sourcing is a new and powerful data procurement paradigm that engages Web users to collectively contribute information. In this work, we target the problem of gathering data from the crowd in an economical and principled fashion. We present Ask It!, a system that allows interactive data sourcing applications to effectively determine which questions should be directed to which users for reducing the uncertainty about the collected data. Ask It! uses a set of novel algorithms for minimizing the number of probing (questions) required from the different users. We demonstrate the challenge and our solution in the context of a multiple-choice question game played by the ICDE'12 attendees, targeted to gather information on the conference's publications, authors and colleagues.
【Keywords】: Internet; data analysis; question answering (information retrieval); Ask It!; Web users; crowd-based data sourcing; data procurement paradigm; interactive data sourcing applications; multiple-choice question game; Databases; Entropy; Games; Measurement; Optimization; Probes; Uncertainty
【Paper Link】 【Pages】:1265-1268
【Authors】: Chunbin Lin ; Jiaheng Lu ; Tok Wang Ling ; Bogdan Cautis
【Abstract】: The existing query languages for XML (e.g., XQuery) require professional programming skills to be formulated, however, such complex query languages burden the query processing. In addition, when issuing an XML query, users are required to be familiar with the content (including the structural and textual information) of the hierarchical XML, which is diffcult for common users. The need for designing user friendly interfaces to reduce the burden of query formulation is fundamental to the spreading of XML community. We present a twig-based XML graphical search system, called LotusX, that provides a graphical interface to simplify the query processing without the need of learning query language and data schemas and the knowledge of the content of the XML document. The basic idea is that LotusX proposes "position-aware" and "auto-completion" features to help users to create tree-modeled queries (twig pattern) by providing the possible candidates on-the-fly. In addition, complex twig queries (including order sensitive queries) are supported in LotusX. Furthermore, a new ranking strategy and a query rewriting solution are implemented to rank and rewrite the query effectively. We provide an online demo for LotusX system: http://datasearch.ruc.edu.cn:8080/LotusX.
【Keywords】: XML; graphical user interfaces; query formulation; query languages; query processing; trees (mathematics); LotusX; XML query; XQuery; autocompletion feature; data schemas; graphical interface; hierarchical XML; order sensitive queries; position-aware XML graphical search system; query formulation; query languages; query processing; query rewriting solution; ranking strategy; tree-modeled queries; twig pattern; twig queries; Books; Database languages; Indexes; Keyword search; Programming; Query processing; XML
【Paper Link】 【Pages】:1269-1272
【Authors】: Mehdi Kargar ; Aijun An
【Abstract】: A system for efficient keyword search in graphs is demonstrated. The system has two components, a search through only the nodes containing the input keywords for a set of nodes that are close to each other and together cover the input keywords and an exploration for finding how these nodes are related to each other. The system generates all or top-k answers in polynomial delay. Answers are presented to the user according to a ranking criterion so that the answers with nodes closer to each other are presented before the ones with nodes farther away from each other. In addition, the set of answers produced by our system is duplication free. The system uses two methods for presenting the final answer to the user. The presentation methods reveal relationships among the nodes in an answer through a tree or a multi-center graph. We will show that each method has its own advantages and disadvantages. The system is demonstrated using two challenging datasets, very large DBLP and highly cyclic Mondial. Challenges and difficulties in implementing an efficient keyword search system are also demonstrated.
【Keywords】: approximation theory; information retrieval; polynomials; tree data structures; trees (mathematics); Mondial; approximation algorithm; data structure; duplication free; information extraction; multicenter graph; polynomial delay; presentation methods; ranking criterion; top-k keyword search system; very large DBLP; Approximation algorithms; Communities; Delay; Indexes; Keyword search; Polynomials
【Paper Link】 【Pages】:1273-1276
【Authors】: Rui Li ; Kin Hou Lei ; Ravi Khadiwala ; Kevin Chen-Chuan Chang
【Abstract】: Witnessing the emergence of Twitter, we propose a Twitter-based Event Detection and Analysis System (TEDAS), which helps to (1) detect new events, to (2) analyze the spatial and temporal pattern of an event, and to (3) identify importance of events. In this demonstration, we show the overall system architecture, explain in detail the implementation of the components that crawl, classify, and rank tweets and extract location from tweets, and present some interesting results of our system.
【Keywords】: social networking (online); TEDAS; Twitter-based event detection and analysis system; event spatial pattern; event temporal pattern; overall system architecture; Conferences; Data engineering
【Paper Link】 【Pages】:1277-1280
【Authors】: Fei Chiang ; Periklis Andritsos ; Erkang Zhu ; Renée J. Miller
【Abstract】: An attribute dictionary is a set of attributes together with a set of common values of each attribute. Such dictionaries are valuable in understanding unstructured or loosely structured textual descriptions of entity collections, such as product catalogs. Dictionaries provide the supervised data for learning product or entity descriptions. In this demonstration, we will present AutoDict, a system that analyzes input data records, and discovers high quality dictionaries using information theoretic techniques. To the best of our knowledge, AutoDict is the first end-to-end system for building attribute dictionaries. Our demonstration will showcase the different information analysis and extraction features within AutoDict, and highlight the process of generating high quality attribute dictionaries.
【Keywords】: cataloguing; dictionaries; information retrieval; text analysis; AutoDict; attribute dictionary; automated dictionary discovery; data record; end-to-end system; entity collection; entity description; high quality dictionaries; information analysis; information extraction; information theoretic technique; learning product; loosely structured textual description; product catalog; unstructured textual description; Data mining; Data models; Dictionaries; Frequency measurement; Hidden Markov models; TV; Tagging
【Paper Link】 【Pages】:1281-1284
【Authors】: Barbara Carminati ; Elena Ferrari ; Jacopo Girardi
【Abstract】: At the beginning of Web 2.0 era, Online Social Networks (OSNs) appeared as just another phenomenon among wikis, blogs, video sharing, and so on. However, they soon became one of the biggest revolution of the Internet era. Statistics confirm the continuing rise in the importance of social networking sites in terms of number of users (e.g., Facebook reaches 750 millions users, Twitter 200 millions, LinkedIn 100 millions), time spent in social networking sites, and amount of data flowing (e.g., Facebook users interact with about 900 million piece of data in terms of pages, groups, events and community pages). This successful trend lets OSNs to be one of the most promising paradigms for information sharing on the Web.
【Keywords】: Internet; social networking (online); trusted computing; Internet; OSN; Web 2.0; blogs; online social networks; trusted information sharing; video sharing; wikis; Access control; Educational institutions; Facebook; Google; Privacy; Servers
【Paper Link】 【Pages】:1285-1288
【Authors】: Elke Achtert ; Sascha Goldhofer ; Hans-Peter Kriegel ; Erich Schubert ; Arthur Zimek
【Abstract】: When comparing clustering results, any evaluation metric breaks down the available information to a single number. However, a lot of evaluation metrics are around, that are not always concordant nor easily interpretable in judging the agreement of a pair of clusterings. Here, we provide a tool to visually support the assessment of clustering results in comparing multiple clusterings. Along the way, the suitability of a couple of clustering comparison measures can be judged in different scenarios.
【Keywords】: data visualisation; pattern clustering; clustering evaluation; evaluation metrics; multiple clusterings; visual support; Clustering algorithms; Complexity theory; Conferences; Data visualization; Measurement; Noise; Visualization
【Paper Link】 【Pages】:1289-1292
【Authors】: Mohamed Sarwat ; Sameh Elnikety ; Yuxiong He ; Gabriel Kliot
【Abstract】: Graphs are used in many large-scale applications, such as social networking. The management of these graphs poses new challenges as such graphs are too large for a single server to manage efficiently. Current distributed techniques such as map-reduce and Pregel are not well-suited to processing interactive ad-hoc queries against large graphs. In this paper we demonstrate Horton, a distributed interactive query execution engine for large graphs. Horton defines a query language that allows the expression of regular language reach ability queries and provides a query execution engine with a query optimizer that allows interactive execution of queries on large distributed graphs in parallel. In the demo, we show the functionality of Horton managing a large graph for a social networking application called Codebook, whose graph represents data on software components, developers, development artifacts such as bug reports, and their interactions in large software projects.
【Keywords】: interactive systems; query languages; query processing; reachability analysis; social networking (online); Codebook; Horton; development artifacts; distributed interactive query execution engine; graph management; graph models; interactive ad hoc query processing; large distributed graphs; large software projects; online query execution engine; query language; query optimizer; regular language reachability queries; social networking application; software components; Data models; Database languages; Engines; Libraries; Servers; Social network services; Software
【Paper Link】 【Pages】:1293-1296
【Authors】: Peter M. Fischer ; Jens Teubner
【Abstract】: We demonstrate MXQuery/H, a modified version of MXQuery that uses hardware acceleration to speed up XML processing. The main goal of this demonstration is to give an interactive example of hardware/software co-design and show how system performance and energy efficiency can be improved by off-loading tasks to FPGA hardware. To this end, we equipped MXQuery/H with various hooks to inspect the different parts of the system. Besides that, our system can finally really leverage the idea of XML projection. Though the idea of projection had been around for a while, its effectiveness remained always limited because of the unavoidable and high parsing overhead. By performing the task in hardware, we relieve the software part from this overhead and achieve processing speed-ups of several factors.
【Keywords】: XML; field programmable gate arrays; hardware-software codesign; query processing; FPGA hardware; MXQuery; MXQuery/H; XML processing; hardware acceleration; hardware/software co-design; parsing overhead; Automata; Field programmable gate arrays; Hardware; Runtime; Skeleton; Software; XML
【Paper Link】 【Pages】:1297-1300
【Authors】: Steffen Hirte ; Andreas Seifert ; Stephan Baumann ; Daniel Klan ; Kai-Uwe Sattler
【Abstract】: Motion sensing input devices like Microsoft's Kinect offer an alternative to traditional computer input devices like keyboards and mouses. Daily new applications using this interface appear. Most of them implement their own gesture detection. In our demonstration we show a new approach using the data stream engine Andu IN. The gesture detection is done based on Andu IN's complex event processing functionality. This way we build a system that allows to define new and complex gestures on the basis of a declarative programming interface. On this basis our demonstration data3 provides a basic natural interaction OLAP interface for a sample star schema database using Microsoft's Kinect.
【Keywords】: data mining; database management systems; gesture recognition; AnduIN; Kinect interface; Microsoft Kinect; OLAP interface; complex event processing; data stream engine; data3; declarative programming interface; gesture detection; motion sensing input device; natural interaction; sample star schema database; Cameras; Data visualization; Databases; Engines; Middleware; Navigation; Sensors
【Paper Link】 【Pages】:1301-1304
【Authors】: Anisoara Nica ; Ian Charlesworth ; Maysum Panju
【Abstract】: Search spaces generated by query optimizers during the optimization process encapsulate characteristics of the join enumeration algorithms, the cost models, as well as critical decisions made for pruning and choosing the best plan. We demonstrate the Join Enumeration Viewer which is a tool designed for visualizing, mining, and comparing plan search spaces generated by different join enumeration algorithms when optimizing same SQL statement. We have enhanced Sybase SQL Anywhere relational database management system to log, in a very compact format, its search space during an optimization process. Such optimization log can then be analyzed by the Join Enumeration Viewer which internally builds the logical and physical plan graphs representing complete and partial plans considered during the optimization process. The optimization logs also contain statistics of the resource consumption during the query optimization such as optimization time breakdown, for example, for logical join enumeration versus costing physical plans, and memory allocation for different optimization structures. The SQL Anywhere Optimizer implements a highly adaptable, self-managing, search space generation algorithm by having several join enumeration algorithms to choose from, each enhanced with different ordering and pruning techniques. The emphasis of the demonstration will be on comparing and contrasting these join enumeration algorithms by analyzing their optimization logs. The demonstration scenarios will include optimizing SQL statements under various conditions which will exercise different algorithms, pruning and ordering techniques. These search spaces will then be visualized and compared using the Join Enumeration Viewer.
【Keywords】: SQL; data mining; data visualisation; graph theory; query processing; relational databases; resource allocation; statistical analysis; storage management; Join Enumeration Viewer tool; SQL Anywhere Optimizer; SQL statement optimization; Sybase SQL Anywhere relational database management system; cost model; costing physical plan; critical decision; highly adaptable algorithm; join enumeration algorithm; logical join enumeration; logical plan graph; memory allocation; optimization log; optimization structure; optimization time breakdown; ordering technique; physical plan graph; pruning technique; query optimization process analysis; resource consumption statistics; search space comparison; search space generation algorithm; search space mining; search space visualization; self-managing algorithm; Algorithm design and analysis; Conferences; Costing; Heuristic algorithms; Optimization; Partitioning algorithms; Visualization
【Paper Link】 【Pages】:1305-1308
【Authors】: Yonghui Xiao ; James J. Gardner ; Li Xiong
【Abstract】: We demonstrate DPCube, a component in our Health Information DE-identification (HIDE) framework, for releasing differentially private data cubes (or multi-dimensional histograms) for sensitive data. HIDE is a framework we developed for integrating heterogenous structured and unstructured health information and provides methods for privacy preserving data publishing. The DPCube component uses differentially private access mechanisms and an innovative 2-phase multidimensional partitioning strategy to publish a multi-dimensional data cube or histogram that achieves good utility while satisfying differential privacy. We demonstrate that the released data cubes can serve as a sanitized synopsis of the raw database and, together with an optional synthesized dataset based on the data cubes, can support various Online Analytical Processing (OLAP) queries and learning tasks.
【Keywords】: data mining; data privacy; database management systems; medical information systems; object-oriented programming; query processing; DPCube component; HIDE framework; OLAP query; differential privacy; differentially private access mechanisms; differentially private data cubes; health information de-identification framework; heterogenous structured health information; innovative 2-phase multidimensional partitioning strategy; learning tasks; multidimensional data cube; multidimensional histograms; online analytical processing query; optional synthesized dataset; privacy preserving data publishing; raw database; released data cubes; sanitized synopsis; sensitive data; unstructured health information; Data analysis; Data privacy; Databases; Estimation; Histograms; Privacy; Publishing
【Paper Link】 【Pages】:1309-1312
【Authors】: Roberto De Virgilio ; Giorgio Orsi ; Letizia Tanca ; Riccardo Torlone
【Abstract】: We present NYAYA, a flexible system for the management of large-scale semantic data which couples a general-purpose storage mechanism with efficient ontological query answering. NYAYA rapidly imports semantic data expressed in different formalisms into semantic data kiosks. Each kiosk exposes the native ontological constraints in a uniform fashion using data log±, a very general rule-based language for the representation of ontological constraints. A group of kiosks forms a semantic data market where the data in each kiosk can be uniformly accessed using conjunctive queries and where users can specify user-defined constraints over the data. NYAYA is easily extensible and robust to updates of both data and meta-data in the kiosk and can readily adapt to different logical organizations of the persistent storage. In the demonstration, we will show the capabilities of NYAYA over real-world case studies and demonstrate its efficiency over well-known benchmarks.
【Keywords】: meta data; ontologies (artificial intelligence); query processing; semantic Web; storage management; NYAYA; conjunctive queries; data log; general-purpose storage mechanism; large-scale semantic data; meta-data; native ontological constraints; ontological query answering; semantic Web; semantic data kiosks; uniform management; Cognition; Electronic publishing; Encyclopedias; Internet; Ontologies; Semantics
【Paper Link】 【Pages】:1313-1316
【Authors】: Songling Liu ; Juan P. Cedeño ; K. Selçuk Candan ; Maria Luisa Sapino ; Shengyu Huang ; Xinsheng Li
【Abstract】: Existing RDF query languages and RDF stores fail to support a large class of knowledge applications which associate utilities or costs on the available knowledge statements. A recent proposal includes (a) a ranked RDF (R2DF) specification to enhance RDF triples with an application specific weights and (b) a SPA Rank QL query language specification, which provides novel primitives on top of the SPARQL language to express top-k queries using traditional query patterns as well as novel flexible path predicates. We introduce and demonstrate R2DB, a database system for querying weighted RDF graphs. R2DB relies on the AR2Q query processing engine, which leverages novel index structures to support efficient ranked path search and includes query optimization strategies based on proximity and sub-result inter-arrival times. In addition to being the first data management system for the R2DF data model, R2DB also provides an innovative features-of-interest (FoI) based method for visualizing large sets of query results (i.e., sub graphs of the data graph).
【Keywords】: data models; data visualisation; database indexing; graph theory; query languages; query processing; AR2Q query processing engine; FoI based method; R2DF data model; R2DF specification; R2DB; RDF query language; RDF stores; RDF triples; SPA rank QL query language specification; SPARQL language; application specific weights; data graph; data management system; database system; features-of-interest based method; flexible path predicates; index structure; knowledge application; knowledge statement; large set visualization; proximity inter-arrival times; query optimization strategy; query pattern; ranked RDF; ranked path search; resource description framework; subgraph; subresult inter-arrival times; top-k queries; weighted RDF graph querying; Data visualization; Database languages; Engines; Indexes; Pattern matching; Query processing; Resource description framework
【Paper Link】 【Pages】:1317-1320
【Authors】: Roger S. Barga ; Jaliya Ekanayake ; Wei Lu
【Abstract】: Spreadsheets are established data collection and analysis tools in business, technical computing and academic research. Excel, for example, offers an attractive user interface, provides an easy to use data entry model, and offers substantial interactivity for what-if analysis. However, spreadsheets and other common client applications do not offer scalable computation for large scale data analytics and exploration. Increasingly researchers in domains ranging from the social sciences to environmental sciences are faced with a deluge of data, often sitting in spreadsheets such as Excel or other client applications, and they lack a convenient way to explore the data, to find related data sets, or to invoke scalable analytical models over the data. To address these limitations, we have developed a cloud data analytics service based on Daytona, which is an iterative MapReduce runtime optimized for data analytics. In our model, Excel and other existing client applications provide the data entry and user interaction surfaces, Daytona provides a scalable runtime on the cloud for data analytics, and our service seamlessly bridges the gap between the client and cloud. Any analyst can use our data analytics service to discover and import data from the cloud, invoke cloud scale data analytics algorithms to extract information from large datasets, invoke data visualization, and then store the data back to the cloud all through a spreadsheet or other client application they are already familiar with.
【Keywords】: business data processing; cloud computing; data analysis; data visualisation; environmental factors; social sciences; spreadsheet programs; user interfaces; Excel; Project Daytona; academic research; business; cloud service; data analytics; data collection; data visualization; environmental sciences; information extraction; iterative MapReduce; social sciences; spreadsheets; technical computing; user interface; Algorithm design and analysis; Analytical models; Cloud computing; Computational modeling; Data models; Distributed databases; Runtime
【Paper Link】 【Pages】:1321-1324
【Authors】: Isabel F. Cruz ; Cosmin Stroe ; Matteo Palmonari
【Abstract】: When compared to a gold standard, the set of mappings that are generated by an automatic ontology matching process is neither complete nor are the individual mappings always correct. However, given the explosion in the number, size, and complexity of available ontologies, domain experts no longer have the capability to create ontology mappings without considerable effort. We present a solution to this problem that consists of making the ontology matching process interactive so as to incorporate user feedback in the loop. Our approach clusters mappings to identify where user feedback will be most beneficial in reducing the number of user interactions and system iterations. This feedback process has been implemented in the Agreement Maker system and is supported by visual analytic techniques that help users to better understand the matching process. Experimental results using the OAEI benchmarks show the effectiveness of our approach. We will demonstrate how users can interact with the ontology matching process through the Agreement Maker user interface to match real-world ontologies.
【Keywords】: ontologies (artificial intelligence); pattern matching; OAEI benchmark; agreement maker system; automatic ontology matching process; clusters mapping; domain expert; interactive user feedback; signature vector; system iteration; user interaction; visual analytic technique; Benchmark testing; Conferences; Feedback loop; Ontologies; User interfaces; Vectors; Visual analytics
【Paper Link】 【Pages】:1325-1328
【Authors】: Pawel Jurczyk ; Li Xiong ; Slawomir Goryczka
【Abstract】: The emergence of cloud computing implies and facilitates managing large collections of highly distributed, autonomous, and possibly private databases. While there is an increasing need for services that allow integration and sharing of various data repositories, it remains a challenge to ensure the privacy, interoperability, and scalability for such services. In this paper we demonstrate a scalable and extensible framework that is aimed to enable privacy preserving data federations. The framework is built on top of a distributed mediator-wrapper architecture where nodes can form collaborative groups for secure anonymization and secure query processing when private data need to be accessed. New anonymization models and protocols will be demonstrated that counter potential attacks in the distributed setting.
【Keywords】: cloud computing; data privacy; distributed databases; open systems; protocols; query processing; security of data; DObjects+; anonymization security; cloud computing; collaborative groups nodes; data repository integration; data repository sharing; data service interoperability; data service privacy; data service scalability; distributed anonymization protocols; distributed mediator-wrapper architecture; distributed-autonomous-private databases; privacy-preserving data federation services; query processing security; Computer architecture; Data privacy; Distributed databases; Privacy; Protocols; Query processing
【Paper Link】 【Pages】:1329-1332
【Authors】: Kyriacos E. Pavlou ; Richard T. Snodgrass
【Abstract】: Regulations and societal expectations have recently emphasized the need to mediate access to valuable databases, even access by insiders. Fraud occurs when a person, often an insider, tries to hide illegal activity. Companies would like to be assured that such tampering has not occurred, or if it does, that it will be quickly discovered and used to identify the perpetrator. At one end of the compliance spectrum lies the approach of restricting access to information and on the other that of information accountability. We focus on effecting information accountability of data stored in high-performance databases. The demonstrated work ensures appropriate use and thus end-to-end accountability of database information via a continuous assurance technology based on cryptographic hashing techniques. A prototype tamper detection and forensic analysis system named DRAGOON was designed and implemented to determine when tampering(s) occurred and what data were tampered with. DRAGOON is scalable, customizable, and intuitive. This work will show that information accountability is a viable alternative to information restriction for ensuring the correct storage, use, and maintenance of databases on extant DBMSes.
【Keywords】: authorisation; computer forensics; cryptography; database management systems; fraud; DBMSEes; DRAGOON; compliance spectrum; continuous assurance technology; cryptographic hashing technique; database forensic analysis safeguard of Arizona; fraud; high-performance databases; information accountability system; prototype tamper detection; tampering; Algorithm design and analysis; Companies; Databases; Forensics; Graphical user interfaces; Monitoring; Security
【Paper Link】 【Pages】:1333-1336
【Authors】: Kenneth P. Smith ; Ameet Kini ; William Wang ; Chris Wolf ; M. David Allen ; Andrew Sillers
【Abstract】: The encrypted execution of database queries promises powerful security protections, however users are currently unlikely to benefit without significant expertise. In this demonstration, we illustrate a simple workflow enabling users to design secure executions of their queries. The Data Storm system demonstrated simplifies both the design and execution of encrypted execution plans, and represents progress toward the challenge of developing a general planner for encrypted query execution.
【Keywords】: cryptography; query processing; DataStorm system; database queries; encrypted execution planning; encrypted query execution; secure queries execution; security protections; Databases; Encryption; Lattices; Planning; Servers
【Paper Link】 【Pages】:1337-1348
【Authors】: Yasin N. Silva ; Per-Åke Larson ; Jingren Zhou
【Abstract】: Many companies now routinely run massive data analysis jobs -- expressed in some scripting language -- on large clusters of low-end servers. Many analysis scripts are complex and contain common sub expressions, that is, intermediate results that are subsequently joined and aggregated in multiple different ways. Applying conventional optimization techniques to such scripts will produce plans that execute a common sub expression multiple times, once for each consumer, which is clearly wasteful. Moreover, different consumers may have different physical requirements on the result: one consumer may want it partitioned on a column A and another one partitioned on column B. To find a truly optimal plan, the optimizer must trade off such conflicting requirements in a cost-based manner. In this paper we show how to extend a Cascade-style optimizer to correctly optimize scripts containing common sub expression. The approach has been prototyped in SCOPE, Microsoft's system for massive data analysis. Experimental analysis of both simple and large real-world scripts shows that the extended optimizer produces plans with 21 to 57% lower estimated costs.
【Keywords】: authoring languages; cloud computing; costing; data analysis; optimisation; query processing; Microsoft system; SCOPE; analysis scripts; cascade-style optimizer; cloud query processing; common subexpressions; conflicting requirements; conventional optimization techniques; cost-based manner; estimated costs; extended optimizer; low-end servers; massive data analysis jobs; physical requirements; real-world scripts; scripting language; sub expression multiple times; Companies; Data analysis; History; Object recognition; Optimization; Query processing; USA Councils
【Paper Link】 【Pages】:1349-1350
【Authors】: Marcin Zukowski ; Mark van de Wiel ; Peter A. Boncz
【Abstract】: Vector wise is a new entrant in the analytical database marketplace whose technology comes straight from innovations in the database research community in the past years. The product has since made waves due to its excellent performance in analytical customer workloads as well as benchmarks. We describe the history of Vectorwise, as well as its basic architecture and the experiences in turning a technology developed in an academic context into a commercial-grade product. Finally, we turn our attention to recent performance results, most notably on the TPC-H benchmark at various sizes.
【Keywords】: SQL; query processing; relational databases; TPC-H benchmark; analytical database marketplace; commercial-grade product; database research community; query pipeline; vectorized analytical DBMS; vectorwise; Benchmark testing; Database systems; Engines; History; Linux; Technological innovation
【Paper Link】 【Pages】:1351-1359
【Authors】: Yuanyuan Tian ; Shirish Tatikonda ; Berthold Reinwald
【Abstract】: With the exponential growth in the amount of data that is being generated in recent years, there is a pressing need for applying machine learning algorithms to large data sets. SystemML is a framework that employs a declarative approach for large scale data analytics. In SystemML, machine learning algorithms are expressed as scripts in a high-level language, called DML, which is syntactically similar to R. DML scripts are compiled, optimized, and executed in the SystemML runtime that is built on top of MapReduce. As the basis of virtually every quantitative analysis, descriptive statistics provide powerful tools to explore data in SystemML. In this paper, we describe our experience in implementing descriptive statistics in SystemML. In particular, we elaborate on how to overcome the two major challenges: (1) achieving numerical stability while operating on large data sets in a distributed setting of MapReduce, and (2) designing scalable algorithms to compute order statistics in MapReduce. By empirically comparing to algorithms commonly used in existing tools and systems, we demonstrate the numerical accuracy achieved by SystemML. We also highlight the valuable lessons we have learned in this exercise.
【Keywords】: data analysis; learning (artificial intelligence); numerical stability; specification languages; statistical analysis; DML; MapReduce; SystemML; declarative approach; high-level language; large scale data analytics; machine learning; numerical stability; numerically stable descriptive statistics; order statistics; scalable descriptive statistics; Accuracy; Approximation algorithms; Correlation; Equations; Higher order statistics; Numerical stability; Standards
【Paper Link】 【Pages】:1360-1369
【Authors】: Michael Busch ; Krishna Gade ; Brian Larson ; Patrick Lok ; Samuel Luckenbill ; Jimmy Lin
【Abstract】: The web today is increasingly characterized by social and real-time signals, which we believe represent two frontiers in information retrieval. In this paper, we present Early bird, the core retrieval engine that powers Twitter's real-time search service. Although Early bird builds and maintains inverted indexes like nearly all modern retrieval engines, its index structures differ from those built to support traditional web search. We describe these differences and present the rationale behind our design. A key requirement of real-time search is the ability to ingest content rapidly and make it searchable immediately, while concurrently supporting low-latency, high-throughput query evaluation. These demands are met with a single-writer, multiple-reader concurrency model and the targeted use of memory barriers. Early bird represents a point in the design space of real-time search engines that has worked well for Twitter's needs. By sharing our experiences, we hope to spur additional interest and innovation in this exciting space.
【Keywords】: Internet; query processing; real-time systems; search engines; social networking (online); Earlybird; Twitter; Web search; World Wide Web; information retrieval; query evaluation; real-time search; retrieval engines; Arrays; Indexing; Query processing; Real time systems; Twitter; Web search
【Paper Link】 【Pages】:1370-1381
【Authors】: Aditya Auradkar ; Chavdar Botev ; Shirshanka Das ; Dave De Maagd ; Alex Feinberg ; Phanindra Ganti ; Lei Gao ; Bhaskar Ghosh ; Kishore Gopalakrishna ; Brendan Harris ; Joel Koshy ; Kevin Krawez ; Jay Kreps ; Shi Lu ; Sunil Nagaraj ; Neha Narkhede ; Sasha Pachev ; Igor Perisic ; Lin Qiao ; Tom Quiggle ; Jun Rao ; Bob Schulman ; Abraham Sebastian ; Oliver Seeliger ; Adam Silberstein ; Boris Shkolnik ; Chinmay Soman ; Roshan Sumbaly ; Kapil Surlaker ; Sajid Topiwala ; Cuong Tran ; Balaji Varadarajan ; Jemiah Westerman ; Zach White ; David Zhang ; Jason Zhang
【Abstract】: Linked In is among the largest social networking sites in the world. As the company has grown, our core data sets and request processing requirements have grown as well. In this paper, we describe a few selected data infrastructure projects at Linked In that have helped us accommodate this increasing scale. Most of those projects build on existing open source projects and are themselves available as open source. The projects covered in this paper include: (1) Voldemort: a scalable and fault tolerant key-value store, (2) Data bus: a framework for delivering database changes to downstream applications, (3) Espresso: a distributed data store that supports flexible schemas and secondary indexing, (4) Kafka: a scalable and efficient messaging system for collecting various user activity events and log data.
【Keywords】: database indexing; distributed databases; public domain software; social networking (online); software fault tolerance; storage management; Data bus; Espresso; Kafka; LinkedIn; Voldemort; activity events; core data sets; data infrastructure projects; database change delivery; distributed data store; fault tolerant key-value store; flexible schemas; log data; messaging system; open source projects; request processing requirements; secondary indexing; social networking sites; Companies; Indexes; LinkedIn; Pipelines; Routing; Servers
【Paper Link】 【Pages】:1382-1393
【Authors】: Claudio Jossen ; Lukas Blunschi ; Magdalini Mori ; Donald Kossmann ; Kurt Stockinger
【Abstract】: This paper describes the meta-data warehouse of Credit Suisse that is productive since 2009. Like most other large organizations, Credit Suisse has a complex application landscape and several data warehouses in order to meet the information needs of its users. The problem addressed by the meta-data warehouse is to increase the agility and flexibility of the organization with regards to changes such as the development of a new business process, a new business analytics report, or the implementation of a new regulatory requirement. The meta-data warehouse supports these changes by providing services to search for information items in the data warehouses and to extract the lineage of information items. One difficulty in the design of such a meta-data warehouse is that there is no standard or well-known meta-data model that can be used to support such search services. Instead, the meta-data structures need to be flexible themselves and evolve with the changing IT landscape. This paper describes the current data structures and implementation of the Credit Suisse meta-data warehouse and shows how its services help to increase the flexibility of the whole organization. A series of example meta-data structures, use cases, and screenshots are given in order to illustrate the concepts used and the lessons learned based on feedback of real business and IT users within Credit Suisse.
【Keywords】: data warehouses; feedback; information retrieval; meta data; organisational aspects; Credit Suisse; IT landscape; business analytics; business process; feedback; information item extraction; meta data structure; meta data warehouse; organization agility; organization flexibility; Data models; Data warehouses; Databases; Organizations; Resource description framework; Standards organizations
【Paper Link】 【Pages】:1394-1404
【Authors】: Zhen Hua Liu ; Hui J. Chang ; Balasubramanyam Sthanikam
【Abstract】: XQuery Update Facility (XQUF), which provides a declarative way of updating XML, has become recommendation by W3C. The SQL/XML standard, on the other hand, defines XMLType as a column data type in RDBMS environment and defines the standard SQL/XML operator, such as XML Query() to embed XQuery to query XMLType column in RDBMS. Based on this SQL/XML standard, XML enabled RDBMS becomes industrial strength platforms to host XML applications in a standard compliance way by providing XML store and query capability. However, updating XML capability support remains to be proprietary in RDBMS until XQUF becomes the recommendation. XQUF is agnostic of how XML is stored so that propagation of actual update to any persistent XML store is beyond the scope of XQUF. In this paper, we show how XQUF can be incorporated into XML Query() to effectively update XML stored in XMLType column in the environment of XML enabled RDBMS, such as Oracle XMLDB. We present various compile time and run time optimisation techniques to show how XQUF can be efficiently implemented to declaratively update XML stored in RDBMS. We present how our approaches of optimising XQUF for common physical XML storage models: native binary XML storage model and relational decomposition of XML storage model. Although our study is done using Oracle XMLDB, all of the presented optimisation techniques are generic to XML stores that need to support update of persistent XML store and not specific to Oracle XMLDB implementation.
【Keywords】: SQL; XML; optimisation; query processing; relational databases; Oracle XMLDB; SQL standard; W3C; XML capability support; XML enabled RDBMS; XML standard; XMLQuery(); XMLType; XQUF; XQuery update facility; compile time techniques; industrial strength platforms; physical XML storage models; run time optimisation techniques; Computational modeling; Indexes; Navigation; Optimization; Standards; Transforms; XML
【Paper Link】 【Pages】:1405-1416
【Authors】: Souripriya Das ; Seema Sundara ; Matthew Perry ; Jagannathan Srinivasan ; Jayanta Banerjee ; Aravind Yalamanchi
【Abstract】: This paper describes the Semantic Indexing feature introduced in Oracle Database for indexing unstructured text (document) columns. This capability enables searching for concepts (such as people, places, organizations, and events), in addition to words or phrases, with further options for sense disambiguation and term expansion by consulting knowledge captured in OWL/RDF ontologies. The distinguishing aspects of our approach are: 1) Indexing: Instead of building a traditional inverted index of (annotated) token and/or named entity occurrences, we extract the entities, associations, and events present in a text column data and store them as RDF named graphs in the Oracle Database Semantic Store. This base content can be further augmented with knowledge bases and inferred triples (obtained by applying domain-specific ontologies and rule bases). 2) Querying: Instead of relying on proprietary extensions for specifying a search, we allow users to specify a complete SPARQL query pattern that can capture arbitrarily complex relationships between query terms. We have implemented this feature by introducing a sem_contains SQL operator and the associated sem_indextype indexing scheme. The indexing scheme employs an extensible architecture that supports indexing of unstructured text using native as well as third party text extraction tools. The paper presents a model for the semantic index and querying, describes the feature, and outlines its implementation leveraging Oracle's native support for RDF/OWL storage, inferencing, and querying. We also report a study involving use of this feature on a TREC collection of over 130,000 news articles.
【Keywords】: SQL; database indexing; knowledge representation languages; object-oriented databases; ontologies (artificial intelligence); query processing; text analysis; OWL-RDF ontologies; Oracle database semantic store; Oracle native support; RDF-OWL storage; SPARQL query pattern; TREC collection; association extraction; entities extraction; entity occurrences; information extraction tools; inverted index; knowledge bases; query terms; search specification; sem-contains SQL operator; sem-indextype indexing scheme; semantic indexing; sense disambiguation; term expansion; text column data; unstructured data SPARQL; unstructured text columns indexing; Data mining; Indexing; Knowledge based systems; Ontologies; Resource description framework; Semantics
【Paper Link】 【Pages】:1417-1428
【Authors】: Sonia Bergamaschi ; Matteo Interlandi ; Mario Longo ; Laura Po ; Maurizio Vincini
【Abstract】: The adoption of business intelligence technology in industries is growing rapidly. Business managers are not satisfied with ad hoc and static reports and they ask for more flexible and easy to use data analysis tools. Recently, application interfaces that expand the range of operations available to the user, hiding the underlying complexity, have been developed. The paper presents eLog, a business intelligence solution designed and developed in collaboration between the database group of the University of Modena and Reggio Emilia and eBilling, an Italian SME supplier of solutions for the design, production and automation of documentary processes for top Italian companies. eLog enables business managers to define OLAP reports by means of a web interface and to customize analysis indicators adopting a simple meta-language. The framework translates the user's reports into MDX queries and is able to automatically select the data cube suitable for each query. Over 140 medium and large companies have exploited the technological services of eBilling S.p.A. to manage their documents flows. In particular, eLog services have been used by the major media and telecommunications Italian companies and their foreign annex, such as Sky, Media set, H3G, Tim Brazil etc. The largest customer can provide up to 30 millions mail pieces within 6 months (about 200 GB of data in the relational DBMS). In a period of 18 months, eLog could reach 150 millions mail pieces (1 TB of data) to handle.
【Keywords】: business data processing; competitive intelligence; data analysis; data mining; query languages; relational databases; MDX queries; OLAP; Web interface; business intelligence; data analysis tool; data cube; eLog business solution; meta-language; relational DBMS; Companies; Data analysis; Data models; Data warehouses; Production; Servers