ACM SIGMOD Conference 2011:Athens, Greece

Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011. ACM 【DBLP Link

Paper Num: 141 || Session Num: 28

Databases on new hardware 4

1. LazyFTL: a page-level flash translation layer optimized for NAND flash memory.

Paper Link】 【Pages】:1-12

【Authors】: Dongzhe Ma ; Jianhua Feng ; Guoliang Li

【Abstract】: Flash is a type of electronically erasable programmable read-only memory (EEPROM), which has many advantages over traditional magnetic disks, such as lower access latency, lower power consumption, lack of noise, and shock resistance. However, due to its special characteristics, flash memory cannot be deployed directly in the place of traditional magnetic disks. The Flash Translation Layer (FTL) is a software layer built on raw flash memory that carries out garbage collection and wear leveling strategies and hides the special characteristics of flash memory from upper file systems by emulating a normal block device like magnetic disks. Most existing FTL schemes are optimized for some specific access patterns or bring about significant overhead of merge operations under certain circumstances. In this paper, we propose a novel FTL scheme named LazyFTL that exhibits low response latency and high scalability, and at the same time, eliminates the overhead of merge operations completely. Experimental results show that LazyFTL outperforms all the typical existing FTL schemes and is very close to the theoretically optimal solution. We also provide a basic design that assists LazyFTL to recover from system failures.

【Keywords】: LazyFTL; address translation; flash translation layer; garbage collection

2. Operation-aware buffer management in flash-based systems.

Paper Link】 【Pages】:13-24

【Authors】: Yanfei Lv ; Bin Cui ; Bingsheng He ; Xuexuan Chen

【Abstract】: The inherent asymmetry of read and write speeds of flash memory poses great challenges for buffer management design. Most of existing flash-based buffer management policies adopt disk-oriented strategies by giving a specific priority to dirty pages, while not fully exploiting the characteristics of the flash memory. In this paper, we propose a novel buffer replacement algorithm named FOR, which stands for Flash-based Operation-aware buffer Replacement. The core idea of FOR is based on novel operation-aware page weight determination for buffer replacement. The weight metric not only measures the locality of read/write operations on a page, but also takes the cost difference of read/write operations into account. We further develop an efficient implementation FOR+ with the time complexity of O(1) for each operation. Experiments on synthetic and benchmark traces demonstrate the efficiency of the proposed strategy, which yields better performance compared with some state-of-the-art flash-based buffer management policies.

【Keywords】: SSD; buffer management; flash memory; operation aware

3. SkimpyStash: RAM space skimpy key-value store on flash-based storage.

Paper Link】 【Pages】:25-36

【Authors】: Biplob K. Debnath ; Sudipta Sengupta ; Jin Li

【Abstract】: We present SkimpyStash, a RAM space skimpy key-value store on flash-based storage, designed for high throughput, low latency server applications. The distinguishing feature of SkimpyStash is the design goal of extremely low RAM footprint at about 1 (± 0.5) byte per key-value pair, which is more aggressive than earlier designs. SkimpyStash uses a hash table directory in RAM to index key-value pairs stored in a log-structured manner on flash. To break the barrier of a flash pointer (say, 4 bytes) worth of RAM overhead per key, it "moves" most of the pointers that locate each key-value pair from RAM to flash itself. This is realized by (i) resolving hash table collisions using linear chaining, where multiple keys that resolve (collide) to the same hash table bucket are chained in a linked list, and (ii) storing the linked lists on flash itself with a pointer in each hash table bucket in RAM pointing to the beginning record of the chain on flash, hence incurring multiple flash reads per lookup. Two further techniques are used to improve performance: (iii) two-choice based load balancing to reduce wide variation in bucket sizes (hence, chain lengths and associated lookup times), and a bloom filter in each hash table directory slot in RAM to disambiguate the choice during lookup, and (iv) compaction procedure to pack bucket chain records contiguously onto flash pages so as to reduce flash reads during lookup. The average bucket size is the critical design parameter that serves as a powerful knob for making a continuum of tradeoffs between low RAM usage and low lookup latencies. Our evaluations on commodity server platforms with real-world data center applications show that SkimpyStash provides throughputs from few 10,000s to upwards of 100,000 get-set operations/sec.

【Keywords】: flash memory; indexing; key-value store; log-structured index.; ram space efficient index

4. Design and evaluation of main memory hash join algorithms for multi-core CPUs.

Paper Link】 【Pages】:37-48

【Authors】: Spyros Blanas ; Yinan Li ; Jignesh M. Patel

【Abstract】: The focus of this paper is on investigating efficient hash join algorithms for modern multi-core processors in main memory environments. This paper dissects each internal phase of a typical hash join algorithm and considers different alternatives for implementing each phase, producing a family of hash join algorithms. Then, we implement these main memory algorithms on two radically different modern multi-processor systems, and carefully examine the factors that impact the performance of each method. Our analysis reveals some interesting results -- a very simple hash join algorithm is very competitive to the other more complex methods. This simple join algorithm builds a shared hash table and does not partition the input relations. Its simplicity implies that it requires fewer parameter settings, thereby making it far easier for query optimizers and execution engines to use it in practice. Furthermore, the performance of this simple algorithm improves dramatically as the skew in the input data increases, and it quickly starts to outperform all other algorithms.

【Keywords】: hash join; main memory; multi-core

Query processing and optimization 4

5. Query optimization techniques for partitioned tables.

Paper Link】 【Pages】:49-60

【Authors】: Herodotos Herodotou ; Nedyalko Borisov ; Shivnath Babu

【Abstract】: Table partitioning splits a table into smaller parts that can be accessed, stored, and maintained independent of one another. From their traditional use in improving query performance, partitioning strategies have evolved into a powerful mechanism to improve the overall manageability of database systems. Table partitioning simplifies administrative tasks like data loading, removal, backup, statistics maintenance, and storage provisioning. Query language extensions now enable applications and user queries to specify how their results should be partitioned for further use. However, query optimization techniques have not kept pace with the rapid advances in usage and user control of table partitioning. We address this gap by developing new techniques to generate efficient plans for SQL queries involving multiway joins over partitioned tables. Our techniques are designed for easy incorporation into bottom-up query optimizers that are in wide use today. We have prototyped these techniques in the PostgreSQL optimizer. An extensive evaluation shows that our partition-aware optimization techniques, with low optimization overhead, generate plans that can be an order of magnitude better than plans produced by current optimizers.

【Keywords】: partitioning; query optimization

6. CrowdDB: answering queries with crowdsourcing.

Paper Link】 【Pages】:61-72

【Authors】: Michael J. Franklin ; Donald Kossmann ; Tim Kraska ; Sukriti Ramesh ; Reynold Xin

【Abstract】: Some queries cannot be answered by machines only. Processing such queries requires human input for providing information that is missing from the database, for performing computationally difficult functions, and for matching, ranking, or aggregating results based on fuzzy criteria. CrowdDB uses human input via crowdsourcing to process queries that neither database systems nor search engines can adequately answer. It uses SQL both as a language for posing complex queries and as a way to model data. While CrowdDB leverages many aspects of traditional database systems, there are also important differences. Conceptually, a major change is that the traditional closed-world assumption for query processing does not hold for human input. From an implementation perspective, human-oriented query operators are needed to solicit, integrate and cleanse crowdsourced data. Furthermore, performance and cost depend on a number of new factors including worker affinity, training, fatigue, motivation and location. We describe the design of CrowdDB, report on an initial set of experiments using Amazon Mechanical Turk, and outline important avenues for future work in the development of crowdsourced query processing systems.

【Keywords】: architecture; crowd; crowdsourcing; database; hybrid system; query processing

7. Skyline query processing over joins.

Paper Link】 【Pages】:73-84

【Authors】: Akrivi Vlachou ; Christos Doulkeridis ; Neoklis Polyzotis

【Abstract】: This paper addresses the problem of efficiently computing the skyline set of a relational join. Existing techniques either require to access all tuples of the input relations or demand specialized multi-dimensional access methods to generate the skyline join result. To avoid these inefficiencies, we introduce the novel SFSJ algorithm that fuses the identification of skyline tuples with the computation of the join. SFSJ is able to compute the correct skyline set by accessing only a subset of the input tuples, i.e., it has the property of early termination. SFSJ employs standard access methods for reading the input tuples and is readily implementable in an existing database system. Moreover, it can be used in pipelined execution plans, as it generates the skyline tuples progressively. Additionally, we formally analyze the performance of SFSJ and propose a novel strategy for accessing the input tuples that is proven to be optimal for SFSJ. Finally, we present an extensive experimental study that validates the effectiveness of SFSJ and demonstrates its advantages over existing techniques.

【Keywords】: join operator; skyline query

8. Efficient parallel skyline processing using hyperplane projections.

Paper Link】 【Pages】:85-96

【Authors】: Henning Köhler ; Jing Yang ; Xiaofang Zhou

【Abstract】: The skyline of a set of multi-dimensional points (tuples) consists of those points for which no clearly better point exists in the given set, using component-wise comparison on domains of interest. Skyline queries, i.e., queries that involve computation of a skyline, can be computationally expensive, so it is natural to consider parallelized approaches which make good use of multiple processors. We approach this problem by using hyperplane projections to obtain useful partitions of the data set for parallel processing. These partitions not only ensure small local skyline sets, but enable efficient merging of results as well. Our experiments show that our method consistently outperforms similar approaches for parallel skyline computation, regardless of data distribution, and provides insights on the impacts of different optimization strategies.

【Keywords】: data partitioning; hyperplane projection; parallel processing; skyline query

Schema mapping and data integration 4

9. Scalable query rewriting: a graph-based approach.

Paper Link】 【Pages】:97-108

【Authors】: George Konstantinidis ; José Luis Ambite

【Abstract】: In this paper we consider the problem of answering queries using views, which is important for data integration, query optimization, and data warehouses. We consider its simplest form, conjunctive queries and views, which already is NP-complete. Our context is data integration, so we search for maximally-contained rewritings. By looking at the problem from a graph perspective we are able to gain a better insight and develop an algorithm which compactly represents common patterns in the source descriptions, and (optionally) pushes some computation offline. This together with other optimizations result in an experimental performance about two orders of magnitude faster than current state-of-the-art algorithms, rewriting queries using over 10000 views within seconds.

【Keywords】: data integration; information integration; query answering using views; query containment; query graphs; query reformulation

10. Automatic discovery of attributes in relational databases.

Paper Link】 【Pages】:109-120

【Authors】: Meihui Zhang ; Marios Hadjieleftheriou ; Beng Chin Ooi ; Cecilia M. Procopiuc ; Divesh Srivastava

【Abstract】: In this work we design algorithms for clustering relational columns into attributes, i.e., for identifying strong relationships between columns based on the common properties and characteristics of the values they contain. For example, identifying whether a certain set of columns refers to telephone numbers versus social security numbers, or names of customers versus names of nations. Traditional relational database schema languages use very limited primitive data types and simple foreign key constraints to express relationships between columns. Object oriented schema languages allow the definition of custom data types; still, certain relationships between columns might be unknown at design time or they might appear only in a particular database instance. Nevertheless, these relationships are an invaluable tool for schema matching, and generally for better understanding and working with the data. Here, we introduce data oriented solutions (we do not consider solutions that assume the existence of any external knowledge) that use statistical measures to identify strong relationships between the values of a set of columns. Interpreting the database as a graph where nodes correspond to database columns and edges correspond to column relationships, we decompose the graph into connected components and cluster sets of columns into attributes. To test the quality of our solution, we also provide a comprehensive experimental evaluation using real and synthetic datasets.

【Keywords】: attribute discovery; schema matching

11. Leveraging query logs for schema mapping generation in U-MAP.

Paper Link】 【Pages】:121-132

【Authors】: Hazem Elmeleegy ; Ahmed K. Elmagarmid ; Jaewoo Lee

【Abstract】: In this paper, we introduce U-MAP, a new system for schema mapping generation. U-MAP builds upon and extends existing schema mapping techniques. However, it mitigates some key problems in this area, which have not been previously addressed. The key tenet of U-MAP is to exploit the usage information extracted from the query logs associated with the schemas being mapped. We describe our experience in applying our proposed system to realistic datasets from the retail and life sciences domains. Our results demonstrate the effectiveness and efficiency of U-MAP compared to traditional approaches.

【Keywords】: data exchange; data integration; query logs; schema mapping; usage

12. Designing and refining schema mappings via data examples.

Paper Link】 【Pages】:133-144

【Authors】: Bogdan Alexe ; Balder ten Cate ; Phokion G. Kolaitis ; Wang Chiew Tan

【Abstract】: A schema mapping is a specification of the relationship between a source schema and a target schema. Schema mappings are fundamental building blocks in data integration and data exchange and, as such, obtaining the right schema mapping constitutes a major step towards the integration or exchange of data. Up to now, schema mappings have typically been specified manually or have been derived using mapping-design systems that automatically generate a schema mapping from a visual specification of the relationship between two schemas. We present a novel paradigm and develop a system for the interactive design of schema mappings via data examples. Each data example represents a partial specification of the semantics of the desired schema mapping. At the core of our system lies a sound and complete algorithm that, given a finite set of data examples, decides whether or not there exists a GLAV schema mapping (i.e., a schema mapping specified by Global-and-Local-As-View constraints) that "fits" these data examples. If such a fitting GLAV schema mapping exists, then our system constructs the "most general" one. We give a rigorous computational complexity analysis of the underlying decision problem concerning the existence of a fitting GLAV schema mapping, given a set of data examples. Specifically, we prove that this problem is complete for the second level of the polynomial hierarchy, hence, in a precise sense, harder than NP-complete. This worst-case complexity analysis notwithstanding, we conduct an experimental evaluation of our prototype implementation that demonstrates the feasibility of interactively designing schema mappings using data examples. In particular, our experiments show that our system achieves very good performance in real-life scenarios.

【Keywords】: data examples; data exchange; data integration; schema mappings

Data on the web 4

13. Apples and oranges: a comparison of RDF benchmarks and real RDF datasets.

Paper Link】 【Pages】:145-156

【Authors】: Songyun Duan ; Anastasios Kementsietsidis ; Kavitha Srinivas ; Octavian Udrea

【Abstract】: The widespread adoption of the Resource Description Framework (RDF) for the representation of both open web and enterprise data is the driving force behind the increasing research interest in RDF data management. As RDF data management systems proliferate, so are benchmarks to test the scalability and performance of these systems under data and workloads with various characteristics. In this paper, we compare data generated with existing RDF benchmarks and data found in widely used real RDF datasets. The results of our comparison illustrate that existing benchmark data have little in common with real data. Therefore any conclusions drawn from existing benchmark tests might not actually translate to expected behaviours in real settings. In terms of the comparison itself, we show that simple primitive data metrics are inadequate to flesh out the fundamental differences between real and benchmark data. We make two contributions in this paper: (1) To address the limitations of the primitive metrics, we introduce intuitive and novel metrics that can indeed highlight the key differences between distinct datasets; (2) To address the limitations of existing benchmarks, we introduce a new benchmark generator with the following novel characteristics: (a) the generator can use any (real or synthetic) dataset and convert it into a benchmark dataset; (b) the generator can generate data that mimic the characteristics of real datasets with user-specified data properties. On the technical side, we formulate the benchmark generation problem as an integer programming problem whose solution provides us with the desired benchmark datasets. To our knowledge, this is the first methodological study of RDF benchmarks, as well as the first attempt on generating RDF benchmarks in a principled way.

【Keywords】: RDF; benchmark

14. Efficient query answering in probabilistic RDF graphs.

Paper Link】 【Pages】:157-168

【Authors】: Xiang Lian ; Lei Chen

【Abstract】: In this paper, we tackle the problem of efficiently answering queries on probabilistic RDF data graphs. Specifically, we model RDF data by probabilistic graphs, and an RDF query is equivalent to a search over subgraphs of probabilistic graphs that have high probabilities to match with a given query graph. To efficiently processqueries on probabilistic RDF graphs, we propose effective pruning mechanisms, structural and probabilistic pruning. For the structural pruning, we carefully design synopses for vertex/edge labels by considering their distributions and other structural information, in order to improve the pruning power. For the probabilistic pruning, we derive a cost model to guide the pre-computation of probability upper bounds such that the query cost is expected to be low. We construct an index structure that integrates synopses/statistics for structural and robabilistic pruning, and propose an efficient approach to answer queries on probabilistic RDF graph data. The efficiency of our solutions has been verified through extensive experiments.

【Keywords】: RDF; probabilistic RDF graph; probabilistic data

15. Facet discovery for structured web search: a query-log mining approach.

Paper Link】 【Pages】:169-180

【Authors】: Jeffrey Pound ; Stelios Paparizos ; Panayiotis Tsaparas

【Abstract】: In recent years, there has been a strong trend of incorporating results from structured data sources into keyword-based web search systems such as Bing or Amazon. When presenting structured data, facets are a powerful tool for navigating, refining, and grouping the results. For a given structured data source, a fundamental problem in supporting faceted search is finding an ordered selection of attributes and values that will populate the facets. This creates two sets of challenges. First, because of the limited screen real-estate, it is important that the top facets best match the anticipated user intent. Second, the huge scale of available data to such engines demands an automated unsupervised solution. In this paper, we model the user faceted-search behavior using the intersection of web query-logs with existing structured data. Since web queries are formulated as free-text queries, a challenge in our approach is the inherent ambiguity in mapping keywords to the different possible attributes of a given entity type. We present an automated solution that elicits user preferences on attributes and values, employing different disambiguation techniques ranging from simple keyword matching, to more sophisticated probabilistic models. We demonstrate experimentally the scalability of our solution by running it on over a thousand categories of diverse entity types and measure the facet quality with a real-user study.

【Keywords】: faceted search; facets; structured web-search

16. Schema-as-you-go: on probabilistic tagging and querying of wide tables.

Paper Link】 【Pages】:181-192

【Authors】: Meiyu Lu ; Divyakant Agrawal ; Bing Tian Dai ; Anthony K. H. Tung

【Abstract】: The emergence of Web 2.0 has resulted in a huge amount of heterogeneous data that are contributed by a large number of users, engendering new challenges for data management and query processing. Given that the data are unified from various sources and accessed by numerous users, providing users with a unified mediated schema as data integration is insufficient. On one hand, a deterministic mediated schema restricts users' freedom to express queries in their preferred vocabulary; on the other hand, it is not realistic for users to remember the numerous attribute names that arise from integrating various data sources. As such, a user-oriented data management and query interface is required. In this paper, we propose an out-of-the-box approach that separates users' actions from database operations. This separating layer deals with the challenges from a semantic perspective. It interprets the semantics of each data value through tags that are provided by users, and then inserts the value into the database together with these tags. When querying the database, this layer also serves as a platform for retrieving data by interpreting the semantics of the queried tags from the users. Experiments are conducted to illustrate both the effectiveness and efficiency of our approach.

【Keywords】: dynamic instantiation; probabilistic tagging; wide table

Data privacy and security 4

17. No free lunch in data privacy.

Paper Link】 【Pages】:193-204

【Authors】: Daniel Kifer ; Ashwin Machanavajjhala

【Abstract】: Differential privacy is a powerful tool for providing privacy-preserving noisy query answers over statistical databases. It guarantees that the distribution of noisy query answers changes very little with the addition or deletion of any tuple. It is frequently accompanied by popularized claims that it provides privacy without any assumptions about the data and that it protects against attackers who know all but one record. In this paper we critically analyze the privacy protections offered by differential privacy. First, we use a no-free-lunch theorem, which defines non-privacy as a game, to argue that it is not possible to provide privacy and utility without making assumptions about how the data are generated. Then we explain where assumptions are needed. We argue that privacy of an individual is preserved when it is possible to limit the inference of an attacker about the participation of the individual in the data generating process. This is different from limiting the inference about the presence of a tuple (for example, Bob's participation in a social network may cause edges to form between pairs of his friends, so that it affects more than just the tuple labeled as "Bob"). The definition of evidence of participation, in turn, depends on how the data are generated -- this is how assumptions enter the picture. We explain these ideas using examples from social network research as well as tabular data for which deterministic statistics have been previously released. In both cases the notion of participation varies, the use of differential privacy can lead to privacy breaches, and differential privacy does not always adequately limit inference about participation.

【Keywords】: differential privacy; privacy

18. TrustedDB: a trusted hardware based database with privacy and data confidentiality.

Paper Link】 【Pages】:205-216

【Authors】: Sumeet Bajaj ; Radu Sion

【Abstract】: TrustedDB is an outsourced database prototype that allows clients to execute SQL queries with privacy and under regulatory compliance constraints without having to trust the service provider. TrustedDB achieves this by leveraging server-hosted tamper-proof trusted hardware in critical query processing stages. TrustedDB does not limit the query expressiveness of supported queries. And, despite the cost overhead and performance limitations of trusted hardware, the costs per query are orders of magnitude lower than any (existing or) potential future software-only mechanisms. TrustedDB is built and runs on actual hardware, and its performance and costs are evaluated here.

【Keywords】: database; trusted hardware

19. Differentially private data cubes: optimizing noise sources and consistency.

Paper Link】 【Pages】:217-228

【Authors】: Bolin Ding ; Marianne Winslett ; Jiawei Han ; Zhenhui Li

【Abstract】: Data cubes play an essential role in data analysis and decision support. In a data cube, data from a fact table is aggregated on subsets of the table's dimensions, forming a collection of smaller tables called cuboids. When the fact table includes sensitive data such as salary or diagnosis, publishing even a subset of its cuboids may compromise individuals' privacy. In this paper, we address this problem using differential privacy (DP), which provides provable privacy guarantees for individuals by adding noise to query answers. We choose an initial subset of cuboids to compute directly from the fact table, injecting DP noise as usual; and then compute the remaining cuboids from the initial set. Given a fixed privacy guarantee, we show that it is NP-hard to choose the initial set of cuboids so that the maximal noise over all published cuboids is minimized, or so that the number of cuboids with noise below a given threshold (precise cuboids) is maximized. We provide an efficient procedure with running time polynomial in the number of cuboids to select the initial set of cuboids, such that the maximal noise in all published cuboids will be within a factor (ln|L| + 1)^2 of the optimal, where |L| is the number of cuboids to be published, or the number of precise cuboids will be within a factor (1 - 1/e) of the optimal. We also show how to enforce consistency in the published cuboids while simultaneously improving their utility (reducing error). In an empirical evaluation on real and synthetic data, we report the amounts of error of different publishing algorithms, and show that our approaches outperform baselines significantly.

【Keywords】: OLAP; data cube; differential privacy; private data analysis

20. iReduct: differential privacy with reduced relative errors.

Paper Link】 【Pages】:229-240

【Authors】: Xiaokui Xiao ; Gabriel Bender ; Michael Hay ; Johannes Gehrke

【Abstract】: Prior work in differential privacy has produced techniques for answering aggregate queries over sensitive data in a privacy-preserving way. These techniques achieve privacy by adding noise to the query answers. Their objective is typically to minimize absolute errors while satisfying differential privacy. Thus, query answers are injected with noise whose scale is independent of whether the answers are large or small. The noisy results for queries whose true answers are small therefore tend to be dominated by noise, which leads to inferior data utility. This paper introduces iReduct, a differentially private algorithm for computing answers with reduced relative error. The basic idea of iReduct is to inject different amounts of noise to different query results, so that smaller (larger) values are more likely to be injected with less (more) noise. The algorithm is based on a novel resampling technique that employs correlated noise to improve data utility. Performance is evaluated on an instantiation of iReduct that generates marginals, i.e., projections of multi-dimensional histograms onto subsets of their attributes. Experiments on real data demonstrate the effectiveness of our solution.

【Keywords】: differential privacy; privacy

Data consistency and parallel DB 4

21. A latency and fault-tolerance optimizer for online parallel query plans.

Paper Link】 【Pages】:241-252

【Authors】: Prasang Upadhyaya ; YongChul Kwon ; Magdalena Balazinska

【Abstract】: We address the problem of making online, parallel query plans fault-tolerant: i.e., provide intra-query fault-tolerance without blocking. We develop an approach that not only achieves this goal but does so through the use of different fault-tolerance techniques at different operators within a query plan. Enabling each operator to use a different fault-tolerance strategy leads to a space of fault-tolerance plans amenable to cost-based optimization. We develop FTOpt, a cost-based fault-tolerance optimizer that automatically selects the best strategy for each operator in a query plan in a manner that minimizes the expected processing time with failures for the entire query. We implement our approach in a prototype parallel query-processing engine. Our experiments demonstrate that (1) there is no single best fault-tolerance strategy for all query plans, (2) often hybrid strategies that mix-and-match recovery techniques outperform any uniform strategy, and (3) our optimizer correctly identifies winning fault-tolerance configurations.

【Keywords】: database; long-running; management; online; parallel; queries; systems

22. ArrayStore: a storage manager for complex parallel array processing.

Paper Link】 【Pages】:253-264

【Authors】: Emad Soroush ; Magdalena Balazinska ; Daniel L. Wang

【Abstract】: We present the design, implementation, and evaluation of ArrayStore, a new storage manager for complex, parallel array processing. ArrayStore builds on prior work in the area of multidimensional data storage, but considers the new problem of supporting a parallel and more varied workload comprising not only range-queries, but also binary operations such as joins and complex user-defined functions. This paper makes two key contributions. First, it examines several existing single-site storage management strategies and array partitioning strategies to identify which combination is best suited for the array-processing workload above. Second, it develops a new and efficient storage-management mechanism that enables parallel processing of operations that must access data from adjacent partitions. We evaluate ArrayStore on over 80GB of real data from two scientific domains and real operators used in these domains. We show that ArrayStore outperforms previously proposed storage management strategies in the context of its diverse target workload.

【Keywords】: overlap execution strategy; parallel databases; query processing; scientific databases

23. Fast checkpoint recovery algorithms for frequently consistent applications.

Paper Link】 【Pages】:265-276

【Authors】: Tuan Cao ; Marcos Antonio Vaz Salles ; Benjamin Sowell ; Yao Yue ; Alan J. Demers ; Johannes Gehrke ; Walker M. White

【Abstract】: Advances in hardware have enabled many long-running applications to execute entirely in main memory. As a result, these applications have increasingly turned to database techniques to ensure durability in the event of a crash. However, many of these applications, such as massively multiplayer online games and main-memory OLTP systems, must sustain extremely high update rates - often hundreds of thousands of updates per second. Providing durability for these applications without introducing excessive overhead or latency spikes remains a challenge for application developers. In this paper, we take advantage of frequent points of consistency in many of these applications to develop novel checkpoint recovery algorithms that trade additional space in main memory for significantly lower overhead and latency. Compared to previous work, our new algorithms do not require any locking or bulk copies of the application state. Our experimental evaluation shows that one of our new algorithms attains nearly constant latency and reduces overhead by more than an order of magnitude for low to medium update rates. Additionally, in a heavily loaded main-memory transaction processing system, it still reduces overhead by more than a factor of two.

【Keywords】: checkpointing; database recovery; durability; logical logging; main-memory databases

24. Warding off the dangers of data corruption with amulet.

Paper Link】 【Pages】:277-288

【Authors】: Nedyalko Borisov ; Shivnath Babu ; NagaPramod Mandagere ; Sandeep Uttamchandani

【Abstract】: Occasional corruption of stored data is an unfortunate byproduct of the complexity of modern systems. Hardware errors, software bugs, and mistakes by human administrators can corrupt important sources of data. The dominant practice to deal with data corruption today involves administrators writing ad hoc scripts that run data-integrity tests at the application, database, file-system, and storage levels. This manual approach is tedious, error-prone, and provides no understanding of the potential system unavailability and data loss if a corruption were to occur. We introduce the Amulet system that addresses the problem of verifying the correctness of stored data proactively and continuously. To our knowledge, Amulet is the first system that: (i) gives administrators a declarative language to specify their objectives regarding the detection and repair of data corruption; (ii) contains optimization and execution algorithms to ensure that the administrator's objectives are met robustly and with least cost, e.g., using pay-as-you cloud resources; and (iii) provides timely notification when corruption is detected, allowing proactive repair of corruption before it impacts users and applications. We describe the implementation and a comprehensive evaluation of Amulet for a database software stack deployed on an infrastructure-as-a-service cloud provider.

【Keywords】: data protection

Service oriented computing, data management in the cloud 5

25. Schedule optimization for data processing flows on the cloud.

Paper Link】 【Pages】:289-300

【Authors】: Herald Kllapi ; Eva Sitaridi ; Manolis M. Tsangaris ; Yannis E. Ioannidis

【Abstract】: Scheduling data processing workflows (dataflows) on the cloud is a very complex and challenging task. It is essentially an optimization problem, very similar to query optimization, that is characteristically different from traditional problems in two aspects: Its space of alternative schedules is very rich, due to various optimization opportunities that cloud computing offers; its optimization criterion is at least two-dimensional, with monetary cost of using the cloud being at least as important as query completion time. In this paper, we study scheduling of dataflows that involve arbitrary data processing operators in the context of three different problems: 1) minimize completion time given a fixed budget, 2) minimize monetary cost given a deadline, and 3) find trade-offs between completion time and monetary cost without any a-priori constraints. We formulate these problems and present an approximate optimization framework to address them that uses resource elasticity in the cloud. To investigate the effectiveness of our approach, we incorporate the devised framework into a prototype system for dataflow evaluation and instantiate it with several greedy, probabilistic, and exhaustive search algorithms. Finally, through several experiments that we have conducted with the prototype elastic optimizer on numerous scientific and synthetic dataflows, we identify several interesting general characteristics of the space of alternative schedules as well as the advantages and disadvantages of the various search algorithms. The overall results are quite promising and indicate the effectiveness of our approach.

【Keywords】: cloud computing; dataflows; query optimization; scheduling

26. Zephyr: live migration in shared nothing databases for elastic cloud platforms.

Paper Link】 【Pages】:301-312

【Authors】: Aaron J. Elmore ; Sudipto Das ; Divyakant Agrawal ; Amr El Abbadi

【Abstract】: Multitenant data infrastructures for large cloud platforms hosting hundreds of thousands of applications face the challenge of serving applications characterized by small data footprint and unpredictable load patterns. When such a platform is built on an elastic pay-per-use infrastructure, an added challenge is to minimize the system's operating cost while guaranteeing the tenants' service level agreements (SLA). Elastic load balancing is therefore an important feature to enable scale-up during high load while scaling down when the load is low. Live migration, a technique to migrate tenants with minimal service interruption and no downtime, is critical to allow lightweight elastic scaling. We focus on the problem of live migration in the database layer. We propose Zephyr, a technique to efficiently migrate a live database in a shared nothing transactional database architecture. Zephyr uses phases of on-demand pull and asynchronous push of data, requires minimal synchronization, results no service unavailability and few or no aborted transactions, minimizes the data transfer overhead, provides ACID guarantees during migration, and ensures correctness in the presence of failures. We outline a prototype implementation using an open source relational database engine and an present a thorough evaluation using various transactional workloads. Zephyr's efficiency is evident from the few tens of failed operations, 10-20% change in average transaction latency, minimal messaging, and no overhead during normal operation when migrating a live database.

【Keywords】: cloud computing; database migration; elastic data management; multitenancy; shared nothing architectures

27. Workload-aware database monitoring and consolidation.

Paper Link】 【Pages】:313-324

【Authors】: Carlo Curino ; Evan P. C. Jones ; Samuel Madden ; Hari Balakrishnan

【Abstract】: In most enterprises, databases are deployed on dedicated database servers. Often, these servers are underutilized much of the time. For example, in traces from almost 200 production servers from different organizations, we see an average CPU utilization of less than 4%. This unused capacity can be potentially harnessed to consolidate multiple databases on fewer machines, reducing hardware and operational costs. Virtual machine (VM) technology is one popular way to approach this problem. However, as we demonstrate in this paper, VMs fail to adequately support database consolidation, because databases place a unique and challenging set of demands on hardware resources, which are not well-suited to the assumptions made by VM-based consolidation. Instead, our system for database consolidation, named Kairos, uses novel techniques to measure the hardware requirements of database workloads, as well as models to predict the combined resource utilization of those workloads. We formalize the consolidation problem as a non-linear optimization program, aiming to minimize the number of servers and balance load, while achieving near-zero performance degradation. We compare Kairos against virtual machines, showing up to a factor of 12× higher throughput on a TPC-C-like benchmark. We also tested the effectiveness of our approach on real-world data collected from production servers at Wikia.com, Wikipedia, Second Life, and MIT CSAIL, showing absolute consolidation ratios ranging between 5.5:1 and 17:1.

【Keywords】: consolidation; multi-tenant databases

28. Predicting cost amortization for query services.

Paper Link】 【Pages】:325-336

【Authors】: Verena Kantere ; Debabrata Dash ; Georgios Gratsias ; Anastasia Ailamaki

【Abstract】: Emerging providers of online services offer access to data collections. Such data service providers need to build data structures, e.g. materialized views and indexes, in order to offer better performance for user query execution. The cost of such structures is charged to the user as part of the overall query service cost. In order to ensure the economic viability of the provider, the building and maintenance cost of new structures has to be amortized to a set of prospective query services that will use them. This work proposes a novel stochastic model that predicts the extent of cost amortization in time and number of services. The model is completed with a novel method that regresses query traffic statistics and provides input to the prediction model. In order to demonstrate the effectiveness of the prediction model, we study its application on an extension of an existing economy model for the management of a cloud DBMS. A thorough experimental study shows that the prediction model ensures the economic viability of the cloud DBMS while enabling the offer of fast and cheap query services.

【Keywords】: cloud data services; cost amortization; prediction models

29. Performance prediction for concurrent database workloads.

Paper Link】 【Pages】:337-348

【Authors】: Jennie Duggan ; Ugur Çetintemel ; Olga Papaemmanouil ; Eli Upfal

【Abstract】: Current trends in data management systems, such as cloud and multi-tenant databases, are leading to data processing environments that concurrently execute heterogeneous query workloads. At the same time, these systems need to satisfy diverse performance expectations. In these newly-emerging settings, avoiding potential Quality-of-Service (QoS) violations heavily relies on performance predictability, i.e., the ability to estimate the impact of concurrent query execution on the performance of individual queries in a continuously evolving workload. This paper presents a modeling approach to estimate the impact of concurrency on query performance for analytical workloads. Our solution relies on the analysis of query behavior in isolation, pairwise query interactions and sampling techniques to predict resource contention under various query mixes and concurrency levels. We introduce a simple yet powerful metric that accurately captures the joint effects of disk and memory contention on query performance in a single value. We also discuss predicting the execution behavior of a time-varying query workload through query-interaction timelines, i.e., a fine-grained estimation of the time segments during which discrete mixes will be executed concurrently. Our experimental evaluation on top of PostgreSQL/TPC-H demonstrates that our models can provide query latency predictions within approximately 20% of the actual values in the average case.

【Keywords】: concurrency; query performance prediction

Spatial and temporal data management 5

30. Reverse spatial and textual k nearest neighbor search.

Paper Link】 【Pages】:349-360

【Authors】: Jiaheng Lu ; Ying Lu ; Gao Cong

【Abstract】: Geographic objects associated with descriptive texts are becoming prevalent. This gives prominence to spatial keyword queries that take into account both the locations and textual descriptions of content. Specifically, the relevance of an object to a query is measured by spatial-textual similarity that is based on both spatial proximity and textual similarity. In this paper, we define Reverse Spatial Textual k Nearest Neighbor (RSTkNN) query, i.e., finding objects that take the query object as one of their k most spatial-textual similar objects. Existing works on reverse kNN queries focus solely on spatial locations but ignore text relevance. To answer RSTkNN queries efficiently, we propose a hybrid index tree called IUR-tree (Intersection-Union R-Tree) that effectively combines location proximity with textual similarity. Based on the IUR-tree, we design a branch-and-bound search algorithm. To further accelerate the query processing, we propose an enhanced variant of the IUR-tree called clustered IUR-tree and two corresponding optimization algorithms. Empirical studies show that the proposed algorithms offer scalability and are capable of excellent performance.

【Keywords】: reverse k nearest neighbor; spatial-keyword query

Paper Link】 【Pages】:361-372

【Authors】: Senjuti Basu Roy ; Kaushik Chakrabarti

【Abstract】: Users often search spatial databases like yellow page data using keywords to find businesses near their current location. Typing the entire query is cumbersome and prone to errors, especially from mobile phones. We address this problem by introducing type-ahead search functionality on spatial databases. Like keyword search on spatial data, type-ahead search needs to be location-aware, i.e., with every letter being typed, it needs to return spatial objects whose names (or descriptions) are valid completions of the query string typed so far, and which rank highest in terms of proximity to the user's location and other static scores. Existing solutions for type-ahead search cannot be used directly as they are not location-aware. We show that a straight-forward combination of existing techniques for performing type-ahead search with those for performing proximity search perform poorly. We propose a formal model for query processing cost and develop novel techniques that optimize that cost. Our empirical evaluations on real and synthetic datasets demonstrate the effectiveness of our techniques. To the best of our knowledge, this is the first work on location-aware type-ahead search.

【Keywords】: type ahead search

32. Collective spatial keyword querying.

Paper Link】 【Pages】:373-384

【Authors】: Xin Cao ; Gao Cong ; Christian S. Jensen ; Beng Chin Ooi

【Abstract】: With the proliferation of geo-positioning and geo-tagging, spatial web objects that possess both a geographical location and a textual description are gaining in prevalence, and spatial keyword queries that exploit both location and textual description are gaining in prominence. However, the queries studied so far generally focus on finding individual objects that each satisfy a query rather than finding groups of objects where the objects in a group collectively satisfy a query. We define the problem of retrieving a group of spatial web objects such that the group's keywords cover the query's keywords and such that objects are nearest to the query location and have the lowest inter-object distances. Specifically, we study two variants of this problem, both of which are NP-complete. We devise exact solutions as well as approximate solutions with provable approximation bounds to the problems. We present empirical studies that offer insight into the efficiency and accuracy of the solutions.

【Keywords】: spatial group keyword query; spatial keyword query

33. Finding semantics in time series.

Paper Link】 【Pages】:385-396

【Authors】: Peng Wang ; Haixun Wang ; Wei Wang

【Abstract】: In order to understand a complex system, we analyze its output or its log data. For example, we track a system's resource consumption (CPU, memory, message queues of different types, etc) to help avert system failures; we examine economic indicators to assess the severity of a recession; we monitor a patient's heart rate or EEG for disease diagnosis. Time series data is involved in many such applications. Much work has been devoted to pattern discovery from time series data, but not much has attempted to use the time series data to unveil a system's internal dynamics. In this paper, we go beyond learning patterns from time series data. We focus on obtaining a better understanding of its data generating mechanism, and we regard patterns and their temporal relations as organic components of the hidden mechanism. Specifically, we propose to model time series data using a novel pattern-based hidden Markov model (pHMM), which aims at revealing a global picture of the system that generates the time series data. We propose an iterative approach to refine pHMMs learned from the data. In each iteration, we use the current pHMM to guide time series segmentation and clustering, which enables us to learn a more accurate pHMM. Furthermore, we propose three pruning strategies to speed up the refinement process. Empirical results on real datasets demonstrate the feasibility and effectiveness of the proposed approach.

【Keywords】: hidden Markov model

34. Querying contract databases based on temporal behavior.

Paper Link】 【Pages】:397-408

【Authors】: Elio Damaggio ; Alin Deutsch ; Dayou Zhou

【Abstract】: Considering a broad definition for service contracts (beyond web services and software, e.g. airline tickets and insurance policies), we tackle the challenges of building a high performance broker in which contracts are both specified and queried through their temporal behavior. The temporal dimension, in conjunction with traditional relational attributes, enables our system to better address difficulties arising from the great deal of information regarding the temporal interaction of the various events cited in contracts (e.g. "No refunds are allowed after a reschedule of the flight, which can be requested only before any flight leg has been used"). On the other hand, querying large repositories of temporal specifications poses an interesting indexing challenge. In this paper, we introduce two distinct and complementary indexing techniques that enable our system to scale the evaluation of a novel and theoretically sound notion of permission of a temporal query by a service contract. Our notion of permission is inspired by previous work on model checking but, given the specific characteristic of our problem, does not reduce to it. We evaluate experimentally our implementation, showing that it scales well with both the number and the complexity of the contracts.

【Keywords】: LTL; b?chi automata; contracts; indexing; linear temporal logic; temporal behavior

Shortest paths and sequence data 5

35. Neighborhood-privacy protected shortest distance computing in cloud.

Paper Link】 【Pages】:409-420

【Authors】: Jun Gao ; Jeffrey Xu Yu ; Ruoming Jin ; Jiashuai Zhou ; Tengjiao Wang ; Dongqing Yang

【Abstract】: With the advent of cloud computing, it becomes desirable to utilize cloud computing to efficiently process complex operations on large graphs without compromising their sensitive information. This paper studies shortest distance computing in the cloud, which aims at the following goals: i) preventing outsourced graphs from neighborhood attack, ii) preserving shortest distances in outsourced graphs, iii) minimizing overhead on the client side. The basic idea of this paper is to transform an original graph G into a link graph Gl kept locally and a set of outsourced graphs Go. Each outsourced graph should meet the requirement of a new security model called 1-neighborhood-d-radius. In addition, the shortest distance query can be answered using Gl and Go. Our objective is to minimize the space cost on the client side when both security and utility requirements are satisfied. We devise a greedy method to produce Gl and Go, which can exactly answer the shortest distance queries. We also develop an efficient transformation method to support approximate shortest distance answering under a given additive error bound. The final experimental results illustrate the effectiveness and efficiency of our method.

【Keywords】: graph transformation; outsource; privacy; shortest distance

36. On k-skip shortest paths.

Paper Link】 【Pages】:421-432

【Authors】: Yufei Tao ; Cheng Sheng ; Jian Pei

【Abstract】: Given two vertices s, t in a graph, let P be the shortest path (SP) from s to t, and P a subset of the vertices in P. P is a k-skip shortest path from s to t, if it includes at least a vertex out of every k consecutive vertices in P. In general, P succinctly describes P by sampling the vertices in P with a rate of at least 1/k. This makes P a natural substitute in scenarios where reporting every single vertex of P is unnecessary or even undesired. This paper studies k-skip SP computation in the context of spatial network databases (SNDB). Our technique has two properties crucial for real-time query processing in SNDB. First, our solution is able to answer k-skip queries significantly faster than finding the original SPs in their entirety. Second, the previous objective is achieved with a structure that occupies less space than storing the underlying road network. The proposed algorithms are the outcome of a careful theoretical analysis that reveals valuable insight into the characteristics of the k-skip SP problem. Their efficiency has been confirmed by extensive experiments with real data.

【Keywords】: k-skip; road network; shortest path

37. Finding shortest path on land surface.

Paper Link】 【Pages】:433-444

【Authors】: Lian Liu ; Raymond Chi-Wing Wong

【Abstract】: Finding shortest paths is a fundamental operator in spatial databases. Recently, terrain datasets have attracted a lot of attention from both industry and academia. There are some interesting issues to be studied in terrain datasets which cannot be found in a traditional two-dimensional space. In this paper, we study one of the issues called a slope constraint which exists in terrain datasets. In this paper, we propose a problem of finding shortest paths with the slope constraint. Then, we show that this new problem is more general than the traditional problem of finding shortest paths without considering the slope constraint. Since finding shortest paths with the slope constraint is costly, we propose a new framework called surface simplification so that we can compute shortest paths with the slope constraint efficiently. Under this framework, the surface is "simplified" such that the complexity of finding shortest paths on this simplified surface is lower. We conducted experiments to show that the surface simplification is very efficient and effective not only for the new problem with the slope constraint but also the traditional problem without the slope constraint.

【Keywords】: land surface; shortest path; spatial databases; surface simplification; terrain; triangular irregular network (TIN) model

38. WHAM: a high-throughput sequence alignment method.

Paper Link】 【Pages】:445-456

【Authors】: Yinan Li ; Allison Terrell ; Jignesh M. Patel

【Abstract】: Over the last decade the cost of producing genomic sequences has dropped dramatically due to the current so called "next-gen" sequencing methods. However, these next-gen sequencing methods are critically dependent on fast and sophisticated data processing methods for aligning a set of query sequences to a reference genome using rich string matching models. The focus of this work is on the design, development and evaluation of a data processing system for this crucial "short read alignment" problem. Our system, called WHAM, employs novel hash-based indexing methods and bitwise operations for sequence alignments. It allows richer match models than existing methods and it is significantly faster than the existing state-of-the-art method. In addition, its relative speedup over the existing method is poised to increase in the future in which read sequence lengths will increase. The WHAM code is available at http://www.cs.wisc.edu/wham/.

【Keywords】: approximate string matching; bit-parallelism; sequence alignment

39. A new approach for processing ranked subsequence matching based on ranked union.

Paper Link】 【Pages】:457-468

【Authors】: Wook-Shin Han ; Jinsoo Lee ; Yang-Sae Moon ; Seung-won Hwang ; Hwanjo Yu

【Abstract】: Ranked subsequence matching finds top-k subsequences most similar to a given query sequence from data sequences. Recently, Han et al. [12] proposed a solution (referred to here as HLMJ) to this problem by using the concept of the minimum distance matching window pair (MDMWP) and a global priority queue. By using the concept of MDMWP, HLMJ can prune many unnecessary accesses to data subsequences using a lower bound distance. However, we notice that HLMJ may incur serious performance overhead for important types of queries. In this paper, we propose a novel systematic framework to solve this problem by viewing ranked subsequence matching as ranked union. Specifically, we propose a notion of the matching subsequence equivalence class (MSEQ) and a novel lower bound called the MSEQ-distance. To completely eliminate the performance problem of HLMJ, we also propose a cost-aware density-based scheduling technique, where we consider both the density and cost of the priority queue. Extensive experimental results with many real datasets show that the proposed algorithm outperforms HLMJ and the adapted PSM [22], a state-of-the-art index-based merge algorithm supporting non-monotonic distance functions, by up to two to three orders of magnitude, respectively.

【Keywords】: ranked subsequence matching; ranked union; time-series data

Data provenance, workflow and cleaning 4

40. Interaction between record matching and data repairing.

Paper Link】 【Pages】:469-480

【Authors】: Wenfei Fan ; Jianzhong Li ; Shuai Ma ; Nan Tang ; Wenyuan Yu

【Abstract】: Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using constraints. These are treated as separate processes in current data cleaning systems, based on heuristic solutions. This paper studies a new problem, namely, the interaction between record matching and data repairing. We show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, we propose a uniform framework that seamlessly unifies repairing and matching operations, to clean a database based on integrity constraints, matching rules and master data. We give a full treatment of fundamental problems associated with data cleaning via matching and repairing, including the static analyses of constraints and rules taken together, and the complexity, termination and determinism analyses of data cleaning. We show that these problems are hard, ranging from NP- or coNP-complete, to PSPACE-complete. Nevertheless, we propose efficient algorithms to clean data via both matching and repairing. The algorithms find deterministic fixes and reliable fixes based on confidence and entropy analysis, respectively, which are more accurate than possible fixes generated by heuristics. We experimentally verify that our techniques significantly improve the accuracy of record matching and data repairing taken as separate processes, using real-life data.

【Keywords】: conditional functional dependency; data cleaning; matching dependency

41. We challenge you to certify your updates.

Paper Link】 【Pages】:481-492

【Authors】: Su Chen ; Xin Luna Dong ; Laks V. S. Lakshmanan ; Divesh Srivastava

【Abstract】: Correctness of data residing in a database is vital. While integrity constraint enforcement can often ensure data consistency, it is inadequate to protect against updates that involve careless, unintentional errors, e.g., whether a specified update to an employee's record was for the intended employee. We propose a novel approach that is complementary to existing integrity enforcement techniques, to guard against such erroneous updates. Our approach is based on (a) updaters providing an update certificate with each database update, and (b) the database system verifying the correctness of the update certificate provided before performing the update. We formalize a certificate as a (challenge, response) pair, and characterize good certificates as those that are easy for updaters to provide and, when correct, give the system enough confidence that the update was indeed intended. We present algorithms that efficiently enumerate good challenges, without exhaustively exploring the search space of all challenges. We experimentally demonstrate that (i) databases have many good challenges, (ii) these challenges can be efficiently identified, (iii) certificates can be quickly verified for correctness, (iv) under natural models of an updater's knowledge of the database, update certificates catch a high percentage of the erroneous updates without imposing undue burden on the updaters performing correct updates, and (v) our techniques are robust across a wide range of challenge parameter settings.

【Keywords】: certificate; data quality; update

42. Labeling recursive workflow executions on-the-fly.

Paper Link】 【Pages】:493-504

【Authors】: Zhuowei Bao ; Susan B. Davidson ; Tova Milo

【Abstract】: This paper presents a compact labeling scheme for answering reachability queries over workflow executions. In contrast to previous work, our scheme allows nodes (processes and data) in the execution graph to be labeled on-the-fly, i.e., in a dynamic fashion. In this way, reachability queries can be answered as soon as the relevant data is produced. We first show that, in general, for workflows that contain recursion, dynamic labeling of executions requires long (linear-size) labels. Fortunately, most real-life scientific workflows are linear recursive, and for this natural class we show that dynamic, yet compact (logarithmic-size) labeling is possible. Moreover, our scheme labels the executions in linear time, and answers any reachability query in constant time. We also show that linear recursive workflows are, in some sense, the largest class of workflows that allow compact, dynamic labeling schemes. Interestingly, the empirical evaluation, performed over both real and synthetic workflows, shows that our proposed dynamic scheme outperforms the state-of-the-art static scheme for large executions, and creates labels that are shorter by a factor of almost 3.

【Keywords】: dynamic labeling scheme; reachability query; recursive workflow

43. Tracing data errors with view-conditioned causality.

Paper Link】 【Pages】:505-516

【Authors】: Alexandra Meliou ; Wolfgang Gatterbauer ; Suman Nath ; Dan Suciu

【Abstract】: A surprising query result is often an indication of errors in the query or the underlying data. Recent work suggests using causal reasoning to find explanations for the surprising result. In practice, however, one often has multiple queries and/or multiple answers, some of which may be considered correct and others unexpected. In this paper, we focus on determining the causes of a set of unexpected results, possibly conditioned on some prior knowledge of the correctness of another set of results. We call this problem View-Conditioned Causality. We adapt the definitions of causality and responsibility for the case of multiple answers/views and provide a non-trivial algorithm that reduces the problem of finding causes and their responsibility to a satisfiability problem that can be solved with existing tools. We evaluate both the accuracy and effectiveness of our approach on a real dataset of user-generated mobile device tracking data, and demonstrate that it can identify causes of error more effectively than static Boolean influence and alternative notions of causality.

【Keywords】: causality; error-tracing; view-conditioning

Information extraction 4

44. Hybrid in-database inference for declarative information extraction.

Paper Link】 【Pages】:517-528

【Authors】: Daisy Zhe Wang ; Michael J. Franklin ; Minos N. Garofalakis ; Joseph M. Hellerstein ; Michael L. Wick

【Abstract】: In the database community, work on information extraction (IE) has centered on two themes: how to effectively manage IE tasks, and how to manage the uncertainties that arise in the IE process in a scalable manner. Recent work has proposed a probabilistic database (PDB) based declarative IE system that supports a leading statistical IE model, and an associated inference algorithm to answer top-k-style queries over the probabilistic IE outcome. Still, the broader problem of effectively supporting general probabilistic inference inside a PDB-based declarative IE system remains open. In this paper, we explore the in-database implementations of a wide variety of inference algorithms suited to IE, including two Markov chain Monte Carlo algorithms, the Viterbi and the sum-product algorithms. We describe the rules for choosing appropriate inference algorithms based on the model, the query and the text, considering the trade-off between accuracy and runtime. Based on these rules, we describe a hybrid approach to optimize the execution of a single probabilistic IE query to employ different inference algorithms appropriate for different records. We show that our techniques can achieve up to 10-fold speedups compared to the non-hybrid solutions proposed in the literature.

【Keywords】: Markov chain monte carlo algorithms; conditional random fields; information extraction; probabilistic database; probabilistic graphical models; query optimization; viterbi

45. Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction.

Paper Link】 【Pages】:529-540

【Authors】: Guoliang Li ; Dong Deng ; Jianhua Feng

【Abstract】: Dictionary-based entity extraction identifies predefined entities (e.g., person names or locations) from a document. A recent trend for improving extraction recall is to support approximate entity extraction, which finds all substrings in the document that approximately match entities in a given dictionary. Existing methods to address this problem support either token-based similarity (e.g., Jaccard Similarity) or character-based dissimilarity (e.g., Edit Distance). It calls for a unified method to support various similarity/dissimilarity functions, since a unified method can reduce the programming efforts, hardware requirements, and the manpower. In addition, many substrings in the document have overlaps, and we have an opportunity to utilize the shared computation across the overlaps to avoid unnecessary redundant computation. In this paper, we propose a unified framework to support many similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance. We devise efficient filtering algorithms to utilize the shared computation and develop effective pruning techniques to improve the performance. The experimental results show that our method achieves high performance and outperforms state-of-the-art studies.

【Keywords】: approximate entity extraction; filtering algorithms

46. Joint unsupervised structure discovery and information extraction.

Paper Link】 【Pages】:541-552

【Authors】: Eli Cortez ; Daniel Oliveira ; Altigran Soares da Silva ; Edleno Silva de Moura ; Alberto H. F. Laender

【Abstract】: In this paper we present JUDIE (Joint Unsupervised Structure Discovery and Information Extraction), a new method for automatically extracting semi-structured data records in the form of continuous text (e.g., bibliographic citations, postal addresses, classified ads, etc.) and having no explicit delimiters between them. While in state-of-the-art Information Extraction methods the structure of the data records is manually supplied the by user as a training step, JUDIE is capable of detecting the structure of each individual record being extracted without any user assistance. This is accomplished by a novel Structure Discovery algorithm that, given a sequence of labels representing attributes assigned to potential values, groups these labels into individual records by looking for frequent patterns of label repetitions among the given sequence. We also show how to integrate this algorithm in the information extraction process by means of successive refinement steps that alternate information extraction and structure discovery. Through an extensively experimental evaluation with different datasets in distinct domains, we compare JUDIE with state-of-the-art information extraction methods and conclude that, even without any user intervention, it is able to achieve high quality results on the tasks of discovering the structure of the records and extracting information from them.

【Keywords】: data management; information extraction; text segmentation

47. Attribute domain discovery for hidden web databases.

Paper Link】 【Pages】:553-564

【Authors】: Xin Jin ; Nan Zhang ; Gautam Das

【Abstract】: Many web databases are hidden behind restrictive form-like interfaces which may or may not provide domain information for an attribute. When attribute domains are not available, domain discovery becomes a critical challenge facing the application of a broad range of existing techniques on third-party analytical and mash-up applications over hidden databases. In this paper, we consider the problem of domain discovery over a hidden database through its web interface. We prove that for any database schema, an achievability guarantee on domain discovery can be made based solely upon the interface design. We also develop novel techniques which provide effective guarantees on the comprehensiveness of domain discovery. We present theoretical analysis and extensive experiments to illustrate the effectiveness of our approach.

【Keywords】: domain discovery; hidden web database

Paper Link】 【Pages】:565-576

【Authors】: Sonia Bergamaschi ; Elton Domnori ; Francesco Guerra ; Raquel Trillo Lado ; Yannis Velegrakis

【Abstract】: Keyword queries offer a convenient alternative to traditional SQL in querying relational databases with large, often unknown, schemas and instances. The challenge in answering such queries is to discover their intended semantics, construct the SQL queries that describe them and used them to retrieve the respective tuples. Existing approaches typically rely on indices built a-priori on the database content. This seriously limits their applicability if a-priori access to the database content is not possible. Examples include the on-line databases accessed through web interface, or the sources in information integration systems that operate behind wrappers with specific query capabilities. Furthermore, existing literature has not studied to its full extend the inter-dependencies across the ways the different keywords are mapped into the database values and schema elements. In this work, we describe a novel technique for translating keyword queries into SQL based on the Munkres (a.k.a. Hungarian) algorithm. Our approach not only tackles the above two limitations, but it offers significant improvements in the identification of the semantically meaningful SQL queries that describe the intended keyword query semantics. We provide details of the technique implementation and an extensive experimental evaluation.

【Keywords】: intensional knowledge; metadata; relational databases; semantic keyword search

Paper Link】 【Pages】:577-588

【Authors】: Marie Jacob ; Zachary G. Ives

【Abstract】: An important means of allowing non-expert end-users to pose ad hoc queries whether over single databases or data integration systems is through keyword search. Given a set of keywords, the query processor finds matches across different tuples and tables. It computes and executes a set of relational sub-queries whose results are combined to produce the k highest ranking answers. Work on keyword search primarily focuses on single-database, single-query settings: each query is answered in isolation, despite possible overlap between queries posed by different users or at different times; and the number of relevant tables is assumed to be small, meaning that sub-queries can be processed without using cost-based methods to combine work. As we apply keyword search to support ad hoc data integration queries over scientific or other databases on the Web, we must reuse and combine computation. In this paper, we propose an architecture that continuously receives sets of ranked keyword queries, and seeks to reuse work across these queries. We extend multiple query optimization and continuous query techniques, and develop a new query plan scheduling module we call the ATC (based on its analogy to an air traffic controller). The ATC manages the flow of tuples among a multitude of pipelined operators, minimizing the work needed to return the top-k answers for all queries. We also develop techniques to manage the sharing and reuse of state as queries complete and input data streams are exhausted. We show the effectiveness of our techniques in handling queries over real and synthetic data sets.

【Keywords】: keyword search; multiple query optimization; top-k queries

Paper Link】 【Pages】:589-600

【Authors】: Yufei Tao ; Stavros Papadopoulos ; Cheng Sheng ; Kostas Stefanidis

【Abstract】: This paper studies the nearest keyword (NK) problem on XML documents. In general, the dataset is a tree where each node is associated with one or more keywords. Given a node q and a keyword w, an NK query returns the node that is nearest to q among all the nodes associated with w. NK search is not only useful as a stand-alone operator but also as a building brick for important tasks such as XPath query evaluation and keyword search. We present an indexing scheme that answers NK queries efficiently, in terms of both practical and worst-case performance. The query cost is provably logarithmic to the number of nodes carrying the query keyword. The proposed scheme occupies space linear to the dataset size, and can be constructed by a fast algorithm. Extensive experimentation confirms our theoretical findings, and demonstrates the effectiveness of NK retrieval as a primitive operator in XML databases.

【Keywords】: group steiner tree; keyword search; nearest keyword; xpath

51. Efficient and generic evaluation of ranked queries.

Paper Link】 【Pages】:601-612

【Authors】: Wen Jin ; Jignesh M. Patel

【Abstract】: An important feature of the existing methods for ranked top-k processing is to avoid searching all the objects in the underlying dataset, and limiting the number of random accesses to the data. However, the performance of these methods degrades rapidly as the number of random accesses increases. In this paper, we propose a novel and general sequential access scheme for top-k query evaluation, which outperforms existing methods. We extend this scheme to efficiently answer top-k queries in subspace and on dynamic data. We also study the "dual" form of top-k queries called "ranking" queries, which returns the rank of a specified record/object, and propose an exact as well as two approximate solutions. An extensive empirical evaluation validates the robustness and efficiency of our techniques.

【Keywords】: ranked queries

Stream and complex event processing 4

52. Changing flights in mid-air: a model for safely modifying continuous queries.

Paper Link】 【Pages】:613-624

【Authors】: Kyumars Sheykh Esmaili ; Tahmineh Sanamrad ; Peter M. Fischer ; Nesime Tatbul

【Abstract】: Continuous queries can run for unpredictably long periods of time. During their lifetime, these queries may need to be adapted either due to changes in application semantics (e.g., the implementation of a new alert detection policy), or due to changes in the system's behavior (e.g., adapting performance to a changing load). While in previous works query modification has been implicitly utilized to serve specific purposes (e.g., load management), to date no research has been done that defines a general-purpose, reliable, and efficiently implementable model for modifying continuous queries at run-time. In this paper, we introduce a punctuation-based framework that can formally express arbitrary lifecycle operations on the basis of input-output mappings and basic control elements such as start or stop of queries. On top of this foundation, we derive all possible query change methods, each providing different levels of correctness guarantees and performance. We further show how these models can be efficiently realized in a state-of-the-art stream processing engine; we also provide experimental results demonstrating the key performance tradeoffs of the change methods.

【Keywords】: continuous query; query lifecycle; query modification

53. How soccer players would do stream joins.

Paper Link】 【Pages】:625-636

【Authors】: Jens Teubner ; René Müller

【Abstract】: In spite of the omnipresence of parallel (multi-core) systems, the predominant strategy to evaluate window-based stream joins is still strictly sequential, mostly just straightforward along the definition of the operation semantics. In this work we present handshake join, a way of describing and executing window-based stream joins that is highly amenable to parallelized execution. Handshake join naturally leverages available hardware parallelism, which we demonstrate with an implementation on a modern multi-core system and on top of field-programmable gate arrays (FPGAs), an emerging technology that has shown distinctive advantages for high-throughput data processing. On the practical side, we provide a join implementation that substantially outperforms CellJoin (the fastest published result) and that will directly turn any degree of parallelism into higher throughput or larger supported window sizes. On the semantic side, our work gives a new intuition of window semantics, which we believe could inspire other stream processing algorithms or ongoing standardization efforts for stream query languages.

【Keywords】: data flow; parallelism; stream joins

54. BE-tree: an index structure to efficiently match boolean expressions over high-dimensional discrete space.

Paper Link】 【Pages】:637-648

【Authors】: Mohammad Sadoghi ; Hans-Arno Jacobsen

【Abstract】: BE-Tree is a novel dynamic tree data structure designed to efficiently index Boolean expressions over a high-dimensional discrete space. BE-Tree copes with both high-dimensionality and expressiveness of Boolean expressions by introducing a novel two-phase space-cutting technique that specifically utilizes the discrete and finite domain properties of the space. Furthermore, BE-Tree employs self-adjustment policies to dynamically adapt the tree as the workload changes. We conduct a comprehensive evaluation to demonstrate the superiority of BE-Tree in comparison with state-of-the-art index structures designed for matching Boolean expressions.

【Keywords】: algorithm; boolean expressions; complex event processing; data structure; publish/subscribe

Paper Link】 【Pages】:649-660

【Authors】: Chun Chen ; Feng Li ; Beng Chin Ooi ; Sai Wu

【Abstract】: Real-time search dictates that new contents be made available for search immediately following their creation. From the database perspective, this requirement may be quite easily met by creating an up-to-date index for the contents and measuring search quality by the time gap between insertion time and availability of the index. This approach, however, poses new challenges for micro-blogging systems where thousands of concurrent users may upload their micro-blogs or tweets simultaneously. Due to the high update and query loads, conventional approaches would either fail to index the huge amount of newly created contents in real time or fall short of providing a scalable indexing service. In this paper, we propose a tweet index called the TI (Tweet Index), an adaptive indexing scheme for microblogging systems such as Twitter. The intuition of the TI is to index the tweets that may appear as a search result with high probability and delay indexing some other tweets. This strategy significantly reduces the indexing cost without compromising the quality of the search results. In the TI, we also devise a new ranking scheme by combining the relationship between the users and tweets. We group tweets into topics and update the ranking of a topic dynamically. The experiments on a real Twitter dataset confirm the efficiency of the TI.

【Keywords】: index; ranking; real-time search

Query processing 4

56. More efficient datalog queries: subsumptive tabling beats magic sets.

Paper Link】 【Pages】:661-672

【Authors】: K. Tuncay Tekle ; Yanhong A. Liu

【Abstract】: Given a set of Datalog rules, facts, and a query, answers to the query can be inferred bottom-up starting with the facts or top-down starting with the query. The dominant strategies to improve the performance of answering queries are reusing answers to subqueries for top-down methods, and transforming rules based on demand from the query, such as the well-known magic sets transformation, for bottom-up methods. However, the performance of these strategies vary drastically, and the most effective method has remained unknown. This paper describes precise time and space complexity analysis for efficient implementation of Datalog queries using subsumptive tabling, a top-down evaluation method with more reuse of answers than the dominant tabling strategy, and shows that subsumptive tabling beats bottom-up evaluation of rules after magic sets transformation in both time and space complexities. It also describes subsumptive demand transformation, a novel method for transforming the rules so that bottom-up evaluation of the transformed rules mimics subsumptive tabling; we show that the time complexity of bottom-up evaluation after this transformation is equal to the the time complexity of top-down evaluation with subsumptive tabling. The paper further describes subsumption optimization, an optimization to increase the use of subsumption in subsumptive methods, and shows its application in the derivation of a well-known demand-driven pointer analysis algorithm. We support our analyses and comparisons through experiments with applications in ontology queries and program analysis.

【Keywords】: bottom-up evaluation; complexity analysis; datalog; tabling

57. Entangled queries: enabling declarative data-driven coordination.

Paper Link】 【Pages】:673-684

【Authors】: Nitin Gupta ; Lucja Kot ; Sudip Roy ; Gabriel Bender ; Johannes Gehrke ; Christoph Koch

【Abstract】: Many data-driven social and Web applications involve collaboration and coordination. The vision of declarative data-driven coordination (D3C), proposed in [9], is to support coordination in the spirit of data management: to make it data-centric and to specify it using convenient declarative languages. This paper introduces entangled queries, a language that extends SQL by constraints that allow for the coordinated choice of result tuples across queries originating from different users or applications. It is nontrivial to define a declarative coordination formalism without arriving at the general (NP-complete) Constraint Satisfaction Problem from AI. In this paper, we propose an efficiently enforcible syntactic safety condition that we argue is at the sweet spot where interesting declarative power meets applicability in large scale data management systems and applications. The key computational problem of D3C is to match entangled queries to achieve coordination. We present an efficient matching algorithm which statically analyzes query workloads and merges coordinating entangled queries into compound SQL queries. These can be sent to a standard database system and return only coordinated results. We present the overall architecture of an implemented system that contains our evaluation algorithm; we also evaluate the performance of the matching algorithm experimentally on realistic coordination workloads.

【Keywords】: D3C; coordination; entangled queries

58. Data generation using declarative constraints.

Paper Link】 【Pages】:685-696

【Authors】: Arvind Arasu ; Raghav Kaushik ; Jian Li

【Abstract】: We study the problem of generating synthetic databases having declaratively specified characteristics. This problem is motivated by database system and application testing, data masking, and benchmarking. While the data generation problem has been studied before, prior approaches are either non-declarative or have fundamental limitations relating to data characteristics that they can capture and efficiently support. We argue that a natural, expressive, and declarative mechanism for specifying data characteristics is through cardinality constraints; a cardinality constraint specifies that the output of a query over the generated database have a certain cardinality. While the data generation problem is intractable in general, we present efficient algorithms that can handle a large and useful class of constraints. We include a thorough empirical evaluation illustrating that our algorithms handle complex constraints, scale well as the number of constraints increase, and outperform applicable prior techniques.

【Keywords】: benchmarking; constraints; data generation; masking; testing

59. Efficient auditing for complex SQL queries.

Paper Link】 【Pages】:697-708

【Authors】: Raghav Kaushik ; Ravishankar Ramamurthy

【Abstract】: We address the problem of data auditing that asks for an audit trail of all users and queries that potentially breached information about sensitive data. A lot of the previous work in data auditing has focused on providing strong privacy guarantees and studied the class of queries that can be audited efficiently while retaining the guarantees. In this paper, we approach data auditing from a different perspective. Our goal is to design an auditing system for arbitrary SQL queries containing constructs such as grouping, aggregation and correlated subqueries. Pivoted on the ability to feasibly address arbitrary queries, we study (1)~what privacy guarantees we can expect, and (2)~how we can efficiently perform auditing.

【Keywords】: access control; auditing; privacy; query processing; security

Data mining 4

60. Exact indexing for support vector machines.

Paper Link】 【Pages】:709-720

【Authors】: Hwanjo Yu ; Ilhwan Ko ; Youngdae Kim ; Seung-won Hwang ; Wook-Shin Han

【Abstract】: SVM (Support Vector Machine) is a well-established machine learning methodology popularly used for classification, regression, and ranking. Recently SVM has been actively researched for rank learning and applied to various applications including search engines or relevance feedback systems. A query in such systems is the ranking function F learned by SVM. Once learning a function F or formulating the query, processing the query to find top-k results requires evaluating the entire database by F. So far, there exists no exact indexing solution for SVM functions. Existing top-k query processing algorithms are not applicable to the machine-learned ranking functions, as they often make restrictive assumptions on the query, such as linearity or monotonicity of functions. Existing metric-based or reference-based indexing methods are also not applicable, because data points are invisible in the kernel space (SVM feature space) on which the index must be built. Existing kernel indexing methods return approximate results or fix kernel parameters. This paper proposes an exact indexing solution for SVM functions with varying kernel parameters. We first propose key geometric properties of the kernel space -- ranking instability and ordering stability -- which is crucial for building indices in the kernel space. Based on them, we develop an index structure iKernel and processing algorithms. We then present clustering techniques in the kernel space to enhance the pruning effectiveness of the index. According to our experiments, iKernel is highly effective overall producing 1~5% of evaluation ratio on large data sets. According to our best knowledge, iKernel is the first indexing solution that finds exact top-k results of SVM functions without a full scan of data set.

【Keywords】: indexing; support vector machines

61. Local graph sparsification for scalable clustering.

Paper Link】 【Pages】:721-732

【Authors】: Venu Satuluri ; Srinivasan Parthasarathy ; Yiye Ruan

【Abstract】: In this paper we look at how to sparsify a graph i.e. how to reduce the edgeset while keeping the nodes intact, so as to enable faster graph clustering without sacrificing quality. The main idea behind our approach is to preferentially retain the edges that are likely to be part of the same cluster. We propose to rank edges using a simple similarity-based heuristic that we efficiently compute by comparing the minhash signatures of the nodes incident to the edge. For each node, we select the top few edges to be retained in the sparsified graph. Extensive empirical results on several real networks and using four state-of-the-art graph clustering and community discovery algorithms reveal that our proposed approach realizes excellent speedups (often in the range 10-50), with little or no deterioration in the quality of the resulting clusters. In fact, for at least two of the four clustering algorithms, our sparsification consistently enables higher clustering accuracies.

【Keywords】: graph clustering; graph sparsification; minwise hashing

62. Advancing data clustering via projective clustering ensembles.

Paper Link】 【Pages】:733-744

【Authors】: Francesco Gullo ; Carlotta Domeniconi ; Andrea Tagarelli

【Abstract】: Projective Clustering Ensembles (PCE) are a very recent advance in data clustering research which combines the two powerful tools of clustering ensembles and projective clustering.Specifically, PCE enables clustering ensemble methods to handle ensembles composed by projective clustering solutions. PCE has been formalized as an optimization problem with either a two-objective or a single-objective function. Two-objective PCE has shown to generally produce more accurate clustering results than its single-objective counterpart, although it can handle the object-based and feature-based cluster representations only independently of one other. Moreover, both the early formulations of PCE do not follow any of the standard approaches of clustering ensembles, namely instance-based, cluster-based, and hybrid. In this paper, we propose an alternative formulation to the PCE problem which overcomes the above issues. We investigate the drawbacks of the early formulations of PCE and define a new single-objective formulation of the problem. This formulation is capable of treating the object- and feature-based cluster representations as a whole, essentially tying them in a distance computation between a projective clustering solution and a given ensemble. We propose two cluster-based algorithms for computing approximations to the proposed PCE formulation, which have the common merit of conforming to one of the standard approaches of clustering ensembles. Experiments on benchmark datasets have shown the significance of our PCE formulation, as both the proposed heuristics outperform existing PCE methods.

【Keywords】: clustering; clustering ensembles; data mining; dimensionality reduction; optimization; projective clustering; subspace clustering

63. Sampling based algorithms for quantile computation in sensor networks.

Paper Link】 【Pages】:745-756

【Authors】: Zengfeng Huang ; Lu Wang ; Ke Yi ; Yunhao Liu

【Abstract】: We study the problem of computing approximate quantiles in large-scale sensor networks communication-efficiently, a problem previously studied by Greenwald and Khana [12] and Shrivastava et al [21]. Their algorithms have a total communication cost of O(k log2 n / ε) and O(k log u / ε), respectively, where k is the number of nodes in the network, n is the total size of the data sets held by all the nodes, u is the universe size, and ε is the required approximation error. In this paper, we present a sampling based quantile computation algorithm with O(√kh/ε) total communication (h is the height of the routing tree), which grows sublinearly with the network size except in the pathological case h=Θ(k). In our experiments on both synthetic and real data sets, this improvement translates into a 10 to 100-fold communication reduction for achieving the same accuracy in the computed quantiles. Meanwhile, the maximum individual node communication of our algorithm is no higher than that of the previous two algorithms.

【Keywords】: quantiles; sensor networks

Information retrieval 4

64. Context-sensitive ranking for document retrieval.

Paper Link】 【Pages】:757-768

【Authors】: Liang Jeff Chen ; Yannis Papakonstantinou

【Abstract】: We study the problem of context-sensitive ranking for document retrieval, where a context is defined as a sub-collection of documents, and is specified by queries provided by domain-interested users. The motivation of context-sensitive search is that the ranking of the same keyword query generally depends on the context. The reason is that the underlying keyword statistics differ significantly from one context to another. The query evaluation challenge is the computation of keyword statistics at runtime, which involves expensive online aggregations. We appropriately leverage and extend materialized view research in order to deliver algorithms and data structures that evaluate context-sensitive queries efficiently. Specifically, a number of views are selected and materialized, each corresponding to one or more large contexts. Materialized views are used at query time to compute statistics which are used to compute ranking scores. Experimental results show that the context-sensitive ranking generally improves the ranking quality, while our materialized view-based technique improves the query efficiency.

【Keywords】: context-sensitive ranking; materialized views; view selection

Paper Link】 【Pages】:769-780

【Authors】: Nathan Bales ; Alin Deutsch ; Vasilis Vassalos

【Abstract】: We address two open problems involving algebraic execution of full-text search queries. First, we show how to correctly apply traditional database rewrite optimizations to full-text algebra plans with integrated scoring, and explain why existing techniques fail. Second, we show how our techniques are applied in a generic scoring framework that supports a wide class of scoring algorithms, including algorithms seen in the literature and user-defined scoring.

【Keywords】: GRAFT; full-text search; optimization; score consistency; text algebra; text calculus

66. Efficient diversity-aware search.

Paper Link】 【Pages】:781-792

【Authors】: Albert Angel ; Nick Koudas

【Abstract】: Typical approaches of ranking information in response to a user's query that return the most relevant results ignore important factors contributing to user satisfaction; for instance, the contents of a result document may be redundant given the results already examined. Motivated by emerging applications, in this work we study the problem of Diversity-Aware Search, the essence of which is ranking search results based on both their relevance, as well as their dissimilarity to other results reported. Diversity-Aware Search is generally a hard problem, and even tractable instances thereof cannot be efficiently solved by adapting existing approaches. We propose DIVGEN, an efficient algorithm for diversity-aware search, which achieves significant performance improvements via novel data access primitives. Although selecting the optimal schedule of data accesses is a hard problem, we devise the first low-overhead data access prioritization scheme with theoretical quality guarantees, and good performance in practice. A comprehensive evaluation on real and synthetic large-scale corpora demonstrates the efficiency and effectiveness of our approach.

【Keywords】: data access scheduling; diversification; search

Paper Link】 【Pages】:793-804

【Authors】: Mingyang Zhang ; Nan Zhang ; Gautam Das

【Abstract】: Search engines over document corpora typically provide keyword-search interfaces. Examples include search engines over the web as well as those over enterprise and government websites. The corpus of such a search engine forms a rich source of information of analytical interest to third parties, but the only available access is by issuing search queries through its interface. To support data analytics over a search engine's corpus, one needs to address two main problems, the sampling of documents (for offline analytics) and the direct (online) estimation of aggregates, while issuing a small number of queries through the keyword-search interface. Existing work on sampling produces samples with unknown bias and may incur an extremely high query cost. Existing aggregate estimation technique suffers from a similar problem, as the estimation error and query cost can both be large for certain aggregates. We propose novel techniques which produce unbiased samples as well as unbiased aggregate estimates with small variances while incurring a query cost an order of magnitude smaller than the existing techniques. We present theoretical analysis and extensive experiments to illustrate the effectiveness of our approach.

【Keywords】: aggregate query processing; sampling; search engine

Probabilistic and uncertain databases 4

68. Ranking with uncertain scoring functions: semantics and sensitivity measures.

Paper Link】 【Pages】:805-816

【Authors】: Mohamed A. Soliman ; Ihab F. Ilyas ; Davide Martinenghi ; Marco Tagliasacchi

【Abstract】: Ranking queries report the top-K results according to a user-defined scoring function. A widely used scoring function is the weighted summation of multiple scores. Often times, users cannot precisely specify the weights in such functions in order to produce the preferred order of results. Adopting uncertain/incomplete scoring functions (e.g., using weight ranges and partially-specified weight preferences) can better capture user's preferences in this scenario. In this paper, we study two aspects in uncertain scoring functions. The first aspect is the semantics of ranking queries, and the second aspect is the sensitivity of computed results to refinements made by the user. We formalize and solve multiple problems under both aspects, and present novel techniques that compute query results efficiently to comply with the interactive nature of these problems.

【Keywords】: aggregation; ranking; scoring; top-k; uncertainty

69. Querying uncertain data with aggregate constraints.

Paper Link】 【Pages】:817-828

【Authors】: Mohan Yang ; Haixun Wang ; Haiquan Chen ; Wei-Shinn Ku

【Abstract】: Data uncertainty arises in many situations. A common approach to query processing uncertain data is to sample many "possible worlds" from the uncertain data and to run queries against the possible worlds. However, sampling is not a trivial task, as a randomly sampled possible world may not satisfy known constraints imposed on the data. In this paper, we focus on an important category of constraints, the aggregate constraints. An aggregate constraint is placed on a set of records instead of on a single record, and a real-life system usually has a large number of aggregate constraints. It is a challenging task to find qualified possible worlds in this scenario, since tuple by tuple sampling is extremely inefficient because it rarely leads to a qualified possible world. In this paper, we introduce two approaches for querying uncertain data with aggregate constraints: constraint aware sampling and MCMC sampling. Our experiments show that the new approaches lead to high quality query results with reasonable cost.

【Keywords】: MCMC; aggregate constraint; uncertain data

70. Jigsaw: efficient optimization over uncertain enterprise data.

Paper Link】 【Pages】:829-840

【Authors】: Oliver Kennedy ; Suman Nath

【Abstract】: Probabilistic databases, in particular ones that allow users to externally define models or probability distributions -- so called VG-Functions -- are an ideal tool for constructing, simulating and analyzing hypothetical business scenarios. Enterprises often use such tools with parameterized models and need to explore a large parameter space in order to discover parameter values that optimize for a given goal. Parameter space is usually very large, making such exploration extremely expensive. We present Jigsaw, a probabilistic database-based simulation framework that addresses this performance problem. In Jigsaw, users define what-if style scenarios as parameterized probabilistic database queries and identify parameter values that achieve desired properties. Jigsaw uses a novel "fingerprinting" technique that efficiently identifies correlations between a query's output distribution for different parameter values. Using fingerprints, Jigsaw is able to reuse work performed for different parameter values, and obtain speedups of as much as 2 orders of magnitude for several real business scenarios.

【Keywords】: Monte Carlo; black box; probabilistic database; simulation

71. Sensitivity analysis and explanations for robust query evaluation in probabilistic databases.

Paper Link】 【Pages】:841-852

【Authors】: Bhargav Kanagal ; Jian Li ; Amol Deshpande

【Abstract】: Probabilistic database systems have successfully established themselves as a tool for managing uncertain data. However, much of the research in this area has focused on efficient query evaluation and has largely ignored two key issues that commonly arise in uncertain data management: First, how to provide explanations for query results, e.g., Why is this tuple in my result? or Why does this output tuple have such high probability?. Second, the problem of determining the sensitive input tuples for the given query, e.g., users are interested to know the input tuples that can substantially alter the output, when their probabilities are modified (since they may be unsure about the input probability values). Existing systems provide the lineage/provenance of each of the output tuples in addition to the output probabilities, which is a boolean formula indicating the dependence of the output tuple on the input tuples. However, lineage does not immediately provide a quantitative relationship and it is not informative when we have multiple output tuples. In this paper, we propose a unified framework that can handle both the issues mentioned above to facilitate robust query processing. We formally define the notions of influence and explanations and provide algorithms to determine the top-l influential set of variables and the top-l set of explanations for a variety of queries, including conjunctive queries, probabilistic threshold queries, top-k queries and aggregation queries. Further, our framework naturally enables highly efficient incremental evaluation when input probabilities are modified (e.g., if uncertainty is resolved). Our preliminary experimental results demonstrate the benefits of our framework for performing robust query processing over probabilistic databases.

【Keywords】: boolean algebra; lineage; probabilistic databases; query processing; sensitivity analysis

OLAP and its applications 4

72. Graph cube: on warehousing and OLAP multidimensional networks.

Paper Link】 【Pages】:853-864

【Authors】: Peixiang Zhao ; Xiaolei Li ; Dong Xin ; Jiawei Han

【Abstract】: We consider extending decision support facilities toward large sophisticated networks, upon which multidimensional attributes are associated with network entities, thereby forming the so-called multidimensional networks. Data warehouses and OLAP (Online Analytical Processing) technology have proven to be effective tools for decision support on relational data. However, they are not well-equipped to handle the new yet important multidimensional networks. In this paper, we introduce Graph Cube, a new data warehousing model that supports OLAP queries effectively on large multidimensional networks. By taking account of both attribute aggregation and structure summarization of the networks, Graph Cube goes beyond the traditional data cube model involved solely with numeric value based group-by's, thus resulting in a more insightful and structure-enriched aggregate network within every possible multidimensional space. Besides traditional cuboid queries, a new class of OLAP queries, crossboid, is introduced that is uniquely useful in multidimensional networks and has not been studied before. We implement Graph Cube by combining special characteristics of multidimensional networks with the existing well-studied data cube techniques. We perform extensive experimental studies on a series of real world data sets and Graph Cube is shown to be a powerful and efficient tool for decision support on large multidimensional networks.

【Keywords】: OLAP; data cube; data warehouse; graph cube; multidimensional network

73. MaSM: efficient online updates in data warehouses.

Paper Link】 【Pages】:865-876

【Authors】: Manos Athanassoulis ; Shimin Chen ; Anastasia Ailamaki ; Phillip B. Gibbons ; Radu Stoica

【Abstract】: Data warehouses have been traditionally optimized for read-only query performance, allowing only offline updates at night, essentially trading off data freshness for performance. The need for 24x7 operations in global markets and the rise of online and other quickly-reacting businesses make concurrent online updates increasingly desirable. Unfortunately, state-of-the-art approaches fall short of supporting fast analysis queries over fresh data. The conventional approach of performing updates in place can dramatically slow down query performance, while prior proposals using differential updates either require large in-memory buffers or may incur significant update migration cost. This paper presents a novel approach for supporting online updates in data warehouses that overcomes the limitations of prior approaches, by making judicious use of available SSDs to cache incoming updates. We model the problem of query processing with differential updates as a type of outer join between the data residing on disks and the updates residing on SSDs. We present MaSM algorithms for performing such joins and periodic migrations, with small memory footprints, low query overhead, low SSD writes, efficient in-place migration of updates, and correct ACID support. Our experiments show that MaSM incurs only up to 7% overhead both on synthetic range scans (varying range size from 100GB to 4KB) and in a TPC-H query replay study, while also increasing the update throughput by orders of magnitude.

【Keywords】: SSDs; data warehouses; materialized sort merge; online updates

74. Latent OLAP: data cubes over latent variables.

Paper Link】 【Pages】:877-888

【Authors】: Deepak Agarwal ; Bee-Chung Chen

【Abstract】: We introduce a novel class of data cube, called latent-variable cube. For many data analysis tasks, data in a database can be represented as points in a multi-dimensional space. Ordinary data cubes compute aggregate functions over these "observed" data points for each cell (i.e., region) in the space, where the cells have different granularities defined by hierarchies. While useful, data cubes do not provide sufficient capability for analyzing "latent variables" that are often of interest but not directly observed in data. For example, when analyzing users' interaction with online advertisements, observed data informs whether a user clicked an ad or not. However, the real interest is often in knowing the click probabilities of ads for different user populations. In this example, click probabilities are latent variables that are not observed but have to be estimated from data. We argue that latent variables are a useful construct for a number of OLAP application scenarios. To facilitate such analyses, we propose cubes that compute aggregate functions over latent variables. Specifically, we discuss the pitfalls of common practice in scenarios where latent variables should, but are not considered; we rigorously define latent-variable cube based on Bayesian hierarchical models and provide efficient algorithms. Through extensive experiments on both simulated and real data, we show that our method is accurate and runs orders of magnitude faster than the baseline.

【Keywords】: algebraic aggregation; bayesian hierarchical models; posterior mode; variance estimation

75. E-Cube: multi-dimensional event sequence analysis using hierarchical pattern query sharing.

Paper Link】 【Pages】:889-900

【Authors】: Mo Liu ; Elke A. Rundensteiner ; Kara Greenfield ; Chetan Gupta ; Song Wang ; Ismail Ari ; Abhay Mehta

【Abstract】: Many modern applications, including online financial feeds, tag-based mass transit systems and RFID-based supply chain management systems transmit real-time data streams. There is a need for event stream processing technology to analyze this vast amount of sequential data to enable online operational decision making. Existing techniques such as traditional online analytical processing (OLAP) systems are not designed for real-time pattern-based operations, while state-of-the-art Complex Event Processing (CEP) systems designed for sequence detection do not support OLAP operations. We propose a novel E-Cube model which combines CEP and OLAP techniques for efficient multi-dimensional event pattern analysis at different abstraction levels. Our analysis of the interrelationships in both concept abstraction and pattern refinement among queries facilitates the composition of these queries into an integrated E-Cube hierarchy. Based on this E-Cube hierarchy, strategies of drill-down (refinement from abstract to more specific patterns) and of roll-up (generalization from specific to more abstract patterns) are developed for the efficient workload evaluation. Our proposed execution strategies reuse intermediate results along both the concept and the pattern refinement relationships between queries. Based on this foundation, we design a cost-driven adaptive optimizer called Chase, that exploits the above reuse strategies for optimal E-Cube hierarchy execution. Our experimental studies comparing alternate strategies on a real world financial data stream under different workload conditions demonstrate the superiority of the Chase method. In particular, our Chase execution in many cases performs ten fold faster than the state-of-the art strategy for real stock market query workloads.

【Keywords】: OLAP; algorithm; complex event processing; optimization; streaming

Graph management 4

Paper Link】 【Pages】:901-912

【Authors】: Arijit Khan ; Nan Li ; Xifeng Yan ; Ziyu Guan ; Supriyo Chakraborty ; Shu Tao

【Abstract】: Complex social and information network search becomes important with a variety of applications. In the core of these applications, lies a common and critical problem: Given a labeled network and a query graph, how to efficiently search the query graph in the target network. The presence of noise and the incomplete knowledge about the structure and content of the target network make it unrealistic to find an exact match. Rather, it is more appealing to find the top-k approximate matches. In this paper, we propose a neighborhood-based similarity measure that could avoid costly graph isomorphism and edit distance computation. Under this new measure, we prove that subgraph similarity search is NP hard, while graph similarity match is polynomial. By studying the principles behind this measure, we found an information propagation model that is able to convert a large network into a set of multidimensional vectors, where sophisticated indexing and similarity search algorithms are available. The proposed method, called Ness (Neighborhood Based Similarity Search), is appropriate for graphs with low automorphism and high noise, which are common in many social and information networks. Ness is not only efficient, but also robust against structural noise and information loss. Empirical results show that it can quickly and accurately find high-quality matches in large networks, with negligible cost.

【Keywords】: RDF; graph alignment; graph query; graph search

77. A memory efficient reachability data structure through bit vector compression.

Paper Link】 【Pages】:913-924

【Authors】: Sebastiaan J. van Schaik ; Oege de Moor

【Abstract】: When answering many reachability queries on a large graph, the principal challenge is to represent the transitive closure of the graph compactly, while still allowing fast membership tests on that transitive closure. Recent attempts to address this problem are complex data structures and algorithms such as Path-Tree and 3-HOP. We propose a simple alternative based on a novel form of bit-vector compression. Our starting point is the observation that when computing the transitive closure, reachable vertices tend to cluster together. We adapt the well-known scheme of word-aligned hybrid compression (WAH) to work more efficiently by introducing word partitions. We prove that the resulting scheme leads to a more compact data structure than its closest competitor, namely interval lists. In extensive and detailed experiments, this is confirmed in practice. We also demonstrate that the new technique can handle much larger graphs than alternative algorithms.

【Keywords】: PWAH; WAH; bit vector compression; graph indexing; reachability queries; transitive closure

78. Incremental graph pattern matching.

Paper Link】 【Pages】:925-936

【Authors】: Wenfei Fan ; Jianzhong Li ; Jizhou Luo ; Zijing Tan ; Xin Wang ; Yinghui Wu

【Abstract】: Graph pattern matching has become a routine process in emerging applications such as social networks. In practice a data graph is typically large, and is frequently updated with small changes. It is often prohibitively expensive to recompute matches from scratch via batch algorithms when the graph is updated. With this comes the need for incremental algorithms that compute changes to the matches in response to updates, to minimize unnecessary recomputation. This paper investigates incremental algorithms for graph pattern matching defined in terms of graph simulation, bounded simulation and subgraph isomorphism. (1) For simulation, we provide incremental algorithms for unit updates and certain graph patterns. These algorithms are optimal: in linear time in the size of the changes in the input and output, which characterizes the cost that is inherent to the problem itself. For general patterns we show that the incremental matching problem is unbounded, i.e., its cost is not determined by the size of the changes alone. (2) For bounded simulation, we show that the problem is unbounded even for unit updates and path patterns. (3) For subgraph isomorphism, we show that the problem is intractable and unbounded for unit updates and path patterns. (4) For multiple updates, we develop an incremental algorithm for each of simulation, bounded simulation and subgraph isomorphism. We experimentally verify that these incremental algorithms significantly outperform their batch counterparts in response to small changes, using real-life data and synthetic data.

【Keywords】: affected area; bounded incremental matching algorithms

79. Assessing and ranking structural correlations in graphs.

Paper Link】 【Pages】:937-948

【Authors】: Ziyu Guan ; Jian Wu ; Qing Zhang ; Ambuj K. Singh ; Xifeng Yan

【Abstract】: Real-life graphs not only have nodes and edges, but also have events taking place, e.g., product sales in social networks and virus infection in communication networks. Among different events, some exhibit strong correlation with the network structure, while others do not. Such structural correlation will shed light on viral influence existing in the corresponding network. Unfortunately, the traditional association mining concept is not applicable in graphs since it only works on homogeneous datasets like transactions and baskets. We propose a novel measure for assessing such structural correlations in heterogeneous graph datasets with events. The measure applies hitting time to aggregate the proximity among nodes that have the same event. In order to calculate the correlation scores for many events in a large network, we develop a scalable framework, called gScore, using sampling and approximation. By comparing to the situation where events are randomly distributed in the same network, our method is able to discover events that are highly correlated with the graph structure. gScore is scalable and was successfully applied to the co-author DBLP network and social networks extracted from TaoBao.com, the largest online shopping network in China, with many interesting discoveries.

【Keywords】: graph; hitting time; structural correlation

Scalable data analytics 4

80. Processing theta-joins using MapReduce.

Paper Link】 【Pages】:949-960

【Authors】: Alper Okcan ; Mirek Riedewald

【Abstract】: Joins are essential for many data analysis tasks, but are not supported directly by the MapReduce paradigm. While there has been progress on equi-joins, implementation of join algorithms in MapReduce in general is not sufficiently understood. We study the problem of how to map arbitrary join conditions to Map and Reduce functions, i.e., a parallel infrastructure that controls data flow based on key-equality only. Our proposed join model simplifies creation of and reasoning about joins in MapReduce. Using this model, we derive a surprisingly simple randomized algorithm, called 1-Bucket-Theta, for implementing arbitrary joins (theta-joins) in a single MapReduce job. This algorithm only requires minimal statistics (input cardinality) and we provide evidence that for a variety of join problems, it is either close to optimal or the best possible option. For some of the problems where 1-Bucket-Theta is not the best choice, we show how to achieve better performance by exploiting additional input statistics. All algorithms can be made 'memory-aware', and they do not require any modifications to the MapReduce environment. Experiments show the effectiveness of our approach.

【Keywords】: MapReduce; skew; theta join processing

81. Llama: leveraging columnar storage for scalable join processing in the MapReduce framework.

Paper Link】 【Pages】:961-972

【Authors】: Yuting Lin ; Divyakant Agrawal ; Chun Chen ; Beng Chin Ooi ; Sai Wu

【Abstract】: To achieve high reliability and scalability, most large-scale data warehouse systems have adopted the cluster-based architecture. In this paper, we propose the design of a new cluster-based data warehouse system, LLama, a hybrid data management system which combines the features of row-wise and column-wise database systems. In Llama, columns are formed into correlation groups to provide the basis for the vertical partitioning of tables. Llama employs a distributed file system (DFS) to disseminate data among cluster nodes. Above the DFS, a MapReduce-based query engine is supported. We design a new join algorithm to facilitate fast join processing. We present a performance study on TPC-H dataset and compare Llama with Hive, a data warehouse infrastructure built on top of Hadoop. The experiment is conducted on EC2. The results show that Llama has an excellent load performance and its query performance is significantly better than the traditional MapReduce framework based on row-wise storage.

【Keywords】: MapReduce; column store; join

82. Fast personalized PageRank on MapReduce.

Paper Link】 【Pages】:973-984

【Authors】: Bahman Bahmani ; Kaushik Chakrabarti ; Dong Xin

【Abstract】: In this paper, we design a fast MapReduce algorithm for Monte Carlo approximation of personalized PageRank vectors of all the nodes in a graph. The basic idea is very efficiently doing single random walks of a given length starting at each node in the graph. More precisely, we design a MapReduce algorithm, which given a graph G and a length », outputs a single random walk of length » starting at each node in G. We will show that the number of MapReduce iterations used by our algorithm is optimal among a broad family of algorithms for the problem, and its I/O efficiency is much better than the existing candidates. We will then show how we can use this algorithm to very efficiently approximate all the personalized PageRank vectors. Our empirical evaluation on real-life graph data and in production MapReduce environment shows that our algorithm is significantly more efficient than all the existing algorithms in the MapReduce setting.

【Keywords】: MapReduce; personalized pagerank

83. A platform for scalable one-pass analytics using MapReduce.

Paper Link】 【Pages】:985-996

【Authors】: Boduo Li ; Edward Mazur ; Yanlei Diao ; Andrew McGregor ; Prashant J. Shenoy

【Abstract】: Today's one-pass analytics applications tend to be data-intensive in nature and require the ability to process high volumes of data efficiently. MapReduce is a popular programming model for processing large datasets using a cluster of machines. However, the traditional MapReduce model is not well-suited for one-pass analytics, since it is geared towards batch processing and requires the data set to be fully loaded into the cluster before running analytical queries. This paper examines, from a systems standpoint, what architectural design changes are necessary to bring the benefits of the MapReduce model to incremental one-pass analytics. Our empirical and theoretical analyses of Hadoop-based MapReduce systems show that the widely-used sort-merge implementation for partitioning and parallel processing poses a fundamental barrier to incremental one-pass analytics, despite various optimizations. To address these limitations, we propose a new data analysis platform that employs hash techniques to enable fast in-memory processing, and a new frequent key based technique to extend such processing to workloads that require a large key-state space. Evaluation of our Hadoop-based prototype using real-world workloads shows that our new platform significantly improves the progress of map tasks, allows the reduce progress to keep up with the map progress, with up to 3 orders of magnitude reduction of internal data spills, and enables results to be returned continuously during the job.

【Keywords】: incremental computation; one-pass analytics; parallel processing

84. ATLAS: a probabilistic algorithm for high dimensional similarity search.

Paper Link】 【Pages】:997-1008

【Authors】: Jiaqi Zhai ; Yin Lou ; Johannes Gehrke

【Abstract】: Given a set of high dimensional binary vectors and a similarity function (such as Jaccard and Cosine), we study the problem of finding all pairs of vectors whose similarity exceeds a given threshold. The solution to this problem is a key component in many applications with feature-rich objects, such as text, images, music, videos, or social networks. In particular, there are many important emerging applications that require the use of relatively low similarity thresholds. We propose ATLAS, a probabilistic similarity search algorithm that in expectation finds a 1 - δ fraction of all similar vector pairs. ATLAS uses truly random permutations both to filter candidate pairs of vectors and to estimate the similarity between vectors. At a 97.5% recall rate, ATLAS consistently outperforms all state-of-the-art approaches and achieves a speed-up of up to two orders of magnitude over both exact and approximate algorithms.

【Keywords】: data mining; set similarity join; similarity search

85. Flexible aggregate similarity search.

Paper Link】 【Pages】:1009-1020

【Authors】: Yang Li ; Feifei Li ; Ke Yi ; Bin Yao ; Min Wang

【Abstract】: Aggregate similarity search, a.k.a. aggregate nearest neighbor (Ann) query, finds many useful applications in spatial and multimedia databases. Given a group Q of M query objects, it retrieves the most (or top-k) similar object to Q from a database P, where the similarity is an aggregation (e.g., sum, max) of the distances between the retrieved object p and all the objects in Q. In this paper, we propose an added flexibility to the query definition, where the similarity is an aggregation over the distances between p and any subset of ÆM objects in Q for some support 0 < Æ d 1. We call this new definition flexible aggregate similarity (Fann) search, which generalizes the Ann problem. Next, we present algorithms for answering Fann queries exactly and approximately. Our approximation algorithms are especially appealing, which are simple, highly efficient, and work well in both low and high dimensions. They also return nearoptimal answers with guaranteed constant-factor approximations in any dimensions. Extensive experiments on large real and synthetic datasets from 2 to 74 dimensions have demonstrated their superior efficiency and high quality.

【Keywords】: aggregate nearest neighbor; aggregate similarity search; nearest neighbor; similarity search

86. Effective data co-reduction for multimedia similarity search.

Paper Link】 【Pages】:1021-1032

【Authors】: Zi Huang ; Heng Tao Shen ; Jiajun Liu ; Xiaofang Zhou

【Abstract】: Multimedia similarity search has been playing a critical role in many novel applications. Typically, multimedia objects are described by high-dimensional feature vectors (or points) which are organized in databases for retrieval. Although many high-dimensional indexing methods have been proposed to facilitate the search process, efficient retrieval over large, sparse and extremely high-dimensional databases remains challenging due to the continuous increases in data size and feature dimensionality. In this paper, we propose the first framework for Data Co-Reduction (DCR) on both data size and feature dimensionality. By utilizing recently developed co-clustering methods, DCR simultaneously reduces both size and dimensionality of the original data into a compact subspace, where lower bounds of the actual distances in the original space can be efficiently established to achieve fast and lossless similarity search in the filter-and refine approach. Particularly, DCR considers the duality between size and dimensionality, and achieves the optimal coreduction which generates the least number of candidates for actual distance computations. We conduct an extensive experimental study on large and real-life multimedia datasets, with dimensionality ranging from 432 to 1936. Our results demonstrate that DCR outperforms existing methods significantly for lossless retrieval, especially in the presence of extremely high dimensionality.

【Keywords】: data co-reduction; high-dimensional indexing; search

87. Efficient exact edit similarity query processing with the asymmetric signature scheme.

Paper Link】 【Pages】:1033-1044

【Authors】: Jianbin Qin ; Wei Wang ; Yifei Lu ; Chuan Xiao ; Xuemin Lin

【Abstract】: Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold t. Most existing method answering edit similarity queries rely on a signature scheme to generate candidates given the query string. We observe that the number of signatures generated by existing methods is far greater than the lower bound, and this results in high query time and index space complexities. In this paper, we show that the minimum signature size lower bound is t +1. We then propose asymmetric signature schemes that achieve this lower bound. We develop efficient query processing algorithms based on the new scheme. Several dynamic programming-based candidate pruning methods are also developed to further speed up the performance. We have conducted a comprehensive experimental study involving nine state-of-the-art algorithms. The experiment results clearly demonstrate the efficiency of our methods.

【Keywords】: Q-gram; approximate pattern matching; edit distance; similarity join; similarity search

Keynote address 2

88. Managing scientific data: lessons, challenges, and opportunities.

Paper Link】 【Pages】:1045-1046

【Authors】: Anastasia Ailamaki

【Abstract】: Today's scientific processes heavily depend on fast and accurate analysis of experimental data. Scientists are routinely overwhelmed by the effort needed to manage the volumes of data produced either by observing phenomena or by sophisticated simulations. As database systems have proven inefficient, inadequate, or insufficient to meet the needs of scientific applications, the scientific community typically uses special-purpose legacy software. When compared to a general-purpose DBMS, however, application-specific systems require more resources to maintain, and in order to achieve acceptable performance they often sacrifice data independence and hinder the reuse of knowledge. Nowadays, scientific datasets are growing at unprecedented rates, a result of increasing complexity of the simulated models and ever-improving instrument precision; consequently, scientists' queries become more sophisticated as they try to interpret the data correctly. Datasets and scientific query complexity are likely to continue to grow indefinitely, rendering legacy systems increasingly inadequate. To respond to the challenge, the data management community aspires to solve scientific data management problems by carefully examining the problems of scientific applications and by developing special- or general-purpose scientific data management techniques and systems. This talk discusses the work of teams around the world in an effort to surface the most critical requirements of such an undertaking, and the technological innovations needed to satisfy them.

【Keywords】: astronomy; brain simulations; earthquake simulations

89. Internet scale storage.

Paper Link】 【Pages】:1047-1048

【Authors】: James R. Hamilton

【Abstract】: The pace of innovation in data center design has been rapidly accelerating over the last five years, driven by the mega-service operators. I believe we have seen more infrastructure innovation in the last five years than we did in the previous fifteen. Most very large service operators have teams of experts focused on server design, data center power distribution and redundancy, mechanical designs, real estate acquisition, and network hardware and protocols. At low scale, with only a data or center or two, it would be crazy to have all these full time engineers and specialist focused on infrastructural improvements and expansion. But, at high scale with tens of data centers, it would be crazy not to invest deeply in advancing the state of the art. Looking specifically at cloud services, the cost of the infrastructure is the difference between an unsuccessful cloud service and a profitable, self-sustaining business. With continued innovation driving down infrastructure costs, investment capital is available, services can be added and improved, and value can be passed on to customers through price reductions. Amazon Web Services, for example, has had eleven price reductions in four years. I don't recall that happening in my first twenty years working on enterprise software. It really is an exciting time in our industry. I started working on database systems twenty years ago during a period of incredibly rapid change. We improved DB2 performance measured using TPC-A by a factor of ten in a single release. The next release, we made a further four-fold improvement. It's rare to be able to improve a product by forty fold in three years but, admittedly, one of the secrets is to begin from a position where work is truly needed. Back then, the database industry was in its infancy. Customers loved the products and were using them heavily, but we were not anywhere close to delivering on the full promise of the technology. That's exactly where cloud computing is today--just where the database world was twenty years ago. Customers are getting great value from cloud computing but, at the same time, we have much more to do and many of the most interesting problems are yet to be solved. I could easily imagine tenfold improvement across several dimensions in over the next five years. What ties these two problems from different decades together is that some of the biggest problems in cloud computing are problems in persistent state management. What's different is that we now have to tackle these problems in a multi-tenant, high-scale, multi-datacenter environment. It's a new vista for database and storage problems. In this talk, we'll analyze an internet-scale data center looking at the cost of power distribution, servers, storage, networking, and cooling on the belief that understanding what drives cost helps us focus on the most valuable research directions. We'll look at some of the fundamental technology limits approached in cloud database and storage solutions on the belief that, at scale, these limits will constrain practical solutions. And we'll consider existing cloud services since they form the foundation on which future solutions might be built.

【Keywords】: storage

Industrial papers 14

90. LCI: a social channel analysis platform for live customer intelligence.

Paper Link】 【Pages】:1049-1058

【Authors】: Malú Castellanos ; Umeshwar Dayal ; Meichun Hsu ; Riddhiman Ghosh ; Mohamed Dekhil ; Yue Lu ; Lei Zhang ; Mark Schreiman

【Abstract】: The rise of Web 2.0 with its increasingly popular social sites like Twitter, Facebook, blogs and review sites has motivated people to express their opinions publicly and more frequently than ever before. This has fueled the emerging field known as sentiment analysis whose goal is to translate the vagaries of human emotion into hard data. LCI is a social channel analysis platform that taps into what is being said to understand the sentiment with the particular ability of doing so in near real-time. LCI integrates novel algorithms for sentiment analysis and a configurable dashboard with different kinds of charts including dynamic ones that change as new data is ingested. LCI has been researched and prototyped at HP Labs in close interaction with the Business Intelligence Solutions (BIS) Division and a few customers. This paper presents an overview of the architecture and some of its key components and algorithms, focusing in particular on how LCI deals with Twitter and illustrating its capabilities with selected use cases.

【Keywords】: business intelligence; customer intelligence; opinion lexicon; opinion mining; sentiment analysis; sentiment monitoring; social media; streaming data

91. Bistro data feed management system.

Paper Link】 【Pages】:1059-1070

【Authors】: Vladislav Shkapenyuk ; Theodore Johnson ; Divesh Srivastava

【Abstract】: Data feed management is a critical component of many data intensive applications that depend on reliable data delivery to support real-time data collection, correlation and analysis. Data is typically collected from a wide variety of sources and organizations, using a range of mechanisms - some data are streamed in real time, while other data are obtained at regular intervals or collected in an ad hoc fashion. Individual applications are forced to make separate arrangements with feed providers, learn the structure of incoming files, monitor data quality, and trigger any processing necessary. The Bistro data feed manager, designed and implemented at AT&T Labs- Research, simplifies and automates this complex task of data feed management: efficiently handling incoming raw files, identifying data feeds and distributing them to remote subscribers. Bistro supports a flexible specification language to define logical data feeds using the naming structure of physical data files, and to identify feed subscribers. Based on the specification, Bistro matches data files to feeds, performs file normalization and compression, efficiently delivers files, and notifies subscribers using a trigger mechanism. We describe our feed analyzer that discovers the naming structure of incoming data files to detect new feeds, dropped feeds, feed changes, or lost data in an existing feed. Bistro is currently deployed within AT&T Labs and is responsible for the real-time delivery of over 100 different raw feeds, distributing data to several large-scale stream warehouses.

【Keywords】: data feed management; distributed systems

92. Apache hadoop goes realtime at Facebook.

Paper Link】 【Pages】:1071-1080

【Authors】: Dhruba Borthakur ; Jonathan Gray ; Joydeep Sen Sarma ; Kannan Muthukkaruppan ; Nicolas Spiegelberg ; Hairong Kuang ; Karthik Ranganathan ; Dmytro Molkov ; Aravind Menon ; Samuel Rash ; Rodrigo Schmidt ; Amitanand S. Aiyer

【Abstract】: Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. Apache HBase is a database-like layer built on Hadoop designed to support billions of messages per day. This paper describes the reasons why Facebook chose Hadoop and HBase over other systems such as Apache Cassandra and Voldemort and discusses the application's requirements for consistency, availability, partition tolerance, data model and scalability. We explore the enhancements made to Hadoop to make it a more effective realtime system, the tradeoffs we made while configuring the system, and how this solution has significant advantages over the sharded MySQL database scheme used in other applications at Facebook and many other web-scale companies. We discuss the motivations behind our design choices, the challenges that we face in day-to-day operations, and future capabilities and improvements still under development. We offer these observations on the deployment as a model for other companies who are contemplating a Hadoop-based solution over traditional sharded RDBMS deployments.

【Keywords】: HBase; data; distributed systems; hadoop; hive; scalability; scribe

93. Nova: continuous Pig/Hadoop workflows.

Paper Link】 【Pages】:1081-1090

【Authors】: Christopher Olston ; Greg Chiou ; Laukik Chitnis ; Francis Liu ; Yiping Han ; Mattias Larsson ; Andreas Neumann ; Vellanki B. N. Rao ; Vijayanand Sankarasubramanian ; Siddharth Seth ; Chao Tian ; Topher ZiCornell ; Xiaodan Wang

【Abstract】: This paper describes a workflow manager developed and deployed at Yahoo called Nova, which pushes continually-arriving data through graphs of Pig programs executing on Hadoop clusters. (Pig is a structured dataflow language and runtime for the Hadoop map-reduce system.) Nova is like data stream managers in its support for stateful incremental processing, but unlike them in that it deals with data in large batches using disk-based processing. Batched incremental processing is a good fit for a large fraction of Yahoo's data processing use-cases, which deal with continually-arriving data and benefit from incremental algorithms, but do not require ultra-low-latency processing.

【Keywords】: hadoop; incremental processing; workflow

94. A Hadoop based distributed loading approach to parallel data warehouses.

Paper Link】 【Pages】:1091-1100

【Authors】: Yu Xu ; Pekka Kostamaa ; Yan Qi ; Jian Wen ; Kevin Keliang Zhao

【Abstract】: One critical part of building and running a data warehouse is the ETL (Extraction Transformation Loading) process. In fact, the growing ETL tool market is already a multi-billion-dollar market. Getting data into data warehouses has been a hindering factor to wider potential database applications such as scientific computing, as discussed in recent panels at various database conferences. One particular problem with the current load approaches to data warehouses is that while data are partitioned and replicated across all nodes in data warehouses powered by parallel DBMS(PDBMS), load utilities typically reside on a single node which face the issues of i) data loss/data availability if the node/hard drives crash; ii) file size limit on a single node; iii) load performance. All of these issues are mostly handled manually or only helped to some degree by tools. We notice that one common thing between Hadoop and Teradata Enterprise Data Warehouse (EDW) is that data in both systems are partitioned across multiple nodes for parallel computing, which creates parallel loading opportunities not possible for DBMSs running on a single node. In this paper we describe our approach of using Hadoop as a distributed load strategy to Teradata EDW. We use Hadoop as the intermediate load server to store data to be loaded to Teradata EDW. We gain all the benefits from HDFS (Hadoop Distributed File System): i) significantly increased disk space for the file to be loaded; ii) once the data is written to HDFS, it is not necessary for the data sources to keep the data even before the file is loaded to Teradata EDW; iii) MapReduce programs can be used to transform and add structures to unstructured or semi-structured data; iv) more importantly since a file is distributed in HDFS, the file can be loaded more quickly in parallel to Teradata EDW, which is the main focus in this paper. When both Hadoop and Teradata EDW coexist on the same hardware platform, as being increasingly required by customers because of reduced hardware and system administration costs, we have another optimization opportunity to directly load HDFS data blocks to Teradata parallel units on the same nodes. However, due to the inherent non-uniform data distribution in HDFS, rarely we can avoid transferring HDFS blocks to remote Teradata nodes. We designed a polynomial time optimal algorithm and a polynomial time approximate algorithm to assign HDFS blocks to Teradata parallel units evenly and minimize network traffic. We performed experiments on synthetic and real data sets to compare the performances of the algorithms.

【Keywords】: data load; hadoop; parallel DBMS

95. A batch of PNUTS: experiences connecting cloud batch and serving systems.

Paper Link】 【Pages】:1101-1112

【Authors】: Adam Silberstein ; Russell Sears ; Wenchao Zhou ; Brian F. Cooper

【Abstract】: Cloud data management systems are growing in prominence, particularly at large Internet companies like Google, Yahoo!, and Amazon, which prize them for their scalability and elasticity. Each of these systems trades off between low-latency serving performance and batch processing throughput. In this paper, we discuss our experience running batch-oriented Hadoop on top of Yahoo's serving-oriented PNUTS system instead of the standard HDFS file system. Though PNUTS is optimized for and primarily used for serving, a number of applications at Yahoo! must run batch-oriented jobs that read or write data that is stored in PNUTS. Combining these systems reveals several key areas where the fundamental properties of each system are mismatched. We discuss our approaches to accommodating these mismatches, by either bending the batch and serving abstractions, or inventing new ones. Batch systems like Hadoop provide coarse task-level recovery, while serving systems like PNUTS provide finer record or transaction-level recovery. We combine both types to log record-level errors, while detecting and recovering from large-scale errors. Batch systems optimize for read and write throughput of large requests, while serving systems use indexing to provide low latency access to individual records. To improve latency-insensitive write throughput to PNUTS, we introduce a batch write path. The systems provide conflicting consistency models, and we discuss techniques to isolate them from one another.

【Keywords】: PNUTS; batch; bulk load; hadoop; hybrid; serving

96. Turbocharging DBMS buffer pool using SSDs.

Paper Link】 【Pages】:1113-1124

【Authors】: Jaeyoung Do ; Donghui Zhang ; Jignesh M. Patel ; David J. DeWitt ; Jeffrey F. Naughton ; Alan Halverson

【Abstract】: Flash solid-state drives (SSDs) are changing the I/O landscape, which has largely been dominated by traditional hard disk drives (HDDs) for the last 50 years. In this paper we propose and systematically explore designs for using an SSD to improve the performance of a DBMS buffer manager. We propose three alternatives that differ mainly in the way that they deal with the dirty pages evicted from the buffer pool. We implemented these alternatives, as well another recently proposed algorithm for this task (TAC), in SQL Server, and ran experiments using a variety of benchmarks (TPC-C, E and H) at multiple scale factors. Our empirical evaluation shows significant performance improvements of our methods over the default HDD configuration (up to 9.4X), and up to a 6.8X speedup over TAC.

【Keywords】: buffer pool; flash SSD

97. Online reorganization in read optimized MMDBS.

Paper Link】 【Pages】:1125-1136

【Authors】: Felix Beier ; Knut Stolze ; Kai-Uwe Sattler

【Abstract】: Query performance is a critical factor in modern business intelligence and data warehouse systems. An increasing number of companies uses detailed analyses for conducting daily business and supporting management decisions. Thus, several techniques have been developed for achieving near realtime response times - techniques which try to alleviate I/O bottlenecks while increasing the throughputs of available processing units, i.e. by keeping relevant data in compressed main-memory data structures and exploiting the read-only characteristics of analytical workloads. However, update processing and skews in data distribution result in degenerations in these densely packed and highly compressed data structures affecting the memory efficiency and query performance negatively. Reorganization tasks can repair these data structures, but -- since these are usually costly operations -- require a well-considered decision which of several possible strategies should be processed and when, in order to reduce system downtimes. In this paper, we address these problems by presenting an approach for online reorganization in main-memory database systems (MMDBS). Based on a discussion of necessary reorganization strategies in IBM Smart Analytics Optimizer, a read optimized parallel MMDBS, we introduce a framework for executing arbitrary reorganization tasks online, i.e. in the background of normal user workloads without disrupting query results or performance.

【Keywords】: MMDBS; main memory database; online reorganization

98. Automated partitioning design in parallel database systems.

Paper Link】 【Pages】:1137-1148

【Authors】: Rimma V. Nehme ; Nicolas Bruno

【Abstract】: In recent years, Massively Parallel Processors (MPPs) have gained ground enabling vast amounts of data processing. In such environments, data is partitioned across multiple compute nodes, which results in dramatic performance improvements during parallel query execution. To evaluate certain relational operators in a query correctly, data sometimes needs to be re-partitioned (i.e., moved) across compute nodes. Since data movement operations are much more expensive than relational operations, it is crucial to design a suitable data partitioning strategy that minimizes the cost of such expensive data transfers. A good partitioning strategy strongly depends on how the parallel system would be used. In this paper we present a partitioning advisor that recommends the best partitioning design for an expected workload. Our tool recommends which tables should be replicated (i.e., copied into every compute node) and which ones should be distributed according to specific column(s) so that the cost of evaluating similar workloads is minimized. In contrast to previous work, our techniques are deeply integrated with the underlying parallel query optimizer, which results in more accurate recommendations in a shorter amount of time. Our experimental evaluation using a real MPP system, Microsoft SQL Server 2008 Parallel Data Warehouse, with both real and synthetic workloads shows the effectiveness of the proposed techniques and the importance of deep integration of the partitioning advisor with the underlying query optimizer.

【Keywords】: distributed databases

99. Oracle database filesystem.

Paper Link】 【Pages】:1149-1160

【Authors】: Krishna Kunchithapadam ; Wei Zhang ; Amit Ganesh ; Niloy Mukherjee

【Abstract】: Modern enterprise, web, and multimedia applications are generating unstructured content at unforeseen volumes in the form of documents, texts, and media files. Such content is generally associated with relational data such as user names, location tags, and timestamps. Storage of unstructured content in a relational database would guarantee the same robustness, transactional consistency, data integrity, data recoverability and other data management features consolidated across files and relational contents. Although database systems are preferred for relational data management, poor performance of unstructured data storage, limited data transformation functionalities, and lack of interfaces based on filesystem standards may keep more than eighty five percent of non-relational unstructured content out of databases in the coming decades. We introduce Oracle Database Filesystem (DBFS) as a consolidated solution that unifies state-of-the-art network filesystem features with relational database management ones. DBFS is a novel shared-storage network filesystem developed in the RDBMS kernel that allows content management applications to transparently store and organize files using standard filesystem interfaces, in the same database that stores associated relational content. The server component of DBFS is based on Oracle SecureFiles, a novel unstructured data storage engine within the RDBMS that provides filesystem like or better storage performance for files within the database while fully leveraging relational data management features such as transaction atomicity, isolation, read consistency, temporality, and information lifecycle management. We present a preliminary performance evaluation of DBFS that demonstrates more than 10TB/hr throughput of filesystem read and write operations consistently over a period of 12 hours on an Oracle Exadata Database cluster of four server nodes. In terms of file storage, such extreme performance is equivalent to ingestion of more than 2500 million 100KB document files a single day. The set of initial results look very promising for DBFS towards becoming the universal storage solution for both relational and unstructured content.

【Keywords】: RDBMS; atomicity; database filesystem; performance and scalability; read consistency; securefiles; temporality; unstructured data

Paper Link】 【Pages】:1161-1164

【Authors】: Fatma Özcan ; David Hoa ; Kevin S. Beyer ; Andrey Balmin ; Chuan Jie Liu ; Yu Li

【Abstract】: Enterprises are dealing with ever increasing volumes of data, reaching into the petabyte scale. With many of our customer engagements, we are observing an emerging trend: They are using Hadoop-based solutions in conjunction with their data warehouses. They are using Hadoop to deal with the data volume, as well as the lack of strict structure in their data to conduct various analyses, including but not limited to Web log analysis, sophisticated data mining, machine learning and model building. This first stage of the analysis is off-line and suitable for Hadoop. But, once their data is summarized or cleansed enough, and their models are built, they are loading the results into a warehouse for interactive querying and report generation. At this later stage, they leverage the wealth of business intelligence tools, which they are accustomed to, that exist for warehouses. In this paper, we outline this use case and discuss the bidirectional connectors we developed between IBM DB2 and IBM InfoSphere BigInsights.

【Keywords】: DB2; Hadoop; federated database systems; map/reduce; system integration

101. Efficient processing of data warehousing queries in a split execution environment.

Paper Link】 【Pages】:1165-1176

【Authors】: Kamil Bajda-Pawlikowski ; Daniel J. Abadi ; Avi Silberschatz ; Erik Paulson

【Abstract】: Hadapt is a start-up company currently commercializing the Yale University research project called HadoopDB. The company focuses on building a platform for Big Data analytics in the cloud by introducing a storage layer optimized for structured data and by providing a framework for executing SQL queries efficiently. This work considers processing data warehousing queries over very large datasets. Our goal is to maximize perfor mance while, at the same time, not giving up fault tolerance and scalability. We analyze the complexity of this problem in the split execution environment of HadoopDB. Here, incoming queries are examined; parts of the query are pushed down and executed inside the higher performing database layer; and the rest of the query is processed in a more generic MapReduce framework. In this paper, we discuss in detail performance-oriented query execution strategies for data warehouse queries in split execution environments, with particular focus on join and aggregation operations. The efficiency of our techniques is demonstrated by running experiments using the TPC-H benchmark with 3TB of data. In these experiments we compare our results with a standard commercial parallel database and an open-source MapReduce implementation featuring a SQL interface (Hive). We show that HadoopDB successfully competes with other systems.

【Keywords】: Hadoop; mapreduce; query execution

102. SQL server column store indexes.

Paper Link】 【Pages】:1177-1184

【Authors】: Per-Åke Larson ; Cipri Clinciu ; Eric N. Hanson ; Artem Oks ; Susan L. Price ; Srikumar Rangarajan ; Aleksandras Surna ; Qingqing Zhou

【Abstract】: The SQL Server 11 release (code named "Denali") introduces a new data warehouse query acceleration feature based on a new index type called a column store index. The new index type combined with new query operators processing batches of rows greatly improves data warehouse query performance: in some cases by hundreds of times and routinely a tenfold speedup for a broad range of decision support queries. Column store indexes are fully integrated with the rest of the system, including query processing and optimization. This paper gives an overview of the design and implementation of column store indexes including enhancements to query processing and query optimization to take full advantage of the new indexes. The resulting performance improvements are illustrated by a number of example queries.

【Keywords】: column store; columnar index; data warehousing; olap

103. An analytic data engine for visualization in tableau.

Paper Link】 【Pages】:1185-1194

【Authors】: Richard Michael Grantham Wesley ; Matthew Eldridge ; Pawel Terlecki

【Abstract】: Efficient data processing is critical for interactive visualization of analytic data sets. Inspired by the large amount of recent research on column-oriented stores, we have developed a new specialized analytic data engine tightly-coupled with the Tableau data visualization system. The Tableau Data Engine ships as an integral part of Tableau 6.0 and is intended for the desktop and server environments. This paper covers the main requirements of our project, system architecture and query-processing pipeline. We use real-life visualization scenarios to illustrate basic concepts and provide experimental evaluation.

【Keywords】: TDE; column store; data visualization; query optimization; tableau data engine

Tutorials 6

104. Learning statistical models from relational data.

Paper Link】 【Pages】:1195-1198

【Authors】: Lise Getoor ; Lilyana Mihalkova

【Abstract】: Statistical Relational Learning (SRL) is a subarea of machine learning which combines elements from statistical and probabilistic modeling with languages which support structured data representations. In this survey, we will: 1) provide an introduction to SRL, 2) describe some of the distinguishing characteristics of SRL systems, including relational feature construction and collective classification, 3) describe three SRL systems in detail, 4) discuss applications of SRL techniques to important data management problems such as entity resolution, selectivity estimation, and information integration, and 5) discuss connections between SRL methods and existing database research such as probabilistic databases.

【Keywords】: machine learning; probabilistic graphical models; statistical relational learning

105. Web data management.

Paper Link】 【Pages】:1199-1200

【Authors】: Michael J. Cafarella ; Alon Y. Halevy

【Abstract】: Web Data Management (or WDM) refers to a body of work concerned with leveraging the large collections of structured data that can be extracted from the Web. Over the past few years, several research and commercial efforts have explored these collections of data with the goal of improving Web search and developing mechanisms for surfacing different kinds of search answers. This work has leveraged (1) collections of structured data such as HTML tables, lists and forms, (2) recent ontologies and knowledge bases created by crowd-sourcing, such as Wikipedia and its derivatives, DBPedia, YAGO and Freebase, and (3) the collection of text documents from the Web, from which facts could be extracted in a domain-independent fashion. The promise of this line of work is based on the observation that new kinds of results can be obtained by leveraging a huge collection of independently created fragments of data, and typically in ways that are wholly unrelated to the authors' original intent. For example, we might use many database schemas to compute a schema thesaurus. Or we might examine many spreadsheets of scientific data that reveal the aggregate practice of an entire scientific field. As such, WDM is tightly linked to Web-enabled collaboration, even (or especially) if the collaborators are unwitting ones. We will cover the key techniques, principles and insights obtained so far in the area of Web Data Management.

【Keywords】: web data management information extraction

106. Privacy-aware data management in information networks.

Paper Link】 【Pages】:1201-1204

【Authors】: Michael Hay ; Kun Liu ; Gerome Miklau ; Jian Pei ; Evimaria Terzi

【Abstract】: The proliferation of information networks, as a means of sharing information, has raised privacy concerns for enterprises who manage such networks and for individual users that participate in such networks. For enterprises, the main challenge is to satisfy two competing goals: releasing network data for useful data analysis and also preserving the identities or sensitive relationships of the individuals participating in the network. Individual users, on the other hand, require personalized methods that increase their awareness of the visibility of their private information. This tutorial provides a systematic survey of the problems and state-of-the-art methods related to both enterprise and personalized privacy in information networks. The tutorial discusses privacy threats, privacy attacks, and privacy-preserving mechanisms tailored specifically to network data.

【Keywords】: anonymization; differential privacy; networks; privacy

107. Large-scale copy detection.

Paper Link】 【Pages】:1205-1208

【Authors】: Xin Luna Dong ; Divesh Srivastava

【Abstract】: The Web has enabled the availability of a vast amount of useful information in recent years. However, the web technologies that have enabled sources to share their information have also made it easy for sources to copy from each other and often publish without proper attribution. Understanding the copying relationships between sources has many benefits, including helping data providers protect their own rights, improving various aspects of data integration, and facilitating in-depth analysis of information flow. The importance of copy detection has led to a substantial amount of research in many disciplines of Computer Science, based on the type of information considered, such as text, images, videos, software code, and structured data. This tutorial explores the similarities and differences between the techniques proposed for copy detection across the different types of information. We also examine the computational challenges associated with large-scale copy detection, indicating how they could be detected efficiently, and identify a range of open problems for the community.

【Keywords】: copy detection; data integration

108. Data management over flash memory.

Paper Link】 【Pages】:1209-1212

【Authors】: Ioannis Koltsidas ; Stratis Viglas

【Abstract】: Flash SSDs are quickly becoming mainstream and emerge as alternatives to magnetic disks. It is therefore imperative to incorporate them seamlessly into the enterprise. We present the salient results of research in the area, touching all aspects of the data management stack: from the fundamentals of flash technology, through storage for database systems and the manipulation of SSD-resident data, to query processing.

【Keywords】: flash memory; solid state drives; storage

109. Datalog and emerging applications: an interactive tutorial.

Paper Link】 【Pages】:1213-1216

【Authors】: Shan Shan Huang ; Todd Jeffrey Green ; Boon Thau Loo

【Abstract】: We are witnessing an exciting revival of interest in recursive Datalog queries in a variety of emerging application domains such as data integration, information extraction, networking, program analysis, security, and cloud computing. This tutorial briefly reviews the Datalog language and recursive query processing and optimization techniques, then discusses applications of Datalog in three application domains: data integration, declarative networking, and program analysis. Throughout the tutorial, we use LogicBlox, a commercial Datalog engine for enterprise software systems, to allow the audience to walk through code examples presented in the tutorial.

【Keywords】: data exchange; data integration; datalog; declarative networking; program analysis; recursive query processing

Demonstrations on systems and performance 8

110. One-pass data mining algorithms in a DBMS with UDFs.

Paper Link】 【Pages】:1217-1220

【Authors】: Carlos Ordonez ; Sasi K. Pitchaimalai

【Abstract】: Data mining research is extensive, but most work has proposed efficient algorithms, data structures and optimizations that work outside a DBMS, mostly on flat files. In contrast, we present a data mining system that can work on top of a relational DBMS based on a combination of SQL queries and User-Defined Functions (UDFs), debuking the common perception that SQL is inefficient or inadequate for data mining. We show our system can analyze large data sets significantly faster than external data mining tools. Moreover, our UDF-based algorithms can process a data set in one pass and have linear scalability.

【Keywords】: DBMs; SQL; UDF; data mining

111. Inspector gadget: a framework for custom monitoring and debugging of distributed dataflows.

Paper Link】 【Pages】:1221-1224

【Authors】: Christopher Olston ; Benjamin Reed

【Abstract】: We demonstrate a novel dataflow introspection framework called Inspector Gadget, which makes it easy to create custom monitoring and debugging add-ons to an existing dataflow engine such as Pig. The framework is motivated by a series of informal user interviews, which revealed that dataflow monitoring and debugging needs are both pressing and diverse. Of the 14 monitoring/debugging behaviors requested by users, we were able to implement 12 in Inspector Gadget, in just a few hundred lines of (Java) code each.

【Keywords】: debugging; hadoop; instrumentation; query processing

112. RAFT at work: speeding-up mapreduce applications under task and node failures.

Paper Link】 【Pages】:1225-1228

【Authors】: Jorge-Arnulfo Quiané-Ruiz ; Christoph Pinkel ; Jörg Schad ; Jens Dittrich

【Abstract】: The MapReduce framework is typically deployed on very large computing clusters where task and node failures are no longer an exception but the rule. Thus, fault-tolerance is an important aspect for the efficient operation of MapReduce jobs. However, currently MapReduce implementations fully recompute failed tasks (subparts of a job) from the beginning. This can significantly decrease the runtime performance of MapReduce applications. We present an alternative system that implements RAFT ideas. RAFT is a family of powerful and inexpensive Recovery Algorithms for Fast-Tracking MapReduce jobs under task and node failures. To recover from task failures, RAFT exploits the intermediate results persisted by MapReduce at several points in time. RAFT piggybacks checkpoints on the task progress computation. To recover from node failures, RAFT maintains a per-map task list of all input key-value pairs producing intermediate results and pushes intermediate results to reducers. In this demo, we demonstrate that RAFT recovers efficiently from both task and node failures. Further, the audience can compare RAFT with Hadoop via an easy-to-use web interface.

【Keywords】: checkpointing; fault-tolerance; hadoop; mapreduce; node failures; recovery

113. WattDB: an energy-proportional cluster of wimpy nodes.

Paper Link】 【Pages】:1229-1232

【Authors】: Daniel Schall ; Volker Hudlet

【Abstract】: The constant growth of data in all businesses leads to bigger database servers. While peak load times require fast and heavyweight hardware to guarantee performance, idle times are a waste of energy and money. Todays DBMSs have the ability to cluster several servers for performance and fault tolerance. Nevertheless, they do not support dynamic powering of the cluster's nodes based on the current workload. In this demo, we propose a newly developed DBMS running on clustered commodity hardware, which is able to dynamically power nodes. The demo allows the user to interact with the DBMS and adjust workloads, while the cluster's reaction is shown in real-time.

【Keywords】: energy proportionality

114. BRRL: a recovery library for main-memory applications in the cloud.

Paper Link】 【Pages】:1233-1236

【Authors】: Tuan Cao ; Benjamin Sowell ; Marcos Antonio Vaz Salles ; Alan J. Demers ; Johannes Gehrke

【Abstract】: In this demonstration we present BRRL, a library for making distributed main-memory applications fault tolerant. BRRL is optimized for cloud applications with frequent points of consistency that use data-parallelism to avoid complex concurrency control mechanisms. BRRL differs from existing recovery libraries by providing a simple table abstraction and using schema information to optimize checkpointing. We will demonstrate the utility of BRRL using a distributed transaction processing system and a platform for scientific behavioral simulations.

【Keywords】: checkpointing; database recovery; durability; logical logging

115. A data-oriented transaction execution engine and supporting tools.

Paper Link】 【Pages】:1237-1240

【Authors】: Ippokratis Pandis ; Pinar Tözün ; Miguel Branco ; Dimitris Karampinas ; Danica Porobic ; Ryan Johnson ; Anastasia Ailamaki

【Abstract】: Conventional OLTP systems assign each transaction to a worker thread and that thread accesses data, depending on what the transaction dictates. This thread-to-transaction work assignment policy leads to unpredictable accesses. The unpredictability forces each thread to enter a large number of critical sections for the completion of even the simplest of the transactions; leading to poor performance and scalability on modern manycore hardware. This demonstration highlights the chaotic access patterns of conventional OLTP designs which are the source of scalability problems. Then, it presents a working prototype of a transaction processing engine that follows a non-conventional architecture, called data-oriented or DORA. DORA is designed around the thread-to-data work assignment policy. It distributes the transaction execution to multiple threads and offers predictable accesses. By design, DORA can decentralize the lock management service, and thereby eliminate the critical sections executed inside the lock manager. We explain the design of the system and show that it more efficiently utilizes the abundant processing power of modern hardware, always contrasting it against the conventional execution. In addition, we present different components of the system, such as a dynamic load balancer. Finally, we present a set of tools that enable the development of applications that use DORA.

【Keywords】: DORA; data-oriented transaction execution; partitioning; runtime re-balancing; transaction flow graph generation

116. iGraph in action: performance analysis of disk-based graph indexing techniques.

Paper Link】 【Pages】:1241-1242

【Authors】: Wook-Shin Han ; Minh-Duc Pham ; Jinsoo Lee ; Romans Kasperovics ; Jeffrey Xu Yu

【Abstract】: Graphs provide a powerful way to model complex structures such as chemical compounds, proteins, images, and program dependence. The previous practice for experiments in graph indexing techniques is that the author of a newly proposed technique does not implement existing indexes on his own code base, but instead uses the original authors' binary executables and reports only the wall clock time. However, we observed that this practice may result in several problems [6]. In order to address these problems, we have implemented all representative graph indexing techniques on a common framework called iGraph [6]. In this demonstration we showcase iGraph and its visual tools using several real datasets and their workloads. For selected queries of the workloads, we show several unique features including visual performance analysis.

【Keywords】: graph indexing techniques; performance analysis; visualization

117. StreamRec: a real-time recommender system.

Paper Link】 【Pages】:1243-1246

【Authors】: Badrish Chandramouli ; Justin J. Levandoski ; Ahmed Eldawy ; Mohamed F. Mokbel

【Abstract】:

【Keywords】: recommender systems

Demonstrations on ranking, the web, and social media 8

118. SkylineSearch: semantic ranking and result visualization for pubmed.

Paper Link】 【Pages】:1247-1250

【Authors】: Julia Stoyanovich ; Mayur Lodha ; William Mee ; Kenneth A. Ross

【Abstract】: Life sciences researchers perform scientific literature search as part of their daily activities. Many such searches are executed against PubMed, a central repository of life sciences articles, and often return hundreds, or even thousands, of results, pointing to the need for data exploration tools. In this demonstration we present SkylineSearch, a semantic ranking and result visualization system designed specifically for PubMed, and available to the scientific community at skyline.cs.columbia.edu. Our system leverages semantic annotations of articles with terms from the MeSH controlled vocabulary, and presents results as a two-dimensional skyline, plotting relevance against publication date. We demonstrate that SkylineSearch supports a richer data exploration experience than does the search functionality of PubMed, allowing users to find relevant references more easily. We also show that SkylineSearch executes queries and presents results in interactive time.

【Keywords】: data exploration; skyline

119. A cross-service travel engine for trip planning.

Paper Link】 【Pages】:1251-1254

【Authors】: Gang Chen ; Chen Liu ; Meiyu Lu ; Beng Chin Ooi ; Shanshan Ying ; Anthony K. H. Tung ; Dongxiang Zhang ; Meihui Zhang

【Abstract】: The online travel services and resources are far from well organized and integrated. Trip planning is still a laborious job requiring interaction with a combination of services such as travel guides, personal travel blogs, map services and public transportation to piece together an itinerary. To facilitate this process, we have designed a cross-service travel engine for trip planners. Our system seamlessly and semantically integrates various types of travel services and resources based on a geographical ontology. We also built a user-friendly visualization tool for travellers to conveniently browse and design personal itineraries on Google Maps.

【Keywords】: cross-service; data integration; geographical ontology; ranking

120. WINACS: construction and analysis of web-based computer science information networks.

Paper Link】 【Pages】:1255-1258

【Authors】: Tim Weninger ; Marina Danilevsky ; Fabio Fumarola ; Joshua M. Hailpern ; Jiawei Han ; Thomas J. Johnston ; Surya Kallumadi ; Hyungsul Kim ; Zhijin Li ; David McCloskey ; Yizhou Sun ; Nathan E. TeGrotenhuis ; Chi Wang ; Xiao Yu

【Abstract】: WINACS (Web-based Information Network Analysis for Computer Science) is a project that incorporates many recent, exciting developments in data sciences to construct a Web-based computer science information network and to discover, retrieve, rank, cluster, and analyze such an information network. With the rapid development of the Web, huge amounts of information are available in the form of Web documents, structures, and links. It has been a dream of the database and Web communities to harvest such information and reconcile the unstructured nature of the Web with the neat, semi-structured schemas of the database paradigm. Taking computer science as a dedicated domain, WINACS first discovers related Web entity structures, and then constructs a heterogeneous computer science information network in order to rank, cluster and analyze this network and support intelligent and analytical queries.

【Keywords】: WINACS; information networks; web mining

121. Tweets as data: demonstration of TweeQL and Twitinfo.

Paper Link】 【Pages】:1259-1262

【Authors】: Adam Marcus ; Michael S. Bernstein ; Osama Badar ; David R. Karger ; Samuel Madden ; Robert C. Miller

【Abstract】: Microblogs such as Twitter are a tremendous repository of user-generated content. Increasingly, we see tweets used as data sources for novel applications such as disaster mapping, brand sentiment analysis, and real-time visualizations. In each scenario, the workflow for processing tweets is ad-hoc, and a lot of unnecessary work goes into repeating common data processing patterns. We introduce TweeQL, a stream query processing language that presents a SQL-like query interface for unstructured tweets to generate structured data for downstream applications. We have built several tools on top of TweeQL, most notably TwitInfo, an event timeline generation and exploration interface that summarizes events as they are discussed on Twitter. Our demonstration will allow the audience to interact with both TweeQL and TwitInfo to convey the value of data embedded in tweets.

【Keywords】: stream processing; twitter; visualization

122. MOBIES: mobile-interface enhancement service for hidden web database.

Paper Link】 【Pages】:1263-1266

【Authors】: Xin Jin ; Aditya Mone ; Nan Zhang ; Gautam Das

【Abstract】: Many web databases are hidden behind form-based interfaces which are not always easy-to-use on mobile devices because of limitations such as small screen sizes, trickier text entry, etc. In this demonstration, we have developed MOBIES, a third-party system that generates mobile-user-friendly interfaces by exploiting data analytics specific to the hidden web databases. Our user studies show the effectiveness of MOBIES on improving user experience over a hidden web database.

【Keywords】: data analytics; hidden web database; mobile HCI

Paper Link】 【Pages】:1267-1270

【Authors】: Alessandro Bozzon ; Daniele Braga ; Marco Brambilla ; Stefano Ceri ; Francesco Corcoglioniti ; Piero Fraternali ; Salvatore Vadacca

【Abstract】: We demonstrate the Search Computing framework for multi-domain queries upon ranked data collected from Web sources. Search Computing answers to queries like "Find a good Jazz concert close to a specified location, a good restaurant and a hotel at walking distance" and fills the gap between generic and domain-specific search engines, by proposing new methods, techniques, interfaces, and tools for building search-based applications spanning multiple data services. The main enabling technology is an execution engine supporting methods for rank-join execution upon ranked data sources, abstracted and wrapped by means of a unifying service model. The demo walks through the interface for formulating multi-domain queries and follows the steps of the query engine that builds the result, with the help of run-time monitors that clearly explain the system's behavior. Once results are extracted, the demonstration shows several approaches for visualizing results and exploring the information space.

【Keywords】: exploratory search; multi-domain queries; search computing

124. EnBlogue: emergent topic detection in web 2.0 streams.

Paper Link】 【Pages】:1271-1274

【Authors】: Foteini Alvanaki ; Sebastian Michel ; Krithi Ramamritham ; Gerhard Weikum

【Abstract】: Emergent topics are newly arising themes in news, blogs, or tweets, often implied by interesting and unexpected correlations of tags or entities. We present the enBlogue system for emergent topic detection. The name enBlogue reflects the analogy with emerging trends in fashion often referred to as en Vogue. EnBlogue continuously monitors Web 2.0 streams and keeps track of sudden changes in tag correlations which can be adjusted using personalization to reflect particular user interests. We demonstrate enBlogue with several real-time monitoring scenarios as well as with time lapse on archived data.

【Keywords】: emergent topics; web 2.0 streams

125. NOAM: news outlets analysis and monitoring system.

Paper Link】 【Pages】:1275-1278

【Authors】: Ilias N. Flaounas ; Omar Ali ; Marco Turchi ; Tristan Snowsill ; Florent Nicart ; Tijl De Bie ; Nello Cristianini

【Abstract】: We present NOAM, an integrated platform for the monitoring and analysis of news media content. NOAM is the data management system behind various applications and scientific studies aiming at modelling the mediasphere. The system is also intended to address the need in the AI community for platforms where various AI technologies are integrated and deployed in the real world. It combines a relational database (DB) with state of the art AI technologies, including data mining, machine learning and natural language processing. These technologies are organised in a robust, distributed architecture of collaborating modules, that are used to populate and annotate the DB. NOAM manages tens of millions of news items in multiple languages, automatically annotating them in order to enable queries based on their semantic properties. The system also includes a unified user interface for interacting with its various modules.

【Keywords】: data mining; large scale; machine learning; media content analysis; news analysis; text analysis

Demonstrations on data integration and probabilistic databases 8

126. Pay-as-you-go mapping selection in dataspaces.

Paper Link】 【Pages】:1279-1282

【Authors】: Cornelia Hedeler ; Khalid Belhajjame ; Norman W. Paton ; Alvaro A. A. Fernandes ; Suzanne M. Embury ; Lu Mao ; Chenjuan Guo

【Abstract】: The vision of dataspaces proposes an alternative to classical data integration approaches with reduced up-front costs followed by incremental improvement on a pay-as-you-go basis. In this paper, we demonstrate DSToolkit, a system that allows users to provide feedback on results of queries posed over an integration schema. Such feedback is then used to annotate the mappings with their respective precision and recall. The system then allows a user to state the expected levels of precision (or recall) that the query results should exhibit and, in order to produce those results, the system selects those mappings that are predicted to meet the stated constraints.

【Keywords】: dataspaces; user feedback

127. Exelixis: evolving ontology-based data integration system.

Paper Link】 【Pages】:1283-1286

【Authors】: Haridimos Kondylakis ; Dimitris Plexousakis

【Abstract】: The evolution of ontologies is an undisputed necessity in ontology-based data integration. Yet, few research efforts have focused on addressing the need to reflect ontology evolution onto the underlying data integration systems. We present Exelixis, a web platform that enables query answering over evolving ontologies without mapping redefinition. This is achieved by rewriting queries among ontology versions. First, changes between ontologies are automatically detected and described using a high level language of changes. Those changes are interpreted as sound global-as-view (GAV) mappings. Then query expansion is applied in order to consider constraints from the ontology and unfolding to apply the GAV mappings. Whenever equivalent rewritings cannot be produced we a) guide query redefinition and/or b) provide the best "over-approximations", i.e. the minimally-containing and minimally-generalized rewritings. For the demonstration we will use four versions of the CIDOC-CRM ontology and real user queries to show the functionality of the system. Then we will allow conference participants to directly interact with the system to test its capabilities.

【Keywords】: data integration; ontologies; query rewriting

128. U-MAP: a system for usage-based schema matching and mapping.

Paper Link】 【Pages】:1287-1290

【Authors】: Hazem Elmeleegy ; Jaewoo Lee ; El Kindi Rezig ; Mourad Ouzzani ; Ahmed K. Elmagarmid

【Abstract】: This demo shows how usage information buried in query logs can play a central role in data integration and data exchange. More specifically, our system U-Map uses query logs to generate correspondences between the attributes of two different schemas and the complex mapping rules to transform and restructure data records from one of these schemas to another. We introduce several novel features showing the benefit of incorporating query log analysis into these key components of data integration and data exchange systems.

【Keywords】: query logs; schema mapping; schema matching

129. The SystemT IDE: an integrated development environment for information extraction rules.

Paper Link】 【Pages】:1291-1294

【Authors】: Laura Chiticariu ; Vivian Chu ; Sajib Dasgupta ; Thilo W. Goetz ; Howard Ho ; Rajasekar Krishnamurthy ; Alexander Lang ; Yunyao Li ; Bin Liu ; Sriram Raghavan ; Frederick Reiss ; Shivakumar Vaithyanathan ; Huaiyu Zhu

【Abstract】: Information Extraction (IE)-the problem of extracting structured information from unstructured text - has become the key enabler for many enterprise applications such as semantic search, business analytics and regulatory compliance. While rule-based IE systems are widely used in practice due to their well-known "explainability," developing high-quality information extraction rules is known to be a labor-intensive and time-consuming iterative process. Our demonstration showcases SystemT IDE, the integrated development environment for SystemT, a state-of-the-art rule-based IE system from IBMResearch that has been successfully embedded in multiple IBM enterprise products. SystemT IDE facilitates the development, test and analysis of high-quality IE rules by means of sophisticated techniques, ranging from data management to machine learning. We show how to build high-quality IE annotators using a suite of tools provided by SystemT IDE, including computing data provenance, learning basic features such as regular expressions and dictionaries, and automatically refining rules based on labeled examples.

【Keywords】: AQL; SystemT; information extraction; pattern discovery; provenance; rule learning

130. ProApproX: a lightweight approximation query processor over probabilistic trees.

Paper Link】 【Pages】:1295-1298

【Authors】: Pierre Senellart ; Asma Souihli

【Abstract】: We demonstrate a system for querying probabilistic XML documents with simple XPath queries. A user chooses between a variety of query answering techniques, both exact and approximate, and observes the running behavior, pros, and cons, of each method, in terms of efficiency, precision of the result, and data model and query language supported.

【Keywords】: XML; approximations; probabilistic data; query processing

131. SPROUT2: a squared query engine for uncertain web data.

Paper Link】 【Pages】:1299-1302

【Authors】: Robert Fink ; Andrew Hogue ; Dan Olteanu ; Swaroop Rath

【Abstract】: SPROUT² is a query answering system that allows users to ask structured queries over tables embedded in Web pages, over Google Fusion tables, and over uncertain tables that can be extracted from answers to Google Squared. At the core of this service lies SPROUT, a query engine for probabilistic databases. This demonstration allows users to compose and ask ad-hoc queries of their choice and also to take a tour through the system's capabilities along pre-arranged scenarios on, e.g., movie actors and directors, biomass facilities, or leveraging corporate databases.

【Keywords】: probabilistic databases; web data management

132. Fuzzy prophet: parameter exploration in uncertain enterprise scenarios.

Paper Link】 【Pages】:1303-1306

【Authors】: Oliver Kennedy ; Steve Lee ; Charles Loboz ; Slawek Smyl ; Suman Nath

【Abstract】: We present Fuzzy Prophet, a probabilistic database tool for constructing, simulating and analyzing business scenarios with uncertain data. Fuzzy Prophet takes externally defined probability distribution (so called VG-Functions) and a declarative description of a target scenario, and performs Monte Carlo simulation to compute probability distribution of the scenario's outcomes. In addition, Fuzzy Prophet supports parameter optimization,where probabilistic models are parameterized and a large parameter space must be explored to find parameters that optimize or achieve a desired goal. Fuzzy Prophet's key innovation is to use 'fingerprints' that can identify parameter values producing correlated outputs of a user-provided stochastic function and to reuse computations across such values. Fingerprints significantly expedite the process of parameter exploration in offline optimization and interactive what-if exploration tasks.

【Keywords】: Monte Carlo; black box; probabilistic database; simulation

133. LinkDB: a probabilistic linkage database system.

Paper Link】 【Pages】:1307-1310

【Authors】: Ekaterini Ioannou ; Wolfgang Nejdl ; Claudia Niederée ; Yannis Velegrakis

【Abstract】: Entity linkage deals with the problem of identifying whether two pieces of information represent the same real world object. The traditional methodology computes the similarity among the entities, and then merges those with similarity above some specific threshold. We demonstrate LinkDB, an original entity storage and querying system that deals with the entity linkage problem in a novel way. LinkDB is a probabilistic linkage database that uses existing linkage techniques to generate linkages among entities, but instead of performing the merges based on these linkages, it stores them alongside the data and performs only the required merges at run-time, by effectively taking into consideration the query specifications. We explain the technical challenges behind this kind of query answering, and we show how this new mechanism is able to provide answers that traditional entity linkage mechanisms cannot.

【Keywords】: data integration; entity linkage; entity resolution

Demonstrations on user support and development environments 8

134. CONFLuEnCE: CONtinuous workFLow ExeCution Engine.

Paper Link】 【Pages】:1311-1314

【Authors】: Panayiotis Neophytou ; Panos K. Chrysanthis ; Alexandros Labrinidis

【Abstract】: Traditional workflow enactment systems view a workflow as a one-time interaction with various data sources, executing a series of steps once, whenever the workflow results are requested. The fundamental underlying assumption has been that data sources are passive and all interactions are structured along the request/reply (query) model. Hence, traditional Workflow Management Systems cannot effectively support business or scientific reactive applications that require the processing of continuous data streams. In this demo, we will present our prototype which transforms workflow execution from the traditional step-wise workflow execution model to a continuous execution model, in order to handle data streams published and delivered asynchronously from multiple sources. We will demonstrate a supply chain management scenario which takes advantage of our continuous execution model to enable on-line interaction between different user roles as well as streaming data coming from various sources.

【Keywords】: continuous workflows; data streams; patterns; scientific workflows; supply chain management; workflow

135. Demonstration of Qurk: a query processor for humanoperators.

Paper Link】 【Pages】:1315-1318

【Authors】: Adam Marcus ; Eugene Wu ; David R. Karger ; Samuel Madden ; Robert C. Miller

【Abstract】: Crowdsourcing technologies such as Amazon's Mechanical Turk ("MTurk") service have exploded in popularity in recent years. These services are increasingly used for complex human-reliant data processing tasks, such as labelling a collection of images, combining two sets of images to identify people that appear in both, or extracting sentiment from a corpus of text snippets. There are several challenges in designing a workflow that filters, aggregates, sorts and joins human-generated data sources. Currently, crowdsourcing-based workflows are hand-built, resulting in increasingly complex programs. Additionally, developers must hand-optimize tradeoffs among monetary cost, accuracy, and time to completion of results. These challenges are well-suited to a declarative query interface that allows developers to describe their worflow at a high level and automatically optimizes workflow and tuning parameters. In this demonstration, we will present Qurk, a novel query system that allows human-based processing for relational databases. The audience will interact with the system to build queries and monitor their progress. The audience will also see Qurk from an MTurk user's perspective, and complete several tasks to better understand how a query is processed.

【Keywords】: crowdsourcing; databases; human computing

136. Automatic example queries for ad hoc databases.

Paper Link】 【Pages】:1319-1322

【Authors】: Bill Howe ; Garrett Cole ; Nodira Khoussainova ; Leilani Battle

【Abstract】: Motivated by eScience applications, we explore automatic generation of example "starter" queries over unstructured collections of tables without relying on a schema, a query log, or prior input from users. Such example queries are demonstrably sufficient to have non-experts self-train and become productive using SQL, helping to increase the uptake of database technology among scientists. Our method is to learn a model for each relational operator based on example queries from public databases, then assemble queries syntactically operator-by-operator. For example, the likelihood that a pair of attributes will be used as a join condition in an example query depends on the cardinality of their intersection, among other features. Our demonstration illustrates that datasets with different statistical properties lead to different sets of example queries with different properties.

【Keywords】: dataspaces; query recommendation; scientific databases

137. NetTrails: a declarative platform for maintaining and querying provenance in distributed systems.

Paper Link】 【Pages】:1323-1326

【Authors】: Wenchao Zhou ; Qiong Fei ; Shengzhi Sun ; Tao Tao ; Andreas Haeberlen ; Zachary G. Ives ; Boon Thau Loo ; Micah Sherr

【Abstract】: We demonstrate NetTrails, a declarative platform for maintaining and interactively querying network provenance in a distributed system. Network provenance describes the history and derivations of network state that result from the execution of a distributed protocol. It has broad applicability in the management, diagnosis, and security analysis of networks. Our demonstration shows the use of NetTrails for maintaining and querying network provenance in a variety of distributed settings, ranging from declarative networks to unmodified legacy distributed systems. We conclude our demonstration with a discussion of our ongoing research on enhancing the query language and security guarantees.

【Keywords】: declarative networking; provenance; visualization

138. GBLENDER: visual subgraph query formulation meets query processing.

Paper Link】 【Pages】:1327-1330

【Authors】: Changjiu Jin ; Sourav S. Bhowmick ; Xiaokui Xiao ; Byron Choi ; Shuigeng Zhou

【Abstract】: Due to the complexity of graph query languages, the need for visual query interfaces that can reduce the burden of query formulation is fundamental to the spreading of graph data management tools to wider community. We present a novel HCI (human-computer interaction)-aware graph query processing paradigm, where instead of processing a query graph after its construction, it interleaves visual query construction and processing to improve system response time. We demonstrate a system called GBLENDER that exploits GUI latency to prune false results and prefetch candidate data graphs by employing a novel action-aware indexing scheme and a data structure called spindle-shaped graphs (SPIG). We demonstrate various innovative features of GBLENDER and its promising performance in evaluating subgraph containment and similarity queries.

【Keywords】: frequent subgraphs; graph databases; graph indexing; infrequent subgraphs; prefetching; visual query formulation

139. Coordination through querying in the youtopia system.

Paper Link】 【Pages】:1331-1334

【Authors】: Nitin Gupta ; Lucja Kot ; Gabriel Bender ; Sudip Roy ; Johannes Gehrke ; Christoph Koch

【Abstract】: In a previous paper, we laid out the vision of declarative data-driven coordination (D3C) where users are provided with novel abstractions that enable them to communicate and coordinate through declarative specifications [3]. In this demo, we will show Youtopia, a novel database system which is our first attempt at implementing this vision. Youtopia provides coordination abstractions within the DBMS. Users submit queries that come with explicit coordination constraints to be met by other queries in the system. Such queries are evaluated together; the system ensures that their joint execution results in the satisfaction of all coordination constraints. That is, the queries coordinate their answers in the manner specified by the users. We show how Youtopia and its abstractions simplify the implementation of a three-tier flight reservation application that allows users to coordinate travel arrangements with their friends.

【Keywords】: coordination; d3c; entangled queries

140. DBWiki: a structured wiki for curated data and collaborative data management.

Paper Link】 【Pages】:1335-1338

【Authors】: Peter Buneman ; James Cheney ; Sam Lindley ; Heiko Müller

【Abstract】: Wikis have proved enormously successful as a means to collaborate in the creation and publication of textual information. At the same time, a large number of curated databases have been developed through collaboration for the dissemination of structured data in specific domains, particularly bioinformatics. We demonstrate a general-purpose platform for collaborative data management, DBWiki, designed to achieve the best of both worlds. Our system not only facilitates the collaborative creation of a database; it also provides features not usually provided by database technology such as versioning, provenance tracking, citability, and annotation. In our demonstration we will show how DBWiki makes it easy to create, correct, discuss and query structured data, placing more power in the hands of users while managing tedious details of data curation automatically.

【Keywords】: curation; databases; wikis

141. Rapid development of web-based query interfacesfor XML datasets with QURSED.

Paper Link】 【Pages】:1339-1342

【Authors】: Abhijith Kashyap ; Michalis Petropoulos

【Abstract】: We present QURSED, a system that automates the development of web-based query forms and reports (QFRs) for semi-structured XML data. Whereas many tools for automated development of QFRs have been proposed for relational datasets, QURSED- to the best of our knowledge, is the first tool that facilitates development of web-based QFRs over XML data. The QURSED system is available online at http://db.cse.buffalo.edu/qursed.

【Keywords】: XML; query interface.; user interfaces