21st ACM International Conference on Information and Knowledge Management, CIKM'12, Maui, HI, USA, October 29 - November 02, 2012. ACM 【DBLP Link】
【Paper Link】 【Pages】:1-2
【Authors】: Ricardo A. Baeza-Yates ; Mounia Lalmas
【Abstract】: In the online world, user engagement refers to the quality of the user experience that emphasizes the positive aspects of the interaction with a web application and, in particular, the phenomena associated with wanting to use that application longer and frequently. This definition is motivated by the observation that successful web applications are not just used, but they are engaged with. Users invest time, attention, and emotion into them. Online providers aim not only to engage users with each service, but across all services in their network. They spend increasing effort to direct users to various services (e.g.~using hyperlinks to help users navigate to and explore other services), to increase user traffic between their services. Nothing is known for users engaging across such a network of Web sites, something we call networked user engagement. We address this problem by combining techniques from web analytics and mining, information retrieval evaluation, and existing works on user engagement coming from the domains of information science, multimodal human computer interaction and cognitive psychology. In this way, we can combine insights from big data with deep analysis of human behavior in the lab or through crowd-sourcing experiments.
【Keywords】: metrics; network of services; user engagement; web sites
【Paper Link】 【Pages】:3
【Authors】: William W. Cohen
【Abstract】: We describe a novel learnable proximity measure based on personalized PageRank (also known as "random walk with reset"). Instead of introducing one weight per edge label, as in most prior work, we introduce one weight for each edge label sequence. We show that this approach is advantageous for a number of real-world tasks, including querying graph databases, recommendation tasks, and inference in large, noisy knowledge bases.
【Keywords】: machine learning; personalized pagerank
【Paper Link】 【Pages】:4-5
【Authors】: Jeffrey Scott Vitter
【Abstract】: We describe recent breakthroughs in the field of compressed data structures, in which the data structure is stored in a compressed representation that still allows fast answers to queries. We focus in particular on compressed data structures to support the important application of pattern matching on massive document collections. Given an arbitrary query pattern in textual form, the job of the data structure is to report all the locations where the pattern appears. Another variant is to report all the documents that contain at least one instance of the pattern. We are particularly interested in reporting only the most relevant documents, using a variety of notions of relevance. We discuss recently developed techniques that support fast search in these contexts as well as under additional positional and temporal constraints.
【Keywords】: compressed data structure; data compression; entropy; external memory; index; pattern matching; search
【Paper Link】 【Pages】:6-15
【Authors】: Dhruv Kumar Mahajan ; Rajeev Rastogi ; Charu Tiwari ; Adway Mitra
【Abstract】: The highly dynamic nature of online commenting environments makes accurate ratings prediction for new comments challenging. In such a setting, in addition to exploiting comments with high predicted ratings, it is also critical to explore comments with high uncertainty in the predictions. In this paper, we propose a novel upper confidence bound (UCB) algorithm called LOGUCB that balances exploration with exploitation when the average rating of a comment is modeled using logistic regression on its features. At the core of our LOGUCB algorithm lies a novel variance approximation technique for the Bayesian logistic regression model that is used to compute the UCB value for each comment. In experiments with a real-life comments dataset from Yahoo! News, we show that LOGUCB with bag-of-words and topic features outperforms state-of-the-art explore-exploit algorithms.
【Keywords】: comment ratings; explore-exploit; logistic regression; upper confidence bound
【Paper Link】 【Pages】:16-25
【Authors】: Ruirui Li ; Ben Kao ; Bin Bi ; Reynold Cheng ; Eric Lo
【Abstract】: Web search queries issued by casual users are often short and with limited expressiveness. Query recommendation is a popular technique employed by search engines to help users refine their queries. Traditional similarity-based methods, however, often result in redundant and monotonic recommendations. We identify five basic requirements of a query recommendation system. In particular, we focus on the requirements of redundancy-free and diversified recommendations. We propose the DQR framework, which mines a search log to achieve two goals: (1) It clusters search log queries to extract query concepts, based on which recommended queries are selected. (2) It employs a probabilistic model and a greedy heuristic algorithm to achieve recommendation diversification. Through a comprehensive user study we compare DQR against five other recommendation methods. Our experiment shows that DQR outperforms the other methods in terms of relevancy, diversity, and ranking performance of the recommendations.
【Keywords】: diversification; query concept; query recommendation
【Paper Link】 【Pages】:26-34
【Authors】: Ioannis Antonellis ; Anish Das Sarma ; Shaddin Dughmi
【Abstract】: In this paper, we identify a fundamental algorithmic problem that we term succinct dynamic covering (SDC), arising in many modern-day web applications, including ad-serving and online recommendation systems such as in eBay, Netflix, and Amazon. Roughly speaking, SDC applies two restrictions to the well-studied Max-Coverage problem [14]: Given an integer k, X={1,2,...,n}and I={S_1,...,S_m}, S_i subseteq X, find |J| subseteq I, such that |J| < k and (union_S_in_J S) is as large as possible. The two restrictions applied by SDC are: (1)Dynamic: At query-time, we are given a query Q subseteq X, and our goal is to find J such that Q bigcap (union_S_J S) is as large as possible; Space-constrained: We don't have enough space to store (and process) the entire input; specifically, we have o(mn), and maybe as little as O((m+n)polylog(mn))space. A solution to SDC maintains a small data structure, and uses this datastructure to answer most dynamic queries with high accuracy. We call such a scheme a Coverage Oracle. We present algorithms and complexity results for coverage oracles. We present deterministic and probabilistic near-tight upper and lower bounds on the approximation ratio of SDC as a function of the amount of space available to the oracle. Our lower bound results show that to obtain constant-factor approximations we need Omega(mn) space. Fortunately, our upper bounds present an explicit tradeoff between space and approximation ratio, allowing us to determine the amount of space needed to guarantee certain accuracy.
【Keywords】: dynamic covering; max-coverage problem; recommendation systems
【Paper Link】 【Pages】:35-44
【Abstract】: Reciprocal recommender systems refer to systems from which users can obtain recommendations of other individuals by satisfying preferences of both parties being involved. Different from the traditional user-item recommendation, reciprocal recommenders focus on the preferences of both parties simultaneously, as well as some special properties in terms of "reciprocal". In this paper, we propose MEET -- a generalized framework for reciprocal recommendation, in which we model the correlations of users as a bipartite graph that maintains both local and global "reciprocal" utilities. The local utility captures users' mutual preferences, whereas the global utility manages the overall quality of the entire reciprocal network. Extensive empirical evaluation on two real-world data sets (online dating and online recruiting) demonstrates the effectiveness of our proposed framework compared with existing recommendation algorithms. Our analysis also provides deep insights into the special aspects of reciprocal recommenders that differentiate them from user-item recommender systems.
【Keywords】: bipartite graph; global and local regularization; reciprocal recommender
【Paper Link】 【Pages】:45-54
【Authors】: Meng Jiang ; Peng Cui ; Rui Liu ; Qiang Yang ; Fei Wang ; Wenwu Zhu ; Shiqiang Yang
【Abstract】: Exponential growth of information generated by online social networks demands effective recommender systems to give useful results. Traditional techniques become unqualified because they ignore social relation data; existing social recommendation approaches consider social network structure, but social context has not been fully considered. It is significant and challenging to fuse social contextual factors which are derived from users' motivation of social behaviors into social recommendation. In this paper, we investigate social recommendation on the basis of psychology and sociology studies, which exhibit two important factors: individual preference and interpersonal influence. We first present the particular importance of these two factors in online item adoption and recommendation. Then we propose a novel probabilistic matrix factorization method to fuse them in latent spaces. We conduct experiments on both Facebook style bidirectional and Twitter style unidirectional social network datasets in China. The empirical result and analysis on these two large datasets demonstrate that our method significantly outperform the existing approaches.
【Keywords】: individual preference; interpersonal influence; matrix factorization; social recommendation
【Paper Link】 【Pages】:55-64
【Authors】: Mengchi Liu ; Jun-Feng Qu
【Abstract】: High utility itemsets refer to the sets of items with high utility like profit in a database, and efficient mining of high utility itemsets plays a crucial role in many real-life applications and is an important research issue in data mining area. To identify high utility itemsets, most existing algorithms first generate candidate itemsets by overestimating their utilities, and subsequently compute the exact utilities of these candidates. These algorithms incur the problem that a very large number of candidates are generated, but most of the candidates are found out to be not high utility after their exact utilities are computed. In this paper, we propose an algorithm, called HUI-Miner (High Utility Itemset Miner), for high utility itemset mining. HUI-Miner uses a novel structure, called utility-list, to store both the utility information about an itemset and the heuristic information for pruning the search space of HUI-Miner. By avoiding the costly generation and utility computation of numerous candidate itemsets, HUI-Miner can efficiently mine high utility itemsets from the utility-lists constructed from a mined database. We compared HUI-Miner with the state-of-the-art algorithms on various databases, and experimental results show that HUI-Miner outperforms these algorithms in terms of both running time and memory consumption.
【Keywords】: high utility itemset; mining algorithm
【Paper Link】 【Pages】:65-74
【Authors】: Weishan Dong ; Wei Fan ; Lei Shi ; Changjin Zhou ; Xifeng Yan
【Abstract】: Traditional pattern mining methods usually work on single data sources. However, in practice, there are often multiple and heterogeneous information sources. They collectively provide contextual information not available in any single source alone describing the same set of objects, and are useful for discovering hidden contextual patterns. One important challenge is to provide a general methodology to mine contextual patterns easily and efficiently. In this paper, we propose a general framework to encode contextual information from multiple sources into a coherent representation---Contextual Information Graph (CIG). The complexity of the encoding scheme is linear in both time and space. More importantly, CIG can be handled by any single-source pattern mining algorithms that accept taxonomies without any modification. We demonstrate by three applications of the contextual association rule, sequence and graph mining, that contextual patterns providing rich and insightful knowledge can be easily discovered by the proposed framework. It enables Contextual Pattern Mining (CPM) by reusing single-source methods, and is easy to deploy and use in real-world systems.
【Keywords】: contextual pattern mining; heterogeneous sources
【Paper Link】 【Pages】:75-84
【Authors】: Linpeng Tang ; Lei Zhang ; Ping Luo ; Min Wang
【Abstract】: Mining interesting patterns from transaction databases has attracted a lot of research interest for more than a decade. Most of those studies use frequency, the number of times a pattern appears in a transaction database, as the key measure for pattern interestingness. In this paper, we introduce a new measure of pattern interestingness, occupancy. The measure of occupancy is motivated by some real-world pattern recommendation applications which require that any interesting pattern X should occupy a large portion of the transactions it appears in. Namely, for any supporting transaction t of pattern X, the number of items in X should be close to the total number of items in t. In these pattern recommendation applications, patterns with higher occupancy may lead to higher recall while patterns with higher frequency lead to higher precision. With the definition of occupancy we call a pattern dominant if its occupancy is above a user-specified threshold. Then, our task is to identify the qualified patterns which are both frequent and dominant. Additionally, we also formulate the problem of mining top-k qualified patterns: finding the qualified patterns with the top-k values of any function (e.g. weighted sum of both occupancy and support). The challenge to these tasks is that the monotone or anti-monotone property does not hold on occupancy. In other words, the value of occupancy does not increase or decrease monotonically when we add more items to a given itemset. Thus, we propose an algorithm called DOFIA (DOminant and Frequent Itemset mining Algorithm), which explores the upper bound properties on occupancy to reduce the search process. The tradeoff between bound tightness and computational complexity is also systematically addressed. Finally, we show the effectiveness of DOFIA in a real-world application on print-area recommendation for Web pages, and also demonstrate the efficiency of DOFIA on several large synthetic data sets.
【Keywords】: DOFIA; constraint-based mining; frequent and dominant pattern; frequent pattern mining
【Paper Link】 【Pages】:85-94
【Authors】: Matteo Riondato ; Justin A. DeBrabant ; Rodrigo Fonseca ; Eli Upfal
【Abstract】: Frequent Itemsets and Association Rules Mining (FIM) is a key task in knowledge discovery from data. As the dataset grows, the cost of solving this task is dominated by the component that depends on the number of transactions in the dataset. We address this issue by proposing PARMA, a parallel algorithm for the MapReduce framework, which scales well with the size of the dataset (as number of transactions) while minimizing data replication and communication cost. PARMA cuts down the dataset-size-dependent part of the cost by using a random sampling approach to FIM. Each machine mines a small random sample of the dataset, of size independent from the dataset size. The results from each machine are then filtered and aggregated to produce a single output collection. The output will be a very close approximation of the collection of Frequent Itemsets (FI's) or Association Rules (AR's) with their frequencies and confidence levels. The quality of the output is probabilistically guaranteed by our analysis to be within the user-specified accuracy and error probability parameters. The sizes of the random samples are independent from the size of the dataset, as is the number of samples. They depend on the user-chosen accuracy and error probability parameters and on the parallel computational model. We implemented PARMA in Hadoop MapReduce and show experimentally that it runs faster than previously introduced FIM algorithms for the same platform, while 1) scaling almost linearly, and 2) offering even higher accuracy and confidence than what is guaranteed by the analysis.
【Keywords】: MapReduce; association rules; frequent itemsets; sampling
【Paper Link】 【Pages】:95-104
【Authors】: Mansurul Bhuiyan ; Snehasis Mukhopadhyay ; Mohammad Al Hasan
【Abstract】: Mining frequent patterns from a hidden dataset is an important task with 43 various real-life applications. In this research, we propose a solution to this problem that is based on Markov Chain Monte Carlo (MCMC) sampling of frequent patterns. Instead of returning all the frequent patterns, the proposed paradigm returns a small set of randomly selected patterns so that the clandestinity of the dataset can be maintained. Our solution also allows interactive sampling, so that the sampled patterns can fulfill the user's requirement effectively. We show experimental results from several real life datasets to validate the capability and usefulness of our solution; in particular, we show examples that by using our proposed solution, an eCommerce marketplace can allow pattern mining on user session data without disclosing the data to the public; such a mining paradigm helps the sellers of the marketplace, which eventually boost the marketplace's own revenue.
【Keywords】: MCMC sampling; interactive pattern mining
【Paper Link】 【Pages】:105-114
【Authors】: Gabriella Kazai ; Nick Craswell ; Emine Yilmaz ; Seyed M. M. Tahaghoghi
【Abstract】: Test collections are powerful mechanisms for the evaluation and optimization of information retrieval systems. However, there is reported evidence that experiment outcomes can be affected by changes to the judging guidelines or changes in the judge population. This paper examines such effects in a web search setting, comparing the judgments of four groups of judges: NIST Web Track judges, untrained crowd workers and two groups of trained judges of a commercial search engine. Our goal is to identify systematic judging errors by comparing the labels contributed by the different groups, working under the same or different judging guidelines. In particular, we focus on detecting systematic differences in judging depending on specific characteristics of the queries and URLs. For example, we ask whether a given population of judges, working under a given set of judging guidelines, are more likely to consistently overrate Wikipedia pages than another group judging under the same instructions. Our approach is to identify judging errors with respect to a consensus set, a judged gold set and a set of user clicks. We further demonstrate how such biases can affect the training of retrieval systems.
【Keywords】: bias; noise; relevence
【Paper Link】 【Pages】:115-124
【Authors】: Katja Hofmann ; Fritz Behr ; Filip Radlinski
【Abstract】: Information retrieval evaluation most often involves manually assessing the relevance of particular query-document pairs. In cases where this is difficult (such as personalized search), interleaved comparison methods are becoming increasingly common. These methods compare pairs of ranking functions based on user clicks on search results, thus better reflecting true user preferences. However, by depending on clicks, there is a potential for bias. For example, users have been previously shown to be more likely to click on results with attractive titles and snippets. An interleaving evaluation where one ranker tends to generate results that attract more clicks (without being more relevant) may thus be biased. We present an approach for detecting and compensating for this type of bias in interleaving evaluations. Introducing a new model of caption bias, we propose features that model bias based on (1) per-document effects, and (2) the (pairwise) relationships between a document and surrounding documents. We show that our model can effectively capture click behavior, with best results achieved by a model that combines both per-document and pairwise features. Applying this model to re-weight observed user clicks, we find a small overall effect on real interleaving comparisons, but also identify a case where initially detected preferences vanish after caption bias re-weighting is applied. Our results indicate that our model of caption bias is effective and can successfully identify interleaving experiments affected by caption bias.
【Keywords】: evaluation; implicit feedback; interleaving
【Paper Link】 【Pages】:125-134
【Authors】: William Webber ; Praveen Chandar ; Ben Carterette
【Abstract】: Assessors are well known to disagree frequently on the relevance of documents to a topic, but the factors leading to assessor disagreement are still poorly understood. In this paper, we examine the relationship between the rank at which a document is returned by a set of retrieval systems and the likelihood that a second assessor will disagree with the relevance assessment of the initial assessor, and find that there is a strong and consistent correlation between the two. We adopt a metarank method of summarizing a document's rank across multiple runs, and propose a logistic regression predictive model of second assessor disagreement given metarank and initially-assessed relevance. The consistency of the model parameters across different topics, assessor pairs, and collections is considered. The model gives comparatively accurate predictions of absolute system scores, but less consistent predictions of relative scores than a simpler rank-insensitive model. We demonstrate that the logistic regression model is robust to using sampled, rather than exhaustive, dual assessment. We demonstrate the use of the sampled predictive model to incorporate assessor disagreement into tests of statistical significance.
【Keywords】: evaluation; retrieval experiment; sampling
【Paper Link】 【Pages】:135-144
【Authors】: Ben Carterette ; Evangelos Kanoulas ; Emine Yilmaz
【Abstract】: Click logs present a wealth of evidence about how users interact with a search system. This evidence has been used for many things: learning rankings, personalizing, evaluating effectiveness, and more. But it is almost always distilled into point estimates of feature or parameter values, ignoring what may be the most salient feature of users---their variability. No two users interact with a system in exactly the same way, and even a single user may interact with results for the same query differently depending on information need, mood, time of day, and a host of other factors. We present a Bayesian approach to using logs to compute posterior distributions for probabilistic models of user interactions. Since they are distributions rather than point estimates, they naturally capture variability in the population. We show how to cluster posterior distributions to discover patterns of user interactions in logs, and discuss how to use the clusters to evaluate search engines according to a user model. Because the approach is Bayesian, our methods can be applied to very large logs (such as those possessed by Web search engines) as well as very small (such as those found in almost any other setting).
【Keywords】: evaluation; test collections; user logs
【Paper Link】 【Pages】:145-154
【Authors】: Shahzad Rajput ; Matthew Ekstrand-Abueg ; Virgiliu Pavlu ; Javed A. Aslam
【Abstract】: The goal of a typical information retrieval system is to satisfy a user's information need---e.g., by providing an answer or information "nugget"---while the actual search space of a typical information retrieval system consists of documents---i.e., collections of nuggets. In this paper, we characterize this relationship between nuggets and documents and discuss applications to system evaluation. In particular, for the problem of test collection construction for IR system evaluation, we demonstrate a highly efficient algorithm for simultaneously obtaining both relevant documents and relevant information. Our technique exploits the mutually reinforcing relationship between relevant documents and relevant information, yielding document-based test collections whose efficiency and efficacy exceed those of typical Cranfield-style test collections, while also generating sets of highly relevant information.
【Keywords】: evaluation; information retrieval; nuggets; relevance assessment
【Paper Link】 【Pages】:155-164
【Authors】: Chenliang Li ; Aixin Sun ; Anwitaman Datta
【Abstract】: Event detection from tweets is an important task to understand the current events/topics attracting a large number of common users. However, the unique characteristics of tweets (e.g. short and noisy content, diverse and fast changing topics, and large data volume) make event detection a challenging task. Most existing techniques proposed for well written documents (e.g. news articles) cannot be directly adopted. In this paper, we propose a segment-based event detection system for tweets, called Twevent. Twevent first detects bursty tweet segments as event segments and then clusters the event segments into events considering both their frequency distribution and content similarity. More specifically, each tweet is split into non-overlapping segments (i.e. phrases possibly refer to named entities or semantically meaningful information units). The bursty segments are identified within a fixed time window based on their frequency patterns, and each bursty segment is described by the set of tweets containing the segment published within that time window. The similarity between a pair of bursty segments is computed using their associated tweets. After clustering bursty segments into candidate events, Wikipedia is exploited to identify the realistic events and to derive the most newsworthy segments to describe the identified events. We evaluate Twevent and compare it with the state-of-the-art method using 4.3 million tweets published by Singapore-based users in June 2010. In our experiments, Twevent outperforms the state-of-the-art method by a large margin in terms of both precision and recall. More importantly, the events detected by Twevent can be easily interpreted with little background knowledge because of the newsworthy segments. We also show that Twevent is efficient and scalable, leading to a desirable solution for event detection from tweets.
【Keywords】: event detection; microblogging; tweet segmentation; twitter
【Paper Link】 【Pages】:165-174
【Authors】: Marco Pennacchiotti ; Fabrizio Silvestri ; Hossein Vahabi ; Rossano Venturini
【Abstract】: In this paper we introduce the task of "tweet recommendation", the problem of suggesting tweets that match a user's interests and likes. We propose an Information-Retrieval-like model that leverages the content of the user's tweets and those of her friends, and that effectively retrieves a set of tweets that is personalized and varied in nature. Our approach could be easily leveraged to build, for example, a Twitter or Facebook timeline that collects messages that are of interest for the user, but that are not posted by her friends. We compare to typical approaches used in similar tasks, reporting significant gains in terms of overall precision, up to about +20%, on both a corpus-based evaluation and real world user study.
【Keywords】: information filtering; twitter recommendation
【Paper Link】 【Pages】:175-184
【Authors】: Chen Lin ; Chun Lin ; Jingxuan Li ; Dingding Wang ; Yang Chen ; Tao Li
【Abstract】: Microblogging service has emerged to be a dominant web medium for billions of individuals sharing and spreading instant news and information, therefore monitoring the event evolution on microblog sphere is crucial for providing both better user experience and deeper understanding on real-time events. In this paper we explore the problem of generating storylines from microblogs for user input queries. This problem is challenging due to the sparse, dynamic and social nature of microblogs. Given a query of an ongoing event, we propose to sketch the real-time storyline of the event by a two-level solution. We first propose a language model with dynamic pseudo relevance feedback to obtain relevant tweets, and then generate storylines via graph optimization. Comprehensive experiments on Twitter data sets demonstrate the effectiveness of the proposed methods in each level and the overall framework.
【Keywords】: dynamic pseudo relevance feedback; language model; microblog; social media; storyline
【Paper Link】 【Pages】:185-194
【Authors】: Marijn Koolen ; Jaap Kamps ; Gabriella Kazai
【Abstract】: The Web and social media give us access to a wealth of information, not only different in quantity but also in character---traditional descriptions from professionals are now supplemented with user generated content. This challenges modern search systems based on the classical model of topical relevance and ad hoc search: How does their effectiveness transfer to the changing nature of information and to the changing types of information needs and search tasks? We use the INEX 2011 Books and Social Search Track's collection of book descriptions from Amazon and social cataloguing site LibraryThing. We compare classical IR with social book search in the context of the LibraryThing discussion forums where members ask for book suggestions. Specifically, we compare book suggestions on the forum with Mechanical Turk judgements on topical relevance and recommendation, both the judgements directly and their resulting evaluation of retrieval systems. First, the book suggestions on the forum are a complete enough set of relevance judgements for system evaluation. Second, topical relevance judgements result in a different system ranking from evaluation based on the forum suggestions. Although it is an important aspect for social book search, topical relevance is not sufficient for evaluation. Third, professional metadata alone is often not enough to determine the topical relevance of a book. User reviews provide a better signal for topical relevance. Fourth, user-generated content is more effective for social book search than professional metadata. Based on our findings, we propose an experimental evaluation that better reflects the complexities of social book search.
【Keywords】: book search; evaluation; user-generated content
【Paper Link】 【Pages】:195-204
【Authors】: Krishna Yeswanth Kamath ; James Caverlee
【Abstract】: In this paper, we propose and evaluate a novel content-driven crowd discovery algorithm that can efficiently identify newly-formed communities of users from the real-time web. Short-lived crowds reflect the real-time interests of their constituents and provide a foundation for user-focused web monitoring. Three of the salient features of the algorithm are its: (i) prefix-tree based locality-sensitive hashing approach for discovering crowds from high-volume rapidly-evolving social media; (ii) efficient user profile updating for incorporating new user activities and fading older ones; and (iii) key dimension identification, so that crowd detection can be focused on the most active portions of the real-time web. Through extensive experimental study, we find significantly more efficient crowd discovery as compared to both a k-means clustering-based approach and a MapReduce-based implementation, while maintaining high-quality crowds as compared to an offline approach. Additionally, we find that expert crowds tend to be "stickier" and last longer in comparison to crowds of typical users.
【Keywords】: clustering; community detection; real-time web; social media
【Paper Link】 【Pages】:205-214
【Authors】: Yuanyuan Zhu ; Jeffrey Xu Yu ; Hong Cheng ; Lu Qin
【Abstract】: A graph models complex structural relationships among objects, and has been prevalently used in a wide range of applications. Building an automated graph classification model becomes very important for predicting unknown graphs or understanding complex structures between different classes. The graph classification framework being widely used consists of two steps, namely, feature selection and classification. The key issue is how to select important subgraph features from a graph database with a large number of graphs including positive graphs and negative graphs. Given the features selected, a generic classification approach can be used to build a classification model. In this paper, we focus on feature selection. We identify two main issues with the most widely used feature selection approach which is based on a discriminative score to select frequent subgraph features, and introduce a new diversified discriminative score to select features that have a higher diversity. We analyze the properties of the newly proposed diversified discriminative score, and conducted extensive performance studies to demonstrate that such a diversified discriminative score makes positive/negative graphs separable and leads to a higher classification accuracy.
【Keywords】: diversity; feature selection; graph classification
【Paper Link】 【Pages】:215-224
【Authors】: Donghyuk Shin ; Si Si ; Inderjit S. Dhillon
【Abstract】: The automated analysis of social networks has become an important problem due to the proliferation of social networks, such as LiveJournal, Flickr and Facebook. The scale of these social networks is massive and continues to grow rapidly. An important problem in social network analysis is proximity estimation that infers the closeness of different users. Link prediction, in turn, is an important application of proximity estimation. However, many methods for computing proximity measures have high computational complexity and are thus prohibitive for large-scale link prediction problems. One way to address this problem is to estimate proximity measures via low-rank approximation. However, a single low-rank approximation may not be sufficient to represent the behavior of the entire network. In this paper, we propose Multi-Scale Link Prediction (MSLP), a framework for link prediction, which can handle massive networks. The basic idea of MSLP is to construct low-rank approximations of the network at multiple scales in an efficient manner. To achieve this, we propose a fast tree-structured approximation algorithm. Based on this approach, MSLP combines predictions at multiple scales to make robust and accurate predictions. Experimental results on real-life datasets with more than a million nodes show the superior performance and scalability of our method.
【Keywords】: hierarchical clustering; link prediction; low rank approximation; social network analysis
【Paper Link】 【Pages】:225-234
【Authors】: Hoda Eldardiry ; Jennifer Neville
【Abstract】: We present a theoretical analysis framework that shows how ensembles of collective classifiers can improve predictions for graph data. We show how collective ensemble classification reduces errors due to variance in learning and more interestingly inference. We also present an empirical framework that includes various ensemble techniques for classifying relational data using collective inference. The methods span single- and multiple-graph network approaches, and are tested on both synthetic and real world classification tasks. Our experimental results, supported by our theoretical justifications, confirm that ensemble algorithms that explicitly focus on both learning and inference processes and aim at reducing errors associated with both, are the best performers.
【Keywords】: collective classification; ensemble learning
【Paper Link】 【Pages】:235-244
【Authors】: Nan Li ; Xifeng Yan ; Zhen Wen ; Arijit Khan
【Abstract】: Given a large real-world graph where vertices are associated with labels, how do we quickly find interesting vertex sets according to a given query? In this paper, we study label-based proximity search in large graphs, which finds the top-k query-covering vertex sets with the smallest diameters. Each set has to cover all the labels in a query. Existing greedy algorithms only return approximate answers, and do not scale well to large graphs. We propose a novel framework, called gDensity, which uses density index and likelihood ranking to find vertex sets in an efficient and accurate manner. Promising vertices are ordered and examined according to their likelihood to produce answers, and the likelihood calculation is greatly facilitated by density indexing. Techniques such as progressive search and partial indexing are further proposed. Experiments on real-world graphs show the efficiency and scalability of gDensity.
【Keywords】: graph mining; indexing; proximity search
【Paper Link】 【Pages】:245-254
【Authors】: Hanghang Tong ; B. Aditya Prakash ; Tina Eliassi-Rad ; Michalis Faloutsos ; Christos Faloutsos
【Abstract】: Controlling the dissemination of an entity (e.g., meme, virus, etc) on a large graph is an interesting problem in many disciplines. Examples include epidemiology, computer security, marketing, etc. So far, previous studies have mostly focused on removing or inoculating nodes to achieve the desired outcome. We shift the problem to the level of edges and ask: which edges should we add or delete in order to speed-up or contain a dissemination? First, we propose effective and scalable algorithms to solve these dissemination problems. Second, we conduct a theoretical study of the two problems and our methods, including the hardness of the problem, the accuracy and complexity of our methods, and the equivalence between the different strategies and problems. Third and lastly, we conduct experiments on real topologies of varying sizes to demonstrate the effectiveness and scalability of our approaches.
【Keywords】: edge manipulation; graph mining; immunization; scalability
【Paper Link】 【Pages】:255-264
【Authors】: Zhen Hai ; Kuiyu Chang ; Gao Cong
【Abstract】: Feature-based opinion analysis has attracted extensive attention recently. Identifying features associated with opinions expressed in reviews is essential for fine-grained opinion mining. One approach is to exploit the dependency relations that occur naturally between features and opinion words, and among features (or opinion words) themselves. In this paper, we propose a generalized approach to opinion feature extraction by incorporating robust statistical association analysis in a bootstrapping framework. The new approach starts with a small set of feature seeds, on which it iteratively enlarges by mining feature-opinion, feature-feature, and opinion-opinion dependency relations. Two association model types, namely likelihood ratio tests (LRT) and latent semantic analysis (LSA), are proposed for computing the pair-wise associations between terms (features or opinions). We accordingly propose two robust bootstrapping approaches, LRTBOOT and LSABOOT, both of which need just a handful of initial feature seeds to bootstrap opinion feature extraction. We benchmarked LRTBOOT and LSABOOT against existing approaches on a large number of real-life reviews crawled from the cellphone and hotel domains. Experimental results using varying number of feature seeds show that the proposed association-based bootstrapping approach significantly outperforms the competitors. In fact, one seed feature is all that is needed for LRTBOOT to significantly outperform the other methods. This seed feature can simply be the domain feature, e.g., "cellphone" or "hotel". The consequence of our discovery is far reaching: starting with just one feature seed, typically just the domain concept word, LRTBOOT can automatically extract a large set of high-quality opinion features from the corpus without any supervision or labeled features. This means that the automatic creation of a set of domain features is no longer a pipe dream!
【Keywords】: aspect; association; bootstrapping; feature; opinion mining; seed; sentiment analysis
【Paper Link】 【Pages】:265-274
【Authors】: Zongyang Ma ; Aixin Sun ; Quan Yuan ; Gao Cong
【Abstract】: Readers of a news article often read its comments contributed by other readers. By reading comments, readers obtain not only complementary information about this news article but also the opinions from other readers. However, the existing ranking mechanisms for comments (e.g., by recency or by user rating) fail to offer an overall picture of topics discussed in comments. In this paper, we first propose to study Topic-driven Reader Comments Summarization (Torcs) problem. We observe that many news articles from a news stream are related to each other; so are their comments. Hence, news articles and their associated comments provide context information for user commenting. To implicitly capture the context information, we propose two topic models to address the Torcs problem, namely, Master-Slave Topic Model (MSTM) and Extended Master-Slave Topic Model (EXTM). Both models treat a news article as a master document and each of its comments as a slave document. MSTM model constrains that the topics discussed in comments have to be derived from the commenting news article. On the other hand, EXTM model allows generating words of comments using both the topics derived from the commenting news article, and the topics derived from all comments themselves. Both models are used to group comments into topic clusters. We then use two ranking mechanisms Maximal Marginal Relevance (MMR) and Rating & Length (RL) to select a few most representative comments from each comment cluster. To evaluate the two models, we conducted experiments on 1005 Yahoo! News articles with more than one million comments. Our experimental results show that EXTM significantly outperforms MSTM by perplexity. Through a user study, we also confirm that the comment summary generated by EXTM achieves better intra-cluster topic cohesion and inter-cluster topic diversity.
【Keywords】: comments summarization; master-slave document; topic model
【Paper Link】 【Pages】:275-284
【Authors】: Rui Yan ; Xiaojun Wan ; Mirella Lapata ; Wayne Xin Zhao ; Pu-Jen Cheng ; Xiaoming Li
【Abstract】: We present a novel graph-based framework for timeline summarization, the task of creating different summaries for different timestamps but for the same topic. Our work extends timeline summarization to a multimodal setting and creates timelines that are both textual and visual. Our approach exploits the fact that news documents are often accompanied by pictures and the two share some common content. Our model optimizes local summary creation and global timeline generation jointly following an iterative approach based on mutual reinforcement and co-ranking. In our algorithm, individual summaries are generated by taking into account the mutual dependencies between sentences and images, and are iteratively refined by considering how they contribute to the global timeline and its coherence. Experiments on real-world datasets show that the timelines produced by our model outperform several competitive baselines both in terms of ROUGE and when assessed by human evaluators.
【Keywords】: evolutionary summarization; iterative reinforcement; text-to-image translation; visual timeline
【Paper Link】 【Pages】:285-294
【Authors】: Xu Sun ; Anshumali Shrivastava ; Ping Li
【Abstract】: In this paper, we explore the use of a novel online multi-task learning framework for the task of search query spelling correction. In our procedure, correction candidates are initially generated by a ranker-based system and then re-ranked by our multi-task learning algorithm. With the proposed multi-task learning method, we are able to effectively transfer information from different and highly biased training datasets, for improving spelling correction on all datasets. Our experiments are conducted on three query spelling correction datasets including the well-known TREC benchmark dataset. The experimental results demonstrate that our proposed method considerably outperforms the existing baseline systems in terms of accuracy. Importantly, the proposed method is about one order of magnitude faster than baseline systems in terms of training speed. Compared to the commonly used online learning methods which typically require more than (e.g.,) 60 training passes, our proposed method is able to closely reach the empirical optimum in about 5 passes.
【Keywords】: multi-task learning; querry spelling correction
【Paper Link】 【Pages】:295-304
【Authors】: Yu Hong ; Xiaopei Zhou ; Tingting Che ; Jian-Min Yao ; Qiaoming Zhu ; Guodong Zhou
【Abstract】: Motivated by the critical importance of connectives in recognizing discourse relations, we present an unsupervised cross-argument inference mechanism to implicit discourse relation recognition. The basic idea is to infer the implicit discourse relation of an argument pair from a large number of comparable argument pairs, which are automatically retrieved from the web in an unsupervised way. In this way, the inference proceeds from explicit relations to implicit ones via connective as bridge. This kind of pair-to-pair inference is based on the assumption that two argument pairs with high content similarity (i.e. comparable argument pairs) should have similar discourse relationship. Evaluation on PDTB proves the effectiveness of our inference mechanism in implicit relation recognition to the four level-1 relations. It also shows that our mechanism significantly outperforms other alternatives.
【Keywords】: implicit discourse relation; pair-to-pair inference
【Paper Link】 【Pages】:305-314
【Authors】: Jeffrey Pound ; Alexander K. Hudek ; Ihab F. Ilyas ; Grant E. Weddell
【Abstract】: Many keyword queries issued to Web search engines target information about real world entities, and interpreting these queries over Web knowledge bases can often enable the search system to provide exact answers to queries. Equally important is the problem of detecting when the reference knowledge base is not capable of answering the keyword query, due to lack of domain coverage. In this work we present an approach to computing structured representations of keyword queries over a reference knowledge base. We mine frequent query structures from a Web query log and map these structures into a reference knowledge base. Our approach exploits coarse linguistic structure in keyword queries, and combines it with rich structured query representations of information needs.
【Keywords】: knowledge bases; query interpretation; query understanding; semantic query understanding
【Paper Link】 【Pages】:315-324
【Authors】: Zhihong Chong ; He Chen ; Zhenjie Zhang ; Hu Shu ; Guilin Qi ; Aoying Zhou
【Abstract】: In the last few years, RDF is becoming the dominating data model used in semantic web for knowledge representation and inference. In this paper, we revisit the problem of pattern matching query in RDF model, which is usually expensive in efficiency due to the huge cost on join operations. To alleviate the efficiency pain, view materialization techniques are usually deployed to accelerate the query processing. However, given an arbitrary view, it remains difficult to identify how to reuse the view for a particular query, because of the NP-hardness behind the algorithm matching patterns and views. To fully exploit the benefit of the materialized views, we propose a new paradigm to enhance the effectiveness of the materialized view. Instead of choosing materialized views in arbitrary form, our paradigm aims to select the views only if they are sortable. The property of sortability raises huge gains on the pattern-view matching, bringing down the cost to linear complexity in terms of the pattern size. On the other side, the costs on identifying sortable views and searching over the views using inverted index are affordable. Moreover, sortable views generally improve the overall performance of pattern matching, by means of a cost model used to optimize the query rewriting on the most appropriate views. Finally, we demonstrate extensive experimental results to verify the superiority of our proposal on both efficiency and effectiveness.
【Keywords】: query processing; rdf indexing; rdf query
【Paper Link】 【Pages】:325-334
【Authors】: Wenqing Lin ; Xiaokui Xiao ; James Cheng ; Sourav S. Bhowmick
【Abstract】: We study a new type of graph queries, which injectively maps its edges to paths of the graphs in a given database, where the length of each path is constrained by a given threshold specified by the weight of the corresponding matching edge. We give important applications of the new graph query and identify new challenges of processing such a query. Then, we devise the cost model of the branch-and-bound algorithm framework for processing the graph query, and propose an efficient algorithm to minimize the cost overhead. We also develop three indexing techniques to efficiently answer the queries online. Finally, we verify the efficiency of our proposed indexes with extensive experiments on large real and synthetic datasets.
【Keywords】: graph databases; graph indexing; graph matching algorithm; graph querying
【Paper Link】 【Pages】:335-344
【Authors】: Sherif Sakr ; Sameh Elnikety ; Yuxiong He
【Abstract】: We propose a SPARQL-like language, G-SPARQL, for querying attributed graphs. The language expresses types of queries which of large interest for applications which model their data as large graphs such as: pattern matching, reachability and shortest path queries. Each query can combine both of structural predicates and value-based predicates (on the attributes of the graph nodes and edges). We describe an algebraic compilation mechanism for our proposed query language which is extended from the relational algebra and based on the basic construct of building SPARQL queries, the Triple Pattern. We describe a hybrid Memory/Disk representation of large attributed graphs where only the topology of the graph is maintained in memory while the data of the graph is stored in a relational database. The execution engine of our proposed query language splits parts of the query plan to be pushed inside the relational database while the execution of other parts of the query plan are processed using memory-based algorithms, as necessary. Experimental results on real datasets demonstrate the efficiency and the scalability of our approach and show that our approach outperforms native graph databases by several factors.
【Keywords】: graphs; query; sparql
【Paper Link】 【Pages】:345-354
【Authors】: Wei Shen ; Jianyong Wang ; Ping Luo ; Min Wang
【Abstract】: Automatically populating ontology with named entities extracted from the unstructured text has become a key issue for Semantic Web and knowledge management techniques. This issue naturally consists of two subtasks: (1) for the entity mention whose mapping entity does not exist in the ontology, attach it to the right category in the ontology (i.e., fine-grained named entity classification), and (2) for the entity mention whose mapping entity is contained in the ontology, link it with its mapping real world entity in the ontology (i.e., entity linking). Previous studies only focus on one of the two subtasks and cannot solve this task of populating ontology with named entities integrally. This paper proposes APOLLO, a grAph-based aPproach for pOpuLating ontoLOgy with named entities. APOLLO leverages the rich semantic knowledge embedded in the Wikipedia to resolve this task via random walks on graphs. Meanwhile, APOLLO can be directly applied to either of the two subtasks with minimal revision. We have conducted a thorough experimental study to evaluate the performance of APOLLO. The experimental results show that APOLLO achieves significant accuracy improvement for the task of ontology population with named entities, and outperforms the baseline methods for both subtasks.
【Keywords】: entity linking; label propagation; named entity classification; ontology population
【Paper Link】 【Pages】:355-364
【Authors】: Mijung Kim ; K. Selçuk Candan
【Abstract】: For many multi-dimensional data applications, tensor operations as well as relational operations need to be supported throughout the data lifecycle. Although tensor decomposition is shown to be effective for multi-dimensional data analysis, the cost of tensor decomposition is often very high. We propose a novel decomposition-by-normalization scheme that first normalizes the given relation into smaller tensors based on the functional dependencies of the relation and then performs the decomposition using these smaller tensors. The decomposition and recombination steps of the decomposition-by- normalization scheme fit naturally in settings with multiple cores. This leads to a highly efficient, effective, and parallelized decomposition-by-normalization algorithm for both dense and sparse tensors. Experiments confirm the efficiency and effectiveness of the proposed decomposition-by-normalization scheme compared to the conventional nonnegative CP decomposition approach.
【Keywords】: tensor decomposition; tensor-based relational data model
【Paper Link】 【Pages】:365-374
【Authors】: Yifan Jin ; Reynold Cheng ; Ben Kao ; Kam-yiu Lam ; Yinuo Zhang
【Abstract】: In typical location-based services (LBS), moving objects (e.g., GPS-enabled mobile phones) report their locations through a wireless network. An LBS server can use the location information to answer various types of continuous queries. Due to hardware limitations, location data reported by the moving objects are often uncertain. In this paper, we study efficient methods for the execution of Continuous Possible Nearest Neighbor Query (CPoNNQ) that accesses imprecise location data. A CPoNNQ is a standing query (which is active during a period of time) such that, at any time point, all moving objects that have non-zero probabilities of being the nearest neighbor of a given query point are reported. To handle the continuous nature of a CPoNNQ, a simple solution is to require moving objects to continuously report their locations to the LBS server, which evaluates the query at every time step. To save communication bandwidth and mobile devices' batteries, we develop two filter-based protocols for CPoNNQ evaluation. Our protocols install "filter bounds" on moving objects, which suppress unnecessary location reporting and communication between the server and the moving objects. Through extensive experiments, we show that our protocols can effectively reduce communication costs while maintaining a high query quality.
【Keywords】: communication cost; continuous queries; uncertain database
【Paper Link】 【Pages】:375-384
【Authors】: Da Yan ; Zhou Zhao ; Wilfred Ng
【Abstract】: RFID (radio frequency identification) technology has been widely used for object tracking in many real-life applications, such as inventory monitoring and product flow tracking. These applications usually rely on passive RFID technologies rather than active ones, since passive RFID tags are more attractive than active ones in many aspects, such as lower tag cost and simpler maintenance. RFID technology is also important for indoor location tracking systems that require high degree of accuracy. However, most existing systems estimate object locations by using active RFID tags, which usually incur localization error of more than one meter. Although recent studies begin to investigate the application of passive tags for indoor location tracking, these methods are far from deployable and research of this application is still in its infancy. In this paper, we propose a new indoor location tracking system, named PassTrack, which relies on the read rates of passive RFID tags for location estimation. PassTrack is designed to tolerate noise arising from external environmental factors, by probabilistically modeling the relationship between tag read rate and tag-reader distance, and updating the model parameters based on the current readings of reference tags. Besides tolerance of noise, PassTrack is also outstanding in terms of localization accuracy and efficiency. Several new approaches for location inference are supported by PassTrack, and the best one incurs an average error of around 30 cm, and is able to carry out over 7500 location estimations per second on an ordinary machine. Furthermore, as a result of using passive RFID tags, PassTrack also enjoys the many other benefits of passive RFID tags mentioned before. We have conducted extensive experiments on both real and synthetic datasets, which demonstrate that our PassTrack system outperforms the previous localization approaches in localization accuracy, tracking efficiency and space applicability.
【Keywords】: localization; passive; read rate; rfid; tag
【Paper Link】 【Pages】:385-394
【Authors】: Ruicheng Zhong ; Ju Fan ; Guoliang Li ; Kian-Lee Tan ; Lizhu Zhou
【Abstract】: Location-Based Services (LBS) have been widely accepted by mobile users recently. Existing LBS-based systems require users to type in complete keywords. However for mobile users it is rather difficult to type in complete keywords on mobile devices. To alleviate this problem, in this paper we study the location-aware instant search problem, which returns users location-aware answers as users type in queries letter by letter. The main challenge is to achieve high interactive speed. To address this challenge, in this paper we propose a novel index structure, prefix-region tree (called PR-Tree), to efficiently support location-aware instant search. PR-Tree is a tree-based index structure which seamlessly integrates the textual description and spatial information to index the spatial data. Using the PR-Tree, we develop efficient algorithms to support single prefix queries and multi-keyword queries. Experiments show that our method achieves high performance and significantly outperforms state-of-the-art methods.
【Keywords】: keywords search; spatial databases; type-ahead search
【Paper Link】 【Pages】:395-404
【Authors】: Tobias Emrich ; Hans-Peter Kriegel ; Nikos Mamoulis ; Matthias Renz ; Andreas Züfle
【Abstract】: The advances in sensing and telecommunication technologies allow the collection and management of vast amounts of spatio-temporal data combining location and time information.Due to physical and resource limitations of data collection devices (e.g., RFID readers, GPS receivers and other sensors) data are typically collected only at discrete points of time. In-between these discrete time instances, the positions of tracked moving objects are uncertain. In this work, we propose novel approximation techniques in order to probabilistically bound the uncertain movement of objects; these techniques allow for efficient and effective filtering during query evaluation using an hierarchical index structure.To the best of our knowledge, this is the first approach that supports query evaluation on very large uncertain spatio-temporal databases, adhering to possible worlds semantics. We experimentally show that it accelerates the existing, scan-based approach by orders of magnitude.
【Keywords】: indexing; uncertain spatio-temporal data; uncertain trajectory
【Paper Link】 【Pages】:405-414
【Authors】: Hao Huang ; Hong Qin ; Shinjae Yoo ; Dantong Yu
【Abstract】: Current popular anomaly detection algorithms are capable of detecting global anomalies but oftentimes fail to distinguish local anomalies from normal instances. This paper aims to improve unsupervised anomaly detection via the exploration of physics-based diffusion space. Building upon the embedding manifold derived from diffusion maps, we devise Local Anomaly Descriptor (LAD) whose originality results from faithfully preserving intrinsic and informative density-relevant neighborhood information. This robust and effective algorithm is designed with a weighted umbrella Laplacian operator to bridge global and local properties. To further enhance the efficacy of our proposed algorithm, we explore the utility of anisotropic Gaussian kernel (AGK) which can offer better manifold-aware affinity information. Comprehensive experiments on both synthetic and UCI real datasets verify that our LAD outperforms existing anomaly detection algorithms.
【Keywords】: LAD; anomaly detection; diffusion space
【Paper Link】 【Pages】:415-424
【Authors】: Leman Akoglu ; Hanghang Tong ; Jilles Vreeken ; Christos Faloutsos
【Abstract】: Spotting anomalies in large multi-dimensional databases is a crucial task with many applications in finance, health care, security, etc. We introduce COMPREX, a new approach for identifying anomalies using pattern-based compression. Informally, our method finds a collection of dictionaries that describe the norm of a database succinctly, and subsequently flags those points dissimilar to the norm---with high compression cost---as anomalies. Our approach exhibits four key features: 1) it is parameter-free; it builds dictionaries directly from data, and requires no user-specified parameters such as distance functions or density and similarity thresholds, 2) it is general; we show it works for a broad range of complex databases, including graph, image and relational databases that may contain both categorical and numerical features, 3) it is scalable; its running time grows linearly with respect to both database size as well as number of dimensions, and 4) it is effective; experiments on a broad range of datasets show large improvements in both compression, as well as precision in anomaly detection, outperforming its state-of-the-art competitors.
【Keywords】: anomaly detection; categorical data; data encoding
【Paper Link】 【Pages】:425-434
【Authors】: Orly Moreno ; Bracha Shapira ; Lior Rokach ; Guy Shani
【Abstract】: Most collaborative Recommender Systems (RS) operate in a single domain (such as movies, books, etc.) and are capable of providing recommendations based on historical usage data which is collected in the specific domain only. Cross-domain recommenders address the sparsity problem by using Machine Learning (ML) techniques to transfer knowledge from a dense domain into a sparse target domain. In this paper we propose a transfer learning technique that extracts knowledge from multiple domains containing rich data (e.g., movies and music) and generates recommendations for a sparse target domain (e.g., games). Our method learns the relatedness between the different source domains and the target domain, without requiring overlapping users between domains. The model integrates the appropriate amount of knowledge from each domain in order to enrich the target domain data. Experiments with several datasets reveal that, using multiple sources and the relatedness between domains improves accuracy of results.
【Keywords】: collaborative filtering; cross domains; recommender systems; transfer learning
【Paper Link】 【Pages】:435-444
【Authors】: Wei Liu ; Jeffrey Chan ; James Bailey ; Christopher Leckie ; Kotagiri Ramamohanarao
【Abstract】: In large and complex graphs of social, chemical/biological, or other relations, frequent substructures are commonly shared by different graphs or by graphs evolving through different time periods. Tensors are natural representations of these complex time-evolving graph data. A factorization of a tensor provides a high-quality low-rank compact basis for each dimension of the tensor, which facilitates the interpretation of frequent substructures of the original graphs. However, the high computational cost of tensor factorization makes it infeasible for conventional tensor factorization methods to handle large graphs that evolve frequently with time. To address this problem, in this paper we propose a novel iterative tensor factorization (ITF) method whose time complexity is linear in the cardinalities of all dimensions of a tensor. This low time complexity means that when using tensors to represent dynamic graphs, the computational cost of ITF is linear in the size (number of edges/vertices) of graphs and is also linear in the number of time periods over which the graph evolves. More importantly, an error estimation of ITF suggests that its factorization correctness is comparable to that of the standard factorization method. We empirically evaluate our method on publication networks and chemical compound graphs, and demonstrate that ITF is an order of magnitude faster than the conventional method and at the same time preserves factorization quality. To the best of our knowledge, this research is the first work that uses important frequent substructures to speed up tensor factorizations for mining dynamic graphs.
【Keywords】: dynamic graphs; scalability; tensor factorization
【Paper Link】 【Pages】:445-454
【Authors】: Farshad Kooti ; Winter A. Mason ; P. Krishna Gummadi ; Meeyoung Cha
【Abstract】: The way in which social conventions emerge in communities has been of interest to social scientists for decades. Here we report on the emergence of a particular social convention on Twitter---the way to indicate a tweet is being reposted and attributing the content to its source. Despite being invented at different times and having different adoption rates, only two variations became widely adopted. In this paper we describe this process in detail, highlighting the factors that come into play in deciding which variation individuals will adopt. Our classification analysis demonstrates that the date of adoption and the number of exposures are particularly important in the adoption process, while personal features (such as the number of followers and join date) and the number of adopter friends have less discriminative power in predicting adoptions. We discuss implications of these findings in the design of future Web applications and services.
【Keywords】: microblog; prediction; social conventions
【Paper Link】 【Pages】:455-464
【Authors】: Ze Li ; Haiying Shen ; Joseph Edward Grant
【Abstract】: Question and Answer (Q&A) websites such as Yahoo!Answers provide a platform where users can post questions and receive answers. These systems take advantage of the collective intelligence of users to find information. In this paper, we analyze the online social network (OSN) in Yahoo!Answers. Based on a large amount of our collected data, we studied the OSN's structural properties, which reveals strikingly distinct properties such as low link symmetry and weak correlation between indegree and outdegree. After studying the knowledge base and behaviors of the users, we find that a small number of top contributors answer most of the questions in the system. Also, each top contributor focuses on only a few knowledge categories. In addition, the knowledge categories of the users are highly clustered. We also study the knowledge base in a user's social network, which reveals that the members in a user's social network share only a few knowledge categories. Based on the findings, we provide guidance in the design of spammer detection algorithms and distributed Q&A systems. We also propose a friendship-knowledge oriented Q&A framework that synergically combines current OSN-based Q&A and web Q&A. We believe that the results presented in this paper are crucial in understanding the collective intelligence in the web Q&A OSNs and lay a cornerstone for the evolution of next-generation Q&A systems.
【Keywords】: collective intelligence; knowledge networks; on-line social networks
【Paper Link】 【Pages】:465-474
【Authors】: Chunyan Wang ; Mao Ye ; Wang-Chien Lee
【Abstract】: The rapid development of on-line social networking sites has dramatically changed the way people live and communicate. One particularly interesting phenomena came along with this development is the prominent role of various on-line networking portals played in scheduling and organizing off-line group events and activities. In this paper, we focus on studying the face-to-face(f2f) group formed through, or facilitated by, on-line portals. We first show the distinct characteristics of such f2f groups by analyzing datasets collected from Whrrl and Meetup. Next, we propose a dynamic model for group gathering based on the process of friend invitation to interpret how a f2f group is formed on-line. The results of our model are confirmed by empirical observations. Finally, we demonstrate that using such group information can effectively improve the accuracies of social tie inference and friend recommendation.
【Keywords】: group-gathering; social networks; social tie inference
【Paper Link】 【Pages】:475-484
【Authors】: Mingqiang Xue ; Panagiotis Karras ; Chedy Raïssi ; Panos Kalnis ; Hung Keng Pung
【Abstract】: Social network data analysis raises concerns about the privacy of related entities or individuals. To address this issue, organizations can publish data after simply replacing the identities of individuals with pseudonyms, leaving the overall structure of the social network unchanged. However, it has been shown that attacks based on structural identification (e.g., a walk-based attack) enable an adversary to re-identify selected individuals in an anonymized network. In this paper we explore the capacity of techniques based on random edge perturbation to thwart such attacks. We theoretically establish that any kind of structural identification attack can effectively be prevented using random edge perturbation and show that, surprisingly, important properties of the whole network, as well as of subgraphs thereof, can be accurately calculated and hence data analysis tasks performed on the perturbed data, given that the legitimate data recipient knows the perturbation probability as well. Yet we also examine ways to enhance the walk-based attack, proposing a variant we call probabilistic attack. Nevertheless, we demonstrate that such probabilistic attacks can also be prevented under sufficient perturbation. Eventually, we conduct a thorough theoretical study of the probability of success of any}structural attack as a function of the perturbation probability. Our analysis provides a powerful tool for delineating the identification risk of perturbed social network data; our extensive experiments with synthetic and real datasets confirm our expectations.
【Keywords】: graph utility; privacy; random perturbation; social network
【Paper Link】 【Pages】:485-494
【Authors】: Tianbing Xu ; Ruofei Zhang ; Zhen Guo
【Abstract】: With the development of Web applications, large scale data are popular; and they are not only getting richer, but also ubiquitously interconnected with users and other objects in various ways, which brings about multi-view data with implicit structure. In this paper, we propose a novel hierarchical Bayesian mixture regression model, which discovers and then exploits the relationships among multiple views of the data to perform various machine learning tasks. A stochastic EM inference and learning algorithm is derived; and a parallel implementation in Hadoop MapReduce [9] paradigm is developed to scale up the learning. We apply the developed model and algorithm on click-through-rate (CTR) prediction and campaign targeting recommendation in online advertising to measure its effectiveness. The experiments on both synthetic data and large scale ads serving data from a real world online advertising exchange demonstrate the superior CTR prediction accuracy of our method compared to existing state-of-the-art methods. The results also show that our model can recommend high performance targeting features for online advertising campaigns.
【Keywords】: hierarchical bayesian regression; online advertising
【Paper Link】 【Pages】:495-504
【Authors】: Javad Azimi ; Ruofei Zhang ; Yang Zhou ; Vidhya Navalpakkam ; Jianchang Mao ; Xiaoli Fern
【Abstract】: One of the most important categories of online advertising is display advertising which provides publishers with significant revenue. Similar to other categories, the main goal in display advertising is to maximize user response rate for advertising campaigns, such as click through rates (CTR) or conversion rates. Previous studies have tried to optimize these parameters using objectives such as behavioral targeting. However, there is no published work so far to address the effect of the visual appearance of ads (creatives) on user response rate via a systematic data-driven approach. In this paper, we quantitatively study the relationship between the visual appearance and performance of creatives using large scale data in the world's largest display ads exchange system, RightMedia. We designed a set of 43 visual features, some of which are novel and others are inspired by related work. We extracted these features from real creatives served on RightMedia. We also designed and conducted a series of experiments to evaluate the effectiveness of visual features for CTR prediction, ranking and performance classification. Based on the evaluation results, we selected a subset of features that have the highest impact on CTR. We believe that the findings presented in this paper will be very useful for the online advertising industry in designing high-performance creatives. It also provides the research community with the first ever data set, initial insights into visual appearance's effect on user response propensity, and evaluation benchmarks for further study.
【Keywords】: creative recommendation; online advertising; visual features
【Paper Link】 【Pages】:505-514
【Authors】: Takehiro Yamamoto ; Tetsuya Sakai ; Mayu Iwata ; Chen Yu ; Ji-Rong Wen ; Katsumi Tanaka
【Abstract】: This paper tackles the problem of mining subgoals of a given search goal from data. For example, when a searcher wants to travel to London, she may need to accomplish several subtasks such as "book flights," "book a hotel," "find good restaurants" and "decide which sightseeing spots to visit." As another example, if a searcher wants to lose weight, there may exist several alternative solutions such as "do physical exercise," "take diet pills," and "control calorie intake." In this paper, we refer to such subtasks or solutions as subgoals, and propose to utilize sponsored search data for finding subgoals of a given query by means of query clustering. Advertisements (ads) reflect advertisers' tremendous efforts in trying to match a given query with implicit user needs. Moreover, ads are usually associated with a particular action or transaction. We therefore hypothesized that they are useful for subgoal mining. To our knowledge, our work is the first to use sponsored search data for this purpose. Our experimental results show that sponsored search data is a good resource for obtaining related queries and for identifying subgoals via query clustering. In particular, our method that combines ad impressions from sponsored search data and query co-occurrences from session data outperforms a state-of-the-art query clustering method that relies on document clicks rather than ad impressions in terms of purity, NMI, Rand Index, F1-measure and subgoal recall.
【Keywords】: query clustering; sponsored search; user intent
【Paper Link】 【Pages】:515-524
【Authors】: Shuai Yuan ; Jun Wang
【Abstract】: Online advertising has become a key source of revenue for both web search engines and online publishers. For them, the ability of allocating right ads to right webpages is critical because any mismatched ads would not only harm web users' satisfactions but also lower the ad income. In this paper, we study how online publishers could optimally select ads to maximize their ad incomes over time. The conventional offline, content-based matching between webpages and ads is a fine start but cannot solve the problem completely because good matching does not necessarily lead to good payoff. Moreover, with the limited display impressions, we need to balance the need of selecting ads to learn true ad payoffs (exploration) with that of allocating ads to generate high immediate payoffs based on the current belief (exploitation). In this paper, we address the problem by employing Partially observable Markov decision processes (POMDPs) and discuss how to utilize the correlation of ads to improve the efficiency of the exploration and increase ad incomes in a long run. Our mathematical derivation shows that the belief states of correlated ads can be naturally updated using a formula similar to collaborative filtering. To test our model, a real world ad dataset from a major search engine is collected and categorized. Experimenting over the data, we provide an analyse of the effect of the underlying parameters, and demonstrate that our algorithms significantly outperform other strong baselines.
【Keywords】: computational advertising; correlation; pomdps; revenue optimisation; value iteration
【Paper Link】 【Pages】:525-534
【Authors】: Mostafa Keikha ; Fabio Crestani ; W. Bruce Croft
【Abstract】: Blog distillation (blog feed retrieval) is a task in blog retrieval where the goal is to rank blogs according to their recurrent relevance to a query topic. One of the main properties of blog feed retrieval is that the unit of retrieval is a collection of documents as opposed to a single document as in other IR tasks. This collection retrieval nature of blog distillation introduces new challenges and requires new investigations specific to this problem. Researchers have addressed this problem by considering a wide range of evidence and information resources. However, previous work has not studied the effect of on-topic diversity of blog posts in blog relevance. By on-topic diversity of blog posts we mean that those posts that are about the query topic need to have high diversity and cover different sub-topics of the query. In this study, we investigate three types of on-topic diversity and their effect on retrieval performance: topical diversity, temporal diversity and hybrid diversity. Our experiments over different blog collections and different baseline methods show that on-topic diversity can improve the performance of the retrieval system. Among the three types of diversity, hybrid diversity, that considers both topical and temporal diversities, achieves the best performance.
【Keywords】: blog retrieval; diversity; novelty
【Paper Link】 【Pages】:535-544
【Authors】: Noam Koenigstein ; Parikshit Ram ; Yuval Shavitt
【Abstract】: Low-rank Matrix Factorization (MF) methods provide one of the simplest and most effective approaches to collaborative filtering. This paper is the first to investigate the problem of efficient retrieval of recommendations in a MF framework. We reduce the retrieval in a MF model to an apparently simple task of finding the maximum dot-product for the user vector over the set of item vectors. However, to the best of our knowledge the problem of efficiently finding the maximum dot-product in the general case has never been studied. To this end, we propose two techniques for efficient search -- (i) We index the item vectors in a binary spatial-partitioning metric tree and use a simple branch and-bound algorithm with a novel bounding scheme to efficiently obtain exact solutions. (ii) We use spherical clustering to index the users on the basis of their preferences and pre-compute recommendations only for the representative user of each cluster to obtain extremely efficient approximate solutions. We obtain a theoretical error bound which determines the quality of any approximate result and use it to control the approximation. Both these simple techniques are fairly independent of each other and hence are easily combined to further improve recommendation retrieval efficiency. We evaluate our algorithms on real-world collaborative-filtering datasets, demonstrating more than ×7 speedup (with respect to the naive linear search) for the exact solution and over ×250 speedup for approximate solutions by combining both techniques.
【Keywords】: collaborative filtering; fast retrieval; inner-product
【Paper Link】 【Pages】:545-554
【Authors】: Johannes Hoffart ; Stephan Seufert ; Dat Ba Nguyen ; Martin Theobald ; Gerhard Weikum
【Abstract】: Measuring the semantic relatedness between two entities is the basis for numerous tasks in IR, NLP, and Web-based knowledge extraction. This paper focuses on disambiguating names in a Web or text document by jointly mapping all names onto semantically related entities registered in a knowledge base. To this end, we have developed a novel notion of semantic relatedness between two entities represented as sets of weighted (multi-word) keyphrases, with consideration of partially overlapping phrases. This measure improves the quality of prior link-based models, and also eliminates the need for (usually Wikipedia-centric) explicit interlinkage between entities. Thus, our method is more versatile and can cope with long-tail and newly emerging entities that have few or no links associated with them. For efficiency, we have developed approximation techniques based on min-hash sketches and locality-sensitive hashing. Our experiments on semantic relatedness and on named entity disambiguation demonstrate the superiority of our method compared to state-of-the-art baselines.
【Keywords】: entity disambiguation; entity relatedness; locality-sensitive hashing; semantic relatedness
【Paper Link】 【Pages】:555-564
【Authors】: Anagha Kulkarni ; Almer S. Tigelaar ; Djoerd Hiemstra ; Jamie Callan
【Abstract】: Large document collections can be partitioned into 'topical shards' to facilitate distributed search. In a low-resource search environment only a few of the shards can be searched in parallel. Such a search environment faces two intertwined challenges. First, determining which shards to consult for a given query: shard ranking. Second, how many shards to consult from the ranking: cutoff estimation. In this paper we present a family of three algorithms that address both of these problems. As a basis we employ a commonly used data structure, the central sample index (CSI), to represent the shard contents. Running a query against the CSI yields a flat document ranking that each of our algorithms transforms into a tree structure. A bottom up traversal of the tree is used to infer a ranking of shards and also to estimate a stopping point in this ranking that yields cost-effective selective distributed search. As compared to a state-of-the-art shard ranking approach the proposed algorithms provide substantially higher search efficiency while providing comparable search effectiveness.
【Keywords】: distributed information retrieval; selective search
【Paper Link】 【Pages】:565-574
【Authors】: Theodoros Lappas ; Evimaria Terzi
【Abstract】: Daily-Deal Sites (DDS) like Groupon, LivingSocial, Amazon's Goldbox, and many more, have become particularly popular over the last three years, providing discounted offers to customers for restaurants, ticketed events, services etc. In this paper, we study the following problem: among a set of candidate deals, which are the ones that a DDS should feature as daily-deals in order to maximize its revenue? Our first contribution lies in providing two combinatorial formulations of this problem. Both formulations take into account factors like the diversification of daily deals and the limited consuming capacity of the userbase. We prove that our problems are NP-hard and devise pseudopolynomial -- time approximation algorithms for their solution. We also propose a set of heuristics, and demonstrate their efficiency in our experiments. In the context of deal selection and scheduling, we acknowledge the importance of the ability to estimate the expected revenue of a candidate deal. We explore the nature of this task in the context of real data, and propose a framework for revenue-estimation. We demonstrate the effectiveness of our entire methodology in an experimental evaluation on a large dataset of daily-deals from Groupon.
【Keywords】: daily deals; deal selection; e-commerce; revenue maximization
【Paper Link】 【Pages】:575-584
【Authors】: Ariel Fuxman ; Anitha Kannan ; Zhenhui Li ; Panayiotis Tsaparas
【Abstract】: Advertisers typically have a fairly accurate idea of the interests of their target audience. However, today's online advertising systems are unable to leverage this information. The reasons are two-fold. First, there is no agreed upon vocabulary of interests for advertisers and advertising systems to communicate. More importantly, advertising systems lack a mechanism for mapping users to the interest vocabulary. In this paper, we tackle both problems. We present a system for direct interest-aware audience selection. This system takes the query histories of search engine users as input, extracts their interests, and describes them with interpretable labels. The labels are not drawn from a predefined taxonomy, but rather dynamically generated from the query histories, and are thus easy for the advertisers to interpret and use for targeting users. In addition, the system enables seamless addition of interest labels that may be provided by the advertiser.
【Keywords】: audience selection; user interests
【Paper Link】 【Pages】:585-594
【Authors】: Shahrzad Shirazipourazad ; Brian Bogard ; Harsh Vachhani ; Arunabha Sen ; Paul Horn
【Abstract】: It has been observed that individuals' decisions to adopt a product or innovation are often influenced by the recommendations of their friends and acquaintances. Motivated by this observation, the last few years have seen a number of studies on influence maximization in social networks. The primary goal of these studies is identification of k most influential nodes in a network. A major limitation of these studies is that they focus on a non-adversarial environment, where only one player is engaged in influencing the nodes. However, in a realistic scenario multiple players attempt to influence the nodes in a competitive fashion. The proposed model considers a competitive environment where a node that has not yet adopted an innovation, can adopt only one of the several competing innovations and once it adopts an innovation, it does not switch. The paper studies the scenario where the first player has already chosen a set of k nodes and the second player, with the knowledge of the choice of the first, attempts to identify a smallest set of nodes (excluding the ones already chosen by the first) so that when the influence propagation process ends, the number of nodes influenced by the second player is larger than the number of nodes influenced by the first. The paper studies two propagation models and shows that in both the models, the identification of the smallest set of nodes to defeat the adversary is NP-Hard. It provides an approximation algorithm and proves that the performance bound is tight. It also presents the results of extensive experimentation using the collaboration network data. Experimental results show that the second player can easily defeat the first with this algorithm, if the first utilizes the node degree or closeness centrality based algorithms for the selection of influential nodes. The proposed algorithm also provides better performance if the second player utilizes it instead of the greedy algorithm to maximize its influence.
【Keywords】: adversarial environment; influence maximization; social networks
【Paper Link】 【Pages】:595-604
【Authors】: Dan Shen ; Jean-David Ruvini ; Badrul Sarwar
【Abstract】: This paper studies the problem of leveraging computationally intensive classification algorithms for large scale text categorization problems. We propose a hierarchical approach which decomposes the classification problem into a coarse level task and a fine level task. A simple yet scalable classifier is applied to perform the coarse level classification while a more sophisticated model is used to separate classes at the fine level. However, instead of relying on a human-defined hierarchy to decompose the problem, we we use a graph algorithm to discover automatically groups of highly similar classes. As an illustrative example, we apply our approach to real-world industrial data from eBay, a major e-commerce site where the goal is to classify live items into a large taxonomy of categories. In such industrial setting, classification is very challenging due to the number of classes, the amount of training data, the size of the feature space and the real-world requirements on the response time. We demonstrate through extensive experimental evaluation that (1) the proposed hierarchical approach is superior to flat models, and (2) the data-driven extraction of latent groups works significantly better than the existing human-defined hierarchy.
【Keywords】: classification; text
【Paper Link】 【Pages】:605-614
【Authors】: Vishrawas Gopalakrishnan ; Suresh Parthasarathy Iyengar ; Amit Madaan ; Rajeev Rastogi ; Srinivasan H. Sengamedu
【Abstract】: Matching product titles from different data feeds that refer to the same underlying product entity is a key problem in online shopping. This matching problem is challenging because titles across the feeds have diverse representations with some missing important keywords like brand and others containing extraneous keywords related to product specifications. In this paper, we propose a novel unsupervised matching algorithm that leverages web earch engines to (1) enrich product titles by adding important missing tokens that occur frequently in search results, and (2) compute importance scores for tokens based on their ability to retrieve other (enriched title) tokens in search results. Our matching scheme calculates the Cosine similarity between enriched title pairs with tokens weighted by their importance scores. We propose an optimization that exploits the templatized structure of product titles to reduce the number of search queries. In experiments with real-life shopping datasets, we found that our matching algorithm has superior F1 scores compared to IDF-based cosine similarity.
【Keywords】: entity resolution; web-based enrichment
【Paper Link】 【Pages】:615-624
【Authors】: Kai-Yang Chiang ; Joyce Jiyoung Whang ; Inderjit S. Dhillon
【Abstract】: We consider the general $k$-way clustering problem in signed social networks where relationships between entities can be either positive or negative. Motivated by social balance theory, the clustering problem in signed networks aims to find mutually antagonistic groups such that entities within the same group are friends with each other. A recent method proposed in [13] extended the spectral clustering algorithm to the signed network setting by considering the signed graph Laplacian. This has been shown to be equivalent to finding clusters that minimize the 2-way signed ratio cut. In this paper, we show that there is a fundamental weakness when we directly extend the signed Laplacian to the k-way clustering problem. To overcome this weakness, we formulate new k-way objectives for signed networks. In particular, we propose a criterion that is analogous to the normalized cut, called balance normalized cut, which is not only theoretically sound but also experimentally effective in k-way clustering. In addition, we prove that these objectives are equivalent to weighted kernel k-means objectives by choosing an appropriate kernel matrix. Employing this equivalence, we develop a multilevel clustering framework for signed networks. In this framework, we coarsen the graph level by level and refine the clustering results at each level via a k-means based algorithm so that the signed clustering objectives are optimized. This approach gives good quality clustering results, and is also highly efficient and scalable. In experiments, we see that our multilevel approach is competitive to other state-of-the-art methods, while it is much faster and more scalable. In particular, the largest graph we have considered in our experiments contains 1 million nodes and 100 million edges --- this graph can be clustered in less than four hundred seconds using our algorithm.
【Keywords】: clustering; signed graph kernels; signed networks; social balance theory;
【Paper Link】 【Pages】:625-634
【Authors】: Xuhui Fan ; Lin Zhu ; Longbing Cao ; Xia Cui ; Yew-Soon Ong
【Abstract】: Evolutionary data, such as topic changing blogs and evolving trading behaviors in capital market, is widely seen in business and social applications. The time factor and intrinsic change embedded in evolutionary data greatly challenge evolutionary clustering. To incorporate the time factor, existing methods mainly regard the evolutionary clustering problem as a linear combination of snapshot cost and temporal cost, and reflect the time factor through the temporal cost. It still faces accuracy and scalability challenge though promising results gotten. This paper proposes a novel evolutionary clustering approach, evolutionary maximum margin clustering (e-MMC), to cluster large-scale evolutionary data from the maximum margin perspective. e-MMC incorporates two frameworks: Data Integration from the data changing perspective and Model Integration corresponding to model adjustment to tackle the time factor and change, with an adaptive label allocation mechanism. Three e-MMC clustering algorithms are proposed based on the two frameworks. Extensive experiments are performed on synthetic data, UCI data and real-world blog data, which confirm that e-MMC outperforms the state-of-the-art clustering algorithms in terms of accuracy, computational cost and scalability. It shows that e-MMC is particularly suitable for clustering large-scale evolving data.
【Keywords】: evolutionary data.; maximum margin clustering
【Paper Link】 【Pages】:635-644
【Authors】: Tim Weninger ; Yonatan Bisk ; Jiawei Han
【Abstract】: Topic taxonomies present a multi-level view of a document collection, where general topics live towards the top of the taxonomy and more specific topics live towards the bottom. Topic taxonomies allow users to quickly drill down into their topic of interest to find documents. We show that hierarchies of documents, where documents live at the inner nodes of the hierarchy-tree can also be inferred by combining document text with inter-document links. We present a Bayesian generative model by which an explicit hierarchy of documents is created. Experiments on three document-graph data sets shows that the generated document hierarchies are able to fit the observed data, and that the levels in the constructed document hierarchy represent practical groupings.
【Keywords】: bayesian generative models; hierarchical clustering; model evaluation; topic models
【Paper Link】 【Pages】:645-653
【Authors】: Xiang Wang ; Buyue Qian ; Ian Davidson
【Abstract】: With the development of statistical machine translation, we have ready-to-use tools that can translate documents from one language to many other languages. These translations provide different yet correlated views of the same set of documents. This gives rise to an intriguing question: can we use the extra information to achieve a better clustering of the documents? Some recent work on multiview clustering provided positive answers to this question. In this work, we propose an alternative approach to address this problem using the constrained clustering framework. Unlike traditional Must-Link and Cannot-Link constraints, the constraints generated from machine translation are dense yet noisy. We show how to incorporate this type of constraints by presenting two algorithms, one parametric and one non-parametric. Our algorithms are easy to implement, efficient, and can consistently improve the clustering of real data, namely the Reuters RCV1/RCV2 Multilingual Dataset. In contrast to existing multiview clustering algorithms, our technique does not need the compatibility or the conditional independence assumption, nor does it involve subtle parameter tuning.
【Keywords】: constrained spectral clustering; document clustering; machine translation
【Paper Link】 【Pages】:654-663
【Authors】: Michail Vlachos ; Aleksander Wieczorek ; Johannes Schneider
【Abstract】: The emergence of cloud-based storage services is opening up new avenues in data exchange and data dissemination. This has amplified the interest in right-protection mechanisms for establishing ownership in case of data leakage. Current right-protection technologies, however, rarely provide strong guarantees on the dataset utility after the protection process. This work presents techniques that explicitly address this shortcoming and provably preserve the outcome of certain mining operations. In particular, we take special care to guarantee that the outcome of hierarchical clustering operations remains the same before and after right protection. We encode data ownership using watermarking principles. In the process, we derive fundamental bounds on the distortion incurred by the watermarking. We leverage our theoretical analysis to design fast algorithms for right protection without exhaustively searching the vast design space.
【Keywords】: watermarking
【Paper Link】 【Pages】:664-673
【Authors】: Azarias Reda ; Yubin Park ; Mitul Tiwari ; Christian Posse ; Sam Shah
【Abstract】: Search plays an important role in online social networks as it provides an essential mechanism for discovering members and content on the network. Related search recommendation is one of several mechanisms used for improving members' search experience in finding relevant results to their queries. This paper describes the design, implementation, and deployment of Metaphor, the related search recommendation system on LinkedIn, a professional social networking site with over 175~million members worldwide. Metaphor builds on a number of signals and filters that capture several dimensions of relatedness across member search activity. The system, which has been in live operation for over a year, has gone through multiple iterations and evaluation cycles. This paper makes three contributions. First, we provide a discussion of a large-scale related search recommendation system. Second, we describe a mechanism for effectively combining several signals in building a unified dataset for related search recommendations. Third, we introduce a query length model for capturing bias in recommendation click behavior. We also discuss some of the practical concerns in deploying related search recommendations.
【Keywords】: log analysis; query suggestions; recommender system
【Paper Link】 【Pages】:674-683
【Authors】: Xingjie Liu ; Yuan Tian ; Mao Ye ; Wang-Chien Lee
【Abstract】: Group activities are essential ingredients of people's social life. The rapid growth of online social networking services has greatly boosted group activities by providing convenient platform for users to organize and participate in such activities. Therefore, recommender systems, as a critical component in social networking services, now face new challenges in supporting group activities. In this paper, we study the group recommendation problem, i.e., making recommendations to a group of people in social networking services. We analyze the decision making process in a group to propose a personal impact topic (PIT) model for group recommendations. The PIT model effectively identifies the group preference profile for a given group by considering the personal preferences and personal impacts of group members. Moreover, we further enhance the discovery of personal impact with social network information to obtain an extended personal impact topic (E-PIT) model. We have conducted comprehensive data analysis and evaluations on three real datasets. The results show that our proposed group recommendation techniques outperform baseline approaches.
【Keywords】: group recommendation; probabilistic generative model; recommender systems
【Paper Link】 【Pages】:684-693
【Authors】: Yongli Ren ; Gang Li ; Jun Zhang ; Wanlei Zhou
【Abstract】: As each user tends to rate a small proportion of available items, the resulted Data Sparsity issue brings significant challenges to the research of recommender systems. This issue becomes even more severe for neighborhood-based collaborative filtering methods, as there are even lower numbers of ratings available in the neighborhood of the query item. In this paper, we aim to address the Data Sparsity issue in the context of the neighborhood-based collaborative filtering. Given the (user, item) query, a set of key ratings are identified, and an auto-adaptive imputation method is proposed to fill the missing values in the set of key ratings. The proposed method can be used with any similarity metrics, such as the Pearson Correlation Coefficient and Cosine-based similarity, and it is theoretically guaranteed to outperform the neighborhood-based collaborative filtering approaches. Results from experiments prove that the proposed method could significantly improve the accuracy of recommendations for neighborhood-based Collaborative Filtering algorithms.
【Keywords】: collaborative filtering; imputation; recommender systems
【Paper Link】 【Pages】:694-703
【Authors】: Deepak Agarwal ; Bee-Chung Chen ; Xuanhui Wang
【Abstract】: Personalized article recommendation is important for news portals to improve user engagement. Existing work quantifies engagement primarily through click rates. We suggest that quality of recommendations may be improved by exploiting different types of "post-read" engagement signals like sharing, commenting, printing and e-mailing article links. Specifically, we propose a multi-faceted ranking problem for recommending articles, where each facet corresponds to a ranking task that seeks to maximize actions of a particular post-read type (e.g., ranking articles to maximize sharing actions). Our approach is to predict the probability that a user would take a post-read action on an article, so that articles can be ranked according to such probabilities. However, post-read actions are rare events --- enormous data sparsity makes the problem challenging. We meet the challenge by exploiting correlations across different post-read action types through a novel locally augmented tensor (LAT) model, so that the ranking performance of a particular action type can be improved by leveraging data from all other action types. Through extensive experiments, we show that our LAT model significantly outperforms a variety of state-of-the-art factor models, logistic regression and IR models.
【Keywords】: multi-faceted; post-read; tensor model
【Paper Link】 【Pages】:704-713
【Authors】: Thanasis G. Papaioannou ; Jean-Eudes Ranvier ; Alexandra Olteanu ; Karl Aberer
【Abstract】: An overwhelming and growing amount of data is available online. The problem of untrustworthy online information is augmented by its high economic potential and its dynamic nature, e.g. transient domain names, dynamic content, etc. In this paper, we address the problem of assessing the credibility of web pages by a decentralized social recommender system. Specifically, we concurrently employ i) item-based collaborative filtering (CF) based on specific web page features, ii) user-based CF based on friend ratings and iii) the ranking of the page in search results. These factors are appropriately combined into a single assessment based on adaptive weights that depend on their effectiveness for different topics and different fractions of malicious ratings. Simulation experiments with real traces of web page credibility evaluations suggest that our hybrid approach outperforms both its constituent components and classical content-based classification approaches.
【Keywords】: collaborative filtering; similarity metrics; social networks
【Paper Link】 【Pages】:714-723
【Authors】: Xiaorui Jiang ; Xiaoping Sun ; Hai Zhuge
【Abstract】: It is important to help researchers find valuable scientific papers from a large literature collection containing information of authors, papers and venues. Graph-based algorithms have been proposed to rank papers based on networks formed by citation and co-author relationships. This paper proposes a new graph-based ranking framework MutualRank that integrates mutual reinforcement relationships among networks of papers, researchers and venues to achieve a more synthetic, accurate and fair ranking result than previous graph-based methods. MutualRank leverages the network structure information among papers, authors, and their venues available from a literature collection dataset and sets up a unified mutual reinforcement model that involves both intra- and inter-network information for ranking papers, authors and venues simultaneously. To evaluate, we collect a set of recommended papers from websites of graduate-level computational linguistics courses of 15 top universities as the benchmark and apply different methods to estimate paper importance. The results show that MutualRank greatly outperforms the competitors including Pag-eRank, HITS and CoRank in ranking papers as well as researchers. The experimental results also demonstrate that venues ranked by MutualRank are reasonable.
【Keywords】: iterative ranking; mutual reinforcement; time distortion
【Paper Link】 【Pages】:724-733
【Authors】: Tam T. Nguyen ; Kuiyu Chang ; Siu Cheung Hui
【Abstract】: We propose a math-aware search engine that is capable of handling both textual keywords as well as mathematical expressions. Our math feature extraction and representation framework captures the semantics of math expressions via a Finite State Machine model. We adapt the passive aggressive online learning binary classifier as the ranking model. We benchmarked our approach against three classical information retrieval (IR) strategies on math documents crawled from Math Overflow, a well-known online math question answering system. Experimental results show that our proposed approach can perform better than other methods by more than 9%.
【Keywords】: learning to rank; math document retrieval; math-aware search engine
【Paper Link】 【Pages】:734-743
【Authors】: Muhammad Ali Norozi ; Paavo Arvola ; Arjen P. de Vries
【Abstract】: Context surrounding hyperlinked semi-structured documents, externally in the form of citations and internally in the form of hierarchical structure, contains a wealth of useful but implicit evidence about a document's relevance. These rich sources of information should be exploited as contextual evidence. This paper proposes various methods of accumulating evidence from the context, and measures the effect of contextual evidence on retrieval effectiveness for document and focused retrieval of hyperlinked semi-structured documents. We propose a re-weighting model to contextualize (a) evidence from citations in a query-independent and query-dependent fashion (based on Markovian random walks) and (b) evidence accumulated from the internal tree structure of documents. The in-links and out-links of a node in the citation graph are used as external context, while the internal document structure provides internal, within-document context. We hypothesize that documents in a good context (having strong contextual evidence) should be good candidates to be relevant to the posed query, and vice versa. We tested several variants of contextualization and verified notable improvements in comparison with the baseline system and gold standards in the retrieval of full documents and focused elements.
【Keywords】: contextualization; random walks; re-weighting; schema agnostic search; semi-structured data; structural indices; xml retrieval
【Paper Link】 【Pages】:744-753
【Authors】: Jin Young Kim ; Henry Allen Feild ; Marc-Allen Cartright
【Abstract】: With the increased availability of e-books and digitized book collections, more users are searching the web for information about books. There are many online digital libraries containing book, author and subject data, which are accessed via internal search services as well as external web sites, such as Google. Although this is a common yet complex information-seeking behavior involving multiple search systems with different characteristics, little is known about how users find information in this scenario. In this work, we analyze web-based book search behavior using three months of logs from the Open Library, a globally accessible digital library. Our study encompasses the user behavior on web search engines and the digital library, unlike previous work which focused on institution-level digital libraries. Among our findings are (1) query characteristics and session-level behaviors are drastically different between internal and external searchers; (2) the field usage is different based on the modes of interaction---keyword search, advanced search interface and faceted filtering; (3) users go through with more iterations of faceted filtering than query reformulation. To facilitate future research on book search, we also create a book search test collection based on the log data. We then perform an evaluation of several retrieval methods, finding that field-based retrieval models have advantages over document-based models.
【Keywords】: book search; query log analysis; user modeling
【Paper Link】 【Pages】:754-763
【Authors】: Ruben Sipos ; Adith Swaminathan ; Pannaga Shivaswamy ; Thorsten Joachims
【Abstract】: In many areas of life, we now have almost complete electronic archives reaching back for well over two decades. This includes, for example, the body of research papers in computer science, all news articles written in the US, and most people's personal email. However, we have only rather limited methods for analyzing and understanding these collections. While keyword-based retrieval systems allow efficient access to individual documents in archives, we still lack methods for understanding a corpus as a whole. In this paper, we explore methods that provide a temporal summary of such corpora in terms of landmark documents, authors, and topics. In particular, we explicitly model the temporal nature of influence between documents and re-interpret summarization as a coverage problem over words anchored in time. The resulting models provide monotone sub-modular objectives for computing informative and non-redundant summaries over time, which can be efficiently optimized with greedy algorithms. Our empirical study shows the effectiveness of our approach over several baselines.
【Keywords】: submodular; summarization; temporal
【Paper Link】 【Pages】:764-772
【Authors】: Guodong Long ; Ling Chen ; Xingquan Zhu ; Chengqi Zhang
【Abstract】: Short & sparse text is becoming more prevalent on the web, such as search snippets, micro-blogs and product reviews. Accurately classifying short & sparse text has emerged as an important while challenging task. Existing work has considered utilizing external data (e.g. Wikipedia) to alleviate data sparseness, by appending topics detected from external data as new features. However, training a classifier on features concatenated from different spaces is not easy considering the features have different physical meanings and different significance to the classification task. Moreover, it exacerbates the "curse of dimensionality" problem. In this study, we propose a transfer classification method, TCSST, to exploit the external data to tackle the data sparsity issue. The transfer classifier will be learned in the original feature space. Considering that the labels of the external data may not be readily available or sufficiently enough, TCSST further exploits the unlabeled external data to aid the transfer classification. We develop novel strategies to allow TCSST to iteratively select high quality unlabeled external data to help with the classification. We evaluate the performance of TCSST on both benchmark as well as real-world data sets. Our experimental results demonstrate that the proposed method is effective in classifying very short & sparse text, consistently outperforming existing and baseline methods.
【Keywords】: classification; external data; short & sparse text mining; transfer learning; wikipedia
【Paper Link】 【Pages】:773-782
【Authors】: Karla L. Caballero Espinosa ; Joel Barajas ; Ram Akella
【Abstract】: We present a new, robust and computationally efficient Hierarchical Bayesian model for effective topic correlation modeling. We model the prior distribution of topics by a Generalized Dirichlet distribution (GD) rather than a Dirichlet distribution as in Latent Dirichlet Allocation (LDA). We define this model as GD-LDA. This framework captures correlations between topics, as in the Correlated Topic Model (CTM) and Pachinko Allocation Model (PAM), and is faster to infer than CTM and PAM. GD-LDA is effective to avoid over-fitting as the number of topics is increased. As a tree model, it accommodates the most important set of topics in the upper part of the tree based on their probability mass. Thus, GD-LDA provides the ability to choose significant topics effectively. To discover topic relationships, we perform hyper-parameter estimation based on Monte Carlo EM Estimation. We provide results using Empirical Likelihood(EL) in 4 public datasets from TREC and NIPS. Then, we present the performance of GD-LDA in ad hoc information retrieval (IR) based on MAP, P@10, and Discounted Gain. We discuss an empirical comparison of the fitting time. We demonstrate significant improvement over CTM, LDA, and PAM for EL estimation. For all the IR measures, GD-LDA shows higher performance than LDA, the dominant topic model in IR. All these improvements with a small increase in fitting time than LDA, as opposed to CTM and PAM.
【Keywords】: document representation; statistical topic modeling
【Paper Link】 【Pages】:783-792
【Authors】: Joon Hee Kim ; Dongwoo Kim ; Suin Kim ; Alice H. Oh
【Abstract】: Topic models such as latent Dirichlet allocation (LDA) and hierarchical Dirichlet processes (HDP) are simple solutions to discover topics from a set of unannotated documents. While they are simple and popular, a major shortcoming of LDA and HDP is that they do not organize the topics into a hierarchical structure which is naturally found in many datasets. We introduce the recursive Chinese restaurant process (rCRP) and a nonparametric topic model with rCRP as a prior for discovering a hierarchical topic structure with unbounded depth and width. Unlike previous models for discovering topic hierarchies, rCRP allows the documents to be generated from a mixture over the entire set of topics in the hierarchy. We apply rCRP to a corpus of New York Times articles, a dataset of MovieLens ratings, and a set of Wikipedia articles and show the discovered topic hierarchies. We compare the predictive power of rCRP with LDA, HDP, and nested Chinese restaurant process (nCRP) using heldout likelihood to show that rCRP outperforms the others. We suggest two metrics that quantify the characteristics of a topic hierarchy to compare the discovered topic hierarchies of rCRP and nCRP. The results show that rCRP discovers a hierarchy in which the topics become more specialized toward the leaves, and topics in the immediate family exhibit more affinity than topics beyond the immediate family.
【Keywords】: bayesian nonparametric models; hierarchical topic modeling
【Paper Link】 【Pages】:793-802
【Authors】: P. Deepak ; Karthik Visweswariah ; Nirmalie Wiratunga ; Sadiq Sani
【Abstract】: We consider the problem of segmenting text documents that have a two-part structure such as a problem part and a solution part. Documents of this genre include incident reports that typically involve description of events relating to a problem followed by those pertaining to the solution that was tried. Segmenting such documents into the component two parts would render them usable in knowledge reuse frameworks such as Case-Based Reasoning. This segmentation problem presents a hard case for traditional text segmentation due to the lexical inter-relatedness of the segments. We develop a two-part segmentation technique that can harness a corpus of similar documents to model the behavior of the two segments and their inter-relatedness using language models and translation models respectively. In particular, we use separate language models for the problem and solution segment types, whereas the inter-relatedness between segment types is modeled using an IBM Model 1 translation model. We model documents as being generated starting from the problem part that comprises of words sampled from the problem language model, followed by the solution part whose words are sampled either from the solution language model or from a translation model conditioned on the words already chosen in the problem part. We show, through an extensive set of experiments on real-world data, that our approach outperforms the state-of-the-art text segmentation algorithms in the accuracy of segmentation, and that such improved accuracy translates well to improved usability in Case-based Reasoning systems. We also analyze the robustness of our technique to varying amounts and types of noise and empirically illustrate that our technique is quite noise tolerant, and degrades gracefully with increasing amounts of noise.
【Keywords】: language models; segmentation; text; translation models
【Paper Link】 【Pages】:803-812
【Authors】: Samaneh Moghaddam ; Martin Ester
【Abstract】: Aspect-based opinion mining, which aims to extract aspects and their corresponding ratings from customers reviews, provides very useful information for customers to make purchase decisions. In the past few years several probabilistic graphical models have been proposed to address this problem, most of them based on Latent Dirichlet Allocation (LDA). While these models have a lot in common, there are some characteristics that distinguish them from each other. These fundamental differences correspond to major decisions that have been made in the design of the LDA models. While research papers typically claim that a new model outperforms the existing ones, there is normally no "one-size-fits-all" model. In this paper, we present a set of design guidelines for aspect-based opinion mining by discussing a series of increasingly sophisticated LDA models. We argue that these models represent the essence of the major published methods and allow us to distinguish the impact of various design decisions. We conduct extensive experiments on a very large real life dataset from Epinions.com (500K reviews) and compare the performance of different models in terms of the likelihood of the held-out test set and in terms of the accuracy of aspect identification and rating prediction.
【Keywords】: aspect identification; aspect-based opinion mining; latent dirichlet allocation; rating prediction; variational methods
【Paper Link】 【Pages】:813-822
【Authors】: Gad Markovits ; Anna Shtok ; Oren Kurland ; David Carmel
【Abstract】: Estimating the effectiveness of a search performed in response to a query in the absence of relevance judgments is the goal of query-performance prediction methods. Post-retrieval predictors analyze the result list of the most highly ranked documents. We address the prediction challenge for retrieval approaches wherein the final result list is produced by fusing document lists that were retrieved in response to a query. To that end, we present a novel fundamental prediction framework that accounts for this special characteristics of the fusion setting; i.e., the use of intermediate retrieved lists. The framework is based on integrating prediction performed upon the final result list with that performed upon the lists that were fused to create it; prediction integration is controlled based on inter-list similarities. We empirically demonstrate the merits of various predictors instantiated from the framework. A case in point, their prediction quality substantially transcends that of applying state-of-the-art predictors upon the final result list.
【Keywords】: fusion; query-performance prediction
【Paper Link】 【Pages】:823-832
【Authors】: Oren Kurland ; Anna Shtok ; Shay Hummel ; Fiana Raiber ; David Carmel ; Ofri Rom
【Abstract】: The query-performance prediction task is estimating the effectiveness of a search performed in response to a query when no relevance judgments are available. Although there exist many effective prediction methods, these differ substantially in their basic principles, and rely on diverse hypotheses about the characteristics of effective retrieval. We present a novel fundamental probabilistic prediction framework. Using the framework, we derive and explain various previously proposed prediction methods that might seem completely different, but turn out to share the same formal basis. The derivations provide new perspectives on several predictors (e.g., Clarity). The framework is also used to devise new prediction approaches that outperform the state-of-the-art.
【Keywords】: query-performance prediction
【Paper Link】 【Pages】:833-842
【Authors】: Arvind Agarwal ; Hema Raghavan ; Karthik Subbian ; Prem Melville ; Richard D. Lawrence ; David Gondek ; James Fan
【Abstract】: This paper aims to solve the problem of improving the ranking of answer candidates for factoid based questions in a state-of-the-art Question Answering system. We first provide an extensive comparison of 5 ranking algorithms on two datasets -- from the Jeopardy quiz show and a medical domain. We then show the effectiveness of a cascading approach, where the ranking produced by one ranker is used as input to the next stage. The cascading approach shows sizeable gains on both datasets. We finally evaluate several rank aggregation techniques to combine these algorithms, and find that Supervised Kemeny aggregation is a robust technique that always beats the baseline ranking approach used by Watson for the Jeopardy competition. We further corroborate our results on TREC Question Answering datasets.
【Keywords】: question-answering; rank-aggregation; ranking
【Paper Link】 【Pages】:843-851
【Authors】: Maksims Volkovs ; Hugo Larochelle ; Richard S. Zemel
【Abstract】: We present a general treatment of the problem of aggregating preferences from several experts into a consensus ranking, in the context where information about a target ranking is available. Specifically, we describe how such problems can be converted into a standard learning-to-rank one on which existing learning solutions can be invoked. This transformation allows us to optimize the aggregating function for any target IR metric, such as Normalized Discounted Cumulative Gain, or Expected Reciprocal Rank. When applied to crowdsourcing and meta-search benchmarks, our new algorithm improves on state-of-the-art preference aggregation methods.
【Keywords】: crowdsourcing; meta-search; preference aggregation
【Paper Link】 【Pages】:852-861
【Authors】: Jian Zhou ; Hongyu Zhang
【Abstract】: For a large and complex software system, the project team could receive a large number of bug reports. Some bug reports could be duplicates as they essentially report the same problem. It is often tedious and costly to manually check if a newly reported bug is a duplicate of an already reported bug. In this paper, we propose BugSim, a method that can automatically retrieve duplicate bug reports given a new bug report. BugSim is based on learning to rank concepts. We identify textual and statistical features of bug reports and propose a similarity function for bug reports based on the features. We then construct a training set by assembling pairs of duplicate and non-duplicate bug reports. We train the weights of features by applying the stochastic gradient descent algorithm over the training set. For a new bug report, we retrieve candidate duplicate reports using the trained model. We evaluate BugSim using more than 45,100 real bug reports of twelve Eclipse projects. The evaluation results show that the proposed method is effective. On average, the recall rate for the top 10 retrieved reports is 76.11%. Furthermore, BugSim outperforms the previous state-of-art methods that are implemented using SVM and BM25Fext.
【Keywords】: bug reports; duplicate bug retrieval; duplicate documents; learning to rank; software maintenance
【Paper Link】 【Pages】:862-871
【Authors】: Zhou Zhao ; Wilfred Ng
【Abstract】: In recent years, RFID technologies have been used in many applications, such as inventory checking and object tracking. However, raw RFID data are inherently unreliable due to physical device limitations and different kinds of environmental noise. Currently, existing work mainly focuses on RFID data cleansing in a static environment (e.g. inventory checking). It is therefore difficult to cleanse RFID data streams in a mobile environment (e.g. object tracking) using the existing solutions, which do not address the data missing issue effectively. In this paper, we study how to cleanse RFID data streams for object tracking, which is a challenging problem, since a significant percentage of readings are routinely dropped. We propose a probabilistic model for object tracking in a mobile environment. We develop a Bayesian inference based approach for cleansing RFID data using the model. In order to sample data from the movement distribution, we devise a sequential sampler that cleans RFID data with high accuracy and efficiency. We validate the effectiveness and robustness of our solution through extensive simulations and demonstrate its performance by using two real RFID applications of human tracking and conveyor belt monitoring.
【Keywords】: data cleaning; probabilistic algorithms; uncertainty
【Paper Link】 【Pages】:872-881
【Authors】: Giansalvatore Mecca ; Paolo Papotti ; Salvatore Raunich ; Donatello Santoro
【Abstract】: Mapping and translating data across different representations is a crucial problem in information systems. Many formalisms and tools are currently used for this purpose, to the point that developers typically face a difficult question: "what is the right tool for my translation task?" In this paper, we introduce several techniques that contribute to answer this question. Among these, a fairly general definition of a data transformation system, a new and very efficient similarity measure to evaluate the outputs produced by such a system, and a metric to estimate user efforts. Based on these techniques, we are able to compare a wide range of systems on many translation tasks, to gain interesting insights about their effectiveness, and, ultimately, about their "intelligence".
【Keywords】: ETL; benchmarks; data transformation; schema mappings
【Paper Link】 【Pages】:882-891
【Authors】: Fereidoon Sadri
【Abstract】: Information integration has been a subject of research for several decades and still remains a very active research area. Many new applications depend or benefit from large scale integration. Examples include large research projects in life sciences, need for data sharing among government agencies, reliance of corporations on business intelligence (which requires data integration from many heterogeneous sources), and integration of information on the web. The importance of information integration with uncertainty has been observed in recent years. Frequently, information from multiple sources are uncertain and possibly inconsistent. Further the process of integration often depends on approximate schema mappings, another source of uncertainty. An integration system is useful only to the extent that the information it produces can be trusted. Hence, providing a measure of certainty for integrated information is of crucial importance in many important applications. In this paper we study the problem of integration of uncertain information. We present a simple and intuitive approach to the representation and integration of uncertain information from multiple sources, and show that our integration approach coincides with a recent formalism for uncertain information integration. We extend the model to probabilistic possible-worlds, and show certain unintuitive constraints are imposed upon probabilities of possible-worlds of sources. In particular, we show the probabilities of possible worlds of a source are not independent, rather, they are dependent on probabilities of other sources. We study the problem of determining the probabilities for the result of integration. Finally, we present a practical approach to relaxing probabilistic constraints in integration.
【Keywords】: information integration; probabilistic data; uncertain data
【Paper Link】 【Pages】:892-901
【Authors】: Yusuke Kozawa ; Toshiyuki Amagasa ; Hiroyuki Kitagawa
【Abstract】: Uncertain databases have been widely developed to deal with the vast amount of data that contain uncertainty. To extract valuable information from the uncertain databases, several methods of frequent itemset mining, one of the major data mining techniques, have been proposed. However, their performance is not satisfactory because handling uncertainty incurs high processing costs. In order to address this problem, we utilize GPGPU (General-Purpose computation on GPU). GPGPU implies using a GPU (Graphics Processing Unit), which is originally designed for processing graphics, to accelerate general purpose computation. In this paper, we propose a method of frequent itemset mining from uncertain databases using GPGPU. The main idea is to speed up probability computations by making the best use of GPU's high parallelism and low-latency memory. We also employ an algorithm to manipulate a bitstring and data-parallel primitives to improve performance in the other parts of the method. Extensive experiments show that our proposed method is up to two orders of magnitude faster than existing methods.
【Keywords】: frequent itemset mining; gpgpu; uncertain databases
【Paper Link】 【Pages】:902-911
【Authors】: Werner Nutt ; Simon Razniewski
【Abstract】: Data completeness is an important aspect of data quality. We consider a setting, where databases can be incomplete in two ways: records may be missing and records may contain null values. We (i) formalize when the answer set of a query is complete in spite of such incompleteness, and (ii) we introduce table completeness statements, by which one can express that certain parts of a database are complete. We then study how to deduce from a set of table-completeness statements that a query can be answered completely. Null values as used in SQL are ambiguous. They can indicate either that no attribute value exists or that a value exists, but is unknown. We study completeness reasoning for the different interpretations. We show that in the combined case it is necessary to syntactically distinguish between different kinds of null values and present an encoding for doing that in standard SQL databases. With this technique, any SQL DBMS evaluates complete queries correctly with respect to the different meanings that nulls can carry. We study the complexity of completeness reasoning and provide algorithms that in most cases agree with the worst-case lower bounds.
【Keywords】: data completeness; data quality; metadata management
【Paper Link】 【Pages】:912-921
【Authors】: Aleksandar Stupar ; Sebastian Michel
【Abstract】: Focusing on the top-K items according to a ranking criterion constitutes an important functionality in many different query answering scenarios. The idea is to read only the necessary information---mostly from secondary storage---with the ultimate goal to achieve low latency. In this work, we consider processing such top-K queries under the constraint that the result items are members of a specific set, which is provided at query time. We call this restriction a set-defined selection criterion. Set-defined selections drastically influence the pros and cons of an id-ordered index vs. a score-ordered index. We present a mathematical model that allows to decide at runtime which index to choose, leading to a combined index. To improve the latency around the break even point of the two indices, we show how to benefit from a partitioned score-ordered index and present an algorithm to create such partitions based on analyzing query logs. Further performance gains can be enjoyed using approximate top-K results, with tunable result quality. The presented approaches are evaluated using both real-world and synthetic data.
【Keywords】: index partitioning; top-k query processing
【Paper Link】 【Pages】:922-931
【Authors】: Liming Zhan ; Ying Zhang ; Wenjie Zhang ; Xuemin Lin
【Abstract】: Uncertainty is inherent in many important applications, such as location-based services (LBS), sensor monitoring and radio-frequency identification (RFID). Recently, considerable research efforts have been put into the field of uncertainty-aware spatial query processing. In this paper, we study the problem of finding top k most influential facilities over a set of uncertain objects, which is an important spatial query in the above applications. Based on the maximal utility principle, we propose a new ranking model to identify the top k most influential facilities, which carefully captures influence of facilities on the uncertain objects. By utilizing two uncertain object indexing techniques, R-tree and U-Quadtree, effective and efficient algorithms are proposed following the filtering and verification paradigm, which significantly improves the performance of the algorithms in terms of CPU and I/O costs. Comprehensive experiments on real datasets demonstrate the effectiveness and efficiency of our techniques.
【Keywords】: spatial; uncertain
【Paper Link】 【Pages】:932-941
【Authors】: Weihuang Huang ; Guoliang Li ; Kian-Lee Tan ; Jianhua Feng
【Abstract】: Many real-world applications have requirements to support moving spatial keyword queries. For example a tourist looks for top-k "seafood restaurants" while walking in a city. She will continuously issue moving queries. However existing spatial keyword search methods focus on static queries and it calls for new effective techniques to support moving queries efficiently. In this paper we propose an effective method to support moving top-k spatial keyword queries. In addition to finding top-k answers of a moving query, we also calculate a safe region such that if a new query with a location falling in the safe region, we can directly use the answer set to answer the query. To this end, we propose an effective model to represent the safe region and devise efficient search algorithms to compute the safe region. We have implemented our method and experimental results on real datasets show that our method achieves high efficiency and outperforms existing methods significantly.
【Keywords】: moving top-k spatial keyword queries; safe region
【Paper Link】 【Pages】:942-951
【Authors】: Da Yan ; Zhou Zhao ; Wilfred Ng
【Abstract】: Finding reverse nearest neighbors (RNNs) is an important operation in spatial databases. The problem of evaluating RNN queries has already received considerable attention due to its importance in many real-world applications, such as resource allocation and disaster response. While RNN query processing has been extensively studied in Euclidean space, no work ever studies this problem on land surfaces. However, practical applications of RNN queries involve terrain surfaces that constrain object movements, which rendering the existing algorithms inapplicable. In this paper, we investigate the evaluation of two types of RNN queries on land surfaces: monochromatic RNN (MRNN) queries and bichromatic RNN (BRNN) queries. On a land surface, the distance between two points is calculated as the length of the shortest path along the surface. However, the computational cost of the state-of-the-art shortest path algorithm on a land surface is quadratic to the size of the surface model, which is usually quite huge. As a result, surface RNN query processing is a challenging problem. Leveraging some newly-discovered properties of Voronoi cell approximation structures, we make use of standard index structures such as an R-tree to design efficient algorithms that accelerate the evaluation of MRNN and BRNN queries on land surfaces. Our proposed algorithms are able to localize query evaluation by accessing just a small fraction of the surface data near the query point, which helps avoid shortest path evaluation on a large surface. Extensive experiments are conducted on large real-world datasets to demonstrate the efficiency of our algorithms.
【Keywords】: land surface; reverse nearest neighbor; terrain
【Paper Link】 【Pages】:952-961
【Authors】: Tom Crecelius ; Ralf Schenkel
【Abstract】: An important building block of many graph applications such as searching in social networks, keyword search in graphs, and retrieval of linked documents is retrieving the transitive neighbors of a node in ascending order of their distances. Since large graphs cannot be kept in memory and graph traversals at query time would be prohibitively expensive, the list of neighbors for each node is usually precomputed and stored in a compact form. While the problem of precomputing all-pairs shortest distances has been well studied for decades, efficiently maintaining this information when the graph changes is not as well understood. This paper presents an algorithm for maintaining nearest neighbor lists in weighted graphs under node insertions and decreasing edge weights. It considers the important case where queries are a lot more frequent than updates, and presents two approaches for transparently performing necessary index updates while executing queries. Extensive experiments with large graphs, including a subset of Twitter's user graph, demonstrate that the overhead for this maintenance is small.
【Keywords】: databases; incremental apsd; shortest paths
【Paper Link】 【Pages】:962-971
【Authors】: Krishna Yeswanth Kamath ; James Caverlee ; Zhiyuan Cheng ; Daniel Z. Sui
【Abstract】: In this paper we seek to understand and model the global spread of social media. How does social media spread from location to location across the globe? Can we model this spread and predict where social media will be popular in the future? Toward answering these questions, we develop a probabilistic model that synthesizes two conflicting hypotheses about the nature of online information spread: (i) the spatial influence model, which asserts that social media spreads to locations that are close by; and (ii) the community affinity influence model, which asserts that social media spreads between locations that are culturally connected, even if they are distant. Based on the geospatial footprint of 755 million geo-tagged hashtags spread through Twitter, we evaluate these models at predicting locations that will adopt hashtags in the future. We find that distance is the single most important explanation of future hashtag adoption since hashtags are fundamentally local. We also find that community affinities (like culture, language, and common interests) enhance the quality of purely spatial models, indicating the necessity of incorporating non-spatial features into models of global social media spread.
【Keywords】: information diffusion models; social media; virtual communities
【Paper Link】 【Pages】:972-981
【Authors】: Xuning Tang ; Christopher C. Yang
【Abstract】: The rapid development of online social media sites is accompanied by the generation of tremendous web contents. Web users are shifting from data consumers to data producers. As a result, topic detection and tracking without taking users' interests into account is not enough. This paper presents a statistical model that can detect interpretable trends and topics from document streams, where each trend (short for trending story) corresponds to a series of continuing events or a storyline. A topic is represented by a cluster of words frequently co-occurred. A trend can contain multiple topics and a topic can be shared by different trends. In addition, by leveraging a Recurrent Chinese Restaurant Process (RCRP), the number of trends in our model can be determined automatically without human intervention, so that our model can better generalize to unseen data. Furthermore, our proposed model incorporates user interest to fully simulate the generation process of web contents, which offers the opportunity for personalized recommendation in online social media. Experiments on three different datasets indicated that our proposed model can capture meaningful topics and trends, monitor rise and fall of detected trends, outperform baseline approach in terms of perplexity on held-out dataset, and improve the result of user participation prediction by leveraging users' interests to different trends.
【Keywords】: evolution; modeling; topic; trend; user interest
【Paper Link】 【Pages】:982-991
【Authors】: Shu Huang ; Min Chen ; Bo Luo ; Dongwon Lee
【Abstract】: How to accurately model and predict the future status of social networks has become an important problem in recent years. Conventional solutions to such a problem often employ topological structure of the sociogram, i.e., friendship links. However, they often disregard different levels of activeness of social actors and become insufficient to deal with complex dynamics of user behaviors. In this paper, to address this issue, we first refine the notion of social activity to better describe dynamic user behaviors in social networks. We then propose a Parameterized Social Activity Model (PSAM) using continuous-time stochastic process for predicting aggregate social activities. With social activities evolving over time, PSAM itself also evolves and therefore dynamically captures the real-time characteristics of the current active population. Our experiments using two real social networks (Facebook and CiteSeer) reveal that the proposed PSAM model is effective in simulating social activity evolution and predicting aggregate social activities accurately at different time scales.
【Keywords】: aggregate social activity; continuous-time stochastic process
【Paper Link】 【Pages】:992-1001
【Authors】: Partha Pratim Talukdar ; Derry Tanti Wijaya ; Tom M. Mitchell
【Abstract】: We consider the problem of automatically acquiring knowledge about the typical temporal orderings among relations (e.g., actedIn(person, film) typically occurs before wonPrize (film, award)), given only a database of known facts (relation instances) without time information, and a large document collection. Our approach is based on the conjecture that the narrative order of verb mentions within documents correlates with the temporal order of the relations they represent. We propose a family of algorithms based on this conjecture, utilizing a corpus of 890m dependency parsed sentences to obtain verbs that represent relations of interest, and utilizing Wikipedia documents to gather statistics on narrative order of verb mentions. Our proposed algorithm, GraphOrder, is a novel and scalable graph-based label propagation algorithm that takes transitivity of temporal order into account, as well as these statistics on narrative order of verb mentions. This algorithm achieves as high as 38.4% absolute improvement in F1 over a random baseline. Finally, we demonstrate the utility of this learned general knowledge about typical temporal orderings among relations, by showing that these temporal constraints can be successfully used by a joint inference framework to assign specific temporal scopes to individual facts.
【Keywords】: graph-based semi-supervised learning; knowledge bases; label propagation; narrative ordering; temporal ordering; temporal scoping
【Paper Link】 【Pages】:1015-1024
【Authors】: Matthias Hagen ; Martin Potthast ; Anna Beyer ; Benno Stein
【Abstract】: Query segmentation is the problem of identifying those keywords in a query, which together form compound concepts or phrases like "new york times". Such segments can help a search engine to better interpret a user's intents and to tailor the search results more appropriately. Our contributions to this problem are threefold. (1) We conduct the first large-scale study of human segmentation behavior based on more than 500000 segmentations. (2) We show that the traditionally applied segmentation accuracy measures are not appropriate for such large-scale corpora and introduce new, more robust measures. (3) We develop a new query segmentation approach with the basic idea that, in cases of doubt, it is often better to (partially) leave queries without any segmentation. This new in-doubt-without approach chooses different segmentation strategies depending on query types. A large-scale evaluation shows substantial improvement upon the state of the art in terms of segmentation accuracy. To draw a complete picture, we also evaluate the impact of segmentation strategies on retrieval performance in a TREC setting. It turns out that more accurate segmentation not necessarily yields better retrieval performance. Based on this insight, we propose an in-doubt-without variant which achieves the best retrieval performance despite leaving many queries unsegmented. But there is still room for improvement: the optimum segmentation strategy which always chooses the segmentation that maximizes retrieval performance, significantly outperforms all other tested approaches.
【Keywords】: query segmentation
【Paper Link】 【Pages】:1025-1034
【Authors】: Abdigani Diriye ; Ryen White ; Georg Buscher ; Susan T. Dumais
【Abstract】: Users of search engines often abandon their searches. Despite the high frequency of Web search abandonment and its importance to Web search engines, little is known about why searchers abandon beyond that it can be for good or bad reasons. In this paper, we ex-tend previous work by studying search abandonment using both a retrospective survey and an in-situ method that captures aban-donment rationales at abandonment time. We show that although satisfaction is a common motivator for abandonment, one-in-five abandonment instances does not relate to satisfaction. We also studied the automatic prediction of the underlying reason for ob-served abandonment. We used features of the query and the results, interaction with the result page (e.g., cursor movements, scrolling, clicks), and the full search session. We show that our classifiers can learn to accurately predict the reasons for observed search abandonment. Such accurate predictions help search providers estimate user satisfaction for queries without clicks, affording a more complete understanding of search engine performance.
【Keywords】: abandonment rationales; web search abandonment
【Paper Link】 【Pages】:1035-1044
【Authors】: Huizhong Duan ; Emre Kiciman ; ChengXiang Zhai
【Abstract】: Understanding users' search intents is critical component of modern search engines. A key limitation made by most query log analyses is the assumption that each clicked web result represents one unique intent. However, there are many search tasks, such as comparison shopping or in-depth research, where a user's intent is to explore many documents. In these cases, the assumption of a one-to-one correspondence between clicked documents and user intent breaks down. To capture and understand such behaviors, we propose the use of click patterns. Click patterns capture the relationship among clicks on search results by treating the set of clicks made by a user as a single unit. We aggregate click patterns together using a hierarchical clustering algorithm to discover the common click patterns. By using click patterns as an empirical representation of user intent, we are able to create a rich representation of mixtures of multiple navigational and informational intents. We analyze real search logs and demonstrate that such complex mixtures of intents do occur in the wild and can be identified using click patterns. We further demonstrate the usefulness of click patterns by integrating them into a measure of query ambiguity and into a query recommendation task. We show that calculating query ambiguity as the entropy over the distribution of click patterns provides a measure of ambiguity with improved discriminative power, consistency and temporal stability as compared to previous measures of ambiguity. We explore the use of click pattern similarity and click pattern entropy in generating query recommendations and show promising results.
【Keywords】: click pattern; click profile; entropy; query ambiguity
【Paper Link】 【Pages】:1045-1054
【Authors】: Van Dang ; Giridhar Kumaran ; Adam D. Troy
【Abstract】: Query reformulation has been studied as a domain independent task. Existing work attempts to expand a query or substitute its terms with the same set of candidates regardless of the domain of this query. Since terms might be semantically related in one domain but not in others, it is more effective to provide candidates for queries with respect to their domain. This paper demonstrates the advantage of this domain dependent query reformulation approach, which learns its candidates, using a standard technique, for each domain from a separate sample of data derived automatically from a generic query log. Our results show that our approach statistically significantly outperforms the domain independent approach, which learns to reformulate from the same log using the same technique, on a large query set consisting of both health and commerce queries. Our results have very practical interpretation: while building different reformulation systems to handle queries from different domains does not require additional manual effort, it provides substantially better retrieval effectiveness than having a single system handling all queries. Additionally, we show that leveraging domain specific manually labelled data leads to further improvement.
【Keywords】: domain dependent; query log; query reformulation
【Paper Link】 【Pages】:1055-1064
【Authors】: Anish Das Sarma ; Ankur Jain ; Ashwin Machanavajjhala ; Philip Bohannon
【Abstract】: De-duplication - identification of distinct records referring to the same real-world entity - is a well-known challenge in data integration. Since very large datasets prohibit the comparison of every pair of records, blocking has been identified as a technique of dividing the dataset for pairwise comparisons, thereby trading off recall of identified duplicates for efficiency. Traditional de-duplication tasks, while challenging, typically involved a fixed schema such as Census data or medical records. However, with the presence of large, diverse sets of structured data on the web and the need to organize it effectively on content portals, de-duplication systems need to scale in a new dimension to handle a large number of schemas, tasks and data sets, while handling ever larger problem sizes. In addition, when working in a map-reduce framework it is important that canopy formation be implemented as a hash function, making the canopy design problem more challenging. We present CBLOCK, a system that addresses these challenges. CBLOCK learns hash functions automatically from attribute domains and a labeled dataset consisting of duplicates. Subsequently, CBLOCK expresses blocking functions using a hierarchical tree structure composed of atomic hash functions. The application may guide the automated blocking process based on architectural constraints, such as by specifying a maximum size of each block (based on memory requirements), impose disjointness of blocks (in a grid environment), or specify a particular objective function trading off recall for efficiency. As a post-processing step to automatically generated blocks, CBLOCK rolls-up smaller blocks to increase recall. We present experimental results on two large-scale de-duplication datasets from a commercial search engine - consisting of over 140K movies and 40K restaurants respectively - and demonstrate the utility of CBLOCK.
【Keywords】: blocking; canopy formation; de-duplication
【Paper Link】 【Pages】:1065-1074
【Authors】: Nelly Vouzoukidou ; Bernd Amann ; Vassilis Christophides
【Abstract】: In this work we are interested in the scalable processing of content filtering queries over text item streams. In particular, we are aiming to generalize state of the art solutions with non-homogeneous scoring functions combining query-independent item importance with query-dependent content relevance. While such complex ranking functions are widely used in web search engines this is to our knowledge the first scientific work studying their usage in a continuous query scenario. Our main contribution consists in the definition and the evaluation of new efficient in-memory data structures for indexing continuous top-k queries based on an original two-dimensional representation of text queries. We are exploring locally-optimal score bounds and heuristics that efficiently prune the search space of candidate top-k query results which have to be updated at the arrival of new stream items. Finally, we experimentally evaluate memory/matching time trade-offs of these index structures. In particular we experimentally illustrate their linear scaling behavior with respect to the number of indexed queries.
【Keywords】: continuous top-k query processing; non-homogeneous scoring functions; text streams
【Paper Link】 【Pages】:1075-1084
【Authors】: Abhijith Kashyap ; Vagelis Hristidis
【Abstract】: Result snippets are used by most search interfaces to preview query results. Snippets help users quickly decide the relevance of the results, thereby reducing the overall search time and effort. Most work on snippets have focused on text snippets for Web pages in Web search. However, little work has studied the problem of snippets for structured data, e.g., product catalogs. Furthermore, all works have focused on the important goal of creating informative snippets, but have ignored the amount of user effort required to comprehend, i.e., read and digest, the displayed snippets. In particular, they implicitly assume that the comprehension effort or cost only depends on the length of the snippet, which we show is incorrect for structured data. We propose novel techniques to construct snippets of structured heterogeneous results, which not only select the most informative attributes for each result, but also minimize the expected user effort (time) to comprehend these snippets. We create a comprehension model to quantify the effort incurred by users in comprehending a list of result snippets. Our model is supported by an extensive user-study. A key observation is that the user effort for comprehending an attribute across multiple snippets only depends on the number of unique positions (e.g., indentations) where this attribute is displayed and not on the number of occurrences. We analyze the complexity of the snippet construction problem and show that the problem is NP-hard, even when we only consider the comprehension cost. We present efficient approximate algorithms, and experimentally demonstrate their effectiveness and efficiency.
【Keywords】: information overload; query interfaces; result snippets
【Paper Link】 【Pages】:1085-1094
【Authors】: Xing Niu ; Shu Rong ; Haofen Wang ; Yong Yu
【Abstract】: Publishing structured data and linking them to Linking Open Data (LOD) is an ongoing effort to create a Web of data. Each newly involved data source may contain duplicated instances (entities) whose descriptions or schemata differ from those of the existing sources in LOD. To tackle this heterogeneity issue, several matching methods have been developed to link equivalent entities together. Many general-purpose matching methods which focus on similarity metrics suffer from very diverse matching results for different data source pairs. On the other hand, the dataset-specific ones leverage heuristic rules or even manual efforts to ensure the quality, which makes it impossible to apply them to other sources or domains. In this paper, we offer a third choice, a general method of automatically discovering dataset-specific matching rules. In particular, we propose a semi-supervised learning algorithm to iteratively refine matching rules and find new matches of high confidence based on these rules. This dramatically relieves the burden on users of defining rules but still gives high-quality matching results. We carry out experiments on real-world large scale data sources in LOD; the results show the effectiveness of our approach in terms of the precision of discovered matches and the number of missing matches found. Furthermore, we discuss several extensions (like similarity embedded rules, class restriction and SPARQL rewriting) to fit various applications with different requirements.
【Keywords】: association rule mining; em algorithm; instance matching; semi-supervised learning
【Paper Link】 【Pages】:1095-1104
【Authors】: Yi Jia ; Wenrong Zeng ; Jun Huan
【Abstract】: Non-stationary Dynamic Bayesian Networks (Non-stationary DBNs) are widely used to model the temporal changes of directed dependency structures from multivariate time series data. However, the existing change-points based non-stationary DBNs methods have several drawbacks including excessive computational cost, and low convergence speed. In this paper we proposed a novel non-stationary DBNs method. Our method is based on the perfect simulation model. We applied this approach for network structure inference from synthetic data and biological microarray gene expression data and compared it with other two state-of-the-art non-stationary DBNs methods. The experimental results demonstrated that our method outperformed two other state-of-the-art methods in both computational cost and structure prediction accuracy. The further sensitivity analysis showed that once converged our model is robust to large parameter ranges, which reduces the uncertainty of the model behavior.
【Keywords】: dynamic bayesian networks; markov chain monte carlo; perfect simulation
【Paper Link】 【Pages】:1105-1112
【Authors】: Ang Sun ; Ralph Grishman
【Abstract】: Relation extraction is the process of identifying instances of specified types of semantic relations in text; relation type extension involves extending a relation extraction system to recognize a new type of relation. We present LGCo-Testing, an active learning system for relation type extension based on local and global views of relation instances. Locally, we extract features from the sentence that contains the instance. Globally, we measure the distributional similarity between instances from a 2 billion token corpus. Evaluation on the ACE 2004 corpus shows that LGCo-Testing can reduce annotation cost by 97% while maintaining the performance level of supervised learning.
【Keywords】: co-testing; co-training; distributional similarity; infactive learning; information extraction; inrelation extraction
【Paper Link】 【Pages】:1113-1122
【Authors】: Sriram Srinivasan ; Sourangshu Bhattacharya ; Rudrasis Chakraborty
【Abstract】: Segmentation of a string of English language characters into a sequence of words has many applications. Here, we study two applications in the internet domain. First application is the web domain segmentation which is crucial for monetization of broken URLs. Secondly, we propose and study a novel application of twitter hashtag segmentation for increasing recall on twitter searches. Existing methods for word segmentation use unsupervised language models. We find that when using multiple corpora, the joint probability model from multiple corpora performs significantly better than the individual corpora. Motivated by this, we propose weighted joint probability model, with weights specific to each corpus. We propose to train the weights in a supervised manner using max-margin methods. The supervised probability models improve segmentation accuracy over joint probability models. Finally, we observe that length of segments is an important parameter for word segmentation, and incorporate length-specific weights into our model. The length specific models further improve segmentation accuracy over supervised probability models. For all models proposed here, inference problem can be solved using the dynamic programming algorithm. We test our methods on five different datasets, two from web domains data, and three from news headlines data from an LDC dataset. The supervised length specific models show significant improvements over unsupervised single corpus and joint probability models. Cross-testing between the datasets confirm that supervised probability models trained on all datasets, and length specific models trained on news headlines data, generalize well. Segmentation of hashtags result in significant improvement in recall on searches for twitter trends.
【Keywords】: compound splitting; hashtag segmentation; structured learning; web domain segmentation; word segmentation
【Paper Link】 【Pages】:1123-1132
【Authors】: André Blessing ; Hinrich Schütze
【Abstract】: We propose crosslingual distant supervision (crosslingual DS) for relation extraction, an approach that automatically extracts labels from a pivot language for labeling one or more target languages. The approach has two benefits compared to standard DS: (i) increased coverage if target language labels are not available; and (ii) higher accuracy of automatically generated labels because noisy labels are eliminated in crosslingual filtering. An evaluation for two relations of different complexity shows that crosslingual DS increases the accuracy of relation extraction. Our approach is language independent; we successfully apply it to four different languages: Chinese, English, French and German.
【Keywords】: crosslingual distant supervision; relation extraction
【Paper Link】 【Pages】:1133-1142
【Authors】: Siddharth Patwardhan ; Branimir Boguraev ; Apoorv Agarwal ; Alessandro Moschitti ; Jennifer Chu-Carroll
【Abstract】: State-of-the-art approaches to token labeling within text documents typically cast the problem either as a classification task, without using complex structural characteristics of the input, or as a sequential labeling task, carried out by a Conditional Random Field (CRF) classifier. Here we explore principled ways for structure to be brought to bear on the task. In line with recent trends in statistical learning of structured natural language input, we use a Support Vector Machine (SVM) classification framework deploying tree kernels. We then propose tree transformations and decorations, as a methodology for modeling complex linguistic phenomena in highly multi-dimensional feature spaces. We develop a general purpose tree engineering framework, which enables us to transcend the typically complex and laborious process of feature engineering. We build kernel based classifiers for two token labeling tasks: fine-grained event recognition, and lexical answer type detection in questions. For both, we show that in comparison with a corresponding linear kernel SVM, our method of using tree kernels improves recognition, thanks to appropriately engineering tree structures for use by the tree kernel. We also observe significant improvements when comparing with a CRF-based realization of structured prediction, itself performing at levels comparable to state-of-the-art.
【Keywords】: support vector machines; token classification; tree kernels
【Paper Link】 【Pages】:1143-1152
【Authors】: Di Jiang ; Jan Vosecky ; Kenneth Wai-Ting Leung ; Wilfred Ng
【Abstract】: Search engine query log is an important information source that contains millions of users' interests and information needs. In this paper, we tackle the problem of discovering latent geographic search topics via mining search engine query logs. A novel framework G-WSTD that contains search session derivation, geographic information extraction and geographic search topic discovery is developed to support a variety of downstream web applications. The core components of the framework are two topic models, which discover geographic search topics from two different perspectives. The first one is the Discrete Search Topic Model (DSTM), which aims to capture the semantic commonalities across discrete geographic locations. The second one is the Regional Search Topic Model (RSTM), which focuses on a specific region on the map and discovers web search topics that demonstrate geographic locality. We evaluate our framework against several strong baselines on a real-life query log. The framework demonstrates improved data interpretability, better prediction performance and higher topic distinctiveness in the experimentation. The effectiveness of the framework is also verified by applications such as user profiling and URL annotation.
【Keywords】: geographic; query log; search engine
【Paper Link】 【Pages】:1153-1162
【Authors】: Chee Wee Leong ; Silviu Cucerzan
【Abstract】: Fact verification has become an important task due to the increased popularity of blogs, discussion groups, and social sites, as well as of encyclopedic collections that aggregate content from many contributors. We investigate the task of automatically retrieving supporting evidence from the Web for factual statements. Using Wikipedia as a starting point, we derive a large corpus of statements paired with supporting Web documents, which we employ further as training and test data under the assumption that the contributed references to Wikipedia represent some of the most relevant Web documents for supporting the corresponding statements. Given a factual statement, the proposed system first transforms it into a set of semantic terms by using machine learning techniques. It then employs a quasi-random strategy for selecting subsets of the semantic terms according to topical likelihood. These semantic terms are used to construct queries for retrieving Web documents via a Web search API. Finally, the retrieved documents are aggregated and re-ranked by employing additional measures of their suitability to support the factual statement. To gauge the quality of the retrieved evidence, we conduct a user study through Amazon Mechanical Turk, which shows that our system is capable of retrieving supporting Web documents comparable to those chosen by Wikipedia contributors.
【Keywords】: fact verification; semantic term extraction; supporting evidence; web references; web search; wikipedia
【Paper Link】 【Pages】:1163-1172
【Authors】: Haitao Yu ; Fuji Ren
【Abstract】: Understanding the information need or intent encoded within a query has long been regarded as an essential factor of effective information retrieval. For better query representation and understanding, two intent roles (kernel-object and modifier) are introduced to structurally parse a class of role-explicit queries, which constitute a majority of common user queries. Furthermore, we focus on two research problems: RP-1: Given a role-explicit query, how to identify the kernel-object and modifier, namely intent role annotation; RP-2: How to determine whether an arbitrary query is role-explicit or not. To solve RP-1, we propose a simplified word n-gram role model (SWNR), which quantifies the generating probability of a role-explicit query and performs intent role annotation effectively. Using a set of discriminative features, we build classifiers to address RP-2 in a supervised manner. The experimental results show that: (1) SWNR can achieve a satisfactory performance, more than 73% in terms of different metrics; (2) The classifiers can achieve more than 90% precision in identifying role-explicit queries; (3) Compared with traditional techniques for query representation and understanding, e.g., name entity recognition in query and class-level query intent inference, intent role annotation provides a more flexible framework and a number of applications can benefit from annotating role-explicit queries, such as intent mining and diversified document ranking.
【Keywords】: kernel-object; modifier; query understanding; role-explicit
【Paper Link】 【Pages】:1173-1182
【Authors】: Wei Gao ; Peng Li ; Kareem Darwish
【Abstract】: Social media streams such as Twitter are regarded as faster first-hand sources of information generated by massive users. The content diffused through this channel, although noisy, provides important complement and sometimes even a substitute to the traditional news media reporting. In this paper, we propose a novel unsupervised approach based on topic modeling to summarize trending subjects by jointly discovering the representative and complementary information from news and tweets. Our method captures the content that enriches the subject matter by reinforcing the identification of complementary sentence-tweet pairs. To valuate the complementarity of a pair, we leverage topic modeling formalism by combining a two-dimensional topic-aspect model and a cross-collection approach in the multi-document summarization literature. The final summaries are generated by co-ranking the news sentences and tweets in both sides simultaneously. Experiments give promising results as compared to state-of-the-art baselines.
【Keywords】: complementary summary; cross-collection topic-aspect model; gibbs sampling; lda
【Paper Link】 【Pages】:1183-1192
【Authors】: Shirui Pan ; Xingquan Zhu
【Abstract】: In this paper, we propose to query correlated graph in a data stream scenario, where given a query graph q an algorithm is required to retrieve all the subgraphs whose Pearson's correlation coefficients with q are greater than a threshold Θ over some graph data flowing in a stream fashion. Due to the dynamic changing nature of the stream data and the inherent complexity of the graph query process, treating graph streams as static datasets is computationally infeasible or ineffective. In the paper, we propose a novel algorithm, CGStream, to identify correlated graphs from data stream, by using a sliding window which covers a number of consecutive batches of stream data records. Our theme is to regard stream query as the traversing along a data stream and the query is achieved at a number of outlooks over the data stream. For each outlook, we derive a lower frequency bound to mine a set of frequent subgraph candidates, where the lower bound guarantees that no pattern is missing from the current outlook to the next outlook. On top of that, we derive an upper correlation bound and a heuristic rule to prune the candidate size, which helps reduce the computation cost at each outlook. Experimental results demonstrate that the proposed algorithm is several times, or even an order of magnitude, more efficient than the straightforward algorithm. Meanwhile, our algorithm achieves good performance in terms of query precision.
【Keywords】: correlated graph; data stream; pearson's correlation coefficient
【Paper Link】 【Pages】:1193-1202
【Authors】: Anastasios Arvanitis ; Antonios Deligiannakis ; Yannis Vassiliou
【Abstract】: The rapid growth of social web has contributed vast amounts of user preference data. Analyzing this data and its relationships with products could have several practical applications, such as personalized advertising, market segmentation, product feature promotion etc. In this work we develop novel algorithms for efficiently processing two important classes of queries involving user preferences, i.e. potential customers identification and product positioning. With regards to the first problem, we formulate product attractiveness based on the notion of reverse skyline queries. We then present a new algorithm, termed as RSA, that significantly reduces the I/O cost, as well as the computation cost, when compared to the state-of-the-art reverse skyline algorithm, while at the same time being able to quickly report the first results. Several real-world applications require processing of a large number of queries, in order to identify the product characteristics that maximize the number of potential customers. Motivated by this problem, we also develop a batched extension of our RSA algorithm that significantly improves upon processing multiple queries individually, by grouping contiguous candidates, exploiting I/O commonalities and enabling shared processing. Our experimental study using both real and synthetic data sets demonstrates the superiority of our proposed algorithms for the studied classes of queries.
【Keywords】: market research; preferences; reverse skylines
【Paper Link】 【Pages】:1203-1212
【Authors】: Aditya G. Parameswaran ; Hyunjung Park ; Hector Garcia-Molina ; Neoklis Polyzotis ; Jennifer Widom
【Abstract】: Crowdsourcing enables programmers to incorporate "human computation" as a building block in algorithms that cannot be fully automated, such as text analysis and image recognition. Similarly, humans can be used as a building block in data-intensive applications--providing, comparing, and verifying data used by applications. Building upon the decades-long success of declarative approaches to conventional data management, we use a similar approach for data-intensive applications that incorporate humans. Specifically, declarative queries are posed over stored relational data as well as data computed on-demand from the crowd, and the underlying system orchestrates the computation of query answers. We present Deco, a database system for declarative crowdsourcing. We describe Deco's data model, query language, and our prototype. Deco's data model was designed to be general (it can be instantiated to other proposed models), flexible (it allows methods for data cleansing and external access to be plugged in), and principled (it has a precisely-defined semantics). Syntactically, Deco's query language is a simple extension to SQL. Based on Deco's data model, we define a precise semantics for arbitrary queries involving both stored data and data obtained from the crowd. We then describe the Deco query processor which uses a novel push-pull hybrid execution model to respect the Deco semantics while coping with the unique combination of latency, monetary cost, and uncertainty introduced in the crowdsourcing environment. Finally, we experimentally explore the query processing alternatives provided by Deco using our current prototype.
【Keywords】: declarative crowdsourcing; human computation
【Paper Link】 【Pages】:1213-1222
【Authors】: Shiwen Cheng ; Arash Termehchy ; Vagelis Hristidis
【Abstract】: Keyword query interfaces (KQIs) for databases provide easy access to data, but often suffer from low ranking quality, i.e. low precision and/or recall, as shown in recent benchmarks. It would be useful to be able to identify queries that are likely to have low ranking quality to improve the user satisfaction. For instance, the system may suggest to the user alternative queries for such hard queries. In this paper, we analyze the characteristics of hard queries and propose a novel framework to measure the degree of difficulty for a keyword query over a database, considering both the structure and the content of the database and the query results. We evaluate our query difficulty prediction model against two relevance judgment benchmarks for keyword search on databases, INEX and SemSearch. Our study shows that our model predicts the hard queries with high accuracy. Further, our prediction algorithms incur minimal time overhead.
【Keywords】: (semi-)structured data; database; keyword query; query performance
【Paper Link】 【Pages】:1223-1232
【Authors】: Yingjie Shi ; Xiaofeng Meng ; Fusheng Wang ; Yantao Gan
【Abstract】: Cloud-based data management systems are emerging as scalable, fault-tolerant, and efficient solutions to manage large volumes of data with cost effective infrastructures, and more and more data analysis applications are migrated to the cloud. As an attractive solution to provide a quick sketch of massive data before a long wait of the final accurate query result, online processing of aggregate queries in the cloud is of paramount importance. This problem is challenging to solve because of the large block based data organization and distributed processing mode in the cloud. In this paper, we present COLA, a system for Cloud Online Aggregation to provide progressive approximate answers for both single tables and joined multiple tables. We develop an online query processing algorithm for MapReduce to support incremental and continuous computing of aggregations on joins which minimizes the waiting time before an acceptable estimate is achieved. We formulate a statistical foundation that supports block-level sampling for single-table online aggregations and effective estimation of approximate results and confidence intervals of statistical significance. We also develop a two-phase stratified sampling method to support multi-table aggregations to improve the approximate query answers and speed up the convergence of confidence intervals. We implement COLA in Hadoop, and our experiments demonstrate that COLA can deliver reasonable precise online estimates within a time period two orders of magnitude shorter than that used to produce exact answers.
【Keywords】: cloud computing; mapreduce; online aggregation
【Paper Link】 【Pages】:1233-1242
【Authors】: Yilei Wang ; Bingzheng Wei ; Jun Yan ; Yang Hu ; Zhi-Hong Deng ; Zheng Chen
【Abstract】: In the past decades, machine learning models, especially supervised learning algorithms, have been widely used in various real world applications. However, no matter how strong a learning model is, it will suffer from the prediction errors when it is applied to real world problems. Due to the black box nature of supervised learning models, it is a challenging problem to fix the supervised learning models by further learning from the failure cases it generates. In this paper, we propose a novel Local Patch Framework (LPF) to locally fix supervised learning models by learning from its predicted failure cases. Since the learning models are generally globally optimized during training process, our proposed LPF assumes that most of the learning errors are led by local errors in the model. Thus we aim to break the black boxes of learning models by identifying and fixing the local errors of various models automatically. The proposed LPF has two key steps, which are local error region subspace learning and local patch model learning. Through this way, we aim to fix the errors of learning models locally and automatically with certain generalization ability on unseen testing data. Experiments on both classification and ranking problems show that the proposed LPF is effective and outperforms the original algorithms and the incremental learning model.
【Keywords】: local patch; metric learning; model fixing; supervised learning model
【Paper Link】 【Pages】:1243-1252
【Authors】: Lifei Chen ; Shengrui Wang
【Abstract】: Naive Bayes (NB for short) is one of the popular methods for supervised classification in a knowledge management system. Currently, in many real-world applications, high-dimensional data pose a major challenge to conventional NB classifiers, due to noisy or redundant features and local relevance of these features to classes. In this paper, an automated feature weighting solution is proposed to result in a NB method effective in dealing with high-dimensional data. We first propose a locally weighted probability model, for Bayesian modeling in high-dimensional spaces, to implement a soft feature selection scheme. Then we propose an optimization algorithm to find the weights in linear time complexity, based on the Logitnormal priori distribution and the Maximum a Posteriori principle. Experimental studies show the effectiveness and suitability of the proposed model for high-dimensional data classification.
【Keywords】: classification; high-dimensionality; maximum a posteriori; naive bayes; soft feature weighting
【Paper Link】 【Pages】:1253-1262
【Authors】: Yuan An ; Xiaohua Hu ; Il-Yeol Song
【Abstract】: In order to realize the Semantic Web, various structures on the Web including Web forms need to be annotated with and mapped to domain ontologies. We present a machine learning-based automatic approach for discovering complex mappings from Web forms to ontologies. A complex mapping associates a set of semantically related elements on a form to a set of semantically related elements in an ontology. Existing schema mapping solutions mainly rely on integrity constraints to infer complex schema mappings. However, it is difficult to extract rich integrity constraints from forms. We show how machine learning techniques can be used to automatically discover complex mappings between Web forms and ontologies. The challenge is how to capture and learn the complicated knowledge encoded in existing complex mappings. We develop an initial solution that takes a naive Bayesian approach. We evaluated the performance of the solution on various domains. Our experimental results show that the solution returns the expected mappings as the top-1 results usually among several hundreds candidate mappings for more than 80% of the test cases. Furthermore, the expected mappings are always returned as the top-k results with k<4. The experiments have demonstrated that the approach is effective and has the potential to save significant human efforts.
【Keywords】: mapping discovery; ontologies; semantic mapping; web forms
【Paper Link】 【Pages】:1263-1272
【Authors】: Xin Chen ; Xiaohua Hu ; Zhongna Zhou ; Yuan An ; Tingting He ; E. K. Park
【Abstract】: In this paper, we deal with two research issues: the automation of visual attribute identification and semantic relation learning between visual attributes and object categories. The contribution is two-fold, firstly, we provide uniform framework to reliably extract both categorical attributes and depictive attributes. Secondly, we incorporate the obtained semantic associations between visual attributes and object categories into a text-based topic model and extract descriptive latent topics from external textual knowledge sources. Specifically, we show that in mining natural language descriptions from external knowledge sources, the relation between semantic visual attributes and object categories can be encoded as Must-Links and Cannot-Links, which can be represented by Dirichlet-Forest prior. To alleviate the workload of manual supervision and labeling in image categorization process, we introduce a semi-supervised training framework using soft-margin semi-supervised SVM classifier. We also show that the large-scale image categorization results can be significantly improved by combining automatically acquired visual attributes. Experimental results show that the proposed model achieves better ability in describing object-related attributes and makes the inferred latent topics more descriptive.
【Keywords】: dirichlet-forest prior; topic model; visual attribute identification
【Paper Link】 【Pages】:1273-1282
【Authors】: Brian Quanz ; Jun Huan
【Abstract】: Multi-view semi-supervised learning methods try to exploit the combination of multiple views along with large amounts of unlabeled data in order to learn better predictive functions when limited labeled data is available. However, lack of complete view data limits the applicability of multi-view semi-supervised learning to real world data. Commonly, one data view is readily and cheaply available, but additionally views may be costly or only available in some cases. This work aims to make multi-view semi-supervised learning approaches more applicable to real world data specifically by addressing the issue of missing views. We introduce CoNet, a feature generation method that learns a mapping from one view to another that is specifically designed to produce features that are useful for multi-view semi-supervised learning algorithms. The mapping is then used to fill in views as pre-processing. Our comprehensive experimental study demonstrates the utility of our method as compared to the state-of-the-art multi-view semi-supervised learning methods for this scenario of partially observed views.
【Keywords】: missing data; multi-view learning; semi-supervised learning
【Paper Link】 【Pages】:1283-1292
【Authors】: Krishna Kummamuru ; Ajith Jujjuru ; Mayuri Duggirala
【Abstract】: Designing interactive voice systems that have optimum cognitive load on callers has been an active research topic for quite some time. There have been many studies comparing the user preferences on navigation trees with higher depths over higher breadths. In this paper, we consider the navigation of structured data containing various types of attributes using phone-based interactions. This problem is particularly relevant to emerging economies in which innovative voice-based applications are being built to address semi-literate population. We address the problem of identifying the right sequence of facets to be presented to the user for phone-based navigation of the data in two stages. Firstly, we perform extensive user studies in the target population to understand the relation between the nature of facets (attributes) of the data and the cognitive load. Secondly, we propose an algorithm to design optimum navigation trees based on the inferences made in the first phase. We compare the proposed algorithm with the traditional facet generation algorithms with respect to various factors and discuss the optimality of the proposed algorithm.
【Keywords】: cognitive load; navigation of structured data; phone-based interaction
【Paper Link】 【Pages】:1293-1302
【Authors】: Jaime Arguello ; Robert Capra
【Abstract】: Aggregated search is the task of blending results from different specialized search services, or verticals, into the web search results. Aggregated search coherence refers to the degree to which results from different systems focus on similar senses of the query. While cross-component coherence has been cited as an important criterion for whole-page evaluation, its effect on search behavior has not been deeply investigated in prior research. In this work, we focus on the coherence between two aggregated search components: images and web results. In particular, we investigate whether the query-senses associated with the blended image results can influence user interaction with the web results. For example, if a user wants web results about "jaguar" the animal, are they more likely to examine the web results if the image results contain pictures of the animal instead of pictures of the car? Based on two large user studies, our results show that the image results can systematically affect user interaction with the web results. If the web results are largely consistent with the search task, then the effect of the image results is small. However, if the web results are only marginally consistent with the search task, such as when they are highly diversified across query-senses, the image results have a significant effect on user interaction with the web results. Our findings have implications on current research in whole-page evaluation, aggregated search, and diversity ranking.
【Keywords】: aggregated search; aggregated search coherence; assimilation effects; evaluation; search behavior; user study
【Paper Link】 【Pages】:1303-1312
【Authors】: Lei Wang ; Dawei Song ; Eyad Elyan
【Abstract】: Most of the state-of-art approaches to Query-by-Example (QBE) video retrieval are based on the Bag-of-visual-Words (BovW) representation of visual content. It, however, ignores the spatial-temporal information, which is important for similarity measurement between videos. Direct incorporation of such information into the video data representation for a large scale data set is computationally expensive in terms of storage and similarity measurement. It is also static regardless of the change of discriminative power of visual words for different queries. To tackle these limitations, in this paper, we propose to discover Spatial-Temporal Correlations (STC) imposed by the query example to improve the BovW model for video retrieval. The STC, in terms of spatial proximity and relative motion coherence between different visual words, is crucial to identify the discriminative power of the visual words. We develop a novel technique to emphasize the most discriminative visual words for similarity measurement, and incorporate this STC-based approach into the standard inverted index architecture. Our approach is evaluated on the TRECVID2002 and CC_WEB_VIDEO datasets for two typical QBE video retrieval tasks respectively. The experimental results demonstrate that it substantially improves the BovW model as well as a state of the art method that also utilizes spatial-temporal information for QBE video retrieval.
【Keywords】: bag-of-visual-word; content based video retrieval; discriminative visual word; query-by-example; spatial-temporal correlation
【Paper Link】 【Pages】:1313-1322
【Authors】: Jingjing Liu ; Chang Liu ; Michael J. Cole ; Nicholas J. Belkin ; Xiangmin Zhang
【Abstract】: We report on an investigation of behavioral differences between users in difficult and easy search tasks. Behavioral factors that can be used in real-time to predict task difficulty are identified. User data was collected in a controlled lab experiment (n=38) where each participant completed four search tasks in the genomics domain. We looked at user behaviors that can be obtained by systems at three levels, distinguished by the time point when the measurements can be done. They are: 1) first-round level at the beginning of the search, 2) accumulated level during the search, and 3) whole-session level by the end of the search. Results show that a number of user behaviors at all three levels differed between easy and difficult tasks. Models predicting task difficulty at all three levels were developed and evaluated. A real-time model incorporating first-round and accumulated levels of behaviors (FA) had fairly good prediction performance (accuracy 83%; precision 88%), which is comparable with the model using the whole-session level behaviors which are not real-time (accuracy 75%; precision 92%). We also found that for efficiency purpose, using only a limited number of significant variables (FC_FA) can obtain a prediction accuracy of 75%, with a precision of 88%. Our findings can help search systems predict task difficulty and adapt search results to users.
【Keywords】: accumulated level; difficulty prediction; first-round level; task difficulty; user behavior; user modeling; whole-session level
【Paper Link】 【Pages】:1323-1331
【Authors】: Nicolae Suditu ; François Fleuret
【Abstract】: Content-based image retrieval systems have to cope with two different regimes: understanding broadly the categories of interest to the user, and refining the search in this or these categories to converge to specific images among them. Here, in contrast with other types of retrieval systems, these two regimes are of great importance since the search initialization is hardly optimal (i.e. the page-zero problem) and the relevance feedback must tolerate the semantic gap of the image's visual features. We present a new approach that encompasses these two regimes, and infers from the user actions a seamless transition between them. Starting from a query-free approach meant to solve the page-zero problem, we propose an adaptive exploration/exploitation trade-off that transforms the original framework into a versatile retrieval framework with full searching capabilities. Our approach is compared to the state-of-the-art it extends by conducting user evaluations on a collection of 60,000 images from the ImageNet database.
【Keywords】: bayesian framework; iterative relevance feedback; query-free interactive image retrieval; user-based evaluation
【Paper Link】 【Pages】:1332-1341
【Authors】: Risi Thonangi ; Shivnath Babu ; Jun Yang
【Abstract】: Solid-state drives are becoming a viable alternative to magnetic disks in database systems, but their performance characteristics, particularly those caused by their erase-before-write behavior, make conventional database indexes a poor fit. There have been various proposals of indexes specialized for these devices, but to make such indexes practical, we must address the issue of concurrency control. Good concurrency control is especially critical to indexes on solid-state drives, because they typically rely on batch updates, which may take long and block concurrent index accesses. We design, implement, and evaluate an index structure called FD+tree and an associated concurrency control scheme called FD+FC. Our evaluation confirms significant performance advantages of our approach over less sophisticated ones, and brings ou insights on data structure design and OLTP performance tuning on solid-state drives.
【Keywords】: concurrency control; performance evaluation; ssd indexes
【Paper Link】 【Pages】:1342-1351
【Authors】: Mu-Woong Lee ; Seung-won Hwang
【Abstract】: Multidimensional indexing is crucial for enabling a fast search over large-scale data. Owing to the unprecedented scale of data, extending such indexing technology has recently gained attention in distributed environments. The goal of existing efforts in distributed indexing has been the localization of queries to data residing at a small number of nodes (i.e., locality-preserving indexing) to minimize communication cost. However, considering that workloads often correlate with data locality, such indexing often generates hotspots. Location-based queries are typically skewed to disaster areas during certain periods of time, e.g., during Hurricane Irene, search traffic increased by more than 2000%. To alleviate such hotspots, we propose workload-balancing as an optimization goal. A cost model analytically supporting the need for load balancing is first developed, then a distributed index that evenly distributes the workload is presented. Our empirical study suggests that hotspots degrading search performance can be effectively alleviated. Specifically, when deployed to Amazon EC2, our proposed scheme showed maximum speed-up of 127.7%. Even in hostile settings where workload is not at all correlated with the search criteria, the proposed scheme's performance is comparable to existing approaches optimized for such settings.
【Keywords】: distributed indxing
【Paper Link】 【Pages】:1352-1361
【Authors】: Zhifeng Bao ; Henning Köhler ; Liwei Wang ; Xiaofang Zhou ; Shazia Wasim Sadiq
【Abstract】: Provenance information is vital in many application areas as it helps explain data lineage and derivation. However, storing fine-grained provenance information can be expensive. In this paper, we present a framework for storing provenance information relating to data derived via database queries. In particular, we first propose a provenance tree data structure which matches the query structure and thereby presents a possibility to avoid redundant storage of information regarding the derivation process. Then we investigate two approaches for reducing storage costs. The first approach utilizes two ingenious rules to achieve reduction on provenance trees. The second one is a dynamic programming solution, which provides a way of optimizing the selection of query tree nodes where provenance information should be stored. The optimization algorithm runs in polynomial time in the query size and is linear in the size of the provenance information, thus enabling provenance tracking and optimization without incurring large overheads. Experiments show that our approaches guarantee significantly lower storage costs than existing approaches.
【Keywords】: provenance storage
【Paper Link】 【Pages】:1362-1371
【Authors】: Manuel Barbosa ; Alexandre Pinto ; Bruno Gomes
【Abstract】: This paper addresses the scenario of multi-release anonymization of datasets. We consider dynamic datasets where data can be inserted and deleted, and view this scenario as a case where each release is a small subset of the dataset corresponding, for example, to the results of a query. Compared to multiple releases of the full database, this has the obvious advantage of faster anonymization. We present an algorithm for post-processing anonymized queries that prevents anonymity attacks using multiple released queries. This algorithm can be used with several distinct protection principles and anonymization algorithms, which makes it generic and flexible. We give an experimental evaluation of the algorithm and compare it to $m$-invariance both in terms of efficiency and data quality. To this end, we propose two data quality metrics based on Shannon's entropy, and show that they can be seen as a refinement of existing metrics.
【Keywords】: anonymization; dynamic datasets; multiple-releases
【Paper Link】 【Pages】:1372-1381
【Authors】: Duncan Yung ; Eric Lo ; Man Lung Yiu
【Abstract】: A moving range query continuously reports the query result (e.g., restaurants) that are within radius $r$ from a moving query point (e.g., moving tourist). To minimize the communication cost with the mobile clients, a service provider that evaluates moving range queries also returns a safe region that bounds the validity of query results. However, an untrustworthy service provider may report incorrect safe regions to mobile clients. In this paper, we present efficient techniques for authenticating the safe regions of moving range queries. We theoretically proved that our methods for authenticating moving range queries can minimize the data sent between the service provider and the mobile clients. Extensive experiments are carried out using both real and synthetic datasets and results show that our methods incur small communication costs and overhead.
【Keywords】: authentication; moving queries
【Paper Link】 【Pages】:1382-1391
【Authors】: Wei Wei ; Xuhui Fan ; Jinyan Li ; Longbing Cao
【Abstract】: Financial variables such as asset returns in the massive market contain various hierarchical and horizontal relationships forming complicated dependence structures. Modeling and mining of these structures is challenging due to their own high structural complexities as well as the stylized facts of the market data. This paper introduces a new canonical vine dependence model to identify the asymmetric and non-linear dependence structures of asset returns without any prior independence assumptions. To simplify the model while maintaining its merit, a partial correlation based method is proposed to optimize the canonical vine. Compared with the original canonical vine, the new model can still maintain the most important dependence but many unimportant nodes are removed to simplify the canonical vine structure. Our model is applied to construct and analyze dependence structures of European stocks as case studies. Its performance is evaluated by measuring portfolio of Value at Risk, a widely used risk management measure. In comparison to a very recent canonical vine model and the 'full' model, our experimental results demonstrate that our model has a much better quality of Value at Risk, providing insightful knowledge for investors to control and reduce the aggregation risk of the portfolio.
【Keywords】: canonical vine; dependence structure; financial variables
【Paper Link】 【Pages】:1392-1401
【Authors】: Dayong Wang ; Steven Chu-Hong Hoi ; Ying He
【Abstract】: Auto face annotation plays an important role in many real-world multimedia information and knowledge management systems. Recently there is a surge of research interests in mining weakly-labeled facial images on the internet to tackle this long-standing research challenge in computer vision and image understanding. In this paper, we present a novel unified learning framework for face annotation by mining weakly labeled web facial images through interdisciplinary efforts of combining sparse feature representation, content-based image retrieval, transductive learning and inductive learning techniques. In particular, we first introduce a new search-based face annotation paradigm using transductive learning, and then propose an effective inductive learning scheme for training classification-based annotators from weakly labeled facial images, and finally unify both transductive and inductive learning approaches to maximize the learning efficacy. We conduct extensive experiments on a real-world web facial image database, in which encouraging results show that the proposed unified learning scheme outperforms the state-of-the-art approaches.
【Keywords】: face annotation; image retrieval; inductive learning; sparse coding; transductive learning; web facial images
【Paper Link】 【Pages】:1402-1411
【Authors】: Fan Deng ; Stefan Siersdorfer ; Sergej Zerr
【Abstract】: We propose two efficient algorithms for exploring topic diversity in large document corpora such as user generated content on the social web, bibliographic data, or other web repositories. Analyzing diversity is useful for obtaining insights into knowledge evolution, trends, periodicities, and topic heterogeneity of such collections. Calculating diversity statistics requires averaging over the similarity of all object pairs, which, for large corpora, is prohibitive from a computational point of view. Our proposed algorithms overcome the quadratic complexity of the average pair-wise similarity computation, and allow for constant time (depending on dataset properties) or linear time approximation with probabilistic guarantees. We show examples of diversity-based studies on large samples from corpora such as the social photo sharing site Flickr, the DBLP bibliography, and US Census data.
【Keywords】: diversity; jaccard;clustering
【Paper Link】 【Pages】:1412-1421
【Authors】: Michele Coscia ; Viridiana Rios
【Abstract】: We develop a framework that uses Web content to obtain quantitative information about a phenomenon that would otherwise require the operation of large scale, expensive intelligence exercises. Exploiting indexed reliable sources such as online newspapers and blogs, we use unambiguous query terms to characterize a complex evolving phenomena and solve a security policy problem: identifying the areas of operation and modus operandi of criminal organizations, in particular, Mexican drug trafficking organizations over the last two decades. We validate our methodology by comparing information that is known with certainty with the one we extracted using our framework. We show that our framework is able to use information available on the web to efficiently extract implicit knowledge about criminal organizations. In the scenario of Mexican drug trafficking, our findings provide evidence that criminal organizations are more strategic and operate in more differentiated ways than current academic literature thought.
【Keywords】: data retrieval; knowledge discovery process; query
【Paper Link】 【Pages】:1422-1431
【Authors】: Meng Jiang ; Peng Cui ; Fei Wang ; Qiang Yang ; Wenwu Zhu ; Shiqiang Yang
【Abstract】: Social networks enable users to create different types of personal items. In dealing with serious information overload, the major problems of social recommendation are sparsity and cold start. In existing approaches, relational and heterogeneous domains can not be effectively utilized for social recommendation, which brings a challenge to model users and multiple types of items together on social networks. In this paper, we consider how to represent social networks with multiple relational domains and alleviate the major problems in an individual domain by transferring knowledge from other domains. We propose a novel Hybrid Random Walk (HRW), which can integrate multiple heterogeneous domains including directed/undirected links, signed/unsigned links and within-domain/cross-domain links into a star-structured hybrid graph with user graph at the center. We perform random walk until convergence and use the steady state distribution for recommendation. We conduct experiments on a real social network dataset and show that our method can significantly outperform existing social recommendation approaches.
【Keywords】: hybrid random walk; relational domains; social recommendation; star-structured graph; transfer learning
【Paper Link】 【Pages】:1432-1441
【Authors】: Yang Yang ; Jie Tang ; Jacklyne Keomany ; Yanting Zhao ; Juanzi Li ; Ying Ding ; Tian Li ; Liangwei Wang
【Abstract】: Detecting and monitoring competitors is fundamental to a company to stay ahead in the global market. Existing studies mainly focus on mining competitive relationships within a single data source, while competing information is usually distributed in multiple networks. How to discover the underlying patterns and utilize the heterogeneous knowledge to avoid biased aspects in this issue is a challenging problem. In this paper, we study the problem of mining competitive relationships by learning across heterogeneous networks. We use Twitter and patent records as our data sources and statistically study the patterns behind the competitive relationships. We find that the two networks exhibit different but complementary patterns of competitions. Our proposed model, Topical Factor Graph Model (TFGM), defines a latent topic layer to bridge the two networks and learns a semi-supervised learning model to classify the relationships between entities (e.g., companies or products). We test the proposed model on two real data sets and the experimental results validate the effectiveness of our model, with an average of +46\% improvement over alternative methods.
【Keywords】: competitive relationship; social network; web mining
【Paper Link】 【Pages】:1442-1451
【Authors】: Chao Zhang ; Lidan Shou ; Ke Chen ; Gang Chen ; Yijun Bei
【Abstract】: The emerging location-based social network (LBSN) services not only allow people to maintain cyber links with their friends, but also enable them to share the events happening on them at different locations. The geo-social correlations among event participants make it possible to quantify mutual user influence for various events. Such a quantification of influence could benefit a wide spectrum of real-life applications such as targeted advertising and viral marketing. In this paper, we perform an in-depth analysis of the geo-social correlations among LBSN users at event level, based on which we address two problems: user influence evaluation and influential events discovery. To capture the geo-social closeness between LBSN users, we propose a unified influence metric. This metric combines a novel social proximity measure named penalized hitting time, with a geographical weight function modeled by power law distribution. We propose two approximate algorithms, namely global iteration (GI) and dynamic neighborhood expansion (DNE), to efficiently evaluate user influence with tight theoretical error bounds. We then adopt the sampling technique and the threshold algorithm to support efficient retrieval of top-K influential events. Extensive experiments on both real-life and synthetic LBSN data sets confirm that the proposed algorithms are effective, efficient, and scalable.
【Keywords】: information extraction; social network; structural analysis
【Paper Link】 【Pages】:1452-1461
【Authors】: Thang N. Dinh ; Yilin Shen ; My T. Thai
【Abstract】: With a rapid expansion of online social networks (OSNs), millions of users are tweeting and sharing their personal status daily without being aware of where that information eventually travels to. Likewise, with a huge magnitude of data available on OSNs, it poses a substantial challenge to track how a piece of information leaks to specific targets. In this paper, we study the problem of smartly sharing information to control the propagation of sensitive information in OSNs. In particular, we formulate and investigate the Maximum Circle of Trust problem of which we seek to construct a circle of trust on the fly so that OSN users can safely share their information knowing that it will not be propagated to their unwanted targets (whom they are not willing to share with). Since most of messages in OSNs are propagated within 2 to 5 hops, we first investigate this problem under 2-hop information propagation by showing the hardness of obtaining an optimal solution, along with an algorithm with proven performance guarantee. In a general case where information can be propagated more than two hops, the problem is #P-hard i.e. the problem cannot be solved in a polynomial time. Thus we propose a novel greedy algorithm, hybridizing the handy but costly sampling method with a novel cut-based estimation. The quality of the hybrid algorithm is comparable to that of the sampling method while taking only a tiny fraction of the time. We have validated the effectiveness of our solutions in many real-world traces. Such an extensive experiment also highlights several important observations on information leakage which help to sharpen the security of OSNs in the future.
【Keywords】: algorithms; circle of trust; complexity; social networks
【Paper Link】 【Pages】:1462-1466
【Authors】: Guan Wang ; Qingbo Hu ; Philip S. Yu
【Abstract】: In the social network research, the studies on social influence maximization and entity similarity are two important and orthogonal tasks. On homogeneous networks, social influence maximization research tries to identify an initial influential set that maximizes the spread of the information, while similarity studies focus on designing meaningful ways to quantify entities' similarities. When heterogeneous networks are becoming ubiquitous and entities of different types are related to each other, we observe the possibility of merging the two directions together to improve the performance for both of them. In fact, we found that influence values among one type of nodes and similarity scores among the other type of nodes reinforce each other towards better and more meaningful results. Therefore, we introduce a framework that computes social influence for one type of nodes and simultaneously measures similarity of the other type of nodes in a heterogeneous network. First, we decouple the target heterogeneous network (or we call it Influence Similarity (IS) network) into three different parts: Influence network, Similarity network and information tunnels (IT) between them. Through IT, we exchange the influence scores and the similarity scores to calculate more precise similarity and influence scores in order to improve both of their qualities. The experiment results on real world data shows that our framework enables influence maximization framework to identify more influential seeds in Influence network and similarity measures to produce more meaningful similarity scores in Similarity network simultaneously.
【Keywords】: influence; similarity; social networks
【Paper Link】 【Pages】:1467-1471
【Authors】: Mahmudur Rahman ; Mansurul Bhuiyan ; Mohammad Al Hasan
【Abstract】: Graphlet frequency distribution (GFD) is an analysis tool for understanding the variance of local structure in a graph. Many recent works use GFD for comparing, and characterizing real-life networks. However, the main bottleneck for graph analysis using GFD is the excessive computation cost for obtaining the frequency of each of the graphlets in a large network. To overcome this, we propose a simple, yet powerful algorithm, called GRAFT, that obtains the approximate graphlet frequency for all graphlets that have upto 5 vertices. Comparing to an exact counting algorithm, our algorithm achieves a speedup factor between 10 and 100 for a negligible counting error, which is, on average, less than 5%; For example, exact graphlet counting for ca-AstroPh takes approximately 3 days; but, GRAFT runs for 45 minutes to perform the same task with a counting accuracy of 95.6%.
【Keywords】: approximate graphlet counting; graph analysis; graphlet frequency distribution
【Paper Link】 【Pages】:1472-1476
【Authors】: Wei Cheng ; Xiang Zhang ; Feng Pan ; Wei Wang
【Abstract】: Two dimensional contingency tables or co-occurrence matrices arise frequently in various important applications such as text analysis and web-log mining. As a fundamental research topic, co-clustering aims to generate a meaningful partition of the contingency table to reveal hidden relationships between rows and columns. Traditional co-clustering algorithms usually produce a predefined number of flat partition of both rows and columns, which do not reveal relationship among clusters. To address this limitation, hierarchical co-clustering algorithms have attracted a lot of research interests recently. Although successful in various applications, the existing hierarchial co-clustering algorithms are usually based on certain heuristics and do not have solid theoretical background. In this paper, we present a new co-clustering algorithm with solid information theoretic background. It simultaneously constructs a hierarchical structure of both row and column clusters which retains sufficient mutual information between rows and columns of the contingency table. An efficient and effective greedy algorithm is developed which grows a co-cluster hierarchy by successively performing row-wise or column-wise splits that lead to the maximal mutual information gain. Extensive experiments on real datasets demonstrate that our algorithm can reveal essential relationships of row (and column) clusters and has better clustering precision than existing algorithms.
【Keywords】: co-clustering; contingency table; entropy; text analysis
【Paper Link】 【Pages】:1477-1481
【Authors】: Bin Tan ; Yuanhua Lv ; ChengXiang Zhai
【Abstract】: A user's web search history contains many valuable search patterns. In this paper, we study search patterns that represent a user's long-lasting and exploratory search interests. By focusing on long-lastingness and exploratoriness, we are able to discover search patterns that are most useful for recommending new and relevant information to the user. Our approach is based on language modeling and clustering, and specifically designed to handle web search logs. We run our algorithm on a real web search log collection, and evaluate its performance using a novel simulated study on the same search log dataset. Experiment results support our hypothesis that long-lastingness and exploratoriness are necessary for generating successful recommendation. Our algorithm is shown to effectively discover such search interest patterns, and thus directly useful for making recommendation based on personal search history.
【Keywords】: recommendation system; search log mining; user modeling
【Paper Link】 【Pages】:1482-1486
【Authors】: Deqing Wang ; Hui Zhang ; Rui Liu ; Weifeng Lv
【Abstract】: Much work has been done on feature selection. Existing methods are based on document frequency, such as Chi-Square Statistic, Information Gain etc. However, these methods have two shortcomings: one is that they are not reliable for low-frequency terms, and the other is that they only count whether one term occurs in a document and ignore the term frequency. Actually, high-frequency terms within a specific category are often regards as discriminators. This paper focuses on how to construct the feature selection function based on term frequency, and proposes a new approach based on t-test, which is used to measure the diversity of the distributions of a term between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that our new approach is comparable to or or slightly better than the state-of-the-art feature selection methods (i.e., chi2, and IG) in terms of macro-F1 and micro-F1
【Keywords】: T-test; feature selection; term frequency; text classification
【Paper Link】 【Pages】:1487-1491
【Authors】: Shuaiqiang Wang ; Jiankai Sun ; Byron J. Gao ; Jun Ma
【Abstract】: Collaborative filtering (CF) is an effective technique addressing the information overload problem. Recently ranking-based CF methods have shown advantages in recommendation accuracy, being able to capture the preference similarity between users even if their rating scores differ significantly. In this study, we seek accuracy improvement of ranking-based CF through adaptation of the vector space model, where we consider each user as a document and her pairwise relative preferences as terms. We then use a novel degree-specialty weighting scheme resembling TF-IDF to weight the terms. Then we use cosine similarity to select a neighborhood of users for the target user to make recommendations. Experiments on benchmarks in comparison with the state-of-the-art methods demonstrate the promise of our approach.
【Keywords】: collaborative filtering;; ranking-based collaborative filtering; recommender systems; term weighting; vector space model
【Paper Link】 【Pages】:1492-1496
【Authors】: Guangyou Zhou ; Kang Liu ; Jun Zhao
【Abstract】: Community question answering (cQA) has become a popular service for users to ask and answer questions. In recent years, the efficiency of cQA service is hindered by a sharp increase of questions in the community. This paper is concerned with the problem of question routing. Question routing in cQA aims to route new questions to the eligible answerers who can give high quality answers. However, the traditional methods suffer from the following two problems: (1) word mismatch between the new questions and the users' answering history; (2) high variance in perceived answer quality. To solve the above two problems, this paper proposes a novel joint learning method by taking both word mismatch and answer quality into a unified framework for question routing. We conduct experiments on large-scale real world data set from Yahoo! Answers. Experimental results show that our proposed method significantly outperforms the traditional query likelihood language model (QLLM) as well as state-of-the-art cluster-based language model (CBLM) and category-sensitive query likelihood language model (TCSLM).
【Keywords】: answer quality; language model; question routing; translation model
【Paper Link】 【Pages】:1497-1501
【Authors】: Andrey Gubichev ; Thomas Neumann
【Abstract】: Finding the minimum connected subtree of a graph that contains a given set of nodes (i.e., the Steiner tree problem) is a fundamental operation in keyword search in graphs, yet it is known to be NP-hard. Existing approximation techniques either make use of the heavy indexing of the graph, or entirely rely on online heuristics. In this paper we bridge the gap between these two extremes and present a scalable landmark-based index structure that, combined with a few lightweight online heuristics, yields a fast and accurate approximation of the Steiner tree. Our solution handles real-world graphs with millions of nodes and provides an approximation error of less than 5% on average.
【Keywords】: graph databases; keyword search; steiner trees
【Paper Link】 【Pages】:1502-1506
【Authors】: Hakan Ceylan ; Ioannis Arapakis ; Pinar Donmez ; Mounia Lalmas
【Abstract】: It is of great interest to news providers such as Yahoo! News to attain higher visitor rates by promoting greater engagement with their content. One aspect of engagement deals with keeping users on the site longer by allowing them to navigate through content with enhanced, click-through experiences. News portals have invested in ways to provide embedded links within news stories. So far these links have been manually curated by professional editors, and due to the manual effort involved, the use of such links has been limited. In this paper we propose an automated approach to detecting and linking newsworthy events to associated articles. Our analysis, conducted on Amazon's Mechanical Turk, reveals that our system's performance is comparable to that of professional editors, and that users find the automatically generated highlights interesting and the associated articles worthy of reading.
【Keywords】: automatic linking; engagement strategy; newsworthiness
【Paper Link】 【Pages】:1507-1511
【Authors】: Fanhua Shang ; Licheng Jiao ; Yuanyuan Liu ; Fei Wang
【Abstract】: Learning data representation is a fundamental problem in data mining and machine learning. Spectral embedding is one popular method for learning effective data representations. In this paper we propose a novel framework to learn enhanced spectral embedding, which not only considers the geometrical structure of the data space, but also takes advantage of the given pairwise constraints. The proposed formulation can be solved by an iterative eigenvalue thresholding (IET) algorithm. Specially, we convert the problem of learning spectral embedding with pairwise constraints into the one of completing an "ideal" kernel matrix. And we introduce the spectral embedding of graph Laplacian as the auxiliary information and cast it as a small-scale positive semidefinite (PSD) matrix optimization problem with nuclear norm regularization. Then, we develop an IET algorithm to solve it efficiently. Moreover, we also present an effective semi-supervised clustering (SSC) approach with learned spectral embedding (LSE). Finally, we validate the proposed IET algorithm and LSE approach by extensive experiments on real-world data sets.
【Keywords】: fixed point method; learning spectral embedding; matrix completion; nuclear norm minimization
【Paper Link】 【Pages】:1512-1516
【Authors】: Rong-Hua Li ; Jeffrey Xu Yu ; Xin Huang ; Hong Cheng ; Zechao Shang
【Abstract】: Measuring robustness of complex networks is a fundamental task for analyzing the structure and function of complex networks. In this paper, we study the network robustness under the maximal vertex coverage (MVC) attack, where the attacker aims to delete as many edges of the network as possible by attacking a small fraction of nodes. First, we present two robustness metrics of complex networks based on MVC attack. We then propose an efficient randomized greedy algorithm with near-optimal performance guarantee for computing the proposed metrics. Finally, we conduct extensive experiments on 20 real datasets. The results show that P2P and co-authorship networks are extremely robust under the MVC attack while both the online social networks and the Email communication networks exhibit vulnerability under the MVC attack. In addition, the results demonstrate the efficiency and effectiveness of our proposed algorithms for computing the corresponding robustness metrics.
【Keywords】: fm sketch; mvc attack; robustness; submodular function
【Paper Link】 【Pages】:1517-1521
【Authors】: Chong Long ; Xiubo Geng ; Chang Xu ; Sathiya Keerthi
【Abstract】: We consider the problem of extracting, in a domain-centric fashion, a given set of attributes from a large number of semi-structured websites. Previous approaches [7, 5] to solve this problem are based on page level inference. We propose a distinct new approach that directly chooses attribute extractors for a site using a scoring mechanism that is designed at the domain level via simple classification methods using a training set from a small number of sites. To keep the number of candidate extractors in each site manageably small we use two observations that hold in most domains: (a) imprecise annotators can be used to identify a small set of candidate extractors for a few attributes (anchors); and (b) non-anchor attributes lie in close proximity to the anchor attributes. Experiments on three domains (Events, Books and Restaurants) show that our approach is very effective in spite of its simplicity.
【Keywords】: information extraction; text mining
【Paper Link】 【Pages】:1522-1526
【Authors】: Tomonari Masada ; Atsuhiro Takasu
【Abstract】: This paper provides a topic model for extracting topic evolutions as a corpus-wide transition matrix among latent topics. Recent trends in text mining point to a high demand for exploiting metadata. Especially, exploitation of reference relationships among documents induced by hyperlinking Web pages, citing scientific articles, tumblring blog posts, retweeting tweets, etc., is put in the foreground of the effort for an effective mining. We focus on scholarly activities and propose a topic model for obtaining a corpus-wide view on how research topics evolve along citation relationships. Our model, called TERESA, extends latent Dirichlet allocation (LDA) by introducing a corpus-wide topic transition probability matrix, which models reference relationships as transitions among topics. Our approximated variational inference updates LDA posteriors and topic transition posteriors alternately. The main issue is execution time amounting to O(MK2), where K is the number of topics and M is that of links in citation network. Therefore, we accelerate the inference with Nvidia CUDA compatible GPUs. We compare the effectiveness of TERESA with that of LDA by introducing a new measure called diversity plus focusedness (D+F). We also present topic evolution examples our method gives.
【Keywords】: citation analysis; gpu; topic modeling
【Paper Link】 【Pages】:1527-1531
【Authors】: Bin Cao ; Jianwei Yin ; ShuiGuang Deng ; Dongjing Wang ; Zhaohui Wu
【Abstract】: How to improve the modeling efficiency and accuracy has become a burning problem. The popularization of recommendation technique in E-Commerce provide us new trajectories that can be used for addressing the problem. In this paper, we propose a graph-based workflow recommendation for improving business process modeling. The start point is so-called "workflow repository" including a set of already developed process models. Graph mining method is used to extract the process patterns from the repository. Based on graph edit distance (GED) [2], we calculate the distance between patterns and the partial business process, viewed as reference model, which is under modeling and select the candidate nodes with smaller distances for recommendation. The performance study show its feasibility for practical uses.
【Keywords】: business process modeling; graph edit distance; graph-based; workflow recommendation
【Paper Link】 【Pages】:1532-1536
【Authors】: Ziawasch Abedjan ; Johannes Lorey ; Felix Naumann
【Abstract】: To integrate Linked Open Data, which originates from various and heterogeneous sources, the use of well-defined ontologies is essential. However, oftentimes the utilization of these ontologies by data publishers differs from the intended application envisioned by ontology engineers. This may lead to unspecified properties being used ad-hoc as predicates in RDF triples or it may result in infrequent usage of specified properties. These mismatches impede the goals and propagation of the Web of Data as data consumers face difficulties when trying to discover and integrate domain-specific information. In this work, we identify and classify common misusage patterns by employing frequency analysis and rule mining. Based on this analysis, we introduce an algorithm to propose suggestions for a data-driven ontology re-engineering workflow, which we evaluate on two large-scale RDF datasets.
【Keywords】: data mining; linked data; ontology engineering
【Paper Link】 【Pages】:1537-1541
【Authors】: Tianyu Li ; Pirooz Chubak ; Laks V. S. Lakshmanan ; Rachel Pottinger
【Abstract】: Extracting ontological relationships (e.g., ISA and HASA) from free-text repositories (e.g., engineering documents and instruction manuals) can improve users' queries, as well as benefit applications built for these domains. Current methods to extract ontologies from text usually miss many meaningful relationships because they either concentrate on single-word terms and short phrases or neglect syntactic relationships between concepts in sentences. We propose a novel pattern-based algorithm to find ontological relationships between complex concepts by exploiting parsing information to extract multi-word concepts and nested concepts. Our procedure is iterative: we tailor the constrained sequential pattern mining framework to discover new patterns. Our experiments on three real data sets show that our algorithm consistently and significantly outperforms previous representative ontology extraction algorithms.
【Keywords】: ontologies; ontology extraction
【Paper Link】 【Pages】:1542-1546
【Authors】: Zheng Lin ; Songbo Tan ; Xueqi Cheng ; Xueke Xu ; Weisong Shi
【Abstract】: Bilingual sentiment lexicon is fundamental resource for cross-language sentiment analysis but its compilation remains a major bottleneck in computational linguistics. Traditional word alignment algorithm faces with the status of large alignment space, which may introduce redundant computations as well as alignment errors. In this paper, we use collocation alignment to extract bilingual sentiment lexicon overcoming the drawbacks of word alignment. The idea of collocation alignment is inspired by the strong cohesion between feature words and opinion words in sentiment corpus. Experimental results show that our approach not only decreases the computing time dramatically but also improves the precision of extracted bilingual word pairs due to the smaller alignment space.
【Keywords】: bilingual sentiment lexicon; collocation alignment; word alignment
【Paper Link】 【Pages】:1547-1551
【Authors】: Lina Yao ; Quan Z. Sheng
【Abstract】: With recent advances in radio-frequency identification (RFID), wireless sensor networks, and Web services, physical things are becoming an integral part of the emerging ubiquitous Web. While this integration offers many exciting opportunities such as efficient supply chains and improved environmental monitoring, it also presents many significant challenges. One such challenge lies in how to classify, discover, and manage ubiquitous things, which is critical for efficient and effective object search, recommendation, and composition. In this paper, we focus on automatically classifying ubiquitous things into manageable semantic category labels by exploiting the information hidden in interactions between users and ubiquitous things. We develop a novel approach to extract latent relevance by building a relational network of ubiquitous things (RNUbiT) where similar things are linked via virtual edges according to their latent relevance. A discriminative learning algorithm is also developed to automatically determine category labels for ubiquitous things. We conducted experiments using real-world data and the experimental results demonstrate the feasibility and validity of our proposed approach.
【Keywords】: modularity; multi-label classification; relational learning; ubiquitous things discovery; web of things
【Paper Link】 【Pages】:1552-1556
【Authors】: Mingqi Lv ; Ling Chen ; Gencai Chen
【Abstract】: A place is a locale that is frequently visited by an individual user and carries important semantic meanings (e.g. home, work, etc.). Many location-aware applications will be greatly enhanced with the ability of the automatic discovery of personally semantic places. The discovery of a user's personally semantic places involves obtaining the physical locations and semantic meanings of these places. In this paper, we propose approaches to address both of the problems. For the physical place extraction problem, a hierarchical clustering algorithm is proposed to firstly extract visit points from the GPS trajectories, and then these visit points can be clustered to form physical places. For the semantic place recognition problem, Bayesian networks (encoding the temporal patterns in which the places are visited) are used in combination with a customized POI (i.e. place of interest) database (containing the spatial features of the places) to categorize the extracted physical places into pre-defined types. An extensive set of experiments have been conducted to demonstrate the effectiveness of the proposed approaches based on a dataset of real-world GPS trajectories.
【Keywords】: gps trajectory; location-aware computing; place discovery; place recognition; spatial-temporal data mining
【Paper Link】 【Pages】:1557-1561
【Authors】: Hanbo Dai ; Feida Zhu ; Ee-Peng Lim ; HweeHwa Pang
【Abstract】: The recent boom of weblogs and social media has attached increasing importance to the identification of suspicious users with unusual behavior, such as spammers or fraudulent reviewers. A typical spamming strategy is to employ multiple dummy accounts to collectively promote a target, be it a URL or a product. Consequently, these suspicious accounts exhibit certain coherent anomalous behavior identifiable as a collection. In this paper, we propose the concept of Coherent Anomaly Collection (CAC) to capture this kind of collections, and put forward an efficient algorithm to simultaneously find the top-K disjoint CACs together with their anomalous behavior patterns. Compared with existing approaches, our new algorithm can find disjoint anomaly collections with coherent extreme behavior without having to specify either their number or sizes. Results on real Twitter data show that our approach discovers meaningful and informative hashtag spammer groups of various sizes which are hard to detect by clustering-based methods.
【Keywords】: anomaly collection/cluster; anomaly/outlier detection
【Paper Link】 【Pages】:1562-1566
【Authors】: Daifeng Li ; Xin Shuai ; Guozheng Sun ; Jie Tang ; Ying Ding ; Zhipeng Luo
【Abstract】: This paper proposes a Topic-Level Opinion Influence Model (TOIM) that simultaneously incorporates topic factor, user opinions and social influence in a unified probabilistic model with two stages learning processes. In the first stage, topic factor and user influence are integrated to generate users' influential relationship based on different topics; in the second stage, users' historical messages and social interaction records are leveraged by TOIM to construct their historical opinions and neighbors' opinion influence through a statistical learning process, which can be further utilized to predict users' future opinions on some specific topics. We evaluate our TOIM on a large-scaled dataset from Tencent Weibo, one of the largest microbloggings website in China. The experimental results show that TOIM can better predict users' opinion than other baseline methods.
【Keywords】: opinion mining; sentiment analysis; social influence; topic modeling
【Paper Link】 【Pages】:1567-1571
【Authors】: Xiangnan Kong ; Philip S. Yu ; Ying Ding ; David J. Wild
【Abstract】: Collective classification approaches exploit the dependencies of a group of linked objects whose class labels are correlated and need to be predicted simultaneously. In this paper, we focus on studying the collective classification problem in heterogeneous networks, which involves multiple types of data objects interconnected by multiple types of links. Intuitively, two objects are correlated if they are linked by many paths in the network. By considering different linkage paths in the network, one can capture the subtlety of different types of dependencies among objects. We introduce the concept of meta-path based dependencies among objects, where a meta path is a path consisting a certain sequence of linke types. We show that the quality of collective classification results strongly depends upon the meta paths used. To accommodate the large network size, a novel solution, called HCC (meta-path based Heterogenous Collective Classification), is developed to effectively assign labels to a group of instances that are interconnected through different meta-paths. The proposed HCC model can capture different types of dependencies among objects with respect to different meta paths. Empirical studies on real-world networks demonstrate that effectiveness of the proposed meta path-based collective classification approach.
【Keywords】: heterogeneous information networks; meta path
【Paper Link】 【Pages】:1572-1576
【Authors】: Yi Song ; Panagiotis Karras ; Sadegh Nobari ; Giorgos Cheliotis ; Mingqiang Xue ; Stéphane Bressan
【Abstract】: The proliferation of online social networks has created intense interest in studying their nature and revealing information of interest to the end user. At the same time, such revelation raises privacy concerns. Existing research addresses this problem following an approach popular in the database community: a model of data privacy is defined, and the data is rendered in a form that satisfies the constraints of that model while aiming to maximize some utility measure. Still, these is no consensus on a clear and quantifiable utility measure over graph data. In this paper, we take a different approach: we define a utility guarantee, in terms of certain graph properties being preserved, that should be respected when releasing data, while otherwise distorting the graph to an extend desired for the sake of confidentiality. We propose a form of data release which builds on current practice in social network platforms: A user may want to see a subgraph of the network graph, in which that user as well as connections and affiliates participate. Such a snapshot should not allow malicious users to gain private information, yet provide useful information for benevolent users. We propose a mechanism to prepare data for user view under this setting. In an experimental study with real data, we demonstrate that our method preserves several properties of interest more successfully than methods that randomly distort the graph to an equal extent, while withstanding structural attacks proposed in the literature.
【Keywords】: data utility; security and privacy; social network
【Paper Link】 【Pages】:1577-1581
【Authors】: Maxim Zhukovskiy ; Dmitry Vinogradov ; Yuri Pritykin ; Liudmila Ostroumova ; Evgeny Grechnikov ; Gleb Gusev ; Pavel Serdyukov ; Andrei M. Raigorodskii
【Abstract】: We consider the Buckley-Osthus implementation of preferential attachment and its ability to model the web host graph in two aspects. One is the degree distribution that we observe to follow the power law, as often being the case for real-world graphs. Another one is the two-dimensional edge distribution, the number of edges between vertices of given degrees. We fit a single "initial attractiveness" parameter a of the model, first with respect to the degree distribution of the web host graph, and then, absolutely independently, with respect to the edge distribution. Surprisingly, the values of a we obtain turn out to be nearly the same. Therefore the same model with the same value of the parameter a fits very well the two independent and basic aspects of the web host graph. In addition, we demonstrate that other models completely lack the asymptotic behavior of the edge distribution of the web host graph, even when accurately capturing the degree distribution. To the best of our knowledge, this is the first study confirming the ability of preferential attachment models to reflect the distribution of edges between vertices with respect to their degrees in a real graph of Internet.
【Keywords】: assortative mixing; buckley-osthus random graphs; edge distribution with respect to vertex degrees; power law degree distribution; preferential attachment; web host graph
【Paper Link】 【Pages】:1582-1586
【Authors】: Huiji Gao ; Jiliang Tang ; Huan Liu
【Abstract】: Location-based social networks (LBSNs) have attracted an increasing number of users in recent years. The availability of geographical and social information of online LBSNs provides an unprecedented opportunity to study the human movement from their socio-spatial behavior, enabling a variety of location-based services. Previous work on LBSNs reported limited improvements from using the social network information for location prediction; as users can check-in at new places, traditional work on location prediction that relies on mining a user's historical trajectories is not designed for this "cold start" problem of predicting new check-ins. In this paper, we propose to utilize the social network information for solving the "cold start" location prediction problem, with a geo-social correlation model to capture social correlations on LBSNs considering social networks and geographical distance. The experimental results on a real-world LBSN demonstrate that our approach properly models the social correlations of a user's new check-ins by considering various correlation strengths and correlation measures.
【Keywords】: geo-social correlation; location prediction; location recommendation; location-based social networks
【Paper Link】 【Pages】:1587-1591
【Authors】: Ido Guy ; Tal Steier ; Maya Barnea ; Inbal Ronen ; Tal Daniel
【Abstract】: Activity streams have become prevalent on the web and are starting to emerge in enterprises. In this work, we present Streamz, a novel application that uses a faceted search approach to provide employees with advanced capabilities of search, navigation, attention management, and other types of analytics on top of an enterprise activity stream. We provide a detailed description of the Streamz tool as well as usage analysis based on user interface logs and interviews of active users.
【Keywords】: activity streams; enterprise; social media
【Paper Link】 【Pages】:1592-1596
【Authors】: Ernesto Diaz-Aviles ; Lucas Drumond ; Zeno Gantner ; Lars Schmidt-Thieme ; Wolfgang Nejdl
【Abstract】: Users engaged in the Social Web increasingly rely upon continuous streams of Twitter messages (tweets) for real-time access to information and fresh knowledge about current affairs. However, given the deluge of tweets, it is a challenge for individuals to find relevant and appropriately ranked information. We propose to address this knowledge management problem by going beyond the general perspective of information finding in Twitter, that asks: "What is happening right now?", towards an individual user perspective, and ask: "What is interesting to me right now?" In this paper, we consider collaborative filtering as an online ranking problem and present RMFO, a method that creates, in real-time, user-specific rankings for a set of tweets based on individual preferences that are inferred from the user's past system interactions. Experiments on the 476 million Twitter tweets dataset show that our online approach largely outperforms recommendations based on Twitter's global trend and Weighted Regularized Matrix Factorization (WRMF), a highly competitive state-of-the-art Collaborative Filtering technique, demonstrating the efficacy of our approach.
【Keywords】: collaborative filtering; online ranking; twitter
【Paper Link】 【Pages】:1597-1601
【Authors】: Luca Bonomi ; Li Xiong ; Rui Chen ; Benjamin C. M. Fung
【Abstract】: In this paper, we study the problem of privacy preserving record linkage which aims to perform record linkage without revealing anything about the non-linked records. We propose a new secure embedding strategy based on frequent variable length grams which allows record linkage on the embedded space. The frequent grams used for constructing the embedding base are mined from the original database under the framework of differential privacy. Compared with the state-of-the-art secure matching schema [15], our approach provides formal, provable privacy guarantees and achieves better scalability while providing comparable utility.
【Keywords】: differential privacy; privacy; record linkage; security
【Paper Link】 【Pages】:1602-1606
【Authors】: Amir Asiaee T. ; Mariano Tepper ; Arindam Banerjee ; Guillermo Sapiro
【Abstract】: Extracting sentiment from Twitter data is one of the fundamental problems in social media analytics. Twitter's length constraint renders determining the positive/negative sentiment of a tweet difficult, even for a human judge. In this work we present a general framework for per-tweet (in contrast with batches of tweets) sentiment analysis which consists of: (1) extracting tweets about a desired target subject, (2) separating tweets with sentiment, and (3) setting apart positive from negative tweets. For each step, we study the performance of a number of classical and new machine learning algorithms. We also show that the intrinsic sparsity of tweets allows performing classification in a low dimensional space, via random projections, without losing accuracy. In addition, we present weighted variants of all employed algorithms, exploiting the available labeling uncertainty, which further improve classification accuracy. Finally, we show that spatially aggregating our per-tweet classification results produces a very satisfactory outcome, making our approach a good candidate for batch tweet sentiment analysis.
【Keywords】: bayes classification; compressed learning; sparse modeling; supervised learning; svm; twitter sentiment analysis
【Paper Link】 【Pages】:1607-1611
【Authors】: Chen Lin ; Runquan Xie ; Lei Li ; Zhenhua Huang ; Tao Li
【Abstract】: A variety of news recommender systems based on different strategies have been proposed to provide news personalization services for online news readers. However, little research work has been reported on utilizing the implicit "social" factors (i.e., the potential influential experts in news reading community) among news readers to facilitate news personalization. In this paper, we investigate the feasibility of integrating content-based methods, collaborative filtering and information diffusion models by employing probabilistic matrix factorization techniques. We propose PRemiSE, a novel Personalized news Recommendation framework via implicit Social Experts, in which the opinions of potential influencers on virtual social networks extracted from implicit feedbacks are treated as auxiliary resources for recommendation. Empirical results demonstrate the efficacy and effectiveness of our method, particularly, on handling the so-called cold-start problem.
【Keywords】: expert; matrix factorization; news recommendation; social network
【Paper Link】 【Pages】:1612-1616
【Authors】: Xianling Mao ; Jing He ; Hongfei Yan ; Xiaoming Li
【Abstract】: Lots of document collections are well organized in hierarchical structure, and such structure can help users browse and understand these collections. Meanwhile, there are a large number of plain document collections loosely organized, and it is difficult for users to understand them effectively. In this paper we study how to automatically integrate latent topics in a plain collection with the topics in a hierarchical structured collection. We propose to use semi-supervised topic modeling to solve the problem in a principled way. The experiments show that the proposed method can generate both meaningful latent topics and expand high quality hierarchical topic structures.
【Keywords】: hierarchical topic modeling; topical integration
【Paper Link】 【Pages】:1617-1621
【Authors】: Hengshu Zhu ; Huanhuan Cao ; Enhong Chen ; Hui Xiong ; Jilei Tian
【Abstract】: A key step for the mobile app usage analysis is to classify apps into some predefined categories. However, it is a nontrivial task to effectively classify mobile apps due to the limited contextual information available for the analysis. To this end, in this paper, we propose an approach to first enrich the contextual information of mobile apps by exploiting the additional Web knowledge from the Web search engine. Then, inspired by the observation that different types of mobile apps may be relevant to different real-world contexts, we also extract some contextual features for mobile apps from the context-rich device logs of mobile users. Finally, we combine all the enriched contextual information into a Maximum Entropy model for training a mobile app classifier. The experimental results based on 443 mobile users' device logs clearly show that our approach outperforms two state-of-the-art benchmark methods with a significant margin.
【Keywords】: automatic mobile app classification; real-world contexts; web knowledge
【Paper Link】 【Pages】:1622-1626
【Authors】: Fang Li ; Tingting He ; Xinhui Tu ; Xiaohua Hu
【Abstract】: This paper presents a tag-topic model with Dirichlet Forest prior (TTM-DF) for semantic knowledge acquisition from blog. The TTM-DF model extends the tag-topic model (TTM) by replacing the Dirichlet prior with the Dirichlet Forest prior over the topic-word multinomial. The correlation between words are calculated to generate a set of Must-Links and Cannot-Links, then the structures of Dirichlet trees are obtained though encoding the constraints of Must-Links and Cannot-Links. Words under the same subtrees are expected to be more correlated than words under different subtrees. We conduct experiments on a synthetic and a blog dataset. Both of the experimental results show that the TTM-DF model performs much better than the TTM model. It can improve the coherence of the underlying topics and the tag-topic distributions, and capture semantic knowledge effectively.
【Keywords】: blog; dirichlet forest prior; tag; topic model
【Paper Link】 【Pages】:1627-1631
【Authors】: Rashmi Gangadharaiah ; Rose Catherine
【Abstract】: Online forums provide a channel for users to report and discuss problems related to products and troubleshooting, for faster resolution. These could garner negative publicity if left unattended by the companies. Manually monitoring these massive amounts of discussions is laborious. This paper makes the first attempt at collecting issues that require immediate action by the product supplier by analyzing the immense information on forums. Features that are specific to forum discussions, in conjunction with linguistic cues help in capturing and better prioritizing issues. Any attempt to collect training data for learning a classifier for this task will require enormous labeling effort. Hence, this paper adopts a co-training approach, which uses minimal manual labeling, coupled with linguistic features extracted using a set-expansion algorithm to discover severe problems. Further, most distinct and recent issues are obtained by incorporating a measure of 'centrality', 'diversity' and temporal aspect of the forum threads. We show that this helps in better prioritizing longstanding issues and identify issues that need to be addressed immediately.
【Keywords】: discovering and prioritizing severe technical issues
【Paper Link】 【Pages】:1632-1636
【Authors】: Raúl Ernesto Gutiérrez de Piñerez Reyes ; Juan Francisco Díaz-Frías
【Abstract】: Informal Mathematical Discourse (IMD) is characterized by the mixture of natural language and symbolic expressions in the context of textbooks, publications in mathematics and mathematical proof. We focused the IMD processing at the low level of discourse. In this paper, we proposed the preprocessing phase before the IMD structure analysis within the context of Controlled Natural Language (CNL). Our contribution is defined in context of the IMD processing and the use of machine learning; first, we present a CNL, a pure corpus and Matemathical Treebank for processing IMD; second, we present a preprocessing phase for IMD analysis with connectives disambiguation and verbs treatment, finally, we found a satisfactory result on input text parsing using a statistical parsing model. We will propagate these results for classification of argumentative informal practices via the low level discourse in IMD processing.
【Keywords】: connective tagging; controlled natural language; corpus; informal mathematical discourse; statistical parsing
【Paper Link】 【Pages】:1637-1641
【Authors】: Sangkeun Lee ; Sungchan Park ; Minsuk Kahng ; Sang-goo Lee
【Abstract】: In this paper, we present a novel random-walk based node ranking measure, PathRank, which is defined on a heterogeneous graph by extending the Personalized PageRank algorithm. Not only can our proposed measure exploit the semantics behind the different types of nodes and edges in a heterogeneous graph, but also it can emulate various recommendation semantics such as collaborative filtering, content-based filtering, and their combinations. The experimental results show that PathRank can produce more various and effective recommendation results compared to existing approaches.
【Keywords】: flexibility; graph; heterogeneity; network; pagerank; personalized pagerank; ranking; recommender systems
【Paper Link】 【Pages】:1642-1646
【Authors】: Yue Lu ; Hongning Wang ; ChengXiang Zhai ; Dan Roth
【Abstract】: With more and more people freely express opinions as well as actively interact with each other in discussion threads, online forums are becoming a gold mine with rich information about people's opinions and social behaviors. In this paper, we study an interesting new problem of automatically discovering opposing opinion networks of users from forum discussions, which are subset of users who are strongly against each other on some topic. Toward this goal, we propose to use signals from both textual content (e.g., who says what) and social interactions (e.g., who talks to whom) which are both abundant in online forums. We also design an optimization formulation to combine all the signals in an unsupervised way. We created a data set by manually annotating forum data on five controversial topics and our experimental results show that the proposed optimization method outperforms several baselines and existing approaches, demonstrating the power of combining both text analysis and social network analysis in analyzing and generating the opposing opinion networks.
【Keywords】: linear programming; online forums; opinion analysis; optimization; social network analysis
【Paper Link】 【Pages】:1647-1651
【Authors】: Guangyou Zhou ; Li Cai ; Kang Liu ; Jun Zhao
【Abstract】: This work investigates selecting concise labels for the newly-arising topics in community question answer. Previous methods of generating labels do not take the information of the existing category hierarchy into consideration. The main motivation of our paper is to utilize this information into the label generation process. We propose a general framework to address this problem. Firstly, we map the questions into Wikipedia concept sets, which are more meaningful than terms. Secondly, important concepts are identified to represent the main focus of the newly-arising topics. Thirdly, candidate labels are extracted from Wikipedia category graph. Finally, candidate labels are filtered and reranked by combination of structure information of existing category hierarchy and Wikipedia category graph. The experiments show that in our test collections, about 80% "correct" labels appear in the top ten labels recommended by our system.
【Keywords】: category hierarchy; community question answering; newly-arising topics
【Paper Link】 【Pages】:1652-1656
【Authors】: Wenpeng Yin ; Yulong Pei ; Fan Zhang ; Lian'en Huang
【Abstract】: Query-oriented relevance, information richness and novelty are important requirements in query-focused summarization, which, to a considerable extent, determine the summary quality. Previous work either rarely took into account all above demands simultaneously or dealt with part of them in the dynamic process of choosing sentences to generate a summary. In this paper, we propose a novel approach that integrates all these requirements skillfully by treating them as sentence features, making that the finally generated summary could fully reflect the combinational effect of these properties. Experimental results on the DUC2005 and DUC2006 datasets demonstrate the effectiveness of our approach.
【Keywords】: lda; query-biased sentence feature; query-focused summarization; topical vector space model
【Paper Link】 【Pages】:1657-1661
【Authors】: Huizhi Liang ; Yue Xu ; Dian Tjondronegoro ; Peter Christen
【Abstract】: Topic recommendation can help users deal with the information overload issue in micro-blogging communities. This paper proposes to use the implicit information network formed by the multiple relationships among users, topics and micro-blogs, and the temporal information of micro-blogs to find semantically and temporally relevant topics of each topic, and to profile users' time-drifting topic interests. The Content based, Nearest Neighborhood based and Matrix Factorization models are used to make personalized recommendations. The effectiveness of the proposed approaches is demonstrated in the experiments conducted on a real world dataset that collected from Twitter.com.
【Keywords】: collaborative filtering; micro-blogs; personalization; temporal dynamics; topic recommendation; web 2.0
【Paper Link】 【Pages】:1662-1666
【Authors】: Guangyou Zhou ; Siwei Lai ; Kang Liu ; Jun Zhao
【Abstract】: In this paper, we address the problem of expert finding in community question answering (CQA). Most of the existing approaches attempt to find experts in CQA by means of link analysis techniques. However, these traditional techniques only consider the link structure while ignore the topical similarity among users (askers and answerers) and user expertise and user reputation. In this study, we propose a topic-sensitive probabilistic model, which is an extension of PageRank algorithm to find experts in CQA. Compared to the traditional link analysis techniques, our proposed method is more effective because it finds the experts by taking into account both the link structure and the topical similarity among users. We conduct experiments on real world data set from Yahoo! Answers. Experimental results show that our proposed method significantly outperforms the traditional link analysis techniques and achieves the state-of-the-art performance for expert finding in CQA.
【Keywords】: expert finding; pagerank; yahoo! answers
【Paper Link】 【Pages】:1667-1671
【Authors】: Jinoh Oh ; Hwanjo Yu
【Abstract】: Sampling is one of fundamental techniques for data preprocessing and mining. It helps to reduce computational costs and improve the mining quality. A sampling method is typically developed independently for a specific problem and for a specific user's interest, because it is hard to develop a method that is generalized across various user's interests. An absence of general framework for sampling makes it inefficient to develop or revise a sampling method as user's interest changes. This paper proposes a general framework, isampling, which facilitates a user developing sampling methods and easily modifying the user's sampling interest in the method. In the framework, a user explicitly describes her sampling interest into a graph model called interest model. Then, isampling automatically selects a sample set according to the model, which satisfies the user's interest. In order to demonstrate the effectiveness of our framework, we develop new trajectory sampling methods using our framework; trajectory sampling has been a challenging problem due to its high complexity of data and various user's interests. We demonstrate the flexibility of our framework by showing how easily trajectory samples of different interests can be generated within our framework.
【Keywords】: model-based sampling; sampling framework for various interest; trajectory sampling
【Paper Link】 【Pages】:1672-1676
【Authors】: Andrea Moro ; Roberto Navigli
【Abstract】: In this paper we present an approach for building a Wikipedia-based semantic network by integrating Open Information Extraction with Knowledge Acquisition techniques. Our algorithm extracts relation instances from Wikipedia page bodies and ontologizes them by, first, creating sets of synonymous relational phrases, called relation synsets, second, assigning semantic classes to the arguments of these relation synsets and, third, disambiguating the initial relation instances with relation synsets. As a result we obtain WiSeNet, a Wikipedia-based Semantic Network with Wikipedia pages as concepts and labeled, ontologized relations between them.
【Keywords】: information extraction; knowledge acquisition; relation ontologization; semantic network
【Paper Link】 【Pages】:1677-1681
【Authors】: Arnau Prat-Pérez ; David Dominguez-Sal ; Josep M. Brunat ; Josep-Lluis Larriba-Pey
【Abstract】: Community detection has arisen as one of the most relevant topics in the field of graph data mining due to its importance in many fields such as biology, social networks or network traffic analysis. The metrics proposed to shape communities are too lax and do not consider the internal layout of the edges in the community, which lead to undesirable results. We define a new community metric called WCC. The proposed metric meets a minimum set of basic properties that guarantees communities with structure and cohesion. We experimentally show that WCC correctly quantifies the quality of communities and community partitions using real and synthetic datasets, and compare some of the most used community detection algorithms in the state of the art.
【Keywords】: community detection; conductance; modularity; social networks
【Paper Link】 【Pages】:1682-1686
【Authors】: Ida Mele ; Francesco Bonchi ; Aristides Gionis
【Abstract】: In this paper we present a novel graph-based data abstraction for modeling the browsing behavior of web users. The objective is to identify users who discover interesting pages before others. We call these users early adopters. By tracking the browsing activity of early adopters we can identify new interesting pages early, and recommend these pages to similar users. We focus on news and blog pages, which are more dynamic in nature and more appropriate for recommendation. Our proposed model is called early-adopter graph. In this graph, nodes represent users and a directed arc between users u and v expresses the fact that u and v visit similar pages and, in particular, that user u tends to visit those pages before user v. The weight of the edge is the degree to which the temporal rule "v visits a page before v" holds. Based on the early-adopter graph, we build a recommendation system for news and blog pages, which outperforms other out-of-the-shelf recommendation systems based on collaborative filtering.
【Keywords】: early-adopter graph; log mining; user-browsing analysis; web-page recommendation
【Paper Link】 【Pages】:1687-1691
【Authors】: Ping Li ; Jiajun Bu ; Chun Chen ; Zhanying He
【Abstract】: Co-clustering targets on grouping the samples and features simultaneously. It takes advantage of the duality between the samples and features. In many real-world applications, the data points or features usually reside on a submanifold of the ambient Euclidean space, but it is nontrivial to estimate the intrinsic manifolds in a principled way. In this study, we focus on improving the co-clustering performance via manifold ensemble learning, which aims to maximally approximate the intrinsic manifolds of both the sample and feature spaces. To achieve this, we develop a novel co-clustering algorithm called Relational Multi-manifold Co-clustering (RMC) based on symmetric nonnegative matrix tri-factorization, which decomposes the relational data matrix into three matrices. This method considers the inter-type relationship revealed by the relational data matrix and the intra-type information reflected by the affinity matrices. Specifically, we assume the intrinsic manifold of the sample or feature space lies in a convex hull of a group of pre-defined candidate manifolds. We hope to learn an appropriate convex combination of them to approach the desired intrinsic manifold. To optimize the objective, the multiplicative rules are utilized to update the factorized matrices and the entropic mirror descent algorithm is exploited to automatically learn the manifold coefficients. Experimental results demonstrate the superiority of the proposed algorithm.
【Keywords】: co-clustering; entropic mirror descent algorithm; manifold ensemble learning; nonnegative matrix tri-factorization
【Paper Link】 【Pages】:1692-1696
【Authors】: George Tsatsaronis ; Iraklis Varlamis ; Kjetil Nørvåg
【Abstract】: Traditional document indexing techniques store documents using easily accessible representations, such as inverted indices, which can efficiently scale for large document sets. These structures offer scalable and efficient solutions in text document management tasks, though, they omit the cornerstone of the documents' purpose: meaning. They also neglect semantic relations that bind terms into coherent fragments of text that convey messages. When semantic representations are employed, the documents are mapped to the space of concepts and the similarity measures are adapted appropriately to better fit the retrieval tasks. However, these methods can be slow both at indexing and retrieval time. In this paper we propose SemaFor, an indexing algorithm for text documents, which uses semantic spanning forests constructed from lexical resources, like Wikipedia, and WordNet, and spectral graph theory in order to represent documents for further processing.
【Keywords】: document indexing; semantic graphs; text representation
【Paper Link】 【Pages】:1697-1701
【Authors】: Pablo N. Mendes ; Peter Mika ; Hugo Zaragoza ; Roi Blanco
【Abstract】: Query logs record the actual usage of search systems and their analysis has proven critical to improving search engine functionality. Yet, despite the deluge of information, query log analysis often suffers from the sparsity of the query space. Based on the observation that most queries pivot around a single entity that represents the main focus of the user's need, we propose a new model for query log data called the entity-aware click graph. In this representation, we decompose queries into entities and modifiers, and measure their association with clicked pages. We demonstrate the benefits of this approach on the crucial task of understanding which websites fulfill similar user needs, showing that using this representation we can achieve a higher precision than other query log-based approaches.
【Keywords】: click graph; query logs; website similarity
【Paper Link】 【Pages】:1702-1706
【Authors】: Freddy Chong Tat Chua ; William W. Cohen ; Justin Betteridge ; Ee-Peng Lim
【Abstract】: Many event monitoring systems rely on counting known keywords in streaming text data to detect sudden spikes in frequency. But the dynamic and conversational nature of Twitter makes it hard to select known keywords for monitoring. Here we consider a method of automatically finding noun phrases (NPs) as keywords for event monitoring in Twitter. Finding NPs has two aspects, identifying the boundaries for the subsequence of words which represent the NP, and classifying the NP to a specific broad category such as politics, sports, etc. To classify an NP, we define the feature vector for the NP using not just the words but also the author's behavior and social activities. Our results show that we can classify many NPs by using a sample of training data from a knowledge-base.
【Keywords】: named entities; noun phrases; social media; twitter
【Paper Link】 【Pages】:1707-1711
【Authors】: Raju Balakrishnan ; Rushi P. Bhatt
【Abstract】: Group-buying ads seeking a minimum number of customers before the deal expiry are increasingly used by the daily-deal providers. Unlike the traditional web ads, the advertiser's profits for group-buying ads depends on the time to expiry and additional customers needed to satisfy the minimum group size. Since both these quantities are time-dependent, optimal bid amounts to maximize profits change with every impression. Consequently, traditional static bidding strategies are far from optimal. Instead, bid values need to be optimized in real-time to maximize expected bidder profits. This online optimization of deal profits is made possible by the advent of ad exchanges offering real-time (spot) bidding. To this end, we propose a real-time bidding strategy for group-buying deals based on the online optimization of the bid values. We derive the expected bidder profit of deals as a function of the bid amounts, and dynamically vary bids to maximize profits. Further, to satisfy time constraints of the online bidding, we present methods of minimizing computation timings. We evaluate the proposed bidding on a multi-million click stream of 935 ads. The method shows significant profit improvement over the existing strategies.
【Keywords】: daily deals; display ads; group-buying; real-time bidding
【Paper Link】 【Pages】:1712-1716
【Authors】: Nurcan Durak ; Ali Pinar ; Tamara G. Kolda ; C. Seshadhri
【Abstract】: Triangles are an important building block and distinguishing feature of real-world networks, but their structure is still poorly understood. Despite numerous reports on the abundance of triangles, there is very little information on what these triangles look like. We initiate the study of degree-labeled triangles, - specifically, degree homogeneity versus heterogeneity in triangles. This yields new insight into the structure of real-world graphs. We observe that networks coming from social and collaborative situations are dominated by homogeneous triangles, i.e., degrees of vertices in a triangle are quite similar to each other. On the other hand, information networks (e.g., web graphs) are dominated by heterogeneous triangles, i.e., the degrees in triangles are quite disparate. Surprisingly, nodes within the top 1% of degrees participate in the vast majority of triangles in heterogeneous graphs. We investigate whether current graph models reproduce the types of triangles that are observed in real data and observe that most models fail to accurately capture these salient features.
【Keywords】: graph models; social networks; triangles in graphs
【Paper Link】 【Pages】:1717-1721
【Authors】: Suradej Intagorn ; Kristina Lerman
【Abstract】: User-generated content, such as photos and videos, is often annotated by users with free-text labels, called tags. Increasingly, such content is also georeferenced, i.e., it is associated with geographic coordinates. The implicit relationships between tags and their locations can tell us much about how people conceptualize places and relations between them. However, extracting such knowledge from social annotations presents many challenges, since annotations are often ambiguous, noisy, uncertain and spatially inhomogeneous. We introduce a probabilistic framework for modeling georeferenced annotations and a method for learning model parameters from data. The framework is flexible and general, and can be used in a variety of applications that mine geospatial knowledge from user-generated content. Specifically, we study three problems: extracting place semantics, predicting locations of photos and learning part-of relations between places. We show our method performs well compared to state-of-the-art approaches developed for the first two problems, and offers a novel solution to the problem of learning relations between places.
【Keywords】: data mining; geo-spatial; information extraction; social network
【Paper Link】 【Pages】:1722-1726
【Authors】: Fernando Gutierrez ; Dejing Dou ; Stephen Fickas ; Gina Griffiths
【Abstract】: Automatic grading systems for summaries and essays have been studied for years. Most commercial and research implementations are based in statistical methods, such as Latent Semantic Analysis (LSA), which can provide high accuracy on similarity between the essay and the graded or standard essays, but they can offer very limited feedback. In the present work, we propose a novel method to provide both grades and meaningful feedback for student summaries by Ontology-based Information Extraction (OBIE). We use ontological concepts and relationships to create extraction rules to identify correct statements. Based on ontology constraints (e.g., disjointness between concepts), we define patterns that are logically inconsistent with the ontology to create rules to extract incorrect statements. Experiments show that the grades given to 18 student summaries on Ecosystems by OBIE are correlated to human gradings. OBIE also provide meaningful feedback on the errors those students made in their summaries.
【Keywords】: automatic grading; information extraction; ontology
【Paper Link】 【Pages】:1727-1731
【Authors】: Qi Li ; Haibo Li ; Heng Ji ; Wen Wang ; Jing Zheng ; Fei Huang
【Abstract】: Traditional isolated monolingual name taggers tend to yield inconsistent results across two languages. In this paper, we propose two novel approaches to jointly and consistently extract names from parallel corpora. The first approach uses standard linear-chain Conditional Random Fields (CRFs) as the learning framework, incorporating cross-lingual features propagated between two languages. The second approach is based on a joint CRFs model to jointly decode sentence pairs, incorporating bilingual factors based on word alignment. Experiments on Chinese-English parallel corpora demonstrated that the proposed methods significantly outperformed monolingual name taggers, were robust to automatic alignment noise and achieved state-of-the-art performance. With only 20%of the training data, our proposed methods can already achieve better performance compared to the baseline learned from the whole training set.1
【Keywords】: bilingual; joint crfs; name tagging
【Paper Link】 【Pages】:1732-1736
【Authors】: Alvin Cheung ; Armando Solar-Lezama ; Samuel Madden
【Abstract】: This paper presents a new approach to select events of interest to users in a social media setting where events are generated from mobile devices. We argue that the problem is best solved by inductive learning, where the goal is to first generalize from the users' expressed "likes" and "dislikes" of specific events, then to produce a program that can be used to collect only data of interest. The key contribution of this paper is a new algorithm that combines machine learning techniques with program synthesis technology to learn users' preferences. We show that when compared with the more standard approaches, our new algorithm provides up to order-of-magnitude reductions in model training time, and significantly higher prediction accuracies for our target application.1
【Keywords】: program synthesis; recommender systems; social networking applications
【Paper Link】 【Pages】:1737-1741
【Authors】: Amr Ahmed ; Mohamed Aly ; Abhimanyu Das ; Alexander J. Smola ; Tasos Anastasakos
【Abstract】: A typical behavioral targeting system optimizing purchase activities, called conversions, faces two main challenges: the web-scale amounts of user histories to process on a daily basis, and the relative sparsity of conversions. In this paper, we try to address these challenges through feature selection. We formulate a multi-task (or group) feature-selection problem among a set of related tasks (sharing a common set of features), namely advertising campaigns. We apply a group-sparse penalty consisting of a combination of an l1 and l2 penalty and an associated fast optimization algorithm for distributed parameter estimation. Our algorithm relies on a variant of the well known Fast Iterative Thresholding Algorithm (FISTA), a closed-form solution for mixed norm programming and a distributed subgradient oracle. To efficiently handle web-scale user histories, we present a distributed inference algorithm for the problem that scales to billions of instances and millions of attributes. We show the superiority of our algorithm in terms of both sparsity and ROC performance over baseline feature selection methods (both single-task -regularization and multi-task mutual-information gain).
【Keywords】: behavioral targeting; feature selection; large-scale learning; sparsity
【Paper Link】 【Pages】:1742-1746
【Authors】: Takuya Makino ; Hiroya Takamura ; Manabu Okumura
【Abstract】: We propose a new model for the guided text summarization task. In this task, it is required that a generated summary covers all the aspects, which are predefined for the topic of the given document cluster; for example, aspects for the topic "Accidents and Natural Disasters" include WHAT, WHEN, WHERE, WHY, WHO AFFECTED, DAMAGES and COUNTERMEASURES. We use as a scorer for an aspect, the maximum entropy classifier that predicts whether each sentence reflects the aspect or not. We formalize the coverage of the aspects as a max-min problem, which enables a summary to cover aspects in a well-balanced manner. In the max-min problem, the minimum of the aspect scores is going to be maximized so that the summary contains all the aspects as much as possible. Furthermore, we integrate the model based on the max-min problem with the maximum coverage summarization model, which generates a summary containing as many conceptual units as possible. Through the experiments on benchmark datasets for the guided summarization, we show that our model outperforms other approaches in terms of ROUGE-2.
【Keywords】: guided summarization; multi-document summarization; optimization problem
【Paper Link】 【Pages】:1747-1751
【Authors】: Joel Barajas ; Ram Akella ; Marius Holtan ; Jaimie Kwon ; Aaron Flores ; Victor Andrei
【Abstract】: In this paper, we develop a time series approach, based on Dynamic Linear Models (DLM), to estimate the impact of ad impressions on the daily number of commercial actions when no user tracking is possible. The proposed method uses aggregate data, and hence it is simple to implement without expensive infrastructure. Specifically, we model the impact of daily number of ad impressions in daily number of commercial actions. We incorporate persistence of campaign effects on actions assuming a decay factor. We relax the assumption of a linear impact of ads on actions using the log-transformation. We also account for outliers with long-tailed distributions fitted and estimated automatically without a pre-defined threshold. This is applied to observational data post-campaign and does not require an experimental set-up. We apply the method to data from one commercial ad network on 2,885 campaigns for 1,251 products during six months, to calibrate and perform model selection. We set up a randomized experiment for two campaigns where user tracking is feasible. We find that the output of the proposed method is consistent with the results of A/B testing with similar confidence intervals.
【Keywords】: attribution; dlm; marketing; online display advertising
【Paper Link】 【Pages】:1752-1756
【Authors】: Yulai Xie ; Dan Feng ; Zhipeng Tan ; Lei Chen ; Kiran-Kumar Muniswamy-Reddy ; Yan Li ; Darrell D. E. Long
【Abstract】: Efficient provenance storage is an essential step towards the adoption of provenance. In this paper, we analyze the provenance collected from multiple workloads with a view towards efficient storage. Based on our analysis, we characterize the properties of provenance with respect to long term storage. We then propose a hybrid scheme that takes advantage of the graph structure of provenance data and the inherent duplication in provenance data. Our evaluation indicates that our hybrid scheme, a combination of web graph compression (adapted for provenance) and dictionary encoding, provides the best tradeoff in terms of compression ratio, compression time and query performance when compared to other compression schemes.
【Keywords】: compression; provenance graphs; storage
【Paper Link】 【Pages】:1769-1773
【Authors】: Fiana Raiber ; Oren Kurland ; Moshe Tennenholtz
【Abstract】: In adversarial and noisy search settings as the Web, the document-query surface level similarity can be a highly misleading relevance signal. Thus, devising content-based relevance estimation (ranking) approaches becomes highly challenging. We address this challenge using two methods that utilize inter-document similarities in an initially retrieved list. The first removes documents from the list that exhibit high query similarity, but for which there is insufficient additional support for relevance that is based on inter-document similarities. The method is based on a probabilistic model that decouples document-query similarities from relevance estimation. The second method re-ranks the list by "rewarding" documents that exhibit high similarity both to the query and to other documents in the list. Both methods incorporate, in addition, at the model level, query-independent document quality estimates. Extensive empirical evaluation demonstrates the merits of our methods.
【Keywords】: inter-document similarities; web search
【Paper Link】 【Pages】:1774-1778
【Authors】: Jin Huang ; Feiping Nie ; Heng Huang ; Yi-Cheng Tu
【Abstract】: Along with the increasing popularity of social web sites, users rely more on the trustworthiness information for many online activities among users. However, such social network data often suffers from severe data sparsity and are not able to provide users with enough information. Therefore, trust prediction has emerged as an important topic in social network research. Traditional approaches explore the topology of trust graph. Previous research in sociology and our life experience suggest that people who are in the same social circle often exhibit similar behavior and tastes. Such ancillary information, is often accessible and therefore could potentially help the trust prediction. In this paper, we address the link prediction problem by aggregating heterogeneous social networks and propose a novel joint manifold factorization (JMF) method. Our new joint learning model explores the user group level similarity between correlated graphs and simultaneously learns the individual graph structure, therefore the shared structures and patterns from multiple social networks can be utilized to enhance the prediction tasks. As a result, we not only improve the trust prediction in the target graph, but also facilitate other information retrieval tasks in the auxiliary graphs. To optimize the objective function, we break down the proposed objective function into several manageable sub-problems, then further establish the theoretical convergence with the aid of auxiliary function. Extensive experiments were conducted on real world data sets and all empirical results demonstrated the effectiveness of our method.
【Keywords】: nonnegative matrix factorization; social network; transfer learning; trust prediction
【Paper Link】 【Pages】:1779-1783
【Authors】: Katja Hofmann ; Shimon Whiteson ; Maarten de Rijke
【Abstract】: Interleaved comparison methods, which compare rankers using click data, are a promising alternative to traditional information retrieval evaluation methods that require expensive explicit judgments. A major limitation of these methods is that they assume access to live data, meaning that new data must be collected for every pair of rankers compared. We investigate the use of previously collected click data (i.e., historical data) for interleaved comparisons. We start by analyzing to what degree existing interleaved comparison methods can be applied and find that a recent probabilistic method allows such data reuse, even though it is biased when applied to historical data. We then propose an interleaved comparison method that is based on the probabilistic approach but uses importance sampling to compensate for bias. We experimentally confirm that probabilistic methods make the use of historical data for interleaved comparisons possible and effective.
【Keywords】: a/b testing; evaluation; implicit feedback; information retrieval; interleaved comparisons; reusability
【Paper Link】 【Pages】:1784-1788
【Authors】: Zijia Lin ; Guiguang Ding ; Mingqing Hu ; Jianmin Wang ; Jiaguang Sun
【Abstract】: In this paper, we propose a novel image auto-annotation model using tag-related random search over range-constrained visual neighbors of the to-be-annotated image. The proposed model, termed as TagSearcher, observes that the annotating performances of many previous visual-neighbor-based models are generally sensitive to the quantity setting of visual neighbors, and the probabilities for visual neighbors to be selected is better to be tag-dependent, meaning that each candidate tag can have its own trustworthy part of visual neighbors for score prediction. And thus TagSearcher uses a constrained range rather than an identical and fixed number of visual neighbors for auto-annotation. By performing a novel tag-related random search process over the graphical model made up of range-constrained visual neighbors, TagSearcher can find the trustworthy part for each candidate tag, and further utilize both visual similarities and tag correlations for score prediction. With the range constraint for visual neighbors and the tag-related random search process, TagSearcher can not only achieve satisfactory annotating performances, but also reduce the performance sensitivity. Experiments conducted on benchmark Corel5k well demonstrate its rationality and effectiveness.
【Keywords】: image annotation; random search; tagsearcher
【Paper Link】 【Pages】:1789-1793
【Authors】: Jing Wang ; Clement T. Yu ; Philip S. Yu ; Bing Liu ; Weiyi Meng
【Abstract】: An important issue that has been neglected so far is the identification of diversionary comments. Diversionary comments under political blog posts are defined as comments that deliberately twist the bloggers' intention and divert the topic to another one. The purpose is to distract readers from the original topic and draw attention to a new topic. Given that political blogs have significant impact on the society, we believe it is imperative to identify such comments. We then categorize diversionary comments into 5 types, and propose an effective technique to rank comments in descending order of being diversionary. To the best of our knowledge, the problem of detecting diversionary comments has not been studied so far. Our evaluation on 2,109 comments under 20 different blog posts from Digg.com shows that the proposed method achieves the high mean average precision (MAP) of 92.6%. Sensitivity analysis indicates that the effectiveness of the method is stable under different parameter settings.
【Keywords】: coreference resolution; diversionary comments; extraction from wikipedia; lda; spam; topic model
【Paper Link】 【Pages】:1794-1798
【Authors】: Anqi Cui ; Min Zhang ; Yiqun Liu ; Shaoping Ma ; Kuo Zhang
【Abstract】: In this paper, we utilize tags in Twitter (the hashtags) as an indicator of events. We first study the properties of hashtags for event detection. Based on several observations, we proposed three attributes of hashtags, including (1) instability for temporal analysis, (2) Twitter meme possibility to distinguish social events from virtual topics or memes, and (3) authorship entropy for mining the most contributed authors. Based on these attributes, breaking events are discovered with hashtags, which cover a wide range of social events among different languages in the real world.
【Keywords】: burst detection; hashtag; social media; twitter
【Paper Link】 【Pages】:1799-1803
【Authors】: Yuanhua Lv ; ChengXiang Zhai
【Abstract】: The query likelihood retrieval function has proven to be empirically effective for many retrieval tasks. From theoretical perspective, however, the justification of the standard query likelihood retrieval function requires an unrealistic assumption that ignores the generation of a "negative query" from a document. This suggests that it is a potentially non-optimal retrieval function. In this paper, we attempt to improve the query likelihood function by bringing back the negative query generation. We propose an effective approach to estimate the probabilities of negative query generation based on the principle of maximum entropy, and derive a more complete query likelihood retrieval function that also contains the negative query generation component. The proposed approach not only bridges the theoretical gap in the existing query likelihood retrieval function, but also improves retrieval effectiveness significantly with no additional computational cost.
【Keywords】: language model; negative query generation; principle of maximum entropy; probability ranking principle; query likelihood
【Paper Link】 【Pages】:1804-1808
【Authors】: Chao Liu ; Yi-Min Wang
【Abstract】: Semantic analysis tries to solve problems arising from polysemy and synonymy that are abundant in natural languages. Recently, Gabrilovich and Markovitch propose the Explicit Semantic Analysis (ESA) technique, which complements the well-known Latent Semantic Analysis (LSA) technique. In this paper, we show that the two techniques are not as distinct as their names suggest; instead, we find that ESA is equivalent to a LSA variant, and this equivalence generalizes to all kernel methods using kernels arising from the canonical dot product. Effectively, this result guarantees that ESA would not outperform the peak efficacy of LSA for any applications using the above kernel methods. In short, this paper for the first time establishes the connections between ESA and LSA, quantifies their relative efficacy, and generalizes the result to a big category of kernel methods.
【Keywords】: explicit semantic analysis; kernel methods; latent semantic analysis
【Paper Link】 【Pages】:1809-1813
【Authors】: Wenbin Cai ; Ya Zhang
【Abstract】: Active learning for ranking, which is to selectively label the most informative examples, has been widely studied in recent years. In this paper, we propose a general active learning for ranking strategy called Variance Maximization (VM). The algorithm relies on noise injection to perturb the original unlabeled examples and generate the rank distribution of each example. Using a DCG-like gain function to measure each ranked list sampled from the rank distribution, Variance Maximization selects the unlabeled example with the largest variance in the gain. The VM strategy is applied at both the query level and the document level, and a two-stage active learning algorithm is further derived. Experimental results on both the LETOR 4.0 dataset and a real-world Web search ranking dataset have demonstrated the effectiveness of the proposed active learning approach.
【Keywords】: active learning; learning to rank; noise injection; variance maximization
【Paper Link】 【Pages】:1814-1818
【Authors】: Xiaofei Zhu ; Jiafeng Guo ; Xueqi Cheng ; Yanyan Lan
【Abstract】: Query recommendation plays a critical role in helping users' search. Most existing approaches on query recommendation aim to recommend relevant queries. However, the ultimate goal of query recommendation is to assist users to reformulate queries so that they can accomplish their search task successfully and quickly. Only considering relevance in query recommendation is apparently not directly toward this goal. In this paper, we argue that it is more important to directly recommend queries with high utility, i.e., queries that can better satisfy users' information needs. For this purpose, we propose a novel generative model, referred to as Query Utility Model (QUM), to capture query utility by simultaneously modeling users' reformulation and click behaviors. The experimental results on a publicly released query log show that, our approach is more effective in helping users find relevant search results and thus satisfying their information needs.
【Keywords】: generative model; query logs; query recommendation; utility
【Paper Link】 【Pages】:1819-1823
【Authors】: Po Hu ; Minlie Huang ; Peng Xu ; Weichang Li ; Adam K. Usadi ; Xiaoyan Zhu
【Abstract】: Patents are critical for a company to protect its core technologies. Effective patent mining in massive patent databases can provide companies with valuable insights to develop strategies for IP management and marketing. In this paper, we study a novel patent mining problem of automatically discovering core patents (i.e., patents with high novelty and influence in a domain). We address the unique patent vocabulary usage problem, which is not considered in traditional word-based statistical methods, and propose a topic-based temporal mining approach to quantify a patent's novelty and influence. Comprehensive experimental results on real-world patent portfolios show the effectiveness of our method.
【Keywords】: core patent mining; patent influence; patent novelty; textual temporal analysis
【Paper Link】 【Pages】:1824-1828
【Authors】: Yilin Shen ; Thang N. Dinh ; Huiyuan Zhang ; My T. Thai
【Abstract】: Online social networks have become an imperative channel for extremely fast information propagation and influence. Thus, the problem of finding a minimum number of seed users who can eventually influence as many users in the network as possible has become one of the central research topics recently. Unfortunately, most of related works have only focused on the network topologies and largely ignored many other important factors such as the users' engagements and the negative or positive impacts between users. More challengingly, the behavior of information propagation across multiple networks simultaneously remains an untrodden area and becomes an urgent need. Our work is the first attempt to tackle the above problem in multiple networks, considering these lacking important factors. In order to capture the users' engagement, we propose to targeting the set of interest-matching users whose interests are similar to what we try to propagate. Then, we develop our Iterative Semi-Supervising Learning based approach to identify the minimum seed users. We validate the effectiveness of our solution by using real-world Twitter-Foursquare networks and academic collaboration multiple networks.
【Keywords】: information propagation; interest prediction; iterative semi-supervised learning; multiple online social networks
【Paper Link】 【Pages】:1829-1833
【Authors】: Theodoros Lappas ; Michail Vlachos
【Abstract】: Blog posts, news articles and other webpages are present on the web in multiple languages. Standard search engines evaluate the relevance of the candidate documents to the given query. However, when considering documents with overlapping content, many of them written in a foreign language other than the user's own native tongue, it is beneficial to promote documents that are easy enough for the user to read. Here, we show how to rank a collection of foreign documents based on both: a) relevance to the query, and b) the comprehension difficulty of the document. We design effective ranking operators that evaluate the difficulty of a foreign document with respect to the user's native language. We show that existing search engines can easily augment their scoring function by incorporating the proposed comprehensibility metrics. Finally, we provide extensive experimental evidence that the comprehensibility-aware ranking model significantly improves the standard relevance-based ranking paradigm.
【Keywords】: document comprehensibility; multilingual document search
【Paper Link】 【Pages】:1834-1838
【Authors】: Jaeho Choi ; W. Bruce Croft ; Jinyoung Kim
【Abstract】: Microblog services typically contain very short documents (e.g., tweets) containing comments about the latest news and events. Many of these documents are not informative or have very little content due to their personal and ephemeral nature. Providing effective retrieval in a microblog service will require addressing the challenge of distinguishing the high-quality, informative documents from the others. Recent work has focused on finding features that indicate the quality of microblog documents, but the impact these quality features on retrieval is not clear. In this paper, we suggest a low-cost quality model using surrogate judgments based on user behavior (i.e., retweets) that can be collected automatically. We analyze the relationship between document informativeness and relevance judgments for microblog retrieval. Then we demonstrate that our behavior-based quality metric has a high correlation with manual judgments. Also, we perform experiments to study the impact of the quality model on microblog retrieval. The results based on the TREC Microblog track show that the proposed quality model, combined with a variety of retrieval models, can improve retrieval performance and is competitive with a model trained using manual relevance judgments.
【Keywords】: microblogs; quality model; quality-biased ranking
【Paper Link】 【Pages】:1839-1843
【Authors】: Xin Xin ; Irwin King ; Ritesh Agrawal ; Michael R. Lyu ; Heyan Huang
【Abstract】: Traditionally click models predict click-through rate (CTR) of an advertisement (ad) independent of other ads. Recent researches however indicate that the CTR of an ad is dependent on the quality of the ad itself but also of the neighboring ads. Using historical click-through data of a commercially available ad server, we identify two types (competing and collaborating) of influences among sponsored ads and further propose a novel click-model, Full Relation Model (FRM), which explicitly models dependencies between ads. On a test data, FRM shows significant improvement in CTR prediction as compared to earlier click models.
【Keywords】: click models; collaborating and competing influence; sponsored search
【Paper Link】 【Pages】:1844-1848
【Authors】: Wei Zheng ; Hui Fang ; Conglei Yao
【Abstract】: The goal of result diversification is to maximize the coverage of query subtopics while minimizing the redundancy in the search results. Intuitively, it is more desirable for a diversification system to cover independent subtopics since it would retrieve sets of non-overlapped relevant documents, which leads to less redundancy in the search results. Unfortunately, existing diversification methods assume that query subtopics are independent and ignore their relations in the diversification process. To overcome this limitation, we propose to exploit concept hierarchies to extract query subtopics and infer their relations. We then apply axiomatic approaches to derive a structural diversification method that can leverage the subtopic relations in result diversification. Experimental results over an enterprise collection show that the relations among query subtopics are useful to improve the diversification performance.
【Keywords】: axiomatic approaches; concept hierarchy; enterprise search; structural diversification
【Paper Link】 【Pages】:1849-1853
【Authors】: Liang Kong ; Shan Jiang ; Rui Yan ; Shize Xu ; Yan Zhang
【Abstract】: In many cases, people would like to read the news with great importance on the Internet. However, what users can grasp covers a very small part compared with the huge amount of news which never stops increasing. In this paper, we try to find what users are most likely to be interested in. We notice that media focus plays an essential role in distinguishing news topics and user attention is also an important factor. Therefore, we first propose five strategies which only exploit media focus to decide news influence impact. Then we provide three strategies to combine user attention with media focus. Meanwhile, we also take four types of interaction between user attention and media focus into consideration. To the best of our knowledge, this is the first work to establish different models for computing influence decay of news topics. Experiments show that better influence scores will be achieved by a decay algorithm based on Ebbinghaus forgetting curve and information fusion by considering interactions between user attention and media focus.
【Keywords】: influence decay; media-user interaction; news ranking
【Paper Link】 【Pages】:1854-1858
【Authors】: Le Wu ; Enhong Chen ; Qi Liu ; Linli Xu ; Tengfei Bao ; Lei Zhang
【Abstract】: Collaborative Filtering(CF) is a popular way to build recommender systems and has been successfully employed in many applications. Generally, two kinds of approaches to CF, the local neighborhood methods and the global matrix factorization models, have been widely studied. Though some previous researches target on combining the complementary advantages of both approaches, the performance is still limited due to the extreme sparsity of the rating data. Therefore, it is necessary to consider more information for better reflecting user preference and item content. To that end, in this paper, by leveraging the extra tagging data, we propose a novel unified two-stage recommendation framework, named Neighborhood-aware Probabilistic Matrix Factorization(NHPMF). Specifically, we first use the tagging data to select neighbors of each user and each item, then add unique Gaussian distributions on each user's(item's) latent feature vector in the matrix factorization to ensure similar users(items) will have similar latent features}. Since the proposed method can effectively explores the external data source(i.e., tagging data) in a unified probabilistic model, it leads to more accurate recommendations. Extensive experimental results on two real world datasets demonstrate that our NHPMF model outperforms the state-of-the-art methods.
【Keywords】: collaborative filtering; matrix factorization; neighborhood method
【Paper Link】 【Pages】:1859-1863
【Authors】: Yao Lu ; Wei Zhang ; Ke Zhang ; Xiangyang Xue
【Abstract】: There are a large number of images available on the web; meanwhile, only a subset of web images can be labeled by professionals because manual annotation is time-consuming and labor-intensive. Although we can now use the collaborative image tagging system, e.g., Flickr, to get a lot of tagged images provided by Internet users, these labels may be incorrect or incomplete. Furthermore, semantics richness requires more than one label to describe one image in real applications, and multiple labels usually interact with each other in semantic space. It is of significance to learn semantic context with large-scale weakly-labeled image set in the task of multi-label annotation. In this paper, we develop a novel method to learn semantic context and predict the labels of web images in a semi-supervised framework. To address the scalability issue, a small number of exemplar images are first obtained to cover the whole data cloud; then the label vector of each image is estimated as a local combination of the exemplar label vectors. Visual context, semantic context, and neighborhood consistency in both visual and semantic spaces are sufficiently leveraged in the proposed framework. Finally, the semantic context and the label confidence vectors for exemplar images are both learned in an iterative way. Experimental results on the real-world image dataset demonstrate the effectiveness of our method.
【Keywords】: image annotation; large scale; semantic context; weakly labeled
【Paper Link】 【Pages】:1864-1868
【Authors】: Samuel Huston ; J. Shane Culpepper ; W. Bruce Croft
【Abstract】: Formulating and processing phrases and other term dependencies to improve query effectiveness is an important problem in information retrieval. However, accessing these types of statistics using standard inverted indexes requires unreasonable processing time or incurs a substantial space overhead. Establishing a balance between these competing space and time trade-offs can dramatically improve system performance. In this paper, we present and analyze a new index structure designed to improve query efficiency in term dependency retrieval models, with bounded space requirements. By adapting a class of (ε,δ)-approximation algorithms originally proposed for sketch summarization in networking applications, we show how to accurately estimate various statistics important in term dependency models with low, probabilistically bounded error rates. The space requirements of the sketch index structure is largely independent of this size and the number of phrase term dependencies. Empirically, we show that the sketch index can reduce the space requirements of the vocabulary component of an index of all n-grams consisting of between 1 and 5 words extracted from the Clueweb-Part-B collection to less than 0.2% of the requirements of an equivalent full index. We show that n-gram queries of 5 words can be processed more efficiently than in current alternatives, such as next-word indexes. We show retrieval using the sketch index to be up to 400 times faster than with positional indexes, and 15 times faster than next-word indexes.
【Keywords】: indexing; scalability; sketching; term dependency models
【Paper Link】 【Pages】:1869-1873
【Authors】: Francesco Bonchi ; Ophir Frieder ; Franco Maria Nardini ; Fabrizio Silvestri ; Hossein Vahabi
【Abstract】: Collaborative content creation and annotation creates vast repositories of all sorts of media, and user-defined tags play a central role as they are a simple yet powerful tool for organizing, searching and exploring the available resources. We observe that when a user annotates a resource with a set of tags, those tags are introduced one at a time. Therefore, when the fourth tag is introduced, a knowledge represented by the previous three tags, i.e., the context in which the fourth tag is produced, is available and exploitable for generating potential correction of the current tag. This context, together with the "wisdom of the crowd" represented by the co-occurrences of tags in all the resources of the repository, can be exploited to provide interactive tag spell check and correction. We develop this idea in a framework, based on a weighted tag co-occurrence graph and on nodes relatedness measures defined on weighted neighborhoods. We test our proposal on a dataset coming from YouTube. The results show that our framework is effective as it outperforms two important baselines. We also show that it is efficient, thus enabling its use in modern tagging services.
【Keywords】: tag co-occurrence graph; tag spell checking and correction
【Paper Link】 【Pages】:1874-1878
【Authors】: Dong Nguyen ; Thomas Demeester ; Dolf Trieschnigg ; Djoerd Hiemstra
【Abstract】: Federated search has the potential of improving web search: the user becomes less dependent on a single search provider and parts of the deep web become available through a unified interface, leading to a wider variety in the retrieved search results. However, a publicly available dataset for federated search reflecting an actual web environment has been absent. As a result, it has been difficult to assess whether proposed systems are suitable for the web setting. We introduce a new test collection containing the results from more than a hundred actual search engines, ranging from large general web search engines such as Google and Bing to small domain-specific engines. We discuss the design and analyze the effect of several sampling methods. For a set of test queries, we collected relevance judgements for the top 10 results of each search engine. The dataset is publicly available and is useful for researchers interested in resource selection for web search collections, result merging and size estimation of uncooperative resources.
【Keywords】: dataset; distributed information retrieval; evaluation; federated search; test collection; web search
【Paper Link】 【Pages】:1879-1884
【Authors】: Zhixiang Eddie Xu ; Minmin Chen ; Kilian Q. Weinberger ; Fei Sha
【Abstract】: In text mining, information retrieval, and machine learning, text documents are commonly represented through variants of sparse Bag of Words (sBoW) vectors (e.g. TF-IDF [1]). Although simple and intuitive, sBoW style representations suffer from their inherent over-sparsity and fail to capture word-level synonymy and polysemy. Especially when labeled data is limited (e.g. in document classification), or the text documents are short (e.g. emails or abstracts), many features are rarely observed within the training corpus. This leads to overfitting and reduced generalization accuracy. In this paper we propose Dense Cohort of Terms (dCoT), an unsupervised algorithm to learn improved sBoW document features. dCoT explicitly models absent words by removing and reconstructing random sub-sets of words in the unlabeled corpus. With this approach, dCoT learns to reconstruct frequent words from co-occurring infrequent words and maps the high dimensional sparse sBoW vectors into a low-dimensional dense representation. We show that the feature removal can be marginalized out and that the reconstruction can be solved for in closed-form. We demonstrate empirically, on several benchmark datasets, that dCoT features significantly improve the classification accuracy across several document classification tasks.
【Keywords】: denoising autoencoder; marginalized; stacked; text features
【Paper Link】 【Pages】:1885-1889
【Authors】: Ahmed Hassan Awadallah ; Ryen W. White
【Abstract】: Complex search tasks such as planning a vacation often comprise multiple queries and may span a number of search sessions. When engaged in such tasks, users may require holistic support in determining the required task activities. Unfortunately, current search engines do not offer such support to their users. In this paper, we propose methods to automatically generate task tours comprising a starting task and a set of relevant related tasks, some or all of which may be necessary to satisfy a user's information needs. Applications of the tours include helping users understand the required steps to complete a task, finding URLs related to the active task, and alerting users to activities they may have missed. We demonstrate through experimentation with human judges and large-scale search logs that our tours are of good quality and can benefit a significant fraction of search engine users.
【Keywords】: search task support; task graph; task tours
【Paper Link】 【Pages】:1890-1894
【Authors】: Sreenivas Gollapudi ; Samuel Ieong ; Anitha Kannan
【Abstract】: Recent work in commerce search has shown that understanding the semantics in user queries enables more effective query analysis and retrieval of relevant products. However, due to lack of sufficient domain knowledge, user queries often include terms that cannot be mapped directly to any product attribute. For example, a user looking for designer handbags might start with such a query because she is not familiar with the manufacturers, the price ranges, and/or the material that gives a handbag designer appeal. Current commerce search engines treat terms such as designer as keywords and attempt to match them to contents such as product reviews and product descriptions, often resulting in poor user experience. In this study, we propose to address this problem by reformulating queries involving terms such as designer, which we call modifiers, to queries that specify precise product attributes. We learn to rewrite the modifiers to attribute values by analyzing user behavior and leveraging structured data sources such as the product catalog that serves the queries. We first produce a probabilistic mapping between the modifiers and attribute values based on user behavioral data. These initial associations are then used to retrieve products from the catalog, over which we infer sets of attribute values that best describe the semantics of the modifiers. We evaluate the effectiveness of our approach based on a comprehensive Mechanical Turk study. We find that users agree with the attribute values selected by our approach in about 95% of the cases and they prefer the results surfaced for our reformulated queries to ones for the original queries in 87% of the time.
【Keywords】: structured search
【Paper Link】 【Pages】:1895-1899
【Authors】: Xueke Xu ; Songbo Tan ; Yue Liu ; Xueqi Cheng ; Zheng Lin
【Abstract】: In this paper, we aim to jointly extract aspects and aspect-specific sentiment knowledge from online reviews, where the sentiment knowledge refers to the aspect-specific opinion words along with their aspect-aware sentiment polarities. To this end, we propose a Joint Aspect/Sentiment model (JAS). JAS detects aspect-specific opinion words by integrating opinion word lexicon knowledge to explicitly separate opinion words from factual words. More importantly, JAS exploits sentiment prior and aspect-contextual sentence-level co-occurrences of opinion words in reviews to further identify aspect-aware sentiment polarities for the opinion words. We apply the learned aspect-specific sentiment knowledge to practical aspect-level sentiment analysis tasks. Experimental results show the effectiveness of JAS in learning aspect-specific sentiment knowledge and the practical value of this knowledge when applied to aspect-level sentiment classification.
【Keywords】: aspect-level sentiment analysis; aspect-specific sentiment knowledge; joint aspect/setniment model; online reviews
【Paper Link】 【Pages】:1900-1904
【Authors】: Ke Zhou ; Xin Li ; Hongyuan Zha
【Abstract】: It is well known that tail queries contribute to a substantial fraction of distinct queries submitted to search engines and thus become a major battle field for search engines. Unfortunately, compared with popular queries, it is much more difficult to obtain good search results for tail queries due to the lack of important relevance signals, such as user clicks, phrase matches and so on. In this paper, we propose to utilize the similarities between different queries to overcome the data sparsity problem for tail queries. Specifically, we propose to jointly learn query similarities and the ranking function from data so that the relevance signals of different but related queries can be collaboratively pooled to enhance the ranking of tail queries. We emphasize that the joint optimization is critical so that the learned query similarity function can adapt to the problem of learning ranking functions. Our proposed method is evaluated on two data sets and the results show that our method improves the relevance of tail queries over several baseline alternatives.
【Keywords】: collaborative ranking; gradient boosting; learning to rank; relevance; tail query
【Paper Link】 【Pages】:1905-1909
【Authors】: V. G. Vinod Vydiswaran ; ChengXiang Zhai ; Dan Roth ; Peter Pirolli
【Abstract】: Deciding whether a claim is true or false often requires understanding the evidence supporting and contradicting the claim. However, when learning about a controversial claim, human biases and viewpoints may affect which evidence documents are considered "trustworthy" or credible. It is important to overcome this bias and know both viewpoints to get a balanced perspective. In this paper, we study various factors that affect learning about the truthfulness of controversial claims. We designed a user study to understand the impact of these factors. Specifically, we studied the impact of presenting evidence with contrasting viewpoints and source expertise rating on how users accessed the evidence documents. This would help us optimize how to teach users about controversial topics in the most effective way, and to design better claim verification systems. We find that users do not seek contrasting viewpoints by themselves, but explicitly presenting contrasting evidence helps them get a well-rounded understanding of the topic. Furthermore, explicit knowledge of the source credibility and the context not only affects what users read, but also how credible they perceive the document to be.
【Keywords】: claim verification; information credibility; user study
【Paper Link】 【Pages】:1910-1914
【Authors】: Wenyi Huang ; Saurabh Kataria ; Cornelia Caragea ; Prasenjit Mitra ; C. Lee Giles ; Lior Rokach
【Abstract】: When we write or prepare to write a research paper, we always have appropriate references in mind. However, there are most likely references we have missed and should have been read and cited. As such a good citation recommendation system would not only improve our paper but, overall, the efficiency and quality of literature search. Usually, a citation's context contains explicit words explaining the citation. Using this, we propose a method that "translates" research papers into references. By considering the citations and their contexts from existing papers as parallel data written in two different "languages", we adopt the translation model to create a relationship between these two "vocabularies". Experiments on both CiteSeer and CiteULike dataset show that our approach outperforms other baseline methods and increase the precision, recall and f-measure by at least 5% to 10%, respectively. In addition, our approach runs much faster in the both training and recommending stage, which proves the effectiveness and the scalability of our work.
【Keywords】: citation recommendation; machine translation
【Paper Link】 【Pages】:1915-1919
【Authors】: Xin Zhang ; Ben He ; Tiejian Luo ; Baobin Li
【Abstract】: By incorporating diverse sources of evidence of relevance, learning to rank has been widely applied to real-time Twitter search, where users are interested in fresh relevant messages. Such approaches usually rely on a set of training queries to learn a general ranking model, which we believe that the benefits brought by learning to rank may not have been fully exploited as the characteristics and aspects unique to the given target queries are ignored. In this paper, we propose to further improve the retrieval performance of learning to rank for real-time Twitter search, by taking the difference between queries into consideration. In particular, we learn a query-biased ranking model with a semi-supervised transductive learning algorithm so that the query-specific features, e.g. the unique expansion terms, are utilized to capture the characteristics of the target query. This query-biased ranking model is combined with the general ranking model to produce the final ranked list of tweets in response to the given target query. Extensive experiments on the standard TREC Tweets11 collection show that our proposed query-biased learning to rank approach outperforms strong baseline, namely the conventional application of the state-of-the-art learning to rank algorithms.
【Keywords】: query-biased learning to rank; real-time twitter search; semi-supervised learning
【Paper Link】 【Pages】:1920-1924
【Authors】: Zhao Liu ; Xipeng Qiu ; Ling Cao ; Xuanjing Huang
【Abstract】: Most open-domain question answering systems achieve better performances with large corpora, such as Web, by taking advantage of information redundancy. However, explicit answers are not always mentioned in the corpus, many answers are implicitly contained and can only be deducted by inference. In this paper, we propose an approach to discover logical knowledge for deep question answering, which automatically extracts knowledge in an unsupervised, domain-independent manner from background texts and reasons out implicit answers for the questions. Firstly, we use semantic role labeling to transform natural language expressions to predicates in first-order logic. Then we use association analysis to uncover the implicit relations among these predicates and build propositions for inference. Since our knowledge is drawn from different sources, we use Markov logic to merge multiple knowledge bases without resolving their inconsistencies. Our experiments show that these propositions can improve the performance of question answering significantly.
【Keywords】: markov logic; question answering; semantic role labeling
【Paper Link】 【Pages】:1925-1929
【Authors】: Zhongang Qi ; Ming Yang ; Zhongfei (Mark) Zhang ; Zhengyou Zhang
【Abstract】: In this paper we study the problem of mining noisy tagging. Most of the existing discriminative classification methods to this problem only consider one tag at a time as the classification target, and completely ignore the rest of the given tags at the same time. In this paper we argue that all the given multiple tags can be utilized simultaneously as an additional feature and the information contained in the multi-label space can be taken advantage of to improve the performance of the classification. We first propose a novel distance measure to compute the distance between instances in the multi-label space. Then we propose several novel methods to incorporate the information of the multi-label space into the discriminative classification methods in one view learning or in two views learning to solve a general multi-label classification problem and to mitigate the influence of the noise in the classification. We apply the proposed solutions to the problem with a more specific context - noisy image annotation, and evaluate the proposed methods on a standard dataset from the related literature. Experiments show that they are superior to the peer methods in the existing literature on solving the problem of mining noisy tagging.
【Keywords】: image annotation prediction; multi-label space; noisy tagging
【Paper Link】 【Pages】:1930-1934
【Authors】: Karthik Raman ; Krysta Marie Svore ; Ran Gilad-Bachrach ; Christopher J. C. Burges
【Abstract】: Many learning algorithms generate complex models that are difficult for a human to interpret, debug, and extend. In this paper, we address this challenge by proposing a new learning paradigm called correctable learning, where the learning algorithm receives external feedback about which data examples are incorrectly learned. We define a set of metrics which measure the correctability of a learning algorithm. We then propose a simple and efficient correctable learning algorithm which learns local models for different regions of the data space. Given an incorrect example, our method samples data in the neighborhood of that example and learns a new, more correct local model over that region. Experiments over multiple classification and ranking datasets show that our correctable learning algorithm offers significant improvements over the state-of-the-art techniques.
【Keywords】: classification; correctable learning; regression
【Paper Link】 【Pages】:1935-1939
【Authors】: Jaehoon Choi ; Donghyeon Kim ; Seongsoon Kim ; Junkyu Lee ; Sangrak Lim ; Sunwon Lee ; Jaewoo Kang
【Abstract】: Search engines have become an important decision making tool today. Decision making queries are often subjective, such as "a good birthday present for my girlfriend", "best action movies in 2010", to name a few. Unfortunately, such queries may not be answered properly by conventional search systems. In order to address this problem, we introduce Consento, a consensus search engine designed to answer subjective queries. Consento performs segment indexing, as opposed to document indexing, to capture semantics from user opinions more precisely. In particular, we define a new indexing unit, Maximal Coherent Semantic Unit (MCSU). An MCSU represents a segment of a document, which captures a single coherent semantic. We also introduce a new ranking method, called ConsensusRank that counts online comments referring to an entity as a weighted vote. In order to validate the efficacy of the proposed framework, we compare Consento with standard retrieval models and their recent extensions for opinion based entity ranking. Experiments using movie and hotel data show the effectiveness of our framework.
【Keywords】: consensus rank; consensus search; entity search; maximal coherent semantic unit; sentiment analysis
【Paper Link】 【Pages】:1940-1944
【Authors】: Benno Stein ; Tim Gollub ; Dennis Hoppe
【Abstract】: We propose a competence partitioning strategy for Web search result presentation: the unmodified head of a ranked result list is combined with a clustering of documents from the result list tail. We identify two principles to which such a clustering must adhere to improve the user's search experience: (1) Avoid the unwanted effect of query aspect repetition, which is called shadowing here. (2) Avoid extreme clusterings, i.e., neither the number of cluster labels nor the number of documents per cluster should exceed the size of the result list head. We present measures to quantify the shadowing effect, and with Faceted Clustering we introduce an algorithm that optimizes the identified principles. The key idea of Faceted Clustering is a dynamic, user-controlled reorganization of a clustering, similar to a faceted navigation system. We report on evaluations using the AMBIENT corpus and demonstrate the potential of our approach by a comparison with two well-known clustering search engines.
【Keywords】: cluster labeling; search result clustering
【Paper Link】 【Pages】:1945-1949
【Authors】: Rawia Awadallah ; Maya Ramanath ; Gerhard Weikum
【Abstract】: We consider the problem of automatically classifying quotations about political debates into both topic and polarity. These quotations typically appear in news media and online forums. Our approach maps quotations onto one or more topics in a category system of political debates, containing more than a thousand fine-grained topics. To overcome the difficulty that pro/con classification faces due to the brevity of quotations and sparseness of features, we have devised a model of quotation expansion that harnesses antonyms from thesauri like WordNet. We developed a suite of statistical language models, judiciously customized to our settings, and use these to define similarity measures for unsupervised or supervised classifications. Experiments show the effectiveness of our method.
【Keywords】: political opinion mining; web information extraction
【Paper Link】 【Pages】:1950-1954
【Authors】: Teerapong Leelanupab ; Guido Zuccon ; Joemon M. Jose
【Abstract】: In the TREC Web Diversity track, novelty-biased cumulative gain (α-NDCG) is one of the official measures to assess retrieval performance of IR systems. The measure is characterised by a parameter, α, the effect of which has not been thoroughly investigated. We find that common settings of α, i.e. α=0.5, may prevent the measure from behaving as desired when evaluating result diversification. This is because it excessively penalises systems that cover many intents while it rewards those that redundantly cover only few intents. This issue is crucial since it highly influences systems at top ranks. We revisit our previously proposed threshold, suggesting α be set on a query-basis. The intuitiveness of the measure is then studied by examining actual rankings from TREC 09-10 Web track submissions. By varying α according to our query-based threshold, the discriminative power of α-NDCG is not harmed and in fact, our approach improves α-NDCG's robustness. Experimental results show that the threshold for α can turn the measure to be more intuitive than using its common settings.
【Keywords】: diversity; evaluation measure
【Paper Link】 【Pages】:1955-1959
【Authors】: Xitong Liu ; Hui Fang ; Fei Chen ; Min Wang
【Abstract】: Enterprise search is important, and the search quality has a direct impact on the productivity of an enterprise. Many information needs of enterprise search center around entities. Intuitively, information related to the entities mentioned in the query, such as related entities, would be useful to reformulate the query and improve the retrieval performance. However, most existing studies on query expansion are term-centric. In this paper, we propose a novel entity-centric query expansion framework for enterprise search. Specifically, given a query containing entities, we first utilize both unstructured and structured information to find entities that are related to the ones in the query. We then discuss how to adapt existing feedback methods to use the related entities to improve search quality. Experiment results show that the proposed entity-centric query expansion strategy is more effective to improve the search performance than the state-of-the-art pseudo feedback methods on longer, natural language-like queries with entities.
【Keywords】: combining structured and unstructured data; enterprise search; entity centric; query expansion; retrieval
【Paper Link】 【Pages】:1960-1964
【Authors】: Chang Wan ; Ben Kao ; David W. Cheung
【Abstract】: In social tagging systems, resources such as images and videos are annotated with descriptive words called tags. It has been shown that tag-based resource searching and retrieval is much more effective than content-based retrieval. With the advances in mobile technology, many resources are also geo-tagged with location information. We observe that a traditional tag (word) can carry different semantics at different locations. We study how location information can be used to help distinguish the different semantics of a resource's tags and thus to improve retrieval accuracy. Given a search query, we propose a location-partitioning method that partitions all locations into regions such that the user query carries distinguishing semantics in each region. Based on the identified regions, we utilize location information in estimating the ranking scores of resources for the given query. These ranking scores are learned using the Bayesian Personalized Ranking (BPR) framework. Two algorithms, namely, LTD and LPITF, which apply Tucker Decomposition and Pairwise Interaction Tensor Factorization, respectively for modeling the ranking score tensor are proposed. Through experiments on real datasets, we show that LTD and LPITF outperform other tag-based resource retrieval methods.
【Keywords】: location-sensitive; ranking; resources recommendation
【Paper Link】 【Pages】:1965-1969
【Authors】: Mark Sanderson ; Andrew Turpin ; Ying Zhang ; Falk Scholer
【Abstract】: The relative performance of retrieval systems when evaluated on one part of a test collection may bear little or no similarity to the relative performance measured on a different part of the collection. In this paper we report the results of a detailed study of the impact that different sub-collections have on retrieval effectiveness, analyzing the effect over many collections, and with different approaches to sub-dividing the collections. The effect is shown to be substantial, impacting on comparisons between retrieval runs that are statistically significant. Some possible causes for the effect are investigated, and the implications of this work are examined for test collection design and for the strength of conclusions one can draw from experimental results.
【Keywords】: information retrieval evaluation; search engines; sub-collections
【Paper Link】 【Pages】:1970-1974
【Authors】: Mihai Georgescu ; Dang Duc Pham ; Claudiu S. Firan ; Wolfgang Nejdl ; Julien Gaugaz
【Abstract】: Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.
【Keywords】: active learning; crowdsourcing; duplicate detection; machine learning; optimization
【Paper Link】 【Pages】:1975-1979
【Authors】: Xiaozhong Liu ; Jinsong Zhang ; Chun Guo
【Abstract】: The goal of this paper is to use innovative text and graph mining algorithms along with full-text citation analysis and topic modeling to enhance classical bibliometric analysis and publication ranking. By utilizing citation contexts extracted from a large number of full-text publications, each citation or publication is represented by a probability distribution over a set of predefined topics, where each topic is labeled by an author contributed keyword. We then used publication/citation topic distribution to generate a citation graph with vertex prior and edge transitioning probability distributions. The publication importance score for each given topic is calculated by PageRank with edge and vertex prior distributions. Based on 104 topics (labeled with keywords) and their review papers, the cited publications of each review paper are assumed as "important publications" for ranking evaluation. The result shows that full text citation and publication content prior topic distribution along with the PageRank algorithm can significantly enhance bibliometric analysis and scientific publication ranking performance for academic IR system.
【Keywords】: bibliometrics; citation analysis; pagerank; prior knowledge; publication ranking; topic modeling
【Paper Link】 【Pages】:1980-1984
【Authors】: Guang Xiang ; Bin Fan ; Ling Wang ; Jason I. Hong ; Carolyn Penstein Rosé
【Abstract】: In this paper, we propose a novel semi-supervised approach for detecting profanity-related offensive content in Twitter. Our approach exploits linguistic regularities in profane language via statistical topic modeling on a huge Twitter corpus, and detects offensive tweets using automatically these generated features. Our approach performs competitively with a variety of machine learning (ML) algorithms. For instance, our approach achieves a true positive rate (TP) of 75.1% over 4029 testing tweets using Logistic Regression, significantly outperforming the popular keyword matching baseline, which has a TP of 69.7%, while keeping the false positive rate (FP) at the same level as the baseline at about 3.77%. Our approach provides an alternative to large scale hand annotation efforts required by fully supervised learning approaches.
【Keywords】: hadoop; machine learning; topic modeling; twitter
【Paper Link】 【Pages】:1985-1989
【Authors】: Vitor Campos de Oliveira ; Guilherme de Castro Mendes Gomes ; Fabiano Belém ; Wladmir C. Brandão ; Jussara M. Almeida ; Nivio Ziviani ; Marcos André Gonçalves
【Abstract】: We here propose a new method for expanding entity related queries that automatically filters, weights and ranks candidate expasion terms extracted from Wikipedia articles related to the original query. Our method is based on state-of-the-art tag recommendation methods that exploit heuristic metrics to estimate the descriptive capacity of a given term. Originally proposed for the context of tags, we here apply these recommendation methods to weight and rank terms extracted from multiple fields of Wikipedia articles according to their relevance for the article. We evaluate our method comparing it against three state-of-the-art baselines in three collections. Our results indicate that our method outperforms all baselines in all collections, with relative gains in MAP of up to 14% against the best ones.
【Keywords】: query expansion; tag recommendation; wikipedia
【Paper Link】 【Pages】:1990-1994
【Authors】: Karl Gyllstrom ; Carsten Eickhoff ; Arjen P. de Vries ; Marie-Francine Moens
【Abstract】: The continued development and maturation of advanced HTML features such as Cascading style sheets (CSS), Javascript, and AJAX, as well as their widespread adoption by browsers, has enabled web pages to flourish with sophistication and interactivity. Unfortunately, this presents challenges to the web search community, as a web page's representation in the browser (i.e., what users see) can diverge dramatically from its raw HTML content (i.e., what search engines index and retrieve). For example, interactive pages may contain content in regions that are not visible before a user action, such as focusing a tab, but which are nonetheless still contained within the raw HTML. We study this divergence by comparing raw HTML to its fully rendered form across a number of metrics spanning presentation, geometry, and content, using a large, representative sample of popular web pages. We find that a large divergence currently exists, and we show via a historical analysis that this divergence has grown more pronounced over the last decade. The general finding of our study is that continuing to index the web via simple HTML parsing will diminish the effectiveness of retrieval on the modern web, and that the IR community should work toward more sophisticated web page processing in indexing technology.
【Keywords】: html; indexing; rendering; web
【Paper Link】 【Pages】:1995-1999
【Authors】: Roi Blanco ; Diego Ceccarelli ; Claudio Lucchese ; Raffaele Perego ; Fabrizio Silvestri
【Abstract】: Recommender systems have become ubiquitous in content-based web applications, from news to shopping sites. Nonetheless, an aspect that has been largely overlooked so far in the recommender system literature is that of automatically building explanations for a particular recommendation. This paper focuses on the news domain, and proposes to enhance effectiveness of news recommender systems by adding, to each recommendation, an explanatory statement to help the user to better understand if, and why, the item can be her interest. We consider the news recommender system as a black-box, and generate different types of explanations employing pieces of information associated with the news. In particular, we engineer text-based, entity-based, and usage-based explanations, and make use of a Markov Logic Networks to rank the explanations on the basis of their effectiveness. The assessment of the model is conducted via a user study on a dataset of news read consecutively by actual users. Experiments show that news recommender systems can greatly benefit from our explanation module as it allows users to discriminate between interesting and not interesting news in the majority of the cases.
【Keywords】: markov logic networks; news recommendation; query log analysis; recommendation snippets
【Paper Link】 【Pages】:2000-2004
【Authors】: Ismail Sengör Altingövde ; Roi Blanco ; Berkant Barla Cambazoglu ; Rifat Ozcan ; Erdem Sarigil ; Özgür Ulusoy
【Abstract】: Despite the continuous efforts to improve the web search quality, a non-negligible fraction of user queries end up with very few or even no matching results in leading web search engines. In this work, we provide a detailed characterization of such queries based on an analysis of a real-life query log. Our experimental setup allows us to characterize the queries with few/no results and compare the mechanisms employed by the major search engines in handling them.
【Keywords】: query difficulty; search result quality; web search engines
【Paper Link】 【Pages】:2005-2009
【Authors】: Konstantin Salomatin ; Tie-Yan Liu ; Yiming Yang
【Abstract】: This paper proposes a new unified optimization framework combining pay-per-click auctions and guaranteed delivery in sponsored search. Advertisers usually have different (and sometimes mixed) marketing goals: brand awareness and direct response. Different mechanisms are good at addressing different goals, e.g., guaranteed delivery was often used to build brand awareness and pay-per-click auctions was widely used for direct marketing. Our new method accommodates both in a unified framework, with the search engine revenue as an optimization objective. In this way, we can target a guaranteed number of ad clicks (or impressions) per campaign for advertisers willing to pay a premium and enable keyword auctions for all others. Specifically, we formulate this joint optimization problem using linear programming and a column generation strategy for efficiency. To select the best column (a ranked list of ads) given a query, we propose a novel dynamic programming algorithm that takes the special structure of the ad allocation and pricing mechanisms into account. We have tested the proposed framework and the algorithms on real ad data obtained from a commercial search engine. The results demonstrate that our proposed approach can outperform several baselines in guaranteeing the number of clicks for the given advertisers, and in increasing the total revenue for the search engine.
【Keywords】: linear programming; online advertising; optimization; sponsored search
【Paper Link】 【Pages】:2010-2014
【Authors】: Sergio Duarte Torres ; Djoerd Hiemstra ; Ingmar Weber ; Pavel Serdyukov
【Abstract】: One of the biggest problems that children experience while searching the web occurs during the query formulation process. Children have been found to struggle formulating queries based on keywords given their limited vocabulary and their difficulty to choose the right keywords. In this work we propose a method that utilizes tags from social media to suggest queries related to children topics. Concretely we propose a simple yet effective approach to bias a random walk defined on a bipartite graph of web resources and tags through keywords that are more commonly used to describe resources for children. We evaluate our method using a large query log sample of queries aimed at retrieving information for children. We show that our method outperforms query suggestions of state-of-the-art search engines and state-of-the art query suggestions based on random walks.
【Keywords】: children; query formulation; social media
【Paper Link】 【Pages】:2015-2019
【Authors】: Azin Ashkan ; Charles L. A. Clarke
【Abstract】: Clickthrough rate provides a fundamental measure of advertising quality, which is widely used in ad selection strategies. However, ads placed in contexts where they are rarely viewed, or where users are unlikely to be interested in commercial results, may receive few clicks regardless of their quality. In this paper, we gain insight into user browsing and click behavior for the purpose of click analysis in sponsored search domain. The list of ads displayed on a page, the user's initial motivation to browse this list, and the persistence of the user are among the contextual factors considered in this paper. We propose a probabilistic model for user's browsing and click behavior using these contextual factors. To evaluate the performance of the model, we compare it with state-of-the-art methods. The experimental results confirm that these contextual factors can better reflect user browsing and click behavior in sponsored search.
【Keywords】: bayesian inference; click model; query log; sponsored search
【Paper Link】 【Pages】:2020-2024
【Authors】: A. Gural Vural ; Berkant Barla Cambazoglu ; Pinar Senkul
【Abstract】: The sentiments and opinions that are expressed in web pages towards objects, entities, and products constitute an important portion of the textual content available in the Web. Despite the vast interest in sentiment analysis and opinion mining, somewhat surprisingly, the discovery of the sentimental or opinionated web content is mostly ignored. This work aims to fill this gap and address the problem of quickly discovering and fetching the sentimental content present in the Web. To this end, we design a sentiment-focused web crawling framework for faster discovery and retrieval of such content. In particular, we propose different sentiment-focused web crawling strategies that prioritize discovered URLs based on their predicted sentiment scores. Through simulations, these strategies are shown to achieve considerable performance improvement over general-purpose web crawling strategies in discovering sentimental content.
【Keywords】: focused web crawling; sentiment analysis
【Paper Link】 【Pages】:2025-2029
【Authors】: Xiao Yu ; Yizhou Sun ; Brandon Norick ; Tiancheng Mao ; Jiawei Han
【Abstract】: With the emergence of web-based social and information applications, entity similarity search in information networks, aiming to find entities with high similarity to a given query entity, has gained wide attention. However, due to the diverse semantic meanings in heterogeneous information networks, which contain multi-typed entities and relationships, similarity measurement can be ambiguous without context. In this paper, we investigate entity similarity search and the resulting ambiguity problems in heterogeneous information networks. We propose to use a meta-path-based ranking model ensemble to represent semantic meanings for similarity queries, exploit the possibility of using using user-guidance to understand users query. Experiments on real-world datasets show that our framework significantly outperforms competitor methods.
【Keywords】: entity similarity search; heterogeneous information network; user guided
【Paper Link】 【Pages】:2030-2034
【Authors】: Hongxia Jin
【Abstract】: In this paper, we are interested in discovering semantically meaningful communities from a single user's perspective. We define a multi-layer analysis problem to derive a user's activity profile. Such an activity profile would include what activity areas a user is involved with, how important each activity is to the user, and who else is involved with the user on each activity as well as each participant's participation level. We believe a semantically meaningful community (corresponding to an activity area) must also consider the topics of the social messages rather than only the social links. While it is possible to use a hybrid approach based on traditional topic modeling, in this paper we propose a unified user modeling approach based on direct clustering over the social messages taking into considerations of both social connections and topics of social messages. Our clustering algorithm can be performed in a unified way in a unsupervised fashion as well as semi-supervised fashion when the user wants to give our algorithm some seeding inputs on his viewpoints. Moreover, when the new data comes, our algorithm can perform incremental updates on the new data without re-clustering the old data. Our experiments on social media datasets available from both within an enterprise and public social network demonstrate the effectiveness of our approach.
【Keywords】: community discovery; social network; user profiling
【Paper Link】 【Pages】:2035-2039
【Authors】: Ricardo Campos ; Gaël Dias ; Alípio Jorge ; Celia Nunes
【Abstract】: In this paper, we present an approach to identify top relevant dates in Web snippets with respect to a given implicit temporal query. Our approach is two-fold. First, we propose a generic temporal similarity measure called GTE, which evaluates the temporal similarity between a query and a date. Second, we propose a classification model to accurately relate relevant dates to their corresponding query terms and withdraw irrelevant ones. We suggest two different solutions: a threshold-based classification strategy and a supervised classifier based on a combination of multiple similarity measures. We evaluate both strategies over a set of real-world text queries and compare the performance of our Web snippet approach with a query log approach over the same set of queries. Experiments show that determining the most relevant dates of any given implicit temporal query can be improved with GTE combined with the second order similarity measure InfoSimba, the Dice coefficient and the threshold-based strategy compared to (1) first-order similarity measures and (2) the query log based approach.
【Keywords】: implicit temporal queries; query log analysis; temporal information Retrieval; temporal query understanding
【Paper Link】 【Pages】:2040-2044
【Authors】: Mark D. Smucker ; Charles L. A. Clarke
【Abstract】: Time-biased gain provides a unifying framework for information retrieval evaluation, generalizing many traditional effectiveness measures while accommodating aspects of user behavior not captured by these measures. By using time as a basis for calibration against actual user data, time-biased gain can reflect aspects of the search process that directly impact user experience, including document length, near-duplicate documents, and summaries. Unlike traditional measures, which must be arbitrarily normalized for averaging purposes, time-biased gain is reported in meaningful units, such as the total number of relevant documents seen by the user. In prior work, we proposed and validated a closed-form equation for estimating time-biased gain, explored its properties, and compared it to standard approaches. In this paper, we use stochastic simulation to numerically approximate time-biased gain. Stochastic simulation provides greater flexibility that will allow us, in future work, to easily accommodate different types of user behavior and increase the realism of the effectiveness measure.
【Keywords】: information retrieval; search evaluation
【Paper Link】 【Pages】:2045-2049
【Authors】: Abhijith Kashyap ; Reza Amini ; Vagelis Hristidis
【Abstract】: Earlier works on personalized Web search focused on the click-through graphs, while recent works leverage social annotations, which are often unavailable. On the other hand, many users are members of the social networks and subscribe to social groups. Intuitively, users in the same group may have similar relevance judgments for queries related to these groups. SonetRank utilizes this observation to personalize the Web search results based on the aggregate relevance feedback of the users in similar groups. SonetRank builds and maintains a rich graph-based model, termed Social Aware Search Graph, consisting of groups, users, queries and results click-through information. SonetRank's personalization scheme learns in a principled way to leverage the following three signals, of decreasing strength: the personal document preferences of the user, of the users of her social groups relevant to the query, and of the other users in the network. SonetRank also uses a novel approach to measure the amount of personalization with respect to a user and a query, based on the query-specific richness of the user's social profile. We evaluate SonetRank with users on Amazon Mechanical Turk and show a significant improvement in ranking compared to state-of-the-art techniques.
【Keywords】: results re-ranking; search personalization; social search
【Paper Link】 【Pages】:2050-2054
【Authors】: Qi Guo ; Dmitry Lagun ; Eugene Agichtein
【Abstract】: Detecting and predicting searcher success is essential for automatically evaluating and improving Web search engine performance. In the past, Web searcher behavior data, such as result clickthrough, dwell time, and query reformulation sequences, have been successfully used for a variety of tasks, including prediction of success in a search session. However, the effectiveness of the previous approaches has been limited, as they tend to ignore how searchers actually view and interact with the visited pages. We show that fine-grained interactions, such as mouse cursor movements and scrolling, provide additional clues for better predicting success of a search session as a whole. To this end, we identify patterns of examination and interaction behavior that correspond to search success, and design a new Fine-grained Session Behavior (FSB) model to capture these patterns. Our experimental results show that FSB is significantly more effective than the state-of-the-art approaches that do not use these additional interaction data.
【Keywords】: mouse cursor analysis; search session; success prediction
【Paper Link】 【Pages】:2055-2059
【Authors】: Sarah K. Tyler ; Yi Zhang
【Abstract】: Search engine users regularly re-issue queries that are the same or similar to ones they have previously issued. In this paper we study this act of query re-issuing, called re-search, focusing on multi session re-searching from an information seeking perspective. By focusing on the series of repeat or similar queries where the user shows a continued interest, new patterns of behavior not previously seen arise. We find that the well-studied re-finding behavior is only a piece of the re-search puzzle, and that even amidst repeated re-findings users exhibit diversification and novelty seeking behaviours for many re-search queries. This suggests diversity and re-finding behaviors should be jointly modelled and captured in evaluation measures, instead of being studied as two separate problems as is seen in many previous approaches.
【Keywords】: query log analysis; re-search; web search
【Paper Link】 【Pages】:2060-2064
【Authors】: Hadi Amiri ; Tat-Seng Chua
【Abstract】: The correspondence between sentiment terminology and the active language used for expressing opinions is a crucial prerequisite for effective sentiment analysis. Mining sentiment terminology includes the detection of new opinion words as well as inferring their polarities. In this paper, we first propose a novel approach based on the interchangeability characteristic of words to detect new opinion words through time. We then show that the current non-time-based polarity inference approaches may assign opposite polarity to the same opinion word at different times. To tackle this issue, we consider the polarity scores computed at different times as polarity evidences (with the possibility of flawed evidences) and combine them to compute a globally correct polarity score for each opinion word. The experiments show that our approach is effective both in terms of the quality of the discovered new opinion words as well as its ability in inferring their polarities through time. Furthermore, we show the application of mining sentiment terminology through time in the sentiment classification (SC) task. The experiments show that mining more recent new opinion words leads to greater improvement in the performance of SC. To the best of our knowledge, this is the first work that investigates "time" as an important factor in mining sentiment terminology.
【Keywords】: opinion word mining; sentiment orientation; temporal opinion lexicon; word polarity
【Paper Link】 【Pages】:2065-2069
【Authors】: Noriaki Kawamae
【Abstract】: This paper presents a topic model that discovers the correlation patterns in a given time-stamped document collection and how these patterns evolve over time. Our proposal, the theme chronicle model (TCM) divides traditional topics into temporal and stable topics to detect the change of each theme over time; previous topic models ignore these differences and characterize trends as merely bursts of topics. TCM introduces a theme topic (stable topic), a trend topic (temporal topic), timestamps, and a latent switch variable in each token to realize these differences. Its topic layers allow TCM to capture not only word co-occurrence patterns in each theme, but also word co-occurrence patterns at any given time in each theme as trends. Experiments on various data sets show that the proposed model is useful as a generative model to discover fine-grained tightly coherent topics, takes advantage of previous models, and then assigns values for new documents.
【Keywords】: bayesian hierarchical model; graphical models; text analysis; topic model; trend analysis
【Paper Link】 【Pages】:2070-2074
【Authors】: Yucheng Low ; Alice X. Zheng
【Abstract】: In this paper, we propose a novel method to efficiently compute the top-K most similar items given a query item, where similarity is defined by the set of items that have the highest vector inner products with the query. The task is related to the classical k-Nearest-Neighbor problem, and is widely applicable in a number of domains such as information retrieval, online advertising and collaborative filtering. Our method assumes an in-memory representation of the dataset and is designed to scale to query lengths of 100,000s of terms. Our algorithm uses a generalized Holder's inequality to upper bound the inner product with the norms of the constituent vectors. We also propose a novel compression scheme that computes bounds for groups of candidate items, thereby speeding up computation and minimizing memory requirements per query. We conduct extensive experiments on the publicly available Wikipedia dataset, and demonstrate that, with a memory overhead of 21%, our method can provide 1-3 orders of magnitude improvement in query run-time compared to naive methods and state of the art competing methods. Our median top-10 word query time is 25 us on 7.5 million words and 2.3 million documents.
【Keywords】: inner product; nearest neighbor; top k
【Paper Link】 【Pages】:2075-2079
【Authors】: Hongbing Wang ; Xuan Zhou ; Wujin Chen ; Peisheng Ma
【Abstract】: This paper considers top-k retrieval using Conditional Preference Network (CP-Net). As a model for expressing user preferences on multiple mutually correlated attributes, CP-Net is of great interest for decision support systems. However, little work has addressed how to conduct efficient data retrieval using CP-Nets. This paper presents an approach to efficiently retrieve the most preferred data items based on a user's CP-Net. The proposed approach consists of a top-k algorithm and an indexing scheme. We conducted extensive experiments to compare our approach against a baseline top-k method - sequential scan. The results show that our approach outperform sequential scan in several circumstances.
【Keywords】: cp-net; database; preference; top-k
【Paper Link】 【Pages】:2080-2084
【Authors】: Daniar Achakeev ; Bernhard Seeger ; Peter Widmayer
【Abstract】: Bulk-loading of R-trees has been an important problem in academia and industry for more than twenty years. Current algorithms create R-trees without any information about the expected query profile. However, query profiles are extremely useful for the design of efficient indexes. In this paper, we address this deficiency and present query-adaptive algorithms for building R-trees optimally designed for a given query profile. Since optimal R-tree loading is NP-hard (even without tuning the structure to a query profile), we provide efficient, easy to implement heuristics. Our sort-based algorithms for query-adaptive loading consist of two steps: First, sorting orders are identified resulting in better R-trees than those obtained from standard space-filling curves. Second, for a given sorting order, we propose a dynamic programming algorithm for generating R-trees in linear runtime. Our experimental results confirm that our algorithms generally create significantly better R-trees than the ones obtained from standard sort-based loading algorithms, even when the query profile is unknown.
【Keywords】: bulk-loading; dynamic-programming; r-tree; z-curve
【Paper Link】 【Pages】:2085-2089
【Authors】: Johannes Wust ; Joos-Hendrik Boese ; Frank Renkes ; Sebastian Blessing ; Jens Krüger ; Hasso Plattner
【Abstract】: The introduction of a 64 bit address space in commodity operating systems and the constant drop in hardware prices made large capacities of main memory in the order of terabytes technically feasible and economically viable. Especially column-oriented in-memory databases are a promising platform to improve data management for enterprise applications. As in-memory databases hold the primary persistence in volatile memory, some form of recovery mechanism is required to prevent potential data loss in case of failures. Two desirable characteristics of any recovery mechanism are (1) that it has a minimal impact on the running system, and (2) that the system recovers quickly and without any data loss after a failure. This paper introduces an efficient logging mechanism for dictionary-compressed column structures that addresses these two characteristics by (1) reducing the overall log size by writing dictionary-compressed values and (2) allowing for parallel writing and reading of log files. We demonstrate the efficiency of our logging approach by comparing the resulting log-file size with traditional logical logging on a workload produced by a productive enterprise system.
【Keywords】: column store; databases; in-memory; logging
【Paper Link】 【Pages】:2090-2093
【Authors】: Lushan Han ; Tim Finin ; Anupam Joshi
【Abstract】: We need better ways to query large linked data collections such as DBpedia. Using the SPARQL query language requires not only mastering its syntax but also understanding the RDF data model, large ontology vocabularies and URIs for denoting entities. Natural language interface systems address the problem, but are still subjects of research. We describe a compromise in which non-experts specify a graphical query "skeleton" and annotate it with freely chosen words, phrases and entity names. The combination reduces ambiguity and allows the generation of an interpretation that can be translated into SPARQL. Key research contributions are the robust methods that combine statistical association and semantic similarity to map user terms to the most appropriate classes and properties in the underlying ontology.
【Keywords】: ontology mapping; question answering; schema-free query
【Paper Link】 【Pages】:2094-2098
【Authors】: Jana Bauckmann ; Ziawasch Abedjan ; Ulf Leser ; Heiko Müller ; Felix Naumann
【Abstract】: Data dependencies are used to improve the quality of a database schema, to optimize queries, and to ensure consistency in a database. Conditional dependencies have been introduced to analyze and improve data quality. A conditional dependency is a dependency with a limited scope defined by conditions over one or more attributes. Only the matching part of the instance must adhere to the dependency. In this paper we focus on conditional inclusion dependencies (CINDs).We generalize the definition of CINDs, distinguishing covering and completeness conditions. We present a new use case for such CINDs showing their value for solving complex data quality tasks. Further, we propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. Our algorithms choose not only the condition values but also the condition attributes automatically. Finally, we show that our approach efficiently provides meaningful and helpful results for our use case.
【Keywords】: association rule mining; cind; link discovery
【Paper Link】 【Pages】:2099-2103
【Authors】: Mahbub Hasan ; Abdullah Mueen ; Vassilis J. Tsotras ; Eamonn J. Keogh
【Abstract】: Queries on the web can easily result in a large number of results. Result Diversification, a process by which the query provides the k most diverse set of matches, enables the user to better understand/explore such large results. Computing the diverse subset from a large set of results needs a massive number of pair-wise distance computations as well as finding the subset that maximizes the total pair-wise distance, which is NP-hard and requires efficient approximate algorithm. The problem becomes more difficult when querying semi-structured data, since diversity can occur not only in the document content but also (and more importantly) in the document structure; thus one needs to efficiently measure the structural differences between results. The tree edit distance is the standard choice but, is too expensive for large result sets. Moreover, the generalized tree edit distance ignores the context of the query and also the content of the documents resulting in poor diversification. We present a novel algorithm for meaningful diversification that considers both the structural context of the query and the content of the matched results while computing pair-wise distances. Our algorithm is an order of magnitude faster than the tree edit distance with an elegant worst case guarantee. We also present a novel algorithm that finds the top-k diverse subset of matches in time linear on the size of the result-set. We experimentally demonstrate the utility of our algorithms as a plugin for standard query processors without introducing large error and latency to the output.
【Keywords】: diversity; semi-structured data; xml
【Paper Link】 【Pages】:2104-2108
【Authors】: Christoph Böhm ; Gerard de Melo ; Felix Naumann ; Gerhard Weikum
【Abstract】: Linked Data has emerged as a powerful way of interconnecting structured data on the Web. However, the cross-linkage between Linked Data sources is not as extensive as one would hope for. In this paper, we formalize the task of automatically creating "sameAs" links across data sources in a globally consistent manner. Our algorithm, presented in a multi-core as well as a distributed version, achieves this link generation by accounting for joint evidence of a match. Experiments confirm that our system scales beyond 100 million entities and delivers highly accurate results despite the vast heterogeneity and daunting scale.
【Keywords】: distributed entity matching; entity matching; linked data; mapreduce
【Paper Link】 【Pages】:2109-2113
【Authors】: Quoc Trung Tran ; Chee-Yong Chan
【Abstract】: Sorting is a fundamental operation in data processing. While the problem of sorting flat data records has been extensively studied, there is very little work on sorting hierarchical data such as XML documents. Existing hierarchy-aware sorting approaches for hierarchical data are based on creating sorted subtrees as initial sorted runs and merging sorted subtrees to create the sorted output using either explicit pointers or absolute node key comparisons for merging subtrees. In this paper, we propose SliceSort, a novel, level-wise sorting technique for hierarchical data that avoids the drawbacks of subtree-based sorting techniques. Our experimental performance evaluation shows that SliceSort outperforms the state-of-art approach, HErMeS, by up to a factor of 27%.
【Keywords】: hierarchical data; slicesort; sorting
【Paper Link】 【Pages】:2114-2118
【Authors】: Qing Xie ; Jia Zhu ; Mohamed A. Sharaf ; Xiaofang Zhou ; Chaoyi Pang
【Abstract】: Piecewise Linear Representation (PLR) has been a widely used method for approximating data streams in the form of compact line segments. The buffer-based approach to PLR enables a semi-global approximation which relies on the aggregated processing of batches of streamed data so that to adjust and improve the approximation results. However, one challenge towards applying the buffer-based approach is allocating the necessary memory resources for stream buffering. This challenge is further complicated in a multi-stream environment where multiple data streams are competing for the available memory resources, especially in resource-constrained systems such as sensors and mobile devices. In this paper, we address precisely those challenges mentioned above and propose efficient buffer management techniques for the PLR of multiple data streams. In particular, we propose a new dynamic approach called Dynamic Buffer Management with Error Monitoring (DBMEM), which leverages the relationship between the buffer demands of each data stream and its exhibited pattern of data values towards estimating its sufficient buffer size. This enables DBMEM to provide a global buffer allocation strategy that maximizes the overall PLR approximation quality for multiple data streams as shown by our experimental results.
【Keywords】: data streams; dynamic buffer allocation; plr
【Paper Link】 【Pages】:2119-2123
【Authors】: Chengkai Li ; Nan Zhang ; Naeemul Hassan ; Sundaresan Rajasekaran ; Gautam Das
【Abstract】: We formulate and investigate the novel problem of finding the skyline k-tuple groups from an n-tuple dataset - i.e., groups of k tuples which are not dominated by any other group of equal size, based on aggregate-based group dominance relationship. The major technical challenge is to identify effective anti-monotonic properties for pruning the search space of skyline groups. To this end, we show that the anti-monotonic property in the well-known Apriori algorithm does not hold for skyline group pruning. We then identify order-specific property which applies to SUM, MIN, and MAX and weak candidate-generation property which applies to MIN and MAX only. Experimental results on both real and synthetic datasets verify that the proposed algorithms achieve orders of magnitude performance gain over a baseline method.
【Keywords】: anti-monotonic properties; group recommendation; skyline queries
【Paper Link】 【Pages】:2124-2128
【Authors】: Yajun Yang ; Jeffrey Xu Yu ; Hong Gao ; Jianzhong Li
【Abstract】: Shortest path query is an important problem in graphs and has been well-studied. However, most approaches for shortest path query are based on single-cost (weight) graphs. In this paper, we introduce the definition of multi-cost graph and study a novel query: the optimal path query over multi-cost graphs. We propose a best-first branch and bound search algorithm with two optimizing strategies. Furthermore, we propose a novel index named k-cluster index to make our method more space and time efficient for large graphs. We discuss how to construct and utilize k-cluster index. We confirm the effectiveness and efficiency of our algorithms using real-life datasets in experiments.
【Keywords】: multi-cost graphs; non-linear functions; optimal path
【Paper Link】 【Pages】:2129-2133
【Authors】: Youzhong Ma ; Jia Rao ; Weisong Hu ; Xiaofeng Meng ; Xu Han ; Yu Zhang ; Yunpeng Chai ; Chunqiu Liu
【Abstract】: The Internet of Things (IOT) has been widely applied in many fields, while the IOT data are always large volume, update frequently and inherently multi-dimensional, these characteristics bring big challenges to the traditional DBMSs. The traditional DBMSs have rich functionality and can deal with multi-attributes access efficiently, they can not scale good enough to deal with large volume data and can not support high insert throughput. The cloud-based database systems have good scalability, but they don't support multi-dimensional access natively.In order to deal with the large volume of IOT data, we propose an update and query efficient index framework (UQE-Index) based on key-value store that can support high insert throughput and provide efficient multi-dimensional query simultaneously. We implemented a prototype based on HBase and did comprehensive experiments to test our solution's scalability and efficiency.
【Keywords】: cloud; index; internet of things
【Paper Link】 【Pages】:2134-2138
【Authors】: Thanh Hoang Nguyen ; Huong Dieu Nguyen ; Viviane Moreira ; Juliana Freire
【Abstract】: Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriate categories and provide infoboxes that follow pre-defined templates, many do not follow the guidelines or follow them loosely. This leads to undesirable effects, such as template duplication, heterogeneity, and schema drift. As a step towards addressing this problem, we propose a new unsupervised approach for clustering Wikipedia infoboxes. Instead of relying on manually assigned categories and template labels, we use the structured information available in infoboxes to group them and infer their entity types. Experiments using over 48,000 infoboxes indicate that our clustering approach is effective and produces high quality clusters.
【Keywords】: clustering; wikipedia infobox
【Paper Link】 【Pages】:2139-2143
【Authors】: Haoyu Tan ; Wuman Luo ; Lionel M. Ni
【Abstract】: During the past decade, various GPS-equipped devices have generated a tremendous amount of data with time and location information, which we refer to as big spatio-temporal data. In this paper, we present the design and implementation of CloST, a scalable big spatio-temporal data storage system to support data analytics using Hadoop. The main objective of CloST is to avoid scan the whole dataset when a spatio-temporal range is given. To this end, we propose a novel data model which has special treatments on three core attributes including an object id, a location and a time. Based on this data model, CloST hierarchically partitions data using all core attributes which enables efficient parallel processing of spatio-temporal range scans. According to the data characteristics, we devise a compact storage structure which reduces the storage size by an order of magnitude. In addition, we proposes scalable bulk loading algorithms capable of incrementally adding new data into the system. We conduct our experiments using a very large GPS log dataset and the results show that CloST has fast data loading speed, desirable scalability in query processing, as well as high data compression ratio.
【Keywords】: big data; spatio-temporal data; storage system
【Paper Link】 【Pages】:2144-2148
【Authors】: Guoliang Li ; Jing Xu ; Jianhua Feng
【Abstract】: With the ever-increasing number of spatio-textual objects, many applications require to find objects close to a given query point in spatial databases. In this paper, we study the problem of keyword-based k-nearest neighbor search in spatial databases, which, given a query point and a set of keywords, finds k-nearest neighbors of the query point that contain all query keywords. To efficiently answer such queries, we propose a new indexing framework by integrating a spatial component and a textual component, which can efficiently prune search space in terms of both spatial information and textual descriptions. We develop effective index structures and pruning techniques to improve query performance. Experimental results show that our approach significantly outperforms state-of-the-art methods.
【Keywords】: k-nearest neighbors; space pruning; spatio-textual objects
【Paper Link】 【Pages】:2149-2153
【Authors】: Rong Zhang ; Chaofeng Sha ; Minqi Zhou ; Aoying Zhou
【Abstract】: A fundamental issue for C2C transactions is how to rank the products based on the reviews written by the previous customers. In this paper, we present an approach to improve products ranking by tackling the noisy ratings that exist in the practical systems. The first problem is the credibility of the customers. We design an iterative algorithm to measure the customer credibility. In the algorithm, we use a feedback strategy to increase or decrease the customer credibility. We increase the credibility for a customer if the customer gives a high (low) score to a good (bad) product and decrease the value if the customer gives a low (high) score to a good (bad) product. The second problem is the inconsistency between the review comments and scores. To deal with it, we train a classifier on a training data that is constructed automatically. The trained classifier is used to predict the scores of the comments. Finally, we calculate the scores of products by considering the customer credibility and the predicted scores. The experimental results show that our proposed approach provides better products ranking than the baseline systems.
【Keywords】: clustering; credibility; e-commerce
【Paper Link】 【Pages】:2154-2158
【Authors】: Yu Sun ; Jin Huang ; Yueguo Chen ; Rui Zhang ; Xiaoyong Du
【Abstract】: Given a set of client locations, a set of facility locations where each facility has a service capacity, and the assumptions that: (i) a client seeks service from its nearest facility; (ii) a facility provides service to clients in the order of their proximity, we study the problem of selecting all possible locations such that setting up a new facility with a given capacity at these locations will maximize the number of served clients. This problem has wide applications in practice, such as setting up new distribution centers for online sales business and building additional base stations for mobile subscribers. We formulate the problem as location selection query for utility maximization. After applying three pruning rules to a baseline solution,we obtain an efficient algorithm to answer the query. Extensive experiments confirm the efficiency of our proposed algorithm.
【Keywords】: capacity constraints; location selection
【Paper Link】 【Pages】:2159-2163
【Authors】: Abdulhakim Ali Qahtan ; Xiangliang Zhang ; Suojin Wang
【Abstract】: In this paper, we propose a new method to estimate the dynamic density over data streams, named KDE-Track as it is based on a conventional and widely used Kernel Density Estimation (KDE) method. KDE-Track can efficiently estimate the density with linear complexity by using interpolation on a kernel model, which is incrementally updated upon the arrival of streaming data. Both theoretical analysis and experimental validation show that KDE-Track outperforms traditional KDE and a baseline method Cluster-Kernels on estimation accuracy of the complex density structures in data streams, computing time and memory usage. KDE-Track is also demonstrated on timely catching the dynamic density of synthetic and real-world data. In addition, KDE-Track is used to accurately detect outliers in sensor data and compared with two existing methods developed for detecting outliers and cleaning sensor data.
【Keywords】: data streams; density estimation; interpolation; outlier detection
【Paper Link】 【Pages】:2164-2168
【Authors】: Dongzhe Ma ; Jianhua Feng ; Guoliang Li
【Abstract】: Most commercial database management systems sort tuples of a relation by their primary keys for the purpose of supporting efficient insertions, deletions, and updates. However, primary keys are usually auto-generated integers, which bear little useful information about user data. Secondary indexes have to be created sometimes to help retrieve tuples by columns other than the primary key. Evidently, a better solution is to sort the data by columns that appear frequently in retrieval conditions. Unfortunately, this method does not work, at least not immediately, when the relation is vertically partitioned, which is a popular technique to reduce I/O overhead, since it is difficult to keep tuples of two partitions in exactly the same order unless the sorting columns are replicated, which again wastes storage space and disk bandwidth unnecessarily. In this paper, we introduce a positional access method that allows a partition to be sorted by another one but incurs little storage overhead and provide details about how to improve its performance.
【Keywords】: positional access method; vertical partitioning
【Paper Link】 【Pages】:2169-2173
【Authors】: Liyue Fan ; Li Xiong
【Abstract】: Sharing real-time aggregate statistics of private data has given much benefit to the public to perform data mining for understanding important phenomena, such as Influenza outbreaks and traffic congestion. However, releasing time-series data with standard differential privacy mechanism has limited utility due to high correlation between data values. We propose FAST, an adaptive system to release real-time aggregate statistics under differential privacy with improved utility. To minimize overall privacy cost, FAST adaptively samples long time-series according to detected data dynamics. To improve the accuracy of data release per time stamp, filtering is used to predict data values at non-sampling points and to estimate true values from noisy observations at sampling points. Our experiments with three real data sets confirm that FAST improves the accuracy of time-series release and has excellent performance even under very small privacy cost.
【Keywords】: differential privacy; estimation; sampling; time series
【Paper Link】 【Pages】:2174-2178
【Authors】: Bahman Bahmani ; Ashish Goel ; Rajendra Shinde
【Abstract】: Distributed frameworks are gaining increasingly widespread use in applications that process large amounts of data. One important example application is large scale similarity search, for which Locality Sensitive Hashing (LSH) has emerged as the method of choice, specially when the data is high-dimensional. To guarantee high search quality, the LSH scheme needs a rather large number of hash tables. This entails a large space requirement, and in the distributed setting, with each query requiring a network call per hash bucket look up, also a big network load. Panigrahy's Entropy LSH scheme significantly reduces the space requirement but does not help with (and in fact worsens) the search network efficiency. In this paper, focusing on the Euclidian space under ι2 norm and building up on Entropy LSH, we propose the distributed Layered LSH scheme, and prove that it exponentially decreases the network cost, while maintaining a good load balance between different machines. Our experiments also verify that our theoretical results.
【Keywords】: distributed systems; locality sensitive hashing; mapreduce; similarity search
【Paper Link】 【Pages】:2179-2183
【Authors】: Jianwen Wang ; Xiaohua Hu ; Xinhui Tu ; Tingting He
【Abstract】: This paper proposes a novel topic model, Author-Conference Topic-Connection (ACTC) Model for academic network search. The ACTC Model extends the author-conference-topic (ACT) model by adding subject of the conference and the latent mapping information between subjects and topics. It simultaneously models topical aspects of papers, authors and conferences with two latent topic layers: a subject layer corresponding to conference topic, and a topic layer corresponding to the word topic. Each author would be associated with a multinomial distribution over subjects of conference (eg., KM, DB, IR for CIKM 2012), the conference(CIKM 2012), and the topics are respectively generated from a sampled subject. Then the words are generated from the sampled topics. We conduct experiments on a data set with 8,523 authors, 22,487 papers and 1,243 conferences from the well-known Arnetminer website, and train the model with different number of subjects and topics. For a qualitative evaluation, we compare ACTC with three others models LDA, Author-Topic (AT) and ACT in academic search services. Experiments show that ACTC can effectively capture the semantic connection between different types of information in academic network and perform well in expert searching and conference searching.
【Keywords】: academic network search; gibbs sampling; topic model
【Paper Link】 【Pages】:2184-2188
【Authors】: Jung Hyun Kim ; K. Selçuk Candan ; Maria Luisa Sapino
【Abstract】: A graph neighborhood consists of a set of nodes that are nearby or otherwise related to each other. While existing definitions consider the structure (or topology) of the graph, we note that they fail to take into account the information propagation and diffusion characteristics, such as decay and reinforcement, common in many networks. In this paper, we first define the propagation efficiency of nodes and edges. We use this to introduce the novel concept of zero-erasure (or impact) neighborhood (ZEN) of a given node, n, consisting of the set of nodes that receive information from (or are impacted by) n without any decay. Based on this, we present an impact neighborhood indexing (INI) algorithm that creates data structures to help quickly identify impact neighborhood of any given node. Experiment results confirm the efficiency and effectiveness of the proposed INI algorithms.
【Keywords】: graph neighborhood; impact propagation; indexing
【Paper Link】 【Pages】:2189-2193
【Authors】: Zhitao Shen ; Muhammad Aamir Cheema ; Xuemin Lin
【Abstract】: A traditional query returns a set of objects that satisfy user defined criteria at the time query was issued. The results are based on the values of objects at query time and may be affected by outliers. Intuitively, an object better meets the user's needs if it persistently satisfies the criteria, i.e., it satisfies the criteria for majority of the time in the past T time units. In this paper, we propose a measure named loyalty that reflects how persistently an object satisfies the criteria. Formally, the loyalty of an object is the total time (in past T time units) it satisfies the query criteria. In this paper, we study top-k loyalty queries over sliding windows that continuously report k objects with the highest loyalties. Each object issues an update when it starts satisfying the criteria or when it stops satisfying the criteria. We show that the lower bound cost of updating the results of a top-k loyalty query is O(logN), for each object update, where N is the number of updates issued in last T time units. We conduct a detailed complexity analysis and show that our proposed algorithm is optimal. Moreover, effective pruning techniques are proposed to improve the efficiency. We experimentally verify the effectiveness of the proposed approach by comparing it with a classic sweep line algorithm.
【Keywords】: data streams; loyalty queries; temporal data
【Paper Link】 【Pages】:2194-2198
【Authors】: Sitong Liu ; Guoliang Li ; Jianhua Feng
【Abstract】: Location-based services have attracted significant attention due to modern mobile phones equipped with GPS devices. These services generate large amounts of spatio-textual data which contain both spatial location and textual descriptions. Since a spatio-textual object may have different representations, possibly because of deviations of GPS or different user descriptions, it calls for efficient methods to integrate spatio-textual data from different sources. In this paper we study a new research problem called spatio-textual similarity join: given two sets of spatio-textual objects, we find the similar object pairs. To the best of our knowledge, we are the first to study this problem. We make the following contributions: (1) We develop a filter-and-refine framework and devise several efficient algorithms. We first generate spatial and textual signatures for the objects and build inverted index on top of these signatures. Then we generate candidate pairs using the inverted lists of signatures. Finally we refine the candidates and generate the final result. (2) We study how to generate high-quality signatures for spatial information. We develop an MBR-prefix based signature to prune large numbers of dissimilar object pairs. (3) Experimental results on real and synthetic datasets show that our algorithms achieve high performance and scale well.
【Keywords】: mbr-prefix; similarity join; spatio-textual
【Paper Link】 【Pages】:2199-2203
【Authors】: Jiacai Ni ; Guoliang Li ; Jun Zhang ; Lei Li ; Jianhua Feng
【Abstract】: Multi-tenant data management is a major application of software as a Service (SaaS). Many companies outsource their data to a third party which hosts a multi-tenant database system to provide data management service. The system should have high performance, low space and excellent scalability. One big challenge is to devise a high-quality database schema. Independent Tables Shared Instances and Shared Tables Shared Instances are two state-of-the-art methods. However, the former has poor scalability, while the latter achieves good scalability at the expense of poor performance and high space overhead. In this paper, we trade-off between the two methods and propose an adaptive database schema design approach to achieve good scalability and high performance with low space. To this end, we identify the important attributes and use them to generate a base table. For other attributes, we construct supplementary tables. We propose a cost-based model to adaptively generate the tables above. Our method has the following advantages. First, our method achieves high scalability. Second, our method can trade-off performance and space requirement. Third, our method can be easily applied to existing databases (e.g., MySQL) with minor revisions. Fourth, our method can adapt to any schemas and query workloads. Experimental results show our method achieves high performance and good scalability with low space and outperforms state-of-the-art method.
【Keywords】: adaptive schema; multi-tenant database
【Paper Link】 【Pages】:2204-2208
【Authors】: Xiulei Qin ; Wenbo Zhang ; Wei Wang ; Jun Wei ; Xin Zhao ; Tao Huang
【Abstract】: As one database offloading strategy, elastic key-value stores are often introduced to speed up the application performance with dynamic scalability. Since the workload is varied, efficient data migration with minimal impact in service is critical for the issue of elasticity and scalability. However, due to the new virtualization technology, real-time and low-latency requirements, data migration within cloud-based key-value stores has to face new challenges: effects of VM interference, and the need to trade off between the two ingredients of migration cost, namely migration time and performance impact. To fulfill these challenges, in this paper we explore a new approach to optimize the data migration. Explicitly, we build two interference-aware models to predict the migration time and performance impact for each migration action using statistical machine learning, and then create a cost model to strike a balance between the two ingredients. Using the load rebalancing scenario as a case study, we have designed one cost-aware migration algorithm that utilizes the cost model to guide the choice of possible migration actions. Finally, we demonstrate the effectiveness of the approach using Yahoo! Cloud Serving Benchmark (YCSB).
【Keywords】: data migration; key-value store; migration cost; vm interference
【Paper Link】 【Pages】:2209-2213
【Authors】: Sebastian Lehrack
【Abstract】: Relational queries applied on probabilistic databases have been established as a powerful tool for accessing huge data sets of uncertain data. Often various parts of such queries have different significances for a specific user. Thus, a query language should allow us to give subqueries different weights to quantify the individual user preferences. In this work we introduce a theoretical foundation for weighted algebra operators on probabilistic databases within a SQL-like query language.
【Keywords】: probabilistic database; proqua; weighted query; weighting
【Paper Link】 【Pages】:2214-2218
【Authors】: Young-Kyoon Suh ; Ahmad Ghazal ; Alain Crolotte ; Pekka Kostamaa
【Abstract】: This paper introduces a new tool that recommends an optimized partitioning solution called Multi-Level Partitioned Primary Index (MLPPI) for a fact table based on the queries in the workload. The tool implements a new technique using a greedy algorithm for search space enumeration. The space is driven by predicates in the queries. This technique fits very well the Teradata MLPPI scheme, as it is based on a general framework using general expressions, ranges and case expressions for partition definitions. The cost model implemented in the tool is based on the Teradata optimizer, and it is used to prune the search space for reaching a final solution. The tool resides completely on the client, and interfaces the database through APIs as opposed to previous work that requires optimizer code extension. The APIs are used to simplify the workload queries, and to capture fact table predicates and costs necessary to make the recommendation. The predicate-driven method implemented by the tool is general, and it can be applied to any clustering or partitioning scheme based on simple field expressions or complex SQL predicates. Experimental results given a particular workload will show that the recommendation from the tool outperforms a human expert. The experiments also show that the solution is scalable both with the workload complexity and the size of the fact table.
【Keywords】: fact table; multi-level partitioning; star schema
【Paper Link】 【Pages】:2219-2223
【Authors】: Carlos Ordonez ; Naveen Mohanam ; Carlos Garcia-Alvarado ; Predrag T. Tosic ; Edgar Martinez
【Abstract】: Efficient and scalable execution of numerical methods inside a DBMS is difficult as its architecture is not suited for intense numerical computations. We study computing Principal Component Analysis (PCA) on large data sets via Singular Value Decomposition (SVD). Given the difficulty to program and optimize numerical methods on an existing DBMS, we explore an alternative reusability approach: calling the well-known numerical library LAPACK. Thus we study several alternatives to summarize the data set with aggregate User-Defined Functions (UDFs) and how to efficiently call SVD numerical methods available in LAPACK via Stored Procedures (SPs). We propose algorithmic and system optimizations to enhance scalability and to push processing into RAM. We show it is feasible to efficiently solve PCA by first summarizing the data set with arrays incrementally updated with aggregate UDFs and then pushing heavy matrix processing in SVD to RAM calling LAPACK via SPs. We benchmark our solution on a modern DBMS. Our solution requires only one pass on the data set and it exhibits linear scalability.
【Keywords】: big data; lapack; linear algebra; numerical methods; sql
【Paper Link】 【Pages】:2224-2228
【Authors】: Sahand Negahban ; Benjamin I. P. Rubinstein ; Jim Gemmell
【Abstract】: We consider a serious, previously-unexplored challenge facing almost all approaches to scaling up entity resolution (ER) to multiple data sources: the prohibitive cost of labeling training data for supervised learning of similarity scores for each pair of sources. While there exists a rich literature describing almost all aspects of pairwise ER, this new challenge is arising now due to the unprecedented ability to acquire and store data from online sources, interest in features driven by ER such as enriched search verticals, and the uniqueness of noisy and missing data characteristics for each source. We show on real-world and synthetic data that for state-of-the-art techniques, the reality of heterogeneous sources means that the number of labeled training data must scale quadratically in the number of sources, just to maintain constant precision/recall. We address this challenge with a brand new transfer learning algorithm which requires far less training data (or equivalently, achieves superior accuracy with the same data) and is trained using fast convex optimization. The intuition behind our approach is to adaptively share structure learned about one scoring problem with all other scoring problems sharing a data source in common. We demonstrate that our theoretically-motivated approach improves upon existing techniques for multi-source ER.
【Keywords】: entity resolution; multi-task learning; transfer learning
【Paper Link】 【Pages】:2229-2233
【Authors】: Mahsa Orang ; Nematollaah Shiri
【Abstract】: Numerous real-life applications, such as wireless sensor networks and location-based services, generate large amount of uncertain time series, where the exact value at each timestamp is unavailable or unknown. In this paper, we formalize the notion of correlation for uncertain time series data and consider a family of probabilistic, threshold-based correlation queries over such data. The proposed formulation extends the notion of correlation developed for standard, certain time series. We show that uncertain correlation is a random variable approaching normal distribution. We also formalize the notion of uncertain time series normalization which is at the core of our correlation query processing approach, while it proves to be an important pre-processing technique in particular for pattern discovery tasks. The results of our numerous experiments indicate that, unlike in the standard time series, there is a trade-off between false alarms and hit ratios, which can be controlled by the probability threshold provided by users. Our results also offer users a guideline for choosing proper threshold values.
【Keywords】: correlation; probabilistic queries; uncertain time series
【Paper Link】 【Pages】:2234-2238
【Authors】: De-Nian Yang ; Wang-Chien Lee ; Nai-Hui Chia ; Mao Ye ; Hui-Ju Hung
【Abstract】: Prior research on viral marketing mostly focuses on promoting one single product item. In this work, we explore the idea of bundling multiple items for viral marketing and formulate a new research problem, called Bundle Configuration for SpreAd Maximization (BCSAM). Efficiently obtaining an optimal product bundle under the setting of BCSAM is very challenging. Aiming to strike a balance between the quality of solution and the computational overhead, we systematically explore various heuristics to develop a suite of algorithms, including κ-Bundle Configuration and Aggregated Bundle Configuration. Moreover, we integrate all the proposed ideas into one efficient algorithm, called Aggregated Bundle Configuration (ABC). Finally, we conduct an extensive performance evaluation on our proposals. Experimental results show that ABC significantly outperforms its counterpart and two baseline approaches in terms of both computational overhead and bundle quality.
【Keywords】: personal preference; product bundling; viral marketing
【Paper Link】 【Pages】:2239-2242
【Authors】: Jiankai Sun ; Shuaiqiang Wang ; Byron J. Gao ; Jun Ma
【Abstract】: Most existing recommender systems can be classified into two categories: collaborative filtering and content-based filtering. Hybrid recommender systems combine the advantages of the two for improved recommendation performance. Traditional recommender systems are rating-based. However, predicting ratings is an intermediate step towards their ultimate goal of generating rankings or recommendation lists. Learning to rank is an established means of predicting rankings and has recently demonstrated high promise in improving quality of recommendations. In this paper, we propose LRHR, the first attempt that adapts learning to rank to hybrid recommender systems. LRHR first defines novel representations for both users and items so that they can be content-comparable. Then, LRHR identifies a set of novel meta-level features for learning purposes. Finally, LRHR adopts RankSVM, a pairwise learning to rank algorithm, to generate recommendation lists of items for users. Extensive experiments on benchmarks in comparison with the state-of-the-art algorithms demonstrate the performance gain of our approach.
【Keywords】: collaborative filtering; features; learning to rank; recommender systems
【Paper Link】 【Pages】:2243-2246
【Authors】: Shuaiqiang Wang ; Xiaoming Xi ; Yilong Yin
【Abstract】: Importance weighted active learning (IWAL) introduces a weighting scheme to measure the importance of each instance for correcting the sampling bias of the probability distributions between training and test datasets. However, the weighting scheme of IWAL involves the distribution of the test data, which can be straightforwardly estimated in active learning by interactively querying users for labels of selected test instances, but difficult for conventional learning where there are no interactions with users, referred as passive learning. In this paper, we investigate the insufficient sampling bias problem, i.e., bias occurs only because of insufficient samples, but the sampling process is unbiased. In doing this, we present two assumptions on the sampling bias, based on which we propose a practical weighting scheme for the empirical loss function in conventional passive learning, and present IWPL, an importance weighted passive learning framework. Furthermore, we provide IWSVM, an importance weighted SVM for validation. Extensive experiments demonstrate significant advantages of IWSVM on benchmarks and synthetic datasets.
【Keywords】: classification; discounted confidence; learning with confidence
【Paper Link】 【Pages】:2247-2250
【Authors】: Lina Yao ; Quan Z. Sheng
【Abstract】: This paper studies web object classification problem with the novel exploration of social tags. More and more web objects are increasingly annotated with human interpretable labels (i.e., tags), which can be considered as an auxiliary attribute to assist the object classification. Automatically classifying web objects into manageable semantic categories has long been a fundamental pre-process for indexing, browsing, searching, and mining heterogeneous web objects. However, such heterogeneous web objects often suffer from a lack of easy-extractable and uniform descriptive features. In this paper, we propose a discriminative tag-centric model for web object classification by jointly modeling the objects category labels and their corresponding social tags and un-coding the relevance among social tags. Our approach is based on recent techniques for learning large-scale discriminative models. We conduct experiments to validate our approach using real-life data. The results show the feasibility and good performance of our approach.
【Keywords】: optimization; semantic annotation; social tagging; web objects classification
【Paper Link】 【Pages】:2251-2254
【Authors】: Duck-Ho Bae ; Seo Jeong ; Sang-Wook Kim ; Minsoo Lee
【Abstract】: An outlier is an object that is considerably dissimilar with the remainder of the dataset. In this paper, we first propose the notion of centrality and center-proximity as novel outlierness measures which can be considered to represent the characteristics of all of the objects in the dataset. We then propose a graph-based outlier detection method which can solve the problems of local density, micro-cluster, and fringe objects. Finally, through extensive experiments, we show the effectiveness of the proposed method.
【Keywords】: center-proximity; centrality; graph-based outlier detection
【Paper Link】 【Pages】:2255-2258
【Authors】: Kyoungman Bae ; Youngjoong Ko
【Abstract】: Classiying user's question into several topics helps respondents answering the question in a cQA service. The word weighting method must estimate the appropriate weight of a word to improve the category (or topic) classification. In this paper, we propose a novel effective word weighting method based on a language model for automatic category classification in the cQA service. We first calculate the occurrence probability of a word in each category by using a language model and then the final weight of each word is estimated by ratio of the occurrence probability of the word on a category to the occurrence probability of the word on the other categories. As a result, the proposed method significantly improves the performance of the category classification.
【Keywords】: cQA service; category (or topic) classification; category (or topic) recommendation; language model; word weighting
【Paper Link】 【Pages】:2259-2262
【Authors】: Xiaohui Yan ; Jiafeng Guo ; Shenghua Liu ; Xueqi Cheng ; Yanfeng Wang
【Abstract】: Non-negative matrix factorization (NMF) has been successfully applied in document clustering. However, experiments on short texts, such as microblogs, Q&A documents and news titles, suggest unsatisfactory performance of NMF. An major reason is that the traditional term weighting schemes, like binary weight and tfidf, cannot well capture the terms' discriminative power and importance in short texts, due to the sparsity of data. To tackle this problem, we proposed a novel term weighting scheme for NMF, derived from the Normalized Cut (Ncut) problem on the term affinity graph. Different from idf, which emphasizes discriminability on document level, the Ncut weighting measures terms' discriminability on term level. Experiments on two data sets show our weighting scheme significantly boosts NMF's performance on short text clustering.
【Keywords】: NMF; clustering; normalized cut; short text
【Paper Link】 【Pages】:2263-2266
【Authors】: Shuaiqiang Wang ; Byron J. Gao ; Shuangling Wang ; Guibao Cao ; Yilong Yin
【Abstract】: In this paper, we introduce polygene-based evolution, a novel framework for evolutionary algorithms (EAs) that features distinctive operations in the evolution process. In traditional EAs, the primitive evolution unit is gene, where genes are independent components during evolution. In polygene-based evolutionary algorithms (PGEAs), the evolution unit is polygene, i.e., a set of co-regulated genes. Discovering and maintaining quality polygenes can play an effective role in evolving quality individuals. Polygenes generalize genes, and PGEAs generalize EAs. Implementing the PGEA framework involves three phases: polygene discovery, polygene planting, and polygene-compatible evolution. Extensive experiments on function optimization benchmarks in comparison with the conventional and state-of-the-art EAs demonstrate the potential of the approach in accuracy and efficiency improvement.
【Keywords】: data mining; evolutionary algorithms; optimization; polygene
【Paper Link】 【Pages】:2267-2270
【Authors】: Michael Symonds ; Peter D. Bruza ; Laurianne Sitbon ; Ian Turner
【Abstract】: This paper develops and evaluates an enhanced corpus based approach for semantic processing. Corpus based models that build representations of words directly from text do not require pre-existing linguistic knowledge, and have demonstrated psychologically relevant performance on a number of cognitive tasks. However, they have been criticised in the past for not incorporating sufficient structural information. Using ideas underpinning recent attempts to overcome this weakness, we develop an enhanced tensor encoding model to build representations of word meaning for semantic processing. Our enhanced model demonstrates superior performance when compared to a robust baseline model on a number of semantic processing tasks.
【Keywords】: semantics; tensor encoding
【Paper Link】 【Pages】:2271-2274
【Authors】: Guanhong Yao ; Deng Cai
【Abstract】: Matrix factorization techniques have been frequently applied in information retrieval, computer vision and pattern recognition. Among them, Non-negative Matrix Factorization (NMF) has received considerable attention due to its psychological and physiological interpretation of naturally occurring data whose representation may be parts-based in the human brain. Locality Preserving Non-negative Matrix Factorization (LPNMF) is a recently proposed graph-based NMF extension which tries to preserves the intrinsic geometric structure of the data. Compared with the original NMF, LPNMF has more discriminating power on data representa- tion thanks to its geometrical interpretation and outstanding ability to discover the hidden topics. However, the computa- tional complexity of LPNMF is O(n3), where n is the number of samples. In this paper, we propose a novel approach called Accelerated LPNMF (A-LPNMF) to solve the com- putational issue of LPNMF. Specifically, A-LPNMF selects p (p j n) landmark points from the data and represents all the samples as the sparse linear combination of these landmarks. The non-negative factors which incorporates the geometric structure can then be efficiently computed. Experimental results on the real data sets demonstrate the effectiveness and efficiency of our proposed method.
【Keywords】: non-negative matrix factorization; speedup
【Paper Link】 【Pages】:2275-2278
【Authors】: Patrick Bamba ; Julien Subercaze ; Christophe Gravier ; Nabil Benmira ; Jimi Fontaine
【Abstract】: In this paper we present a Friend Recommender System for micro-blogging. Traditional batch processing of massive amounts of data makes it difficult to provide a near-real time friend recommender system or even a system that can properly scale to millions of users. In order to overcome these issues, we have designed a solution that represents user-generated micro posts as a set of pseudo-cliques. These graphs are assigned a hash value using an original Concept-Sensitive Hash function, a new sub-kind of Locally-Sensitive Hash functions. Finally, since the user profiles are represented as a binary footprint, the pairwise comparison of footprints using the Hamming distance provides scalability to the recommender system. The paper goes with an online application relying on a large Twitter dataset, so that the reader can freely experiment the system.
【Keywords】: friends; graph; locally-sensitive hash; pseudo-clique; recommender system; social networks; twitter
【Paper Link】 【Pages】:2279-2282
【Authors】: Priyanka Garg ; Irwin King ; Michael R. Lyu
【Abstract】: The polarity of opinion is a crucial part of information and ignoring the asymmetry between them, can potentially result in an inaccurate estimation of the number of product adoptions and incorrect recommendations. We analyze the propagation patterns of the negative and positive opinions on two real world datasets, Flixster and Epinions, and observe that the presence of negative opinions significantly reduces the number of expressed opinions. To account for the asymmetry between the two kind of opinions, we propose extensions of the two most popular information propagation models, Independent Cascade and Linear Threshold models. The proposed extensions give a tractable influence problem and improves the prediction accuracy of future opinions, by more than 3% on Flixster and 5% on Epinions datasets.
【Keywords】: information flow; negative opinions; social networks
【Paper Link】 【Pages】:2283-2286
【Authors】: Paul Dütting ; Monika Henzinger ; Ingmar Weber
【Abstract】: Suppose your sole interest in recommending a product to me is to maximize the amount paid to you by the seller for a sequence of recommendations. How should you recommend optimally if I become more inclined to ignore you with each irrelevant recommendation you make? Finding an answer to this question is a key challenge in all forms of marketing that rely on and explore social ties; ranging from personal recommendations to viral marketing. We prove that even if the recommendee regains her initial trust on each successful recommendation, the expected revenue the recommender can make over an infinite period due to payments by the seller is bounded. This can only be overcome when the recommendee also incrementally regains trust during periods without any recommendation. Here, we see a connection to "banner blindness," suggesting that showing fewer ads can lead to a higher long-term revenue.
【Keywords】: banner blindness; recommendations; trust loss in advertising
【Paper Link】 【Pages】:2287-2290
【Authors】: Prakash Mandayam Comar ; Lei Liu ; Sabyasachi Saha ; Antonio Nucci ; Pang-Ning Tan
【Abstract】: Malware detection from network traffic flows is a challenging problem due to data irregularity issues such as imbalanced class distribution, noise, missing values, and heterogeneous types of features. To address these challenges, this paper presents a two-stage classification approach for malware detection. The framework initially employs random forest as a macro-level classifier to separate the malicious from non-malicious network flows, followed by a collection of one-class support vector machine classifiers to identify the specific type of malware. A novel tree-based feature construction approach is proposed to deal with data imperfection issues. As the performance of the support vector machine classifier often depends on the kernel function used to compute the similarity between every pair of data points, designing an appropriate kernel is essential for accurate identification of malware classes. We present a simple algorithm to construct a weighted linear kernel on the tree transformed features and demonstrate its effectiveness in detecting malware from real network traffic data.
【Keywords】: feature construction; kernels; malware detection
【Paper Link】 【Pages】:2291-2294
【Authors】: Chieh-Jen Wang ; Hsin-Hsi Chen
【Abstract】: In Internet ad campaign, ranking of an ad on search result pages depends on a cost-per-click (CPC) of ad words offered by an advertiser and a quality score estimated by a search engine. Bidding for ad words with a higher CPC is more competitive than bidding for the same ad words with a lower CPC in the ad ranking competition. However, offering a higher CPC will increase a burden on advertisers. In contrast, offering a lower CPC may decrease the exposure rate of their ads. Thus, how to select an appropriate CPC for ad words is indispensable for advertisers. In this paper, we extract different semantic levels of features, such as named entities, topic terminologies, and individual words from a large-scale real-world ad words corpus, and explore various learning based prediction algorithms. The thorough experimental results show that the CPC prediction models considering more ad words semantics achieve better prediction performance, and the prediction model using the support vector regression (SVR) and features from all semantic levels performs the best.
【Keywords】: CPC prediction; ad ranking; search engine optimization
【Paper Link】 【Pages】:2295-2298
【Authors】: Shengfeng Ju ; Shoushan Li ; Yan Su ; Guodong Zhou ; Yu Hong ; Xiaojun Li
【Abstract】: Semi-supervised sentiment classification aims to train a classifier with a small number of labeled data (called seed data) and a large amount of unlabeled data. a big advantage of this approach is its saving of annotation effort by using the unlabeled data which is usually freely available. In this paper, we propose an approach to further minimize the annotation effort of semi-supervised sentiment classification by actively selecting the seed data. Specifically, a novel selection strategy is proposed to simultaneously select good words and documents for manual annotation by considering both of their annotation costs and informativeness. Experimental results demonstrate the effectiveness of our approach.
【Keywords】: opinion mining; seed selection; semi-supervised; sentiment classification
【Paper Link】 【Pages】:2299-2302
【Authors】: Rohit Babbar ; Ioannis Partalas ; Éric Gaussier ; Cécile Amblard
【Abstract】: While multi-class categorization of documents has been of research interest for over a decade, relatively fewer approaches have been proposed for large scale taxonomies in which the number of classes range from hundreds of thousand as in Directory Mozilla to over a million in Wikipedia. As a result of ever increasing number of text documents and images from various sources, there is an immense need for automatic classification of documents in such large hierarchies. In this paper, we analyze the tradeoffs between the important characteristics of different classifiers employed in the top down fashion. The properties for relative comparison of these classifiers include, (i) accuracy on test instance, (ii) training time (iii) size of the model and (iv) test time required for prediction. Our analysis is motivated by the well known error bounds from learning theory, which is also further reinforced by the empirical observations on the publicly available data from the Large Scale Hierarchical Text Classification Challenge. We show that by exploiting the data heterogenity across the large scale hierarchies, one can build an overall classification system which is approximately 4 times faster for prediction, 3 times faster to train, while sacrificing only 1% point in accuracy.
【Keywords】: empirical tradeoffs; hierarchical classification
【Paper Link】 【Pages】:2303-2306
【Authors】: Jingsong Zhang ; Yinglin Wang ; Hao Wei
【Abstract】: Ontology plays a very important role in supporting knowledge-based applications. In cloud computing, ontology learning technology is facing new challenges in dealing with heterogeneous data sources from different domains and researchers, which may contain various particular concepts and relations. Traditional ontology learning frameworks usually focus only on the extraction of concepts and taxonomic relations from the multi-structured corpus. However, former researches rarely studied the interactions during ontology learning process among different researchers. Lack of interactions among people who build ontology in different domains may cause inconsistent ontology. Besides, lack of incentive during the ontology building process will also result in low efficiency. To address these challenges, this paper specifies a novel solution to perform ontology learning. The solution includes a service-oriented ontology interaction framework, a service-oriented ontology learning strategy. It shows that it advances ontology learning to a higher level of performance and portability with a number of experiments in demo system.
【Keywords】: cloud computing; ontology; ontology interaction; ontology learning; service-oriented framework
【Paper Link】 【Pages】:2307-2310
【Authors】: Afroza Sultana ; Quazi Mainul Hasan ; Ashis Kumer Biswas ; Soumyava Das ; Habibur Rahman ; Chris H. Q. Ding ; Chengkai Li
【Abstract】: Given the sheer amount of work and expertise required in authoring Wikipedia articles, automatic tools that help Wikipedia contributors in generating and improving content are valuable. This paper presents our initial step towards building a full-fledged author assistant, particularly for suggesting infobox templates for articles. We build SVM classifiers to suggest infobox template types, among a large number of possible types, to Wikipedia articles without infoboxes. Different from prior works on Wikipedia article classification which deal with only a few label classes for named entity recognition, the much larger 337-class setup in our study is geared towards realistic deployment of infobox suggestion tool. We also emphasize testing on articles without infoboxes, due to that labeled and unlabeled data exhibit different distributions of features, which departs from the typical assumption that they are drawn from the same underlying population.
【Keywords】: text classification; wikipedia
【Paper Link】 【Pages】:2311-2314
【Authors】: Pedro G. Campos ; Alejandro Bellogín ; Fernando Díez ; Iván Cantador
【Abstract】: Popular online rental services such as Netflix and MoviePilot often manage household accounts. A household account is usually shared by various users who live in the same house, but in general does not provide a mechanism by which current active users are identified, and thus leads to considerable difficulties for making effective personalized recommendations. The identification of the active household members, defined as the discrimination of the users from a given household who are interacting with a system (e.g. an on-demand video service), is thus an interesting challenge for the recommender systems research community. In this paper, we formulate the above task as a classification problem, and address it by means of global and local feature selection methods and classifiers that only exploit time features from past item consumption records. The results obtained from a series of experiments on a real dataset show that some of the proposed methods are able to select relevant time features, which allow simple classifiers to accurately identify active members of household accounts.
【Keywords】: feature selection; household member identification; recommender systems; time features
【Paper Link】 【Pages】:2315-2318
【Authors】: Fumiyo Fukumoto ; Takeshi Yamamoto ; Suguru Matsuyoshi ; Yoshimi Suzuki
【Abstract】: This paper addresses the problem of dealing with a collection of negative training documents which is suitable for relatively small number of positive documents, and presents a method for eliminating the need for manually collecting negative training documents based on supervised machine learning techniques. We applied an error correction technique to the results of negative training data obtained by the Positive Example Based Learning (PEBL). Moreover, we used a boosting technique to learn a set of negative data to train classifiers. The results using Japanese newspaper documents showed that the method contributes for reducing the cost of manual collection of negative training documents.
【Keywords】: small positive documents and unlabeled data; text classification
【Paper Link】 【Pages】:2319-2322
【Authors】: Wei Liu ; Andrey Kan ; Jeffrey Chan ; James Bailey ; Christopher Leckie ; Jian Pei ; Kotagiri Ramamohanarao
【Abstract】: Existing graph compression techniquesmostly focus on static graphs. However for many practical graphs such as social networks the edge weights frequently change over time. This phenomenon raises the question of how to compress dynamic graphs while maintaining most of their intrinsic structural patterns at each time snapshot. In this paper we show that the encoding cost of a dynamic graph is proportional to the heterogeneity of a three dimensional tensor that represents the dynamic graph. We propose an effective algorithm that compresses a dynamic graph by reducing the heterogeneity of its tensor representation, and at the same time also maintains a maximum lossy compression error at any time stamp of the dynamic graph. The bounded compression error benefits compressed graphs in that they retain good approximations of the original edge weights, and hence properties of the original graph (such as shortest paths) are well preserved. To the best of our knowledge, this is the first work that compresses weighted dynamic graphs with bounded lossy compression error at any time snapshot of the graph.
【Keywords】: dynamic graphs; graph compression; graph mining.
【Paper Link】 【Pages】:2323-2326
【Authors】: Yajuan Duan ; Furu Wei ; Ming Zhou ; Heung-Yeung Shum
【Abstract】: In this paper, we address the problem of classifying tweets into topical categories. Because of the short, noisy and ambiguous nature of tweets, we propose to collectively conduct the classification by exploiting the context information (i.e. related tweets) other than individually as in conventional text classification methods. In particular, we augment the content-based representation of text with tweets sharing same #hashtag or URL, which results in a tweet graph. We then formulate the tweet classification task under a graph optimization framework. We investigate three popular approaches, namely, Loopy Belief Propagation (LBP), Relaxation Labeling (RL), and Iterative Classification Algorithm (ICA). Extensive experiment results show that the graph-based tweet classification approach remarkably improves the performance, while the ICA model with relationship of sharing the same #hashtag gives the best result on separate tweet graph.
【Keywords】: graph-based classification; tweet classification
【Paper Link】 【Pages】:2327-2330
【Authors】: Lakshmi Ramachandran ; Edward F. Gehringer
【Abstract】: In this paper we propose a new word-order based graph representation for text. In our graph representation vertices represent words or phrases and edges represent relations between contiguous words or phrases. The graph representation also includes dependency information. Our text representation is suitable for applications involving the identification of relevance or paraphrases across texts, where word-order information would be useful. We show that this word-order based graph representation performs better than a dependency tree representation while identifying the relevance of one piece of text to another.
【Keywords】: text relevance; text representation; word-order graph
【Paper Link】 【Pages】:2331-2334
【Authors】: Brigitte Boden ; Stephan Günnemann ; Thomas Seidl
【Abstract】: Data sources representing social networks with additional attribute information about the nodes are widely available in today's applications. Recently, combined clustering methods were introduced that consider graph information and attribute information simultaneously to detect meaningful clusters in such networks. In many cases, such attributed graphs also evolve over time. Therefore, there is a need for clustering methods that are able to trace clusters over different time steps and analyze their evolution over time. In this paper, we extend our combined clustering method DB-CSC to the analysis of evolving combined clusters.
【Keywords】: community detection; evolution; graph clustering; networks
【Paper Link】 【Pages】:2335-2338
【Authors】: Andrey Kupavskii ; Liudmila Ostroumova ; Alexey Umnov ; Svyatoslav Usachev ; Pavel Serdyukov ; Gleb Gusev ; Andrey Kustarev
【Abstract】: Retweet cascades play an essential role in information diffusion in Twitter. Popular tweets reflect the current trends in Twitter, while Twitter itself is one of the most important online media. Thus, understanding the reasons why a tweet becomes popular is of great interest for sociologists, marketers and social media researches. What is even more important is the possibility to make a prognosis of a tweet's future popularity. Besides the scientific significance of such possibility, this sort of prediction has lots of practical applications such as breaking news detection, viral marketing etc. In this paper we try to forecast how many retweets a given tweet will gain during a fixed time period. We train an algorithm that predicts the number of retweets during time T since the initial moment. In addition to a standard set of features we utilize several new ones. One of the most important features is the flow of the cascade. Another one is PageRank on the retweet graph, which can be considered as the measure of influence of users.
【Keywords】: influence; information diffusion; retweet cascade
【Paper Link】 【Pages】:2339-2342
【Authors】: Guohua Liang ; Chengqi Zhang
【Abstract】: Imbalanced time series classification (TSC) involving many real-world applications has increasingly captured attention of researchers. Previous work has proposed an intelligent-structure preserving over-sampling method (SPO), which the authors claimed achieved better performance than other existing over-sampling and state-of-the-art methods in TSC. The main disadvantage of over-sampling methods is that they significantly increase the computational cost of training a classification model due to the addition of new minority class instances to balance data-sets with high dimensional features. These challenging issues have motivated us to find a simple and efficient solution for imbalanced TSC. Statistical tests are applied to validate our conclusions. The experimental results demonstrate that this proposed simple random under-sampling technique with SVM is efficient and can achieve results that compare favorably with the existing complicated SPO method for imbalanced TSC.
【Keywords】: SVM; imbalanced time series classification; under-sampling
【Paper Link】 【Pages】:2343-2346
【Authors】: Jiwoon Ha ; Soon-Hyoung Kwon ; Sang-Wook Kim ; Christos Faloutsos ; Sunju Park
【Abstract】: The top-n recommendation focuses on finding the top-n items that the target user is likely to purchase rather than predicting his/her ratings on individual items. In this paper, we propose a novel method that provides top-n recommendation by probabilistically determining the target user's preference on items. This method models the purchasing relationships between users and items as a bipartite graph and employs Belief Propagation to compute the preference of the target user on items. We analyze the proposed method in detail by examining the changes in recommendation accuracy under different parameter settings. We also show that the proposed method is up to 40% more accurate than an existing method by comparing it with an RWR-based method via extensive experiments.
【Keywords】: belief propagation; data mining; top-n recommendation
【Paper Link】 【Pages】:2347-2350
【Authors】: Alfan Farizki Wicaksono ; Sung-Hyon Myaeng
【Abstract】: Weblog, one of the fastest growing user generated contents, often contains key learnings gleaned from people's past experiences which are really worthy to be well presented to other people. One of the key learnings contained in weblogs is often vented in the form of advice. In this paper, we aim to provide a methodology to extract sentences that reveal advices on weblogs. We observed our data to discover the characteristics of advices contained in weblogs. Based on this observation, we define our task as a classification problem using various linguistic features. We show that our proposed method significantly outperforms the baseline. The presence or absence of imperative mood expression appears to be the most important feature in this task. It is also worth noting that the work presented in this paper is the first attempt on mining advices from English data.
【Keywords】: advice mining; text mining
【Paper Link】 【Pages】:2351-2354
【Authors】: Zhenfeng Zhu ; Xingquan Zhu ; Yangdong Ye ; Yue-Fei Guo ; Xiangyang Xue
【Abstract】: Proximal support vector machine (PSVM) is a simple but effective classifier, especially for solving large-scale data classification problems. An inherent deficiency of PSVM lies on its inefficiency for dealing with high-dimensional data. In this paper, we propose a parallel version of PSVM (PPSVM). Based on random dimensionality partitioning, PPSVM can obtain partitioned local model parameters in parallel, with combined parameters to form the final global solution. In fact, PPSVM enjoys two properties: 1) It can calculate model parameters in parallel and is therefore a fast learning method with theoretically proved convergence; and 2) It can avoid the inversion of large matrix, which makes it suitable for high-dimensional data. In the paper, we also propose a random PPSVM with randomly partitioned data in each iteration to improve the performance of PSVM. Experimental results on real-world data demonstrate that the proposed methods can obtain similar or even better prediction accuracy than PSVM with much better runtime efficiency.
【Keywords】: high dimensionality; parallel; proximal support vector machine
【Paper Link】 【Pages】:2355-2358
【Authors】: Won-Seok Hwang ; Ho-Jong Lee ; Sang-Wook Kim ; Minsoo Lee
【Abstract】: A variety of recommendation methods have been proposed to satisfy the performance and accuracy; however, it is fairly difficult to satisfy both of them because there is a trade-off between them. In this paper, we introduce the notion of category experts and propose the recommendation method by exploiting the ratings of category experts instead of those of the users similar to a target user. We also extend the method that uses both the category preference of a target user and his/her similarity to category experts. We show that our method significantly outperforms the existing methods in terms of performance and accuracy through extensive experiments with real-world data.
【Keywords】: collaborative filtering; expert; performance evaluation; recommender system
【Paper Link】 【Pages】:2359-2362
【Authors】: Jinyoung Yeo ; Jin-Woo Park ; Seung-won Hwang
【Abstract】: In this paper, we propose a new type of market model called the social domination game model. Given a set C of customers and a set P of products, this model simulates market competition among P and estimates market shares, considering both the dominance relation between C and P and the influence relation among the members of C. With this model, we propose a greedy product positioning algorithm for designing a new product that approximately maximizes market share. Our experimental results show that the proposed algorithm creates a new product gaining up to 97.5% market share of the best product's market share obtained by the exact method, while significantly outperforming the exact method in terms of running time, i.e., by up to two orders of magnitude.
【Keywords】: domination game; influence propagation
【Paper Link】 【Pages】:2363-2366
【Authors】: Madian Khabsa ; Pucktada Treeratpituk ; C. Lee Giles
【Abstract】: Given a set of automatically extracted entities E of size n, we would like to cluster all the various names referring to the same canonical entity together. The variations of each entity include acronyms, full name, and informal naming conventions. We propose using search engine results to cluster variations of each entity based on the URLs appearing in those results. We create a cluster C for each top search result returned by querying for the entity e ∈ E assigning e to the cluster C. Our experiments on a manually created dataset shows that our approach achieves higher precision and recall than string matching algorithm and hierarchical clustering based disambiguation methods.
【Keywords】: disambiguation; entity resolution; search engines
【Paper Link】 【Pages】:2367-2370
【Authors】: Hikaru Takemura ; Keishi Tajima
【Abstract】: Many microblog messages remain useful only within a short time, and users often find such a message after its informational value has vanished. Users also sometimes miss old but still useful messages buried among outdated ones. To solve these problems, we develop a method of classifying messages into the following three categories: (1) messages that users should read now because their value will diminish soon, (2) messages that users may read later because their value will not largely change soon, and (3) messages that are not useful anymore because their value has vanished. Our method uses an error correcting output code consisting of binary classifiers each of which determines whether a given message has value at specific time point. Our experiments on Twitter data confirmed that it outperforms naive methods.
【Keywords】: filtering; microblog; real-time; time-dependency; twitter
【Paper Link】 【Pages】:2371-2374
【Authors】: Xiao Yang ; Zhaoxin Zhang ; Ke Wang
【Abstract】: The traditional collaborative filtering approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. To address these problems, we present a novel scalable item-based collaborative filtering method by using incremental update and local link prediction. By subdividing the computations and analyzing the factors in different cases of item-to-item similarity, we design the incremental update strategies in item-based CF, which can make the recommender system more efficient and scalable. Based on the transitive structure of item similarity graph, we use the local link prediction method to find implicit candidates to alleviate the lack of neighbors in predictions and recommendations caused by the sparsity of data. The experiment results validate that our algorithm can improve the performance of traditional CF, and can increase the efficiency in recommendations.
【Keywords】: collaborative filtering; incremental update; link prediction; scalability; similarity graph
【Paper Link】 【Pages】:2375-2378
【Authors】: Cheng-Te Li ; Man-Kwan Shan
【Abstract】: One important function of current social networking services is allowing users to initialize different kinds of activity groups (e.g. study group, cocktail party, and group buying) and invite friends to attend in either manual or collaborative manners. However, such process of group formation is tedious, and could either include inappropriate group members or miss relevant ones. This work proposes to automatically compose the activity groups in a social network according to user-specified activity information. Given the activity host, a set of labels representing the activity's subjects, the desired group size, and a set of must-inclusive persons, we aim to find a set of individuals as the activity group, in which members are required to not only be familiar with the host but also have great communications with each other. We devise an approximation algorithm to greedily solve the group composing problem. Experiments on a real social network show the promising effectiveness of the proposed approach as well as the satisfactory human subjective study.
【Keywords】: activity group; group formation; social network
【Paper Link】 【Pages】:2379-2382
【Authors】: Xu Chen ; Zhiyong Peng ; Cheng Zeng
【Abstract】: Patents are public and scientific literatures protected by the law, and their abstracts highly contained valuable information. Patent's semantic annotation can effectively protect intellectual property rights and promote corporations' scientific research innovation. Currently, automatic patent annotation mainly used supervised machine learning algorithms, which required abundant expensive labeled patent data. Due to lack of enough labeled Chinese patent data, this paper adopted a semi-supervised machine learning method named co-training, which started from a little labeled data. This method combined keyword extraction with list extraction, and incrementally annotated functional clauses in patent abstract. Experiment results indicated this method can gradually improve the recall without sacrificing the precision.
【Keywords】: co-training; information extraction; patent mining; semantic annotation
【Paper Link】 【Pages】:2383-2386
【Authors】: Xianling Mao ; Zhaoyan Ming ; Zheng-Jun Zha ; Tat-Seng Chua ; Hongfei Yan ; Xiaoming Li
【Abstract】: Recently, statistical topic modeling has been widely applied in text mining and knowledge management due to its powerful ability. A topic, as a probability distribution over words, is usually difficult to be understood. A common, major challenge in applying such topic models to other knowledge management problem is to accurately interpret the meaning of each topic. Topic labeling, as a major interpreting method, has attracted significant attention recently. However, previous works simply treat topics individually without considering the hierarchical relation among topics, and less attention has been paid to creating a good hierarchical topic descriptors for a hierarchy of topics. In this paper, we propose two effective algorithms that automatically assign concise labels to each topic in a hierarchy by exploiting sibling and parent-child relations among topics. The experimental results show that the inter-topic relation is effective in boosting topic labeling accuracy and the proposed algorithms can generate meaningful topic labels that are useful for interpreting the hierarchical topics.
【Keywords】: statistical topic models; topic model labeling
【Paper Link】 【Pages】:2387-2390
【Authors】: Jing Liu ; Xinying Song ; Jingtian Jiang ; Chin-Yew Lin
【Abstract】: In this paper, we address the problem of author extraction (AE) from user generated content (UGC) pages. Most existing solutions for web information extraction, including AE, adopt supervised approaches, which require expensive manual annotation. We propose a novel unsupervised approach for automatically collecting and labeling training data based on two key observations of author names: (1) people tend to use a single name across sites if their preferred names are available; (2) people tend to create unique usernames to easily distinguish themselves from others, e.g. travelbug61. Our AE solution only requires features extracted from a single UGC page instead of relying on clues from multiple UGC pages. We conducted extensive experiments. (1) The evaluation of automatically labeled author field data shows 95.0% precision. (2) Our method achieves an F1 score of 96.1%, which significantly outperforms a state-of-the-art supervised approach with single page features (F1 score: 68.4%) and has a comparable performance to its multiple page solution (F1 score: 95.4%). (3) We also examine the robustness of our approach on various UGC pages from forums and review sites, and achieve promising results as well.
【Keywords】: author extraction; unsupervised approach
【Paper Link】 【Pages】:2391-2394
【Authors】: Krisztian Balog ; Robert Neumayer
【Abstract】: A significant portion of information needs in web search target entities. These may come in different forms or flavours, ranging from short keyword queries to more verbose requests, expressed in natural language. We address the task of automatically annotating queries with target types from an ontology. The identified types can subsequently be used, e.g., for creating semantically more informed query and retrieval models, filtering results, or directing the requests to specific verticals. Our study makes the following contributions. First, we formalise the task of hierarchical target type identification, argue that it is best viewed as a ranking problem, and propose multiple evaluation metrics. Second, we develop a purpose-built test collection by hand-annotating over 300 queries, from various recent entity search benchmarking campaigns, with target types from the DBpedia ontology. Finally, we introduce and examine two baseline models, inspired by federated search techniques. We show that these methods perform surprisingly well when target types are limited to a flat list of top level categories; finding the right level of granularity in the hierarchy, however, is particularly challenging and requires further investigation.
【Keywords】: entity retrieval; query classification; semantic search
【Paper Link】 【Pages】:2395-2398
【Authors】: Rishabh Mehrotra ; Rushabh Agrawal ; Syed Aqueel Haider
【Abstract】: Machine Learning algorithms are often as good as the data they can learn from. Enormous amount of unlabeled data is readily available and the ability to efficiently use such amount of unlabeled data holds a significant promise in terms of increasing the performance of various learning tasks. We consider the task of supervised Domain Adaptation and present a Self-Taught learning based framework which makes use of the K-SVD algorithm for learning sparse representation of data in an unsupervised manner. To the best of our knowledge this is the first work that integrates K-SVD algorithm into the self-taught learning framework. The K-SVD algorithm iteratively alternates between sparse coding of the instances based on the current dictionary and a process of updating/adapting the dictionary to better fit the data so as to achieve a sparse representation under strict sparsity constraints. Using the learnt dictionary, a rich feature representation of the few labeled instances is obtained which is fed to a classifier along with class labels to build the model. We evaluate our framework on the task of domain adaptation for sentiment classification. Both self-domain (requiring very few domain-specific training instances) and cross-domain classification (requiring 0 labeled instances of target domain and very few labeled instances of source domain) are performed. Empirical comparisons of self-domain and cross-domain results establish the efficacy of the proposed framework.
【Keywords】: domain adaptation; sparse representation; transfer learning
【Paper Link】 【Pages】:2399-2402
【Authors】: Qi Zhang ; Yan Wu ; Xuanjing Huang
【Abstract】: Pseudo-relevance feedback via query expansion has been widely studied from various perspectives in the past decades. Its effectiveness in improving retrieval effectiveness has been shown in many tasks. A variety of criteria were proposed to select additional terms for the original queries. However, most of the existing methods weight and select terms individually and do not consider the impact of term-to-term relationship. In this paper, we first examine the influence of combinations of terms through data analysis, which demonstrate the significant effect of term-to-term relationship on retrieval effectiveness. Then, to address this problem, we formalize the query expansion task as an integer linear programming (ILP) problem. The model combines the weights learned from a supervised method for individual terms, and integrates constraints to capture relations between terms. Finally, three standard TREC collections are used to evaluate the proposed method. Experimental results demonstrate that the proposed method can significantly improve the effectiveness of retrieval.
【Keywords】: integer linear programming; relevance feedback
【Paper Link】 【Pages】:2403-2406
【Authors】: Killian Levacher ; Séamus Lawless ; Vincent Wade
【Abstract】: Content slicing addresses the need of adaptive systems to reuse open corpus material by converting it into re-composable information objects. However this conversion is highly dependent upon the ability to correctly fragment pages into structurally sound atomic pieces. A recently suggested approach to fragmentation, which relies on densitometric page representation, claims to achieve high accuracy and time performance. Although it has been well received within the research community, a full evaluation of this approach and identification of strengths and weaknesses across a range of characteristics hasn't been performed. This paper proposes an independent evaluation of the approach with respect to granularity control, accuracy, time performance, content diversity and linguistic dependency. Moreover, this paper also provides a significant contribution to address important weaknesses discovered during the analysis, in order to improve the suitability and impact of the original algorithm within the context of content slicing.
【Keywords】: analysis; densitometric; fragmentation; open-corpus
【Paper Link】 【Pages】:2407-2410
【Authors】: Shinil Kim ; Seon Yang ; Youngjoong Ko
【Abstract】: This paper proposes how to effectively retrieve the mathematical equations when the plain words are given as a query. The proposed system requires no complicated mathematical symbols, no particular input tool and no constraint of query. Users can enter a query with plain words like the traditional Information Retrieval. For this, we extract features from the plain texts that are converted from the real math equations. Experimental results show an outstanding performance, a MRR of 0.6585.
【Keywords】: identifier & number; mathML; mathematical equation retrievel; operator & structure
【Paper Link】 【Pages】:2411-2414
【Authors】: Mingda Wu ; Shan Jiang ; Yan Zhang
【Abstract】: Under the joint influence of the presentation of search results and users' browsing and clicking habits, the click probability distribution does not merely obey a monotonic decreasing Zipf function. In this paper, we present evidence that the click behavior on the entries of search engines' result pages is influenced by Serial Position Effect, which is independent of how these entries are ordered, and introduce a new function to characterize the click probability distribution.
【Keywords】: click behavior; principle of least effort; serial position effect; zipf's law
【Paper Link】 【Pages】:2415-2418
【Authors】: Jin-Woo Jeong ; Xin-Jing Wang ; Dong-Ho Lee
【Abstract】: In this paper, we propose a new method to measure the visualness of a concept. The visualness of a concept is generally defined as what extent a concept has visual characteristics. Even though the visualness of a concept is important and useful for various image search tasks, it has not received much spotlight yet. In this work, we especially focus on how to measure the visualness of a complex concept such as "round table", "dry bed" rather than a simple concept like "ball", "apple". To measure the visualness, we first collect sample images of a complex concept using web image search engines, and then group the images based on the visual features. Finally, we compute visual purity and weighted entropy of the clusters, which will act as a visualness score for the concept. Through various experiments, we show and discuss interesting results about the visualness of a concept.
【Keywords】: concept visualness; image clustering; visual purity
【Paper Link】 【Pages】:2419-2422
【Authors】: Nima Asadi ; Jimmy Lin
【Abstract】: Most modern web search engines employ a two-phase ranking strategy: a candidate list of documents is generated using a "cheap" but low-quality scoring function, which is then reranked by an "expensive" but high-quality method (usually machine-learned). This paper focuses on the problem of candidate generation for conjunctive query processing in this context. We describe and evaluate a fast, approximate postings list intersection algorithms based on Bloom filters. Due to the power of modern learning-to-rank techniques and emphasis on early precision, significant speedups can be achieved without loss of end-to-end retrieval effectiveness. Explorations reveal a rich design space where effectiveness and efficiency can be balanced in response to specific hardware configurations and application scenarios.
【Keywords】: learning to rank; postings lists intersection; scalability and efficiency
【Paper Link】 【Pages】:2423-2426
【Authors】: Chaoran Cui ; Jun Ma ; Shuaiqiang Wang ; Shuai Gao ; Tao Lian
【Abstract】: Automatic image annotation plays an important role in modern keyword-based image retrieval systems. Recently, many neighbor-based methods have been proposed and achieved good performance for image annotation. However, existing work mainly focused on exploring a distance metric learning algorithm to determine the neighbors of an image, and neglected the subsequent keyword propagation process. They usually used some simple heuristic propagation rules, and propagated each keyword independently without considering the inherent semantic coherence among keywords. In this paper, we propose a novel learning-based keyword propagation strategy and incorporate it into the neighbor-based method framework. In particular, we employ the structural SVM to learn a scoring function which can evaluate different candidate keyword sets for a test image. Moreover, we explicitly enforce the semantic coherence constraint for the propagated keywords in our approach. The annotation of the test image is propagated as a whole rather than separate keywords. Experiments on two benchmark data sets demonstrate the effectiveness of our approach for image annotation and ranked retrieval.
【Keywords】: image annotation; semantic coherence; structural learning
【Paper Link】 【Pages】:2427-2430
【Authors】: Kareem Darwish ; Walid Magdy ; Ahmed Mourad
【Abstract】: The use of social media has profoundly affected social and political dynamics in the Arab world. In this paper, we explore the Arabic microblogs retrieval. We illustrate some of the challenges associated with Arabic microblog retrieval, which mainly stem from the use of different Arabic dialects that vary in lexical selection, morphology, and phonetics and lack orthographic and spelling conventions. We present some of the required processing for effective retrieval such as improved letter normalization, elongated word handling, stopword removal, and stemming
【Keywords】: arabic retrieval; arabic twitter; dialect arabic normalization; microblog search
【Paper Link】 【Pages】:2431-2434
【Authors】: Hichem Bannour ; Céline Hudelot
【Abstract】: Semantic hierarchies have been introduced recently to improve image annotation. They was used as a framework for hierarchical image classification, and thus to improve classifiers accuracy and reduce the complexity of managing large scale data. In this paper, we investigate the contribution of semantic hierarchies for hierarchical image classification. We propose first a new method based on the hierarchy structure to train efficiently hierarchical classifiers. Our method, named One-Versus-Opposite-Nodes, allows decomposing the problem in several independent tasks and therefore scales well with large database. We also propose two methods for computing a hierarchical decision function that serves to annotate new image samples. The former is performed by a top-down classifiers voting, while the second is based on a bottom-up score fusion. The experiments on Pascal VOC'2010 dataset showed that our methods improve well the image annotation results.
【Keywords】: hierarchical image classification; image annotation; machine learning; semantic hierarchies
【Paper Link】 【Pages】:2435-2438
【Authors】: Ronan Cummins
【Abstract】: Modelling the document scores returned from an IR system for a given query using parameterised score distributions is an area of research that has become more popular in recent years. Score distribution (SD) models are useful for a number of IR tasks. These include data fusion, query performance prediction, determining thresholds in filtering applications, and tasks in the area of distributed retrieval. The inference of performance metrics, such as average precision, from these SD models is an important consideration. In this paper, we study the accuracy of a number of methods of inferring average precision from an SD model.
【Keywords】: inference; information retrieval; score distributions
【Paper Link】 【Pages】:2439-2442
【Authors】: Bevan Koopman ; Guido Zuccon ; Peter Bruza ; Laurianne Sitbon ; Michael Lawley
【Abstract】: Measures of semantic similarity between medical concepts are central to a number of techniques in medical informatics, including query expansion in medical information retrieval. Previous work has mainly considered thesaurus-based path measures of semantic similarity and has not compared different corpus-driven approaches in depth. We evaluate the effectiveness of eight common corpus-driven measures in capturing semantic relatedness and compare these against human judged concept pairs assessed by medical professionals. Our results show that certain corpus-driven measures correlate strongly (approx 0.8) with human judgements. An important finding is that performance was significantly affected by the choice of corpus used in priming the measure, i.e., used as evidence from which corpus-driven similarities are drawn. This paper provides guidelines for the implementation of semantic similarity measures for medical informatics and concludes with implications for medical information retrieval.
【Keywords】: medical information retrieval; semantic similarity
【Paper Link】 【Pages】:2443-2446
【Authors】: Ronan Cummins ; Colm O'Riordan
【Abstract】: Retrieval functions in information retrieval (IR) are fundamental to the effectiveness of search systems. However, considerable parameter tuning is often needed to increase the effectiveness of the retrieval. Document length normalisation is one such aspect that requires tuning on a per-query and per-collection basis for many retrieval functions. In this paper, we develop an approach that regularises the level of normalisation to apply on a per-query basis. We formally describe the interaction between query-terms and document length normalisation using a constraint. We then develop a general pre-retrieval approach to adapt a number of state-of-the-art ranking functions so that they adhere to the constraint. Finally, we empirically demonstrate that the adapted retrieval functions outperform default versions of the original retrieval functions, and perform at least comparably to tuned versions of the original functions, on a number of datasets. Essentially this regulates the normalisation parameter in a number of retrieval functions on a per-query basis in a principled manner.
【Keywords】: constraints; retrieval functions
【Paper Link】 【Pages】:2447-2450
【Authors】: Manuel Gomez-Rodriguez ; Monica Rogati
【Abstract】: The online and offline worlds are converging. Location-based services, ubiquitous mobile devices and on-the-go social network accessibility are blurring the distinction between in-person activities and their virtual counterpart. An important effect of this convergence is the rapid and powerful impact of offline events (meetings, conferences) on the evolution and temporal dynamics of the online connectivity between members of social and professional networks. However, these effects have been largely unexplored. We study these effects by using data from LinkedIn, a popular professional social networking site. We find that offline events may induce connectivity changes in the online network -- there is a dramatic increase in the number of connections between event attendees shortly after the date of the event. Building on these insights, we describe a non-supervised method that exploits connectivity changes temporally correlated to real world events to successfully infer more than 40% of specific event attendees. Finally, we revisit the link prediction problem by including user contributed information about off-line events to achieve higher link prediction performance.
【Keywords】: link prediction; real world events; social networks; temporal dynamics
【Paper Link】 【Pages】:2451-2454
【Authors】: Eyal Krikon ; David Carmel ; Oren Kurland
【Abstract】: We present a novel approach to predicting the performance of passage retrieval for question answering. That is, estimating the effectiveness, for answer extraction, of a list of passages retrieved in response to a question when relevance judgments are not available. Our prediction model integrates two types of estimates. The first estimates the probability that the information need expressed by the question is satisfied by the passages. This estimate is devised by adapting query-performance predictors developed for the document retrieval task. The second type estimates the probability that the passages contain the answers. This estimate relies on the occurrences of named entities that are likely to answer the question. Empirical evaluation demonstrates the merits of our prediction approach. For example, the prediction quality is much better than that of the only previous prediction method devised for the task at hand.
【Keywords】: passage retrieval; query performance prediction; question answering
【Paper Link】 【Pages】:2455-2458
【Authors】: Jun Xu ; Ruifeng Xu ; Qin Lu ; Xiaolong Wang
【Abstract】: This paper proposes a novel approach using a coarse-to-fine analysis strategy for sentence-level emotion classification which takes into consideration of similarities to sentences in training set as well as adjacent sentences in the context. First, we use intra-sentence based features to determine the emotion label set of a target sentence coarsely through the statistical information gained from the label sets of the k most similar sentences in the training data. Then, we use the emotion transfer probabilities between neighboring sentences to refine the emotion labels of the target sentences. Such iterative refinements terminate when the emotion classification converges. The proposed algorithm is evaluated on Ren-CECps, a Chinese blog emotion corpus. Experimental results show that the coarse-to-fine emotion classification algorithm improves the sentence-level emotion classification by 19.11% on the average precision metric, which outperforms the baseline methods.
【Keywords】: emotion classification; machine learning; multi-label classification
【Paper Link】 【Pages】:2459-2462
【Authors】: Oren Kurland ; Fiana Raiber ; Anna Shtok
【Abstract】: We show that two tasks which were independently addressed in the information retrieval literature actually amount to the exact same task. The first is query performance prediction; i.e., estimating the effectiveness of a search performed in response to a query in the absence of relevance judgments. The second task is cluster ranking, that is, ranking clusters of similar documents by their presumed effectiveness (i.e., relevance) with respect to the query. Furthermore, we show that several state-of-the-art methods that were independently devised for each of the two tasks are based on the same principles. Finally, we empirically demonstrate that using insights gained in work on query-performance prediction can help, in many cases, to improve the performance of a previously proposed cluster ranking method.
【Keywords】: cluster ranking; query-performance prediction
【Paper Link】 【Pages】:2463-2466
【Authors】: Nattiya Kanhabua ; Kjetil Nørvåg
【Abstract】: Retrieval effectiveness of temporal queries can be improved by taking into account the time dimension. Existing temporal ranking models follow one of two main approaches: 1) a mixture model linearly combining textual similarity and temporal similarity, and 2) a probabilistic model generating a query from the textual and temporal part of document independently. In this paper, we propose a novel time-aware ranking model based on learning-to-rank techniques. We employ two classes of features for learning a ranking model, entity-based and temporal features, which are derived from annotation data. Entity-based features are aimed at capturing the semantic similarity between a query and a document, whereas temporal features measure the temporal similarity. Through extensive experiments we show that our ranking model significantly improves the retrieval effectiveness over existing time-aware ranking models.
【Keywords】: temporal queries; time-aware ranking models
【Paper Link】 【Pages】:2467-2470
【Authors】: Yu Cheng ; Kunpeng Zhang ; Yusheng Xie ; Ankit Agrawal ; Alok N. Choudhary
【Abstract】: Most of the existing active learning algorithms assume all the category labels as independent or consider them in a "flat" structure. However, in reality, there are many applications in which the set of possible labels are often organized in a hierarchical structure. In this paper, we consider the problem of active learning when the categories are represented as a tree. Our goal is to exploit the structure information of the label tree in active learning to select the most informative samples to be labeled. We propose an algorithm that estimates the semantic space, embedding the category hierarchy. In this space, each category label is represented as a prototype and the uncertainty is measured using a variance-based fashion. We also demonstrate notable performance improvement with the proposed approach on synthetic and real datasets.
【Keywords】: active learning; hierarchical classification; label tree embedding
【Paper Link】 【Pages】:2471-2474
【Authors】: Zongcheng Ji ; Fei Xu ; Bin Wang ; Ben He
【Abstract】: The major challenge for Question Retrieval (QR) in Community Question Answering (CQA) is the lexical gap between the queried question and the historical questions. This paper proposes a novel Question-Answer Topic Model (QATM) to learn the latent topics aligned across the question-answer pairs to alleviate the lexical gap problem, with the assumption that a question and its paired answer share the same topic distribution. Experiments conducted on a real world CQA dataset from Yahoo! Answers show that combining both parts properly can get more knowledge than each part or both parts in a simple mixing way and combining our QATM with the state-of-the-art translation-based language model, where the topic and translation information is learned from the question-answer pairs at two different grained semantic levels respectively, can significantly improve the QR performance.
【Keywords】: community question answering; question retrieval; question-answer topic model; topic model; translation model
【Paper Link】 【Pages】:2475-2478
【Authors】: Harumi Murakami ; Yuki Miyake
【Abstract】: This research investigates how humans distinguish different people with identical names on the web to improve web people search. We asked subjects to classify 20 pages of web people-search results for each of 20 person names and analyzed their decision processes through questionnaire, protocol analysis, and interview. We found that keywords, vocations, works (for a real person, works are those made by the individual and, for a fictional person, works are those in which the individual appears), facial images, and the names of related people are important for distinguishing individuals. We proposed a model for distinguishing individuals and a knowledge-structure model based on the experiment's results.
【Keywords】: distinguishing individual model; knowledge-structure model; person name disambiguation; web people search
【Paper Link】 【Pages】:2479-2482
【Authors】: Bo Long ; Jiang Bian ; Anlei Dong ; Yi Chang
【Abstract】: With the rapid growth of E-Commerce on the Internet, online product search service has emerged as a popular and effective paradigm for customers to find desired products and select transactions. Most product search engines today are based on adaptations of relevance models devised for information retrieval. However, there is still a big gap between the mechanism of finding products that customers really desire to purchase and that of retrieving products of high relevance to customers' query. In this paper, we address this problem by proposing a new ranking framework for enhancing product search based on dynamic best-selling prediction in E-Commerce. Specifically, we first develop an effective algorithm to predict the dynamic best-selling, i.e. the volume of sales, for each product item based on its transaction history. By incorporating such best-selling prediction with relevance, we propose a new ranking model for product search, in which we rank higher the product items that are not only relevant to the customer's need but with higher probability to be purchased by the customer. Results of a large scale evaluation, conducted over the dataset from a commercial product search engine, demonstrate that our new ranking method is more effective for locating those product items that customers really desire to buy at higher rank positions without hurting the search relevance.
【Keywords】: best selling prediction; e-commerce; product search; transaction history
【Paper Link】 【Pages】:2483-2486
【Authors】: Gianni Amati ; Giuseppe Amodeo ; Carlo Gaibisso
【Abstract】: Freshness of information in real-time search is central in social networks, news, blogs and micro-blogs. Nevertheless, there is not a clear experimental evidence that shows what principled approach effectively combines time and content. We introduce a novel approach to model freshness using a survival analysis of relevance over time. In such models, freshness is measured by the tail probability of relevance over time. We also assume that the probability distributions for freshness are heavy-tailed. The heavy-tailed models of freshness are shown to be highly effective on the micro-blogging test collection of TREC 2011. The improvements over the state-of-the-art time-based models are statistically significant or moderately significant.
【Keywords】: blog search; combining searches
【Paper Link】 【Pages】:2487-2490
【Authors】: Ruey-Cheng Chen ; Chia-Jung Lee ; Chiung-Min Tsai ; Jieh Hsiang
【Abstract】: We develop a new static index pruning criterion based on the notion of information preservation. This idea is motivated by the fact that model degeneration, as does static index pruning, inevitably reduces the predictive power of the resulting model. We model this loss in predictive power using conditional entropy and show that the decision in static index pruning can therefore be optimized to preserve information as much as possible. We evaluated the proposed approach on three different test corpora, and the result shows that our approach is comparable in retrieval performance to state-of-the-art methods. When efficiency is of concern, our method has some advantages over the reference methods and is therefore suggested in Web retrieval settings.
【Keywords】: conditional entropy; index pruning; information retrieval
【Paper Link】 【Pages】:2491-2494
【Authors】: Jaeho Choi ; W. Bruce Croft
【Abstract】: Time information impacts relevance in retrieval for the queries that are sensitive to trends and events. Microblog services particularly focused on recent news and events so dealing with the temporal aspects of microblogs is essential for providing effective retrieval. Recent work on time-based retrieval has shown that selecting the relevant time period for query expansion is promising. In this paper, we suggest a method for selecting the time period for query expansion based on a user behavior (i.e., retweets) that can be collected easily. We then use these time periods for query expansion in a pseudo-relevance feedback setting. More specifically, we use the difference in the temporal distribution between the top retrieved documents and retweets. The experimental results based on the TREC Microblog collection show that our method for selecting periods for query expansion improves retrieval performance compared to another approach.
【Keywords】: microblogs; query expansion; time-based model
【Paper Link】 【Pages】:2495-2498
【Authors】: Prakhar Biyani ; Cornelia Caragea ; Amit Singh ; Prasenjit Mitra
【Abstract】: Online forums have become a popular source of information due to the unique nature of information they contain. Internet users use these forums to get opinions of other people on issues and to find factual answers to specific questions. Topics discussed in online forum threads can be subjective seeking personal opinions or non-subjective seeking factual information. Hence, knowing subjectivity orientation of threads would help forum search engines to satisfy user's information needs more effectively by matching the subjectivities of user's query and topics discussed in the threads in addition to lexical match between the two. We study methods to analyze the subjectivity of online forum threads. Experimental results on a popular online forum demonstrate the effectiveness of our methods.
【Keywords】: online forums; subjectivity
【Paper Link】 【Pages】:2499-2502
【Authors】: Yllias Chali ; Sadid A. Hasan ; Kaisar Imam
【Abstract】: This paper addresses the task of answering complex questions using a multi-document summarization approach within a reinforcement learning setting. Given a set of complex questions, a list of relevant documents per question, and the corresponding human-generated summaries (i.e. answers to the questions) as training data, the reinforcement learning module iteratively learns a number of feature weights in order to facilitate the automatic generation of summaries i.e. answers to unseen complex questions. Previous works on this task have utilized a fully automatic reinforcement learning framework that selects the document sentences as the potential candidate (i.e. machine-generated) summary sentences by exploiting a relatedness measure with the available human-written summaries. In this paper, we propose an extension to this model that incorporates user interaction into the reinforcement learner to guide the candidate summary sentence selection process. Experimental results reveal the effectiveness of the user interaction component in the reinforcement learning framework.
【Keywords】: complex question answering; multi-document summarization; reinforcement learning; user interaction
【Paper Link】 【Pages】:2503-2506
【Authors】: Qing Zhang ; Jianwu Li ; Zhiping Zhang ; Li Wang
【Abstract】: Recommending related scientific articles for a researcher is very important and useful in practice but also is full of challenges due to the latent complex semantic relations among scientific literatures. To deal with these challenges, this paper proposes a novel framework with link-missing data adaption, which casts the recommendation task to subspace embedding and similarity ranking problems. The relation regularized subspace in this framework is constructed via Relation Regularized Matrix Factorization (RRMF) for well modeling both content and link structure simultaneously. However, the link structure for an article is not always available in practical recommending. To solve this problem, we further propose two alternative approaches based on Latent Dirichlet Allocation (LDA) for link-missing articles recommendation as an extension of RRMF. Experiments on CiteSeer dataset demonstrate our method is more effective in comparison with some state-of-the-art approaches and is able to handle the link-missing case which the link-based methods never can fit.
【Keywords】: latent dirichlet allocation; link-missing data; recommendation; regularized matrix factorization; related scientific articles
【Paper Link】 【Pages】:2507-2510
【Authors】: Fiana Raiber ; Oren Kurland
【Abstract】: We present a study of the cluster hypothesis, and of the performance of cluster-based retrieval methods, performed over large scale Web collections. Among the findings we present are (i) the cluster hypothesis can hold, as determined by a specific test, for large scale Web corpora to the same extent it does for newswire corpora; (ii) while spam documents do not affect the extent to which the cluster hypothesis holds, they considerably affect the performance of cluster based, as well as that of document-based, retrieval methods; and, (iii) as is the case for newswire corpora, cluster-based methods can yield better performance than document-based methods for Web corpora.
【Keywords】: cluster hypothesis; cluster-based retrieval
【Paper Link】 【Pages】:2511-2514
【Authors】: Shize Xu ; Liang Kong ; Yan Zhang
【Abstract】: Manual timelines have greatly helped us to keep pace with the big world. In this paper, we introduce a novel solution which generates image-text timelines for news events based on Evolutionary Image-Text Summarization, which is an important and challenging problem. We first extract image's semantic information under translation model, and then fuse the high quality images with text timeline under an image assignment algorithm which can optimize the global coordination of the final timeline. The experimental results show that news readers can receive more satisfaction from the image-text timelines we generate.
【Keywords】: cross-modality; image-text; summarization; timeline
【Paper Link】 【Pages】:2515-2518
【Authors】: Muhammad Atif Qureshi ; Colm O'Riordan ; Gabriella Pasi
【Abstract】: Finding domain specific key terms/phrases from a given set of documents is a challenging task. A domain may be defined as an area of interest over a collection of documents which may not be explicitly defined but implicitly observable in those documents. When considering a collection of documents related to academic research, examples of key terms/phrases may be Information Retrieval", "Marine Biology", etc. In this paper a technique for extracting important key terms/phrases in a considered topical domain is proposed using external evidence from the titles of Wikipedia articles and the Wikipedia category graph. We performed some experiments over the document collection of Web sites of different post-graduate schools. Our preliminary evaluations show promising results for the detection of domain specific key terms/phrases from the given set of domain focused Web pages.
【Keywords】: community detection; n-gram model; open-domain knowledge; wikipedia
【Paper Link】 【Pages】:2519-2522
【Authors】: Shuzi Niu ; Yanyan Lan ; Jiafeng Guo ; Xueqi Cheng
【Abstract】: This paper is concerned with top-k ranking problem, which reflects the fact that people pay more attention to the top ranked objects in real ranking application like information retrieval. A popular approach to top-k ranking problem is based on probabilistic models, such as Luce model and Mallows model. However, whether the sequential generative process described in these models is a suitable way for top-k ranking remains a question. According to the riffled independence factorization proposed in recent literature, which is a natural structural assumption on top-k ranking, we propose a new generative process of top-k ranking data. Our approach decomposes distributions over the top-k ranking into two layers: the first layer describes the relative ordering between the top k objects and the rest n-k objects, and the second layer describes the full ordering on the top k objects. On this basis, we propose a new probabilistic model for top-k ranking problem, called hierarchical ordering model. Specifically, we use three different probabilistic models to describe different generative processes of the first layer, and Luce model to describe the sequential generative process of the second layer, thus we obtain three different specific hierarchical ordering models. We also conduct extensive experiments on benchmark datasets to show that our proposed models can outperform previous models significantly.
【Keywords】: learning to rank; ranking model; top-k
【Paper Link】 【Pages】:2523-2526
【Authors】: Adam Jatowt ; Katsumi Tanaka
【Abstract】: Recently many historical texts have become digitized and made accessible for search and browsing. As human language is subject to constant evolution, these texts pose varying challenges to current users. In this paper we report the results of large-scale studies on the usage of words and the evolution of English language vocabulary over the last two centuries to help with understanding its impact on readability and retrieval of historical documents. We perform analysis of several lexical factors which may influence accessibility and readability of historical texts based on two large scale lexical corpora: the Corpus of Historical American English and Google Books 1-gram.
【Keywords】: historical texts; information retrieval; language evolution; readability
【Paper Link】 【Pages】:2527-2530
【Authors】: Alexandros Karatzoglou ; Linas Baltrunas ; Karen Church ; Matthias Böhmer
【Abstract】: The explosive growth of the mobile application (app) market has made it difficult for users to find the most interesting and relevant apps from the hundreds of thousands that exist today. Context is key in the mobile space and so too are proactive services that ease user input and facilitate effective interaction. We believe that to enable truly novel mobile app recommendation and discovery, we need to support real context-aware recommendation that utilizes the diverse range of implicit mobile data available in a fast and scalable manner. In this paper we introduce the Djinn model, a novel context-aware collaborative filtering algorithm for implicit feedback data that is based on tensor factorization. We evaluate our approach using a dataset from an Android mobile app recommendation service called appazaar. Our results show that our approach compares favorably with state-of-the-art collaborative filtering methods.
【Keywords】: collaborative filtering; implicit feedback; mobile apps; mobile recommendation; tensor factorization; context
【Paper Link】 【Pages】:2531-2534
【Authors】: Subhabrata Mukherjee ; Akshat Malu ; Balamurali A. R. ; Pushpak Bhattacharyya
【Abstract】: In this paper, we present TwiSent, a sentiment analysis system for Twitter. Based on the topic searched, TwiSent collects tweets pertaining to it and categorizes them into the different polarity classes positive, negative and objective. However, analyzing micro-blog posts have many inherent challenges compared to the other text genres. Through TwiSent, we address the problems of 1) Spams pertaining to sentiment analysis in Twitter, 2) Structural anomalies in the text in the form of incorrect spellings, nonstandard abbreviations, slangs etc., 3) Entity specificity in the context of the topic searched and 4) Pragmatics embedded in text. The system performance is evaluated on manually annotated gold standard data and on an automatically annotated tweet set based on hashtags. It is a common practise to show the efficacy of a supervised system on an automatically annotated dataset. However, we show that such a system achieves lesser classification accurcy when tested on generic twitter dataset. We also show that our system performs much better than an existing system.
【Keywords】: entity specific twitter sentiment; micro blogs; sentiment analysis; spam; twitter
【Paper Link】 【Pages】:2535-2538
【Authors】: Dehong Gao ; Renxian Zhang ; Wenjie Li ; Yuexian Hou
【Abstract】: Twitter, the most famous micro-blogging service and online social network, collects millions of tweets every day. Due to the length limitation, users usually need to explore other ways to enrich the content of their tweets. Some studies have provided findings to suggest that users can benefit from added hyperlinks in tweets. In this paper, we focus on the hyperlinks in Twitter and propose a new application, called hyperlink recommendation in Twitter. We expect that the recommended hyperlinks can be used to enrich the information of user tweets. A three-way tensor is used to model the user-tweet-hyperlink collaborative relations. Two tensor-based clustering approaches, tensor decomposition-based clustering (TDC) and tensor approximation-based clustering (TAC) are developed to group the users, tweets and hyperlinks with similar interests, or similar contexts. Recommendation is then made based on the reconstructed tensor using cluster information. The evaluation results in terms of Mean Absolute Error (MAE) shows the advantages of both the TDC and TAC approaches over a baseline recommendation approach, i.e., memory-based collaborative filtering. Comparatively, the TAC approach achieves better performance than the TDC approach.
【Keywords】: three-way clustering; twitter hyperlink recommendation
【Paper Link】 【Pages】:2539-2542
【Authors】: Stéphane Clinchant
【Abstract】: We study the impact of concavity in IR models and propose to use a generalized logarithm function, the n-logarithm to weight words in documents. We extend the family of information based Information Retrieval (IR) models with this function. We show that that concavity is indeed an important property of IR models. Experiments conducted for IR tasks, Latent Semantic Indexing and Text Categorization show improvements.
【Keywords】: IR models; concavity; information models
【Paper Link】 【Pages】:2543-2546
【Authors】: Ilaria Bordino ; Debora Donato ; Barbara Poblete
【Abstract】: Toolbar navigation logs provide rich data for enhancing information discovery on the Web. The value of this data resides in its scope, which goes beyond that of traditional query-mining data sources, such as search-engine logs. In this paper we present a methodology for extracting relevant association rules for queries, based on historic user navigational data. In addition, we propose a graph-based approach for extracting related queries and URLs for a given query.
【Keywords】: association rules; toolbar data; web data mining
【Paper Link】 【Pages】:2547-2550
【Authors】: Alexander Kolesnikov ; Yury Logachev ; Valeriy Topinskiy
【Abstract】: Predicting CTR of ads on the search result page is an urgent topic. The reason for this is that choosing the right advertisement greatly affects revenue of the search engine and advertisers and user's satisfaction. For ads with the large click history it is quite clear how to predict CTR by utilizing statistical data. But for new ads with a poor click history such approach is not robust and reliable. We suggest a model for predicting CTR of such new ads. Contrary to the previous models of predicting CTR of new ads, our model uses events - clicks and skips1 instead of the observed CTR. In addition we have implemented several novel features, that resulted into the increase of the performance of our model. Offline and online experiments on the real search engine system demonstrated that our model outperforms the baseline and the approaches suggested in previous papers.
【Keywords】: CPC; CTR; click-through rate; paid search; sponsored search; web advertising
【Paper Link】 【Pages】:2551-2554
【Authors】: Richard McCreadie ; Craig Macdonald ; Iadh Ounis ; Jim Giles ; Ferris Jabr
【Abstract】: On the Web, content farms produce articles engineered such that search engines rank them highly, in order to turn a profit from online advertising. Recently, content farms have increasingly been the target of demotion strategies by Web search engines, since content farm articles are often considered to be of suspect quality. In this paper, we study the prevalence of content farms in the results returned by three major Web search engines over time. In particular, we develop a crowdsourced approach to identify content farm articles from the results returned by these search engines. Our results show that between the period of March and August 2011, the number of content farm articles observed on a number of indicative queries was reduced by up to 55% in the top ranks.
【Keywords】: content farms; crowdsourcing; web search
【Paper Link】 【Pages】:2555-2558
【Authors】: Eugene Kharitonov ; Pavel Serdyukov
【Abstract】: In this paper we study usefulness of user's demographical context for improving ranking of ambiguous queries. Context-aware relevance model is learnt from implicit user behaviour by using a simple yet general modification of a state-of-art click model which is capable to catch dependences from the search context. After that the machine learned click model is used in an offline re-ranking experiment and it is demonstrated that the demographical context ranking features provide improvements in ranking quality. Further, we perform a study to investigate the impact of different facets of demographical features (gender, age, and income) on search ranking performance and manually analyse queries which exhibit strong context dependences to get an additional understanding of the model behaviour.
【Keywords】: click models; context-aware ranking; personalization; web search
【Paper Link】 【Pages】:2559-2562
【Authors】: Craig Macdonald ; Rodrygo L. T. Santos ; Iadh Ounis
【Abstract】: Learning to rank studies have mostly focused on query-dependent and query-independent document features, which enable the learning of ranking models of increased effectiveness. Modern learning to rank techniques based on regression trees can support query features, which are document-independent, and hence have the same values for all documents being ranked for a query. In doing so, such techniques are able to learn sub-trees that are specific to certain types of query. However, it is unclear which classes of features are useful for learning to rank, as previous studies leveraged anonymised features. In this work, we examine the usefulness of four classes of query features, based on topic classification, the history of the query in a query log, the predicted performance of the query, and the presence of concepts such as persons and organisations in the query. Through experiments on the ClueWeb09 collection, our results using a state-of-the-art learning to rank technique based on regression trees show that all four classes of query features can significantly improve upon an effective learned model that does not use any query feature.
【Keywords】: learning to rank; query features
【Paper Link】 【Pages】:2563-2566
【Authors】: Andrey Kustarev ; Yury Ustinovsky ; Anna Mazur ; Pavel Serdyukov
【Abstract】: Search sessions are known to be a rich source of diverse valuable information for individual query analysis. In this paper, we address the problem of query performance prediction by utilizing the entire logical search sessions containing the given query. Guided by the intuitions based on the observations made after the analysis of the search sessions' properties and performance of the queries they contain, we propose a number of features that significantly advance the existing query performance prediction models. Some of them specifically allow to focus on tail queries with sparse click-through statistics.
【Keywords】: query flow graph; query performance prediction; user sessions
【Paper Link】 【Pages】:2567-2570
【Abstract】: Most of the current recommender systems heavily rely on explicit user feedback such as ratings on items to model users' interests. However, in many applications, it is very hard to collect the explicit feedback, while implicit feedback such as user clicks may be more available. Furthermore, it is often more suitable for many recommender systems to address a ranking problem than a rating predicting problem. This paper proposes a latent pairwise preference learning (LPPL) approach for recommendation with implicit feedback. LPPL directly models user preferences with respect to a set of items rather than the rating scores on individual items, which are modeled with a set of features by analyzing clickthrough data available in many real-world recommender systems. The LPPL approach models both the latent variables of group structure of users and the pairwise preferences simultaneously. We conduct experiments on the testbed from a real-world recommender system and demonstrate that the proposed approach can effectively improve the recommendation performance against several baseline algorithms.
【Keywords】: implicit feedback; pairwise preferences; recommender systems
【Paper Link】 【Pages】:2571-2574
【Authors】: Reede Ren ; John P. Collomosse ; Joemon M. Jose
【Abstract】: This paper improves spatial pyramid kernel (SPK) and proposes a relevance learning approach to compare performer's poses in a large dance archive, the NRCD collection1. Domain knowledge of Choreutics is exploited to define pose topics and a selection operator is developed for pose topic matching. The visual structure descriptor of self similarity (SSF) is extended to hierarchical self similarity (HSSF) to keep shape context. The framework of Bag-of-Visual Words (BOVW) is applied to encode as well as to speed up the matching on pose topics/topic combinations. This alleviates the complexity in limb allocation which is infeasible in our data. Extensive experiments show that the new approach outperforms the original SPK in both precision and robustness.
【Keywords】: dance retrieval; pose relevance learning; spatial pyramid kernel
【Paper Link】 【Pages】:2575-2578
【Authors】: Christopher Wienberg ; Andrew S. Gordon
【Abstract】: An effective means of retrieving relevant photographs from the web is to search for terms that would likely appear in the surrounding text in multimedia documents. In this paper, we investigate the complementary search strategy, where relevant multimedia documents are retrieved using the photographs they contain. We concentrate our efforts on the retrieval of large numbers of personal stories posted to Internet weblogs that are relevant to a particular search topic. Photographs are often included in posts of this sort, typically taken by the author during the course of the narrated events of the story. We describe a new story search tool, PhotoFall, which allows users to quickly find stories related to their topic of interest by judging the relevance of the photographs extracted from top search results. We evaluate the accuracy of relevance judgments made using this interface, and discuss the implications of the results for improving topic-based searches of multimedia content.
【Keywords】: photographs; storytelling; weblogs
【Paper Link】 【Pages】:2579-2582
【Authors】: Amin Y. Teymorian ; Xiao Qin ; Ophir Frieder
【Abstract】: Selective query forwarding is a promising technique to help scale high-quality and cost-efficient query evaluation in distributed search systems. The basic idea is simple. After a local site receives a query, it determines non-local sites to forward the query to and returns an aggregation of local and non-local results. We introduce "RESQ", a hybrid rank-energy selective query forwarding model. The novel contribution of RESQ is to simultaneously consider both ranking quality and energy costs when making forwarding decisions. Using a large-scale query log and publicly-available energy price time series, we demonstrate the ability of RESQ forwarding to achieve favorable tradeoffs between the possibility of returning high ranking query results and savings in temporally- and spatially-varying energy prices.
【Keywords】: distributed IR; energy; linear program; query forwarding
【Paper Link】 【Pages】:2583-2586
【Authors】: Gabriella Kazai ; Jaap Kamps ; Natasa Milic-Frayling
【Abstract】: Information retrieval systems require human contributed relevance labels for their training and evaluation. Increasingly such labels are collected under the anonymous, uncontrolled conditions of crowdsourcing, leading to varied output quality. While a range of quality assurance and control techniques have now been developed to reduce noise during or after task completion, little is known about the workers themselves and possible relationships between workers' characteristics and the quality of their work. In this paper, we ask how do the relatively well or poorly-performing crowds, working under specific task conditions, actually look like in terms of worker characteristics, such as demographics or personality traits. Our findings show that the face of a crowd is in fact indicative of the quality of their work.
【Keywords】: crowdsourcing; demographics; personality traits; worker accuracy
【Paper Link】 【Pages】:2587-2590
【Authors】: Pawel Dybala ; Rafal Rzepka ; Kenji Araki ; Kohichi Sayama
【Abstract】: In this paper we propose a method of filtering excessive amount of textual data acquired from the Internet. In our research on pun generation in Japanese we experienced problems with extensively long data processing time, caused by the amount of phonetic candidates generated (i.e. phrases that can be used to generate actual puns) by our system. Simple, naive approach in which we take into considerations only phrases with the highest occurrence in the Internet, can effect in deletion of those candidates that are actually usable. Thus, we propose a data filtering method in which we compare two Internet-based rankings: a co-occurrence ranking and a hit rate ranking, and select only candidates which occupy the same or similar positions in these rankings. In this work we analyze the effects of such data reduction, considering 1 cases: when the candidates are on exactly the same positions in both rankings, and when their positions differ by 1, 2, 3 and 4. The analysis is conducted on data acquired by comparing pun candidates generated by the system (and filtered with our method) with phrases that were actually used in puns created by humans. The results show that the proposed method can be used to filter excessive amounts of textual data acquired from the Internet.
【Keywords】: AI; HCI; NLP; humor processing; web-based data extraction
【Paper Link】 【Pages】:2591-2594
【Authors】: Changsung Kang ; Jeehaeng Lee ; Yi Chang
【Abstract】: We consider the problem of identifying primary categories of a business listing among the categories provided by the owner of the business. The category information submitted by business owners cannot be trusted with absolute certainty since they may purposefully add some secondary or irrelevant categories to increase recall in local search results, which makes category search very challenging for local search engines. Thus, identifying primary categories of a business is a crucial problem in local search. This problem can be cast as a multi-label classification problem with a large number of categories. However, the large scale of the problem makes it infeasible to use conventional supervised-learning-based text categorization approaches. We propose a large-scale classification framework that leverages multiple types of classification labels to produce a highly accurate classifier with fast training time. We effectively combine the complementary label sources to refine prediction. The experimental results indicate that our framework achieves very high precision and recall and outperforms a Centroid-based method.
【Keywords】: primary category; text categorization; vertical search
【Paper Link】 【Pages】:2595-2598
【Authors】: Zhen Yue ; Jiepu Jiang ; Shuguang Han ; Daqing He
【Abstract】: This paper presents a user study aiming to investigate the query reformulation in collaborative Web search. 7 pairs of participants were recruited and each pair worked as a team on two collaborative exploratory Web search tasks. Through the log analysis, we compared possible sources for participants to draw query terms from. The results show that both search and collaborative actions are possible resources for new query terms. Traditional resources for query expansion such as previous search histories and relevant documents are still important resources for new query terms. The content in chat and workspace generated by participants themselves seems more likely to be the resource for new query terms than that of their partners. Task types also affect the influences on query reformulations. For the academic task, previously saved relevance documents are the most important resources for new query terms while chat histories are the most important resources for the leisure task.
【Keywords】: collaborative web search; query reformulation
【Paper Link】 【Pages】:2599-2602
【Authors】: Lei Guo ; Jun Ma ; Zhumin Chen ; Haoran Jiang
【Abstract】: Recommender systems with social networks (RSSN) have been well studied in recent works. However, these methods ignore the relationships among items, which may affect the quality of recommendations. Motivated by the observation that related items often have similar ratings, we propose a framework integrating items' relations, users' social graph and user-item rating matrix for recommendation. Experimental results show that our approach performs better than the state-of-art algorithm and the method with only users' social graph ensemble in terms of MAP and RMSE.
【Keywords】: item relation; matrix factorization; recommender systems; social network; social recommendation
【Paper Link】 【Pages】:2603-2606
【Authors】: Sumit Bhatia ; Bin He ; Qi He ; W. Scott Spangler
【Abstract】: Even though queries received by traditional information retrieval systems are quite short, there are many application scenarios where long natural language queries are more effective. Further, incorporating term position information can help improve results of long queries. However, the techniques for incorporating term position information have been developed for terse queries and hence, can not be directly applied to long queries. Though there exist some methods for performing proximal search for long queries, they are not scalable due to long query response times. We describe an intuitive and simple, yet effective technique that implicitly incorporates term position information for long queries in a scalable manner. Our proposed approach achieves more than 700% faster query response times while maintaining the quality of retrieved results when compared with a state-of-the-art method for performing proximal search for very long queries.
【Keywords】: long queries; patent search; prior art search; proximal search; term proximity; verbose queries
【Paper Link】 【Pages】:2607-2610
【Authors】: Adam Jatowt ; Katsumi Tanaka
【Abstract】: Readability is one of key factors determining document quality and reader's satisfaction. In this paper we analyze readability of Wikipedia, which is a popular source of information for searchers about unknown topics. Although Wikipedia articles are frequently listed by search engines on top ranks, they are often too difficult for average readers searching information about difficult queries. We examine the average readability of content in Wikipedia and compare it to the one in Simple Wikipedia and Britannica. Next, we investigate readability of selected categories in Wikipedia. Apart from standard readability measures we use some new metrics based on words' popularity and their distributions across different document genres and topics.
【Keywords】: readability; web content analysis; web search
【Paper Link】 【Pages】:2611-2614
【Authors】: Young-joo Chung
【Abstract】: Rakuten recipe is a recipe site where users can submit their recipes and share with the others. Since recipe contents are generated by users, they usually contain many misspellings, abbreviations, synonyms, hypernyms and hyponyms. Identifying and normalizing these words is essential to retrieve relevant recipes to user's request. In this paper, we introduce a new approach to finding related words in a recipe domain using the data structure. Based on the observation that people usually write the main ingredient in the first position of ingredient lists of each recipe and such a ingredient is strongly related to the categories where recipes belong, we calculate relation scores of word pairs using real service data, which contains 790 categories and 405,519 recipes. The experimental result showed that we successfully found semantically related word pairs with f-score of 0.93.
【Keywords】: food entity relation; recipe; search effectiveness
【Paper Link】 【Pages】:2615-2618
【Authors】: Bo Lu ; Ye Yuan ; Guoren Wang
【Abstract】: Tag-based social image search predominately focus on using user-annotated tags to find out the results of user query. However, the performance of tag-based social image search is usually unable to satisfy the needs of users. In this paper, we propose a novel framework based on Social Relationship Graph for Social Image Search (SRGSIS), which involves two stages. In the first stage, we use heterogeneous data from multiple modalities to build a social relationship graph. Then, for the given query keywords, we execute an efficient keyword search algorithm over the social relationship graph and obtain top-k candidate results based on relevance score. We model these results as the answer trees connecting keyword nodes that match keywords in the query. In the second stage, for refining the candidate results, each image in social relationship graph is represented as a region adjacency graph by using the visual content of image. We further model these region adjacency graphs as a closure tree and compute approximate graph similarity between the candidate results and the closure tree to obtain more desirable results. Extensive experimental results demonstrate the effectiveness of the proposed approach.
【Keywords】: closure-tree; keyword search; multimodality; social relationship graph
【Paper Link】 【Pages】:2619-2622
【Authors】: Xun Wang ; Lei Wang ; Jiwei Li ; Sujian Li
【Abstract】: Summarization and Keyword Selection are two important tasks in NLP community. Although both aim to summarize the source articles, they are usually treated separately by using sentences or words. In this paper, we propose a two-level graph based ranking algorithm to generate summarization and extract keywords at the same time. Previous works have reached a consensus that important sentence is composed by important keywords. In this paper, we further study the mutual impact between them through context analysis. We use Wikipedia to build a two-level concept-based graph, instead of traditional term-based graph, to express their homogenous relationship and heterogeneous relationship. We run PageRank and HITS rank on the graph to adjust both homogenous and heterogeneous relationships. A more reasonable relatedness value will be got for key sentence selection and keyword selection. We evaluate our algorithm on TAC 2011 data set. Traditional term-based approach achieves a score of 0.255 in ROUGE-1 and a score of 0.037 and ROUGE-2 and our approach can improve them to 0.323 and 0.048 separately.
【Keywords】: graph; keyword; markov chain; summarization
【Paper Link】 【Pages】:2623-2626
【Authors】: Nattiya Kanhabua ; Kjetil Nørvåg
【Abstract】: News prediction retrieval has recently emerged as the task of retrieving predictions related to a given news story (or a query). Predictions are defined as sentences containing time references to future events. Such future-related information is crucially important for understanding the temporal development of news stories, as well as strategies planning and risk management. The aforementioned work has been shown to retrieve a significant number of relevant predictions. However, only a certain news topics achieve good retrieval effectiveness. In this paper, we study how to determine the difficulty in retrieving predictions for a given news story. More precisely, we address the query difficulty estimation problem for news prediction retrieval. We propose different entity-based predictors used for classifying queries into two classes, namely, Easy and Difficult. Our prediction model is based on a machine learning approach. Through experiments on real-world data, we show that our proposed approach can predict query difficulty with high accuracy.
【Keywords】: future events; news predictions; query difficulty estimation; relevance ranking
【Paper Link】 【Pages】:2627-2630
【Authors】: Maxim Zhukovskiy ; Dmitry Vinogradov ; Gleb Gusev ; Pavel Serdyukov ; Andrei M. Raigorodskii
【Abstract】: Traditional link-based web ranking algorithms run on a single web snapshot without concern of the dynamics of web pages and links. In particular, the correlation of web pages freshness and their classic PageRank is negative (see [11]). For this reason, in recent years a number of authors introduce some algorithms of PageRank actualization. We introduce our new algorithm called Actual PageRank, which generalizes some previous approaches and therefore provides better capability for capturing the dynamics of the Web. To the best of our knowledge we are the first to conduct ranking evaluations of a fresh-aware variation of PageRank on a large data set. The results demonstrate that our method achieves more relevant and fresh results than both classic PageRank and its "fresh" modifications.
【Keywords】: freshness; link-based ranking; pagerank; web search
【Paper Link】 【Pages】:2631-2634
【Authors】: Ke Zhou ; Ronan Cummins ; Mounia Lalmas ; Joemon M. Jose
【Abstract】: The aggregation of search results from heterogeneous verticals (news, videos, blogs, etc) has become an important consideration in search. When aiming to select suitable verticals, from which items are selected to be shown along with the standard "ten blue links", there exists the potential to both help (selecting relevant verticals) and harm (selecting irrelevant verticals) the existing result set. In this paper, we present an approach that considers both reward and risk within the task of vertical selection (VS). We propose a novel risk-aware VS evaluation metric that incorporates users' risk-levels and users' individual preference of verticals. Using the proposed metric, we present a detailed analysis of both reward and risk of current resource selection approaches within a multi-label classification framework. The results bring insights into the effectiveness and robustness of current vertical selection approaches.
【Keywords】: aggregated search; evaluation; vertical selection
【Paper Link】 【Pages】:2635-2638
【Authors】: Jiepu Jiang ; Daqing He ; Shuguang Han ; Zhen Yue ; Chaoqun Ni
【Abstract】: We propose a method to dynamically estimate the utility of documents in a search session by modeling the users' browsing behaviors and novelty. The method can be applied to evaluate query reformulations in a search session.
【Keywords】: evaluation; interactive search; query reformulation; query sugges-tion; search session
【Paper Link】 【Pages】:2639-2642
【Authors】: Byron J. Gao ; Zhumin Chen ; Qi Kang
【Abstract】: Keyword search over graphs has a wide array of applications in querying structured, semi-structured and unstructured data. Existing models typically use minimal trees or bounded subgraphs as query answers. While such models emphasize relevancy, they would suffer from incompleteness of information and redundancy among answers, making it difficult for users to effectively explore query answers. To overcome these drawbacks, we propose a novel cluster-based model, where query answers are relevancy-connected clusters. A cluster is a subgraph induced from a maximal set of relevancy-connected nodes. Such clusters are coherent and relevant, yet complete and redundancy free. They can be of arbitrary shape in contrast to the sphere-shaped bounded subgraphs in existing models. We also propose an efficient search algorithm and a corresponding graph index for large, disk-resident data graphs.
【Keywords】: indexing; information complete; keyword search over graphs; redundancy free
【Paper Link】 【Pages】:2643-2646
【Authors】: Yafei Li ; Dingming Wu ; Jianliang Xu ; Byron Choi ; Weifeng Su
【Abstract】: Location-based social networks, such as Foursquare and Facebook Places, are bridging the gap between the physical world and online social networking services through acquired user locations. Some social networks released check-in services that allow users to share their visiting locations with their friends. In this paper, users' interests are modeled by check-in actions. We propose a new spatial-aware interest group (SIG) query that retrieves a user group of size k where every user is highly interested in the query keyword and also spatially close to each other. An efficient algorithm AIR based on the IR-tree is proposed for the processing of SIG queries. Furthermore, an optimization is developed and achieves a much better performance than the baseline algorithm.
【Keywords】: query processing; social networks; spatial databases
【Paper Link】 【Pages】:2647-2650
【Authors】: Thomas Bernecker ; Tobias Emrich ; Hans-Peter Kriegel ; Matthias Renz ; Andreas Züfle
【Abstract】: Ranking queries have been investigated extensively in the past due to their broad range of applications. In this paper, we study this problem in the context of fuzzy objects that have indeterministic boundaries. Fuzzy objects play an important role in many areas, such as biomedical image databases and GIS. To the best of our knowledge, we present the first efficient approach for similarity ranking in fuzzy object databases. The main challenge of ranking fuzzy objects is that these objects consist of multiple instances, each associated with a probability. We propose a framework to transform fuzzy objects into probabilistic objects which can then be ranked using existing algorithms for probabilistic objects.
【Keywords】: fuzzy data; probabilistic data; probabilistic ranking
【Paper Link】 【Pages】:2651-2654
【Authors】: Shuai Zheng ; Fusheng Wang ; James J. Lu ; Joel H. Saltz
【Abstract】: While current biomedical ontology repositories offer primitive query capabilities, it is difficult or cumbersome to support ontology based semantic queries directly in semantically annotated biomedical databases. The problem may be largely attributed to the mismatch between the models of the ontologies and the databases, and the mismatch between the query interfaces of the two systems. To fully realize semantic query capabilities based on ontologies, we develop a system DBOntoLink to provide unified semantic query interfaces by extending database query languages. With DBOntoLink, semantic queries can be directly and naturally specified as extended functions of the database query languages without any programming needed. DBOntoLink is adaptable to different ontologies through customizations and supports major biomedical ontologies hosted at the NCBO BioPortal. We demonstrate the use of DBOntoLink in a real world biomedical database with semantically annotated medical image annotations.
【Keywords】: ontology; query languages
【Paper Link】 【Pages】:2655-2658
【Authors】: Jakub Lokoc ; Jürgen Wünschmann ; Tomás Skopal ; Albrecht Rothermel
【Abstract】: In this paper, we present the vision of the usage of an object-based video data storage format for similarity search. The efficient (fast) and effective (accurate) search in video streams is an ongoing and still unsolved problem. Using an object-based format of multimedia data, all the information that is needed to answer queries is already available in a machine accessible format. This way, the process of creating (video) descriptors as well as the similarity search becomes easier, because the data is already organized in a manner that allows fast access to specific information. To demonstrate the concept of similarity search process using the object-based 3D video format, we present experiments conducted on generated clouds of points (an abstraction of 3D video data).
【Keywords】: object-based video coding; similarity search; video retrieval
【Paper Link】 【Pages】:2659-2662
【Authors】: Shirui Pan ; Xingquan Zhu
【Abstract】: In this paper, we propose to query correlated graphs in a data stream scenario, where an algorithm is required to retrieve the top k graphs which are mostly correlated to a query graph q. Due to the dynamic changing nature of the stream data and the inherent complexity of the graph query process, treating graph streams as static datasets is computationally infeasible or ineffective. In the paper, we propose a novel algorithm, Hoe-PGPL, to identify top-k correlated graphs from data stream, by using a sliding window which covers a number of consecutive batches of stream data records. Our theme is to employ Hoeffding bound to discover some potential candidates and use two level candidate checking (one corresponding to the whole sliding window level and one corresponding to the local data batch level) to accurately estimate the correlation of the emerging candidate patterns, without rechecking the historical stream data. Experimental results demonstrate that the proposed algorithm not only achieves good performance in terms of query precision and recall, but also is several times, or even an order of magnitude, more efficient than the straightforward algorithm with respect to the time and the memory consumption. Our method represents the first research endeavor for data stream based top-k correlated graph query.
【Keywords】: correlated graph query; graph stream; pearson's correlation coefficient
【Paper Link】 【Pages】:2663-2666
【Authors】: Christoph Böhm ; Gjergji Kasneci ; Felix Naumann
【Abstract】: Large amounts of graph-structured data are emerging from various avenues, ranging from natural and life sciences to social and semantic web communities. We address the problem of discovering subgraphs of entities that reflect latent topics in graph-structured data. These topics are structured meta-information providing further insights into the data. The presented approach effectively detects such topics by exploiting only the structure of the underlying graph, thus avoiding the dependency on textual labels, which are a scarce asset in prevalent graph datasets. The viability of our approach is demonstrated in experiments on real-world datasets.
【Keywords】: conceptual patterns; latent topics; subgraph mining
【Paper Link】 【Pages】:2667-2670
【Authors】: Michael J. Welch ; Aamod Sane ; Chris Drome
【Abstract】: User facing topical web applications such as events or shopping sites rely on large collections of data records about real world entities that are updated at varying latencies ranging from days to seconds. For example, event venue details are changed relatively infrequently whereas ticket pricing and availability for an event is often updated in near-realtime. Users regard these sites as high quality if they seldom show duplicates, the URLs are stable, and their content is fresh, so it is important to resolve duplicate entity records with high quality and low latencies. High quality entity resolution typically evaluates the entire record corpus for similar record clusters at the cost of latency, while low latency resolution examines the least possible entities to keep time to a minimum, even at the cost of quality. In this paper we show how to keep low latency while achieving high quality, combining the best of both approaches: given an entity to be resolved, our incremental Fastpath system, in a matter of milliseconds, makes approximately the same decisions that the underlying batch system would have made. Our experiments show that the Fastpath system makes matching decisions for previously unseen entities with 90% precision and 98% recall relative to batch decisions, with latencies under 20ms on commodity hardware.
【Keywords】: deduplication; entity resolution; knowledge base
【Paper Link】 【Pages】:2671-2673
【Authors】: Steffen Metzger ; Michael Stoll ; Katja Hose ; Ralf Schenkel
【Abstract】: Semantic recognition and annotation of unqiue enities and their relations is a key in understanding the essence contained in large text corpora. It typically requires a combination of efficient automatic methods and manual verification. Usually, both parts are seen as consecutive steps. In this demo we present MIKE, a user interface enabling the integration of user feedback into an iterative extraction process. We show how an extraction system can directly learn from such integrated user supervision. In general, this setup allows for stepwise training of the extraction system to a particular domain, while using user feedback early in the iterative extraction process improves extraction quality and reduces the overall human effort needed.
【Keywords】: gui; information extraction; knowledge acquisition; learning; user feedback; web service
【Paper Link】 【Pages】:2674-2676
【Authors】: Yafang Wang ; Maximilian Dylla ; Zhaochun Ren ; Marc Spaniol ; Gerhard Weikum
【Abstract】: Acquiring high-quality (temporal) facts for knowledge bases is a labor-intensive process. Although there has been recent progress in the area of semi-supervised fact extraction, these approaches still have limitations, including a restricted corpus, a fixed set of relations to be extracted or a lack of assessment capabilities. In this paper we introduce PRAVDA-live, a framework that overcomes these limitations and supports the entire pipeline of interactive knowledge harvesting. To this end, our demo exhibits fact extraction from ad-hoc corpus creation, via relation specification, labeling and assessment all the way to ready-to-use RDF exports.
【Keywords】: interactive knowledge harvesting; label propagation
【Paper Link】 【Pages】:2677-2679
【Authors】: Yunfei Chen ; Lanbo Zhang ; Aaron Michelony ; Yi Zhang
【Abstract】: As the increasing of popularity of social web, cyber bullying has become a more and more serious issue among children. Bullying causes huge negative effects on children, even suicide. SocialFilter is a realtime system that helps parents and educators track children's messages on Twitter, especially in order to detect whether they have been bullied or bullying others. The aim of the system is 4 I's, identity of bullies, inference of bullying message, influence of bully behavior, and intervention. We solve this problem by using machine learning technique. The current system is tracking tens of thousands of active children users on Twitter and automatically detect bullying messages at real time.
【Keywords】: bully; detecting; twitter
【Paper Link】 【Pages】:2680-2682
【Authors】: Eduard C. Dragut ; Mourad Ouzzani ; Amgad Madkour ; Nabeel Mohamed ; Peter Baker ; David E. Salt
【Abstract】:
【Keywords】: genomic data; ionomic data; visualization
【Paper Link】 【Pages】:2683-2685
【Authors】: Benjamin Bertin ; Vasile-Marian Scuturici ; Jean-Marie Pinon ; Emmanuel Risler
【Abstract】: We demonstrate CarbonDB, a web application for Life Cycle Inventory data management. Life Cycle Assessment provides a well-accepted methodology for modelling environmental impacts of human activities. This methodology relies on the decomposition of a studied system into interdependent processes in a phase called Life Cycle Inventory. Several organisations provide processes databases containing thousands of processes with their interdependency links. The usual workflow to manage those databases is based on the manipulation of individual processes, which turns out to be a very harnessing work even if there are strong semantic similarities between the involved processes. In previous publications, we proposed a new workflow for LCA inventory databases maintenance based on the addition of semantic information to the processes they contained. This method considerably eases the modeling process and offers a synthetic view of the dependencies links. We created a web application based on this approach composed of a back-end for data management and a front-end for searching processes and visualize the dependencies links in a graph.
【Keywords】: clustering; environmental database; life cycle assessment; onotology; semantic annotation
【Paper Link】 【Pages】:2686-2688
【Authors】: Nattiya Kanhabua ; Sara Romano ; Avaré Stewart ; Wolfgang Nejdl
【Abstract】: Microblogging services, such as Twitter, are gaining interests as a means of sharing information in social networks. Numerous works have shown the potential of using Twitter posts (or tweets) in order to infer the existence and magnitude of real-world events. In the medical domain, there has been a surge in detecting public health related tweets for early warning so that a rapid response from health authorities can take place. In this paper, we present a temporal analytics tool for supporting a comparative, temporal analysis of disease outbreaks between Twitter and official sources, such as, World Health Organization (WHO) and ProMED-mail. We automatically extract and aggregate outbreak events from official outbreak reports, producing time series data. Our tool can support a correlation analysis and an understanding of the temporal developments of outbreak mentions in Twitter, based on comparisons with official sources.
【Keywords】: Twitter; disease outbreaks; event detection; time series analysis
【Paper Link】 【Pages】:2689-2691
【Authors】: Hyun Duk Kim ; ChengXiang Zhai ; Thomas A. Rietz ; Daniel Diermeier ; Meichun Hsu ; Malú Castellanos ; Carlos Ceja Limon
【Abstract】: Topic modeling is popular for text mining tasks. Recently, topic modeling has been combined with time lines when textual data is related to external non-textual time series data such as stock prices. However, no previous work has used the external non-textual time series data in the process of topic modeling. In this paper, we describe a novel text mining system, Integrative Causal Topic Miner (InCaToMi) that integrates textual and non-textual time series data. InCaToMi automatically finds causal relationships and topics using text data and external non-textual time series data using Granger Testing. Moreover, InCaToMi considers the non-textual time series data in the topic modeling process, using the time series data to iteratively improve modeling results through interactions between it and the textual data at both topic and word levels.
【Keywords】: causal topic mining; integrative topic mining; time series
【Paper Link】 【Pages】:2692-2694
【Authors】: Philipp Kranen ; Stephan Wels ; Tim Rohlfs ; Sebastian Raubach ; Thomas Seidl
【Abstract】: Testing algorithms and systems involves trying different sets of parameter values on different domains or data sets. Even for a moderate number of parameters and domains the number of possible experiments can get very large due to the combinatorial explosion. Evaluating the outcome of these experiments requires comparing the results, which is often done by writing a script or inspecting the result files manually. For a new algorithm or version, the work has to be done over again. With hundreds, thousands, or even more possible experiments, both the preparation and the evaluation can become complex and tedious. In this demonstrator we present a software tool, called ET, for evaluating the parameters of an algorithm or system, either automatically or controlled by the user. It allows to launch large numbers of experiments in just a few clicks, visually explore the results and analyze the performance of the algorithm.
【Keywords】: evaluation; framework; structured testing
【Paper Link】 【Pages】:2695-2697
【Authors】: Walid Magdy ; Ahmed M. Ali ; Kareem Darwish
【Abstract】: Searching social content in general and microblogs (aka tweets) in particular has been basic and limited, especially for time-sensitive topics. The currently implemented microblog search on sites such as Twitter is based on simple word matching and retrieves the most recent microblogs that match a given query. Furthermore, a user may obtain hundreds or perhaps thousands of microblogs in response to a given query, leading to information overload. We present a new multidimensional microblog search tool that generates a comprehensive report from microblogs instead of a flat list of recent/relevant microblogs for a given query. Reports may include tag-clouds, topic time series, and most popular and funny microblogs, etc. The tool can be configured for monitoring time-sensitive topics using a set of predefined queries. We demonstrate our system on Arabic and English microblog collections. Additionally, we show a special configuration of the system for monitoring the 2012 Egyptian presidential elections.
【Keywords】: elections; microblog search; retrieval results summarization; twitter
【Paper Link】 【Pages】:2698-2700
【Authors】: Stewart Whiting ; Ke Zhou ; Joemon M. Jose ; Omar Alonso ; Teerapong Leelanupab
【Abstract】: Time plays a central role in many web search information needs relating to recent events. For recency queries where fresh information is most desirable, there is likely to be a great deal of highly-relevant information created very recently by crowds of people across the world, particularly on platforms such as Wikipedia and Twitter. With so many users, mainstream events are often very quickly reflected in these sources. The English Wikipedia encyclopedia consists of a vast collection of user-edited articles covering a range of topics. During events, users collaboratively create and edit existing articles in near real-time. Simultaneously, users on Twitter disseminate and discuss event details, with a small number of users becoming influential for the topic. In this demo, we propose a novel approach to presenting a summary of new information and users related to recent or ongoing events associated with the user's search topic, therefore aiding most recent information discovery. We outline methods to detect search topics which are driven by events, identify and extract changing Wikipedia article passages and find influential Twitter users. Using these, we provide a system which displays familiar tiles in search results to present recent changes in the event-related Wikipedia articles, as well as Twitter users who have tweeted recent relevant information about the event topics.
【Keywords】: Twitter; Wikipedia; events; time
【Paper Link】 【Pages】:2701-2703
【Authors】: Jie Yin ; Sarvnaz Karimi ; Bella Robinson ; Mark A. Cameron
【Abstract】: During a disastrous event, such as an earthquake or river flooding, information on what happened, who was affected and how, where help is needed, and how to aid people who were affected, is crucial. While communication is important in such times of crisis, damage to infrastructure such as telephone lines makes it difficult for authorities and victims to communicate. Microblogging has played a critical role as an important communication platform during crises when other media has failed. We demonstrate our ESA (Emergency Situation Awareness) system that mines microblogs in real-time to extract and visualise useful information about incidents and their impact on the community in order to equip the right authorities and the general public with situational awareness.
【Keywords】: filtering; social media monitoring; social web mining
【Paper Link】 【Pages】:2704-2706
【Authors】: Zhumin Chen ; Byron J. Gao ; Qi Kang
【Abstract】: Existing search engines have page as the unit of information of retrieval. They typically return a ranked list of pages, each being a search result containing the query keywords. This within-one-page constraint disallows utilization of relationship information that is often available and greatly beneficial. To utilize relationship information and improve search precision, we explore cross-page search, where each answer is a logical page consisting of multiple closely related pages that collectively contain the query keywords. We have implemented a prototype Cager, providing cross-page search and visualization over real dataset.
【Keywords】: cross-page search; keyword search over graphs
【Paper Link】 【Pages】:2707-2709
【Authors】: Wilson Wong ; Lawrence Cavedon ; John Thangarajah ; Lin Padgham
【Abstract】: One of the biggest bottlenecks for conversational systems is large-scale provision of suitable content. Our approach readily provides this without the need for custom-crafting. In this demonstration, we present the use of question-answer (QA) pairs mined from online question-and-answer websites to construct system utterances for a conversational agent. Our system uses QA pairs to formulate utterances that drive a conversation in addition to the answering of user questions as has been done in previous work. We use a collection of strategies that specify how and when the different parts of our question-answer pairs can be used and augmented with a small number of generic hand-crafted text snippets to generate natural and coherent system utterances.
【Keywords】: conversational system; question-answer pairs
【Paper Link】 【Pages】:2710-2712
【Authors】: Sergej Zerr ; Stefan Siersdorfer ; Jonathon S. Hare
【Abstract】: Photo publishing in Social Networks and other Web2.0 applications has become very popular due to the pervasive availability of cheap digital cameras, powerful batch upload tools and a huge amount of storage space. A portion of uploaded images are of a highly sensitive nature, disclosing many details of the users' private life. We have developed a web service which can detect private images within a user's photo stream and provide support in making privacy decisions in the sharing context. In addition, we present a privacy-oriented image search application which automatically identifies potentially sensitive images in the result set and separates them from the remaining pictures.
【Keywords】: classification; diversification; image analysis; privacy
【Paper Link】 【Pages】:2713-2715
【Authors】: Sheng Lin ; Peiquan Jin ; Xujian Zhao ; Lihua Yue
【Abstract】: Most Web pages contain temporal information, which can be utilized by search engines to improve searching performance for users. However, traditional search engines have little support in processing temporal-textual Web queries. Aiming at solving this problem, in this paper we present and implement a prototype system for time-sensitive queries, which is called TASE (Time-Aware Search Engine). TASE extracts both the explicit and implicit temporal expressions for each Web page, and calculates the relevant score between the Web page and each temporal expression, and then re-rank search results based on the temporal-textual relevance between Web pages and the queries. It is demonstrated that TASE can improve the effectiveness of temporal-textual Web queries.
【Keywords】: re-ranking; time; web search
【Paper Link】 【Pages】:2716-2718
【Authors】: Zhuowei Bao ; Benny Kimelfeld ; Yunyao Li ; Sriram Raghavan ; Huahai Yang
【Abstract】: Enterprise search is challenging due to various reasons, notably the dynamic terminology and domain structure that are specific to the enterprise, combined with the fact that search deployments are typically managed by domain experts who are not necessarily search experts. To address that, it has been proposed to design search architectures that feature two principles: comprehensibility of the ranking mechanism and customizability of the search engine by means of intuitive runtime rules. The proposed demonstration operates on top of an engine implementation based on this search philosophy, and provides an administrator toolkit to realize the two principles. In particular, the toolkit provides a complete visualization of the provenance (hence ranking) of search results, embeds an editor for programming runtime rules, facilitates the investigation of (the cause of) missing or low-ranked desired results, and provides suggestions of rewrite rules to handle such results.
【Keywords】: enterprise search; rule suggestion; search administration toolkit; search provenance visualization
【Paper Link】 【Pages】:2719-2721
【Authors】: Yuhki Shiraishi ; Jianwei Zhang ; Yukiko Kawai ; Toyokazu Akiyama
【Abstract】: We present a novel system that combines the advantages of social communication and Web search by simultaneously discovering important pages and users. First, the system provides a communication interface attached to pages, which allows users to talk with each other in real time while browsing the same page, i.e., page-centric communication. Then, the system can efficiently provide two ranking lists of pages and users by analyzing a hybrid structure of hyperlinks (page-page relationship) and social links (page-user relationship and user-user relationship). Thus, users can efficiently search for important pages as well as important users related to their queries through the ranking function, and immediately obtain useful information or knowledge from not only pages themselves but also from other users.
【Keywords】: communication; ranking; search; social networking
【Paper Link】 【Pages】:2722-2724
【Authors】: Mouna Kacimi ; Johann Gamper
【Abstract】: A query topic can be subjective involving a variety of opinions, judgments, arguments, and many other debatable aspects. Typically, search engines process queries independently from the nature of their topics using a relevance-based retrieval strategy. Hence, search results about subjective topics are often biased towards a specific view point or version. In this demo, we shall present MOUNA, a novel approach for opinion diversification. Given a query on a subjective topic, MOUNA ranks search results based on three scores: (1) relevance of documents, (2) semantic diversity to avoid redundancy and capture the different arguments used to discuss the query topic, and (3) sentiment diversity to cover a balanced set of documents having positive, negative, and neutral sentiments about the query topic. Moreover, MOUNA enhances the representation of search results with a summary of the different arguments and sentiments related to the query topic. Thus, the user can navigate through the results and explore the links between them. We provide an example scenario in this demonstration to illustrate the inadequacy of relevance-based techniques for searching subjective topics and highlight the innovative aspects of MOUNA. A video showing the demo can be found in http://www.youtube.com/user/mounakacimi/videos .
【Keywords】: diversification; ranking; sentiment analysis
【Paper Link】 【Pages】:2725-2727
【Authors】: Ognjen Savkovic ; Paramita Mirza ; Sergey Paramonov ; Werner Nutt
【Abstract】: MAGIK demonstrates how to use meta-information about the completeness of a database to assess the quality of the answers returned by a query. The system holds so-called table-completeness (TC) statements, by which one can express that a table is partially complete, that is, it contains all facts about some aspect of the domain. Given a query, MAGIK determines from such meta-information whether the database contains sufficient data for the query answer to be complete. If, according to the TC statements, the database content is not sufficient for a complete answer, MAGIK explains which further TC statements are needed to guarantee completeness. MAGIK extends and complements theoretical work on modeling and reasoning about data completeness by providing the first implementation of a reasoner. The reasoner operates by translating completeness reasoning tasks into logic programs, which are executed by an answer set engine.
【Keywords】: answer set programming; data completeness; data quality
【Paper Link】 【Pages】:2728-2730
【Authors】: Tobias Emrich ; Hans-Peter Kriegel ; Johannes Niedermayer ; Matthias Renz ; André Suhartha ; Andreas Züfle
【Abstract】: This demo presents a framework for running probabilistic graph queries on uncertain graphs and visualizing their results. The framework supports the most common uncertainty model for uncertain graphs, i.e. existential uncertainty for the edges of the graph. A large variety of meaningful graph queries are supported, such as shortest path, range, kN, reverse kN, reachability and various aggregation queries. Since the problem of exact probability computation according to possible world semantics is in #P-Time for many combinations of model and query, and since ignoring uncertainty (e.g. by using expectations only) will yield counterintuitive and hard to interpret results, our framework uses an optimized version of Monte-Carlo sampling to estimate the results which allows us not only to perform queries that conform to possible world semantics but also to sample only parts of a graph relevant for a given query. The main strength of this framework is the visualization combined with statistic hypothesis tests, which gives the user not only the estimated result of a query, but also an indication of how significant and reliable these results are. The aim of this demonstration is to give an intuition that a sampling based approach to probabilistic graphs is viable, and that the estimated results quickly converge even for very large graphs. A video demonstrating our framework can be downloaded at http://www.dbs.ifi.lmu.de/Publikationen/videos/PGraph.html
【Keywords】: monte-carlo; probabilistic graph; sampling; visualization
【Paper Link】 【Pages】:2731-2733
【Authors】: Melanie Herschel ; Hanno Eichelberger
【Abstract】: When developing data transformations - a task omnipresent in applications like data integration, data migration, data cleaning, or scientific data processing -developers quickly face the need to verify the semantic correctness of the transformation. Declarative specifications of data transformations, e.g., SQL or ETL tools, increase developer productivity but usually provide limited or no means for inspection or debugging. In this situation, developers today have no choice but to manually analyze the transformation and, in case of an error, to (repeatedly) fix and test the transformation. The goal of the Nautilus project is to semi-automatically support this analysis-fix-test cycle. This demonstration focuses on one main component of Nautilus, namely the Nautilus Analyzer that helps developers in understanding and debugging their data transformations. The demonstration will show the capabilities of this component for data transformations specified in SQL on scenarios from different domains that are based on real-world data. We provide an overview the Nautilus Analyzer, discuss components and implementation techniques, and outline our demonstration plan. The Nautilus website (http://nautilus-system.org) features a video, screenshots, and further details.
【Keywords】: data provenance; query analysis
【Paper Link】 【Pages】:2734-2736
【Authors】: Asma Souihli ; Pierre Senellart
【Abstract】: ProApproX 2.0 allows users to query uncertain tree-structured data in the form of probabilistic XML documents. The demonstrated version integrates a fully redesigned query engine that, first, produces a propositional formula that represents the probabilistic lineage of a given answer over the probabilistic XML document, and, second, searches for an optimal strategy to approximate the probability of the lineage. This latter part relies on a query-optimizer-like approach: exploring different evaluation plans for different parts of the formula and predicting the cost of each plan, using a cost model for the various evaluation algorithms. The demonstration presents the graphical user interface of ProApproX 2.0, that allows a user to input an XPath query and approximation parameters, and lists query results with their probabilities; the interface also gives insight into the way the computation is performed, by displaying the compilation of the query lineage as a tree annotated with evaluation operators.
【Keywords】: approximation algorithm; cost model; probabilistic data; query processing; xml
【Paper Link】 【Pages】:2737-2739
【Authors】: Hyebong Choi ; Kyong-Ha Lee ; Soo-Hyong Kim ; Yoon-Joon Lee ; Bongki Moon
【Abstract】: The volume of XML data is tremendous in many areas, but especially in data logging and scientific areas. XML data in the areas are accumulated over time as new data are continuously collected. It is a challenge to process massive XML data with multiple twig pattern queries given by multiple users in a timely manner. We showcase HadoopXML, a system that simultaneously processes many twig pattern queries for a massive volume of XML data with Hadoop. Specifically, HadoopXML provides an efficient way to process a single large XML file in parallel. It processes multiple twig pattern queries simultaneously with a shared input scan. Users do not need to iterate M/R jobs for each query. HadoopXML also reduces many I/Os by enabling twig pattern queries to share their path solutions each other. Moreover, HadoopXML provides a sophisticated runtime load balancing scheme for fairly assigning multiple twig pattern joins across nodes. With synthetic and real world XML dataset, we demonstrate how efficiently HadoopXML processes many twig pattern queries in a shared and balanced way.
【Keywords】: mapreduce; parallel processing; query optimization; xml
【Paper Link】 【Pages】:2740-2742
【Authors】: Christan Earl Grant ; Joir-dan Gumbs ; Kun Li ; Daisy Zhe Wang ; George Chitouras
【Abstract】: In many domains, structured data and unstructured text are both important natural resources to fuel data analysis. Statistical text analysis needs to be performed over text data to extract structured information for further query processing. Typically, developers will need to connect multiple tools to build off-line batch processes to perform text analytic tasks. MADden is an integrated system developed for relational database systems such as PostgreSQL and Greenplum for real-time ad hoc query processing over structured and unstructured data. MADden implements four important text analytic functions that we have contributed to the MADlib open source library for textual analytics. In this demonstration, we will show the capability of the MADden text analytic library using computational journalism as the driving application. We show real-time declarative query processing over multiple data sources with both structured and text information.
【Keywords】: databases; query-driven; text analytics
【Paper Link】 【Pages】:2743-2745
【Authors】: K. Selçuk Candan ; Rosaria Rossini ; Maria Luisa Sapino ; Xiaolan Wang
【Abstract】: Since many applications rely on time-based data, visualizing temporal data and helping experts explore large time series data sets are critical in many application domains. In this interactive system preview, we argue that time series often carry structural features that can, if efficiently identified and effectively visualized, help reduce visual overload and help the user quickly focus on the relevant portions of the data sets. Relying on this observation, we introduce a novel STFMap system, which includes four innovative query- and feature-driven time series data set visualization techniques: (a) segment-maps, (b) warp-maps, (c) stretch-maps, and (d) feature-maps. These rely on the salient temporal features of the time series and their alignments with respect to the given user query to help users explore the data set in a query-driven fashion.
【Keywords】: data exploration; time series data sets
【Paper Link】 【Pages】:2746-2748
【Authors】: Imen Ben Dhia ; Talel Abdessalem ; Mauro Sozio
【Abstract】: While online social networks (OSN) present unprecedented opportunities for sharing information and multimedia content among users, they raise major privacy issues as users could often access personal or confidential data of other users. Most social networks provide some basic access control policies, which however seem to be very limited given the diversity of user relationships in the current social networks (e.g. friend, acquaintance, son) as well as the needs of social network users who might want to express sophisticated access control policies (e.g. invite all children of my colleagues to my child's birthday party). In this demonstration proposal, we present Primates a privacy management system for social networks. Primates allows users to specify access control rules for their resources and enforces access control over all shared resources. The set of users who are allowed to access a given resource is defined by a set of constraints on the paths connecting the owner of a resource to its requester in the social graph. We demonstrate the accuracy of our access control model and the scalability of our system.
【Keywords】: access control; online social networks
【Paper Link】 【Pages】:2749-2751
【Authors】: Andrés Aranda-Andújar ; Francesca Bugiotti ; Jesús Camacho-Rodríguez ; Dario Colazzo ; François Goasdoué ; Zoi Kaoudi ; Ioana Manolescu
【Abstract】: We present AMADA, a platform for storing Web data (in particular, XML documents and RDF graphs) based on the Amazon Web Services (AWS) cloud infrastructure. AMADA operates in a Software as a Service (SaaS) approach, allowing users to upload, index, store, and query large volumes of Web data. The demonstration shows (i) the step-by-step procedure for building and exploiting the warehouse (storing, indexing, querying) and (ii) the monitoring tools enabling one to control the expenses (monetary costs) charged by AWS for the operations involved while running AMADA.
【Keywords】: aws; cloud computing; monetary cost; query processing; web data management
【Paper Link】 【Pages】:2752-2753
【Authors】: Jalal Mahmud ; James Caverlee ; Jeffrey Nichols ; John O'Donovan ; Michelle X. Zhou
【Abstract】: Massive amounts of data are being generated on social media sites, such as Twitter and Facebook. This data can be used to better understand people, such as their personality traits, perceptions, and preferences, and predict their behavior. This deeper understanding of users and their behaviors can benefit a wide range of intelligent applications, such as advertising, social recommender systems, and personalized knowledge management. These applications will also benefit individual users themselves by optimizing their experiences across a wide variety of domains, such as retail, healthcare, and education. Since mining and understanding user behavior from social media often requires interdisciplinary effort, including machine learning, text mining, human-computer interaction, and social science, our workshop aims to bring together researchers and practitioners from multiple fields to discuss the creation of deeper models of individual users by mining the content that they publish and the social networking behavior that they exhibit.
【Keywords】: data-driven; social media analysis; user modeling
【Paper Link】 【Pages】:2754-2755
【Authors】: Xiaofeng Meng ; Adam Silberstein ; Fusheng Wang
【Abstract】: The fourth ACM international workshop on cloud data management is held in Maui, Hawaii, USA on October 29, 2012 and co-located with the ACM 21th Conference on Information and Knowledge Management (CIKM). The main objective of the workshop is to address the challenges of large scale data management based on the cloud computing infrastructure. The workshop brings together researchers and practitioners from cloud computing, distributed storage, query processing, parallel algorithms, data mining, and system analysis, all attendees share common research interests in maximizing performance, reducing cost of cloud data management and enlarging the scale of their endeavors. We have constructed an exciting program of seven refereed papers and four invited keynote talks that will give participants a full dose of emerging research.
【Keywords】: cloud computing; data management
【Paper Link】 【Pages】:2756-2757
【Authors】: Veli Bicer ; Thanh Tran ; Fatma Ozcan ; Opher Etzion
【Abstract】: Cities today have become highly dense, dynamic living areas for the majority of planet's population and also focal points of innovation, commerce, and growth in a highly modernized world. Due to its intensifying importance, cities need to transform into sustainable, smarter and credible places to enable a tenantable and comfortable life for their citizens. City data, which is the source of our digitized knowledge about the cities, is a highly important element to achive this goal as it is the main input to build complex city ecosystems and to solve particular problems that are encountered in the cities today. In this respect, this workshop will provide a major forum to identify the challenges and opportunities in terms of better managing city data and to reveal its discriminating importance in various applications in a city ecosystem. As city data becomes more widespread and prevailing, it poses novel research problems which importantly are open to the investigation of a broad community of researchers in various fields.
【Keywords】: data analytics;; data management; data monitoring; information retrieval; smart cities
【Paper Link】 【Pages】:2758-2759
【Authors】: Cui Tao ; Matt-Mouley Bouamrane
【Abstract】: Data management and knowledge engineering have long been important research fields in computer science, and rapid progress in recent years have increasingly seen these technologies successfully applied to solve complex biomedical challenges and support health services professionals in the course of their intellectually-demanding clinical duties, such as through the use of decision-support or expert systems. Yet, as the biomedical knowledge available in the modern digital world grows exponentially, there is a pressing need for a focused forum to promote technology and knowledge transfer from basic research to biomedical applications as well as allowing for implementers of healthcare systems to share their experiences with the research community. The Managing Interoperability and Complexity in Health Systems, MIXHS workshops are designed for such a purpose with the view that multi-disciplinary approaches within a holistic forum is essential to rise to the ever new challenges of biomedical knowledge complexity and interoperability of health systems and services.
【Keywords】: bio-medical knowledge management; electronic health systems interoperability and integration
【Paper Link】 【Pages】:2760-2761
【Authors】: Spyros Kotoulas ; Yi Zeng ; Zhisheng Huang
【Abstract】: The rapid and perpetual growth of knowledge on the Web has given rise to many grand challenges (such as scalability, inconsistency, uncertainty, distribution and dynamics) for traditional knowledge processing methods and systems. Knowledge representation, retrieval and reasoning methods need to evolve and adapt to the Web to face these challenges and make this vast, heterogenous knowledge useful and accessible. In this light, the International Workshop on Web-scale Knowledge Representation, Retrieval, and Reasoning (Web-KR) is initiated. This workshop serves as the third one in this workshop series. This summary discusses the scope of Web-KR and introduces the advances in this field through the accepted papers in the Web-KR 2012 workshop, co-located with CIKM 2012.
【Keywords】: knowledge representation; knowledge retrieval; scalability; semantic search; web reasoning
【Paper Link】 【Pages】:2762-2763
【Authors】: Christopher C. Yang ; Hsinchun Chen ; Howard D. Wactlar ; Carlo Combi ; Xuning Tang
【Abstract】: The Smart Health and Wellbeing workshop is organized to develop a platform for authors to discuss fundamental principles, algorithms or applications of intelligent data acquisition, processing and analysis of healthcare data. We are particularly interested in information and knowledge management papers, in which the approaches are accompanied by an in-depth experimental evaluation with real world data. This paper provides an overview of the workshop and the accepted contributions.
【Keywords】: healthcare; information technology; wellness
【Paper Link】 【Pages】:2764-2765
【Authors】: Gabriella Kazai ; Monica Landoni ; Carsten Eickhoff ; Peter Brusilovsky
【Abstract】: BooksOnline'12, the fifth workshop in the series, aims to offer a forum for bringing together expertise from academia, industry and libraries to facilitate the exchange of research results and technology in the field of digital libraries with specific focus on online books and complementary social media. The focus of this year's workshop is "engaging reading experiences", starting from the act of deciding what to read, through the exploration and interpretation of a book's content, to sharing the overall experience. Within this overall umbrella theme, the accepted papers naturally showed three salient themes: (1) Search and Discovery, (2) Personalization and Recommendation, and Reading Experiences beyond Text. The contributions demonstrate a range of technologies, including a collaborative tabletop visual approach to support the searching and discovery of books, co-citation methods to enhance document retrieval; exploring open issues in audio-book production to support non-text based reading and improving e-book accessibility; new approaches to recommendation that take into account writing style as well as looking specifically to young readers and their needs in order to develop recommendation tools that consider both content and reading level and match these against the readers' specific interests and reading ability. Following in the theme of the reader playing a central role in the future of our digital era, we are honored to welcome Maribeth Back from FX Palo Alto and Natasa Milic-Frayling from Microsoft Research as our keynote speakers.
【Keywords】: booksonline'12 workshop papers summary
【Paper Link】 【Pages】:2766-2767
【Authors】: Min Song ; Doheon Lee ; Hua Xu ; Sophia Ananiadou
【Abstract】: The organizers of ACM Sixth International Workshop on Data and Text Mining in Biomedical Informatics (DTMBIO 12) are happy announce that the sixth DTMBIO will be held in conjunction with CIKM, one of the largest data management conferences. The major interests of DTMBIO are on the state-of-the-art applications of data and text mining on biomedical research problems. DTMBIO 12 will be a forum of discussing and exchanging informatics related techniques and problems in the context of biomedical research.
【Keywords】: medical information systems
【Paper Link】 【Pages】:2768-2769
【Authors】: Ingmar Weber ; Ana-Maria Popescu ; Marco Pennacchiotti
【Abstract】: What is the role of the internet in politics general and during campaigns in particular? And what is the role of large amounts of user data in all of this? In the 2008 U.S. presidential campaign the Democrats were far more successful than the Republicans in utilizing online media for mobilization, co-ordination and fundraising. For the first time, social media and the Internet played a fundamental role in political campaigns. However, technical research in this area has been surprisingly limited and fragmented. The goal of this workshop is to bring together, for the first time, researchers working at the intersection of social network analysis, computational social science and political science, to share and discuss their ideas in a common forum; and to inspire further developments in this growing, fascinating field. The workshop has Filippo Menczer as keynote speaker, it includes technical presentations of accepted papers and concludes with a panel discussion where scientists and media experts from different fields can interact and share views.
【Keywords】: Twitter; computational political science; facebook; politics; elections; social media
【Paper Link】 【Pages】:2770-2771
【Authors】: Haggai Roitman ; Iván Cantador ; Miriam Fernández
【Abstract】: This paper provides an overview of the 1st International Workshop on Multimodal Crowd Sensing (CrowdSens 2012), held at the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012). This workshop aimed to provide an open forum for researchers from various fields such as fields such as Natural Language Processing, Information Extraction, Data Mining, Information Retrieval, User Modeling and Personalization, Stream Processing, and Sensor Networks, for addressing the challenges of effectively mining, analyzing, fusing, and exploiting information sourced from multimodal physical and social sensor data sources.
【Keywords】: algorithms; experimentation; human factors; performance
【Paper Link】 【Pages】:2772-2773
【Authors】: Jaap Kamps ; Jussi Karlgren ; Peter Mika ; Vanessa Murdock
【Abstract】: There is an increasing amount of structure on the Web as a result of modern Web languages, user tagging and annotation, emerging robust NLP tools, and an ever growing volume of linked data. These meaningful, semantic, annotations hold the promise to significantly enhance information access, by enhancing the depth of analysis of today's systems. Currently, we have only started exploring the possibilities and only begin to understand how these valuable semantic cues can be put to fruitful use. To complicate matters, standard text search excels at shallow information needs expressed by short keyword queries, and here semantic annotation contributes very little, if anything. The main questions for the workshop are how to leverage the rich context currently available, especially in a mobile search scenario, giving powerful new handles to exploit semantic annotations. And how can we fruitfully combine information retrieval and semantic web approaches, and for the first time work actively toward a unified view on exploiting semantic annotations.
【Keywords】: semantic annotation
【Paper Link】 【Pages】:2774-2775
【Authors】: Rakesh Agrawal ; Douglas W. Oard ; Nitendra Rajput
【Abstract】: Several issues arise with management of content that is generated in developing regions. Some result from linguistic diversity (as in India and Africa), some result from content being available only in forms that are more difficult to computationally manipulate (e.g., handwriting, speech, and legacy digital text in nonstandard encodings), some result from underinvestment in language resources for the languages of these regions, and some result from increased contact between cultures that have different views regarding the proper use of information and information artifacts. Such issues warrant focused attention if we are to optimally leverage information and knowledge management to the advantage of populations in developing regions. That is the purpose of this workshop.
【Keywords】: information management; international development; knowledge management
【Paper Link】 【Pages】:2776-2777
【Authors】: Aparna S. Varde ; Fabian M. Suchanek
【Abstract】: The PIKM 2012 workshop is the 5th of its kind after 4 successful PhD workshops at ACM CIKM. This PhD workshop invites papers that describe the Ph.D. dissertation proposals of doctoral students in any of the CIKM areas: databases, information retrieval, data mining and knowledge management. Interdisciplinary work across these tracks is particularly encouraged. This year PIKM has received around 25 submissions from over 12 countries across the globe, among which 10 have been accepted as full papers for oral presentation while 4 have been accepted as short ones for poster presentation. The selection has been conducted based on reviews submitted by an expert team comprising 21 PC members spanning 12 countries and 6 continents with a good balance of industry and academia.
【Keywords】: phd; workshop
【Paper Link】 【Pages】:2778-2779
【Authors】: George H. L. Fletcher ; Prasenjit Mitra
【Abstract】: We give an overview of WIDM 2012, held in conjunction with CIKM 2012 in Maui, Hawaii. WIDM 2012 is the twelfth in a series of international workshops on Web Information and Data Management held in conjunction with CIKM since 1998. The objective of the workshop is to bring together researchers and industrial practitioners to present and discuss leading research into how web data and information can be extracted, stored, analyzed, and processed to provide useful knowledge to end users for advanced database and web applications.
【Keywords】: web data; web exploration; web information; web mining
【Paper Link】 【Pages】:2780-2781
【Authors】: Matteo Golfarelli ; Il-Yeol Song
【Abstract】: The ACM DOLAP workshop presents research on data warehousing and On-Line Analytical Processing (OLAP). The DOLAP 2012 program is organized in four interesting sessions on data warehouse design and maintainability, OLAP querying and trends, warehousing of complex data, performance optimization and benchmarking.
【Keywords】: data warehouse; olap