Proceedings of the Fifth International Conference on Web Search and Web Data Mining, WSDM 2012, Seattle, WA, USA, February 8-12, 2012. ACM 【DBLP Link】
【Paper Link】 【Pages】:1-2
【Authors】: Hal R. Varian
【Abstract】: It is now possible to acquire real time information on economic variables of interest from various commercial sources. I illustrate how one can use Google Trends data to measure the state of the macroeconomy in various sectors, and discuss some of the ramifications for research and policy.
【Keywords】: forecasting
【Paper Link】 【Pages】:3-12
【Authors】: Roelof van Zwol ; Lluis Garcia Pueyo
【Abstract】: The success of image object retrieval systems relies on the visual bag-of-words paradigm, which allows image retrieval systems to adopt a retrieval strategy analogous to text retrieval. In this paper we propose two spatially-aware retrieval strategies for image object retrieval that replaces the vector space model. The advantage of the proposed spatially-aware indexing and retrieval strategies are threefold: (1) It allows for the deployment of small visual vocabularies, (2) the number of images evaluated at retrieval time is significantly reduced, and (3) it eliminates the need for a post-retrieval phase, which is normally used to test the spatial composition of the visual words in the retrieved images. The first spatially-aware retrieval strategy explores the direct neighbourhood of two local features for common visual words to determine the similarity of the region surrounding the local features. The second strategy embeds the spatial composition of its neighbourhood directly in the index using edge signatures. Both strategies rely on the coherence of the neighbourhood of points in different images containing similar objects. The comparison of the spatially-aware retrieval strategies against the vector space baseline shows a significant improvement in terms of early precision, and at the same time significantly reduce the number of candidates to be considered at retrieval time.
【Keywords】: image retrieval; quantization error; spatially-aware indexing
【Paper Link】 【Pages】:13-22
【Authors】: Yuan Cao Zhang ; Diarmuid Ó Séaghdha ; Daniele Quercia ; Tamas Jambor
【Abstract】: Recommendation systems exist to help users discover content in a large body of items. An ideal recommendation system should mimic the actions of a trusted friend or expert, producing a personalised collection of recommendations that balance between the desired goals of accuracy, diversity, novelty and serendipity. We introduce the Auralist recommendation framework, a system that - in contrast to previous work - attempts to balance and improve all four factors simultaneously. Using a collection of novel algorithms inspired by principles of "serendipitous discovery", we demonstrate a method of successfully injecting serendipity, novelty and diversity into recommendations whilst limiting the impact on accuracy. We evaluate Auralist quantitatively over a broad set of metrics and, with a user study on music recommendation, show that Auralist's emphasis on serendipity indeed improves user satisfaction.
【Keywords】: accuracy; collaborative filtering; diversification; metrics; novelty; recommender systems; serendipity
【Paper Link】 【Pages】:23-32
【Authors】: Hanqiang Cheng ; Yu-Li Liang ; Xinyu Xing ; Xue Liu ; Richard Han ; Qin Lv ; Shivakant Mishra
【Abstract】: Online video chat services, such as Chatroulette, Omegle, and vChatter are becoming increasingly popular and have attracted millions of users. One critical problem encountered in such applications is the presence of misbehaving users ("flashers") and obscene content. Automatically filtering out obscene content from these systems in an efficient manner poses a difficult challenge. This paper presents a novel Fine-Grained Cascaded (FGC) classification solution that significantly speeds up the compute-intensive process of classifying misbehaving users by dividing image feature extraction into multiple stages and filtering out easily classified images in earlier stages, thus saving unnecessary computation costs of feature extraction in later stages. Our work is further enhanced by integrating new webcam-related contextual information (illumination and color) into the classification process, and a 2-stage soft margin SVM algorithm for combining multiple features. Evaluation results using real-world data set obtained from Chatroulette show that the proposed FGC based classification solution significantly outperforms state-of-the-art techniques.
【Keywords】: misbehaving user; online video chat
【Paper Link】 【Pages】:33-42
【Authors】: Haipeng Zhang ; Mohammed Korayem ; Erkang You ; David J. Crandall
【Abstract】: Studying relationships between keyword tags on social sharing websites has become a popular topic of research, both to improve tag suggestion systems and to discover connections between the concepts that the tags represent. Existing approaches have largely relied on tag co-occurrences. In this paper, we show how to find connections between tags by comparing their distributions over time and space, discovering tags with similar geographic and temporal patterns of use. Geo-spatial, temporal and geo-temporal distributions of tags are extracted and represented as vectors which can then be compared and clustered. Using a dataset of tens of millions of geo-tagged Flickr photos, we show that we can cluster Flickr photo tags based on their geographic and temporal patterns, and we evaluate the results both qualitatively and quantitatively using a panel of human judges. We also develop visualizations of temporal and geographic tag distributions, and show that they help humans recognize semantic relationships between tags. This approach to finding and visualizing similar tags is potentially useful for exploring any data having geographic and temporal annotations.
【Keywords】: flickr; geo-spatial and temporal clustering; tag semantics and visualization
【Paper Link】 【Pages】:43-52
【Authors】: Nilesh N. Dalvi ; Ravi Kumar ; Bo Pang
【Abstract】: Despite their 140-character limitation, tweets embody a lot of valuable information, especially temporal and spatial. In this paper we study the geographic aspects of tweets, for a given object domain. We propose a user-level model for spatial encoding in tweets that goes beyond the explicit geo-coding or place name mentions; this model can be used to match objects to tweets. We illustrate our model and methodology using restaurants as the objects, and show a significant improvement in performance over using standard language models. En route, we obtain a method to geolocate users who tweet about geolocated objects; this may be of independent interest.
【Keywords】: language model; object matching; spatial model; tweets
【Paper Link】 【Pages】:53-62
【Authors】: George Papadakis ; Ekaterini Ioannou ; Claudia Niederée ; Themis Palpanas ; Wolfgang Nejdl
【Abstract】: A prerequisite for leveraging the vast amount of data available on the Web is Entity Resolution, i.e., the process of identifying and linking data that describe the same real-world objects. To make this inherently quadratic process applicable to large data sets, blocking is typically employed: entities (records) are grouped into clusters - the blocks - of matching candidates and only entities of the same block are compared. However, novel blocking techniques are required for dealing with the noisy, heterogeneous, semi-structured, user-generateddata in the Web, as traditional blocking techniques are inapplicable due to their reliance on schema information. The introduction of redundancy, improves the robustness of blocking methods but comes at the price of additional computational cost. In this paper, we present methods for enhancing the efficiency of redundancy-bearing blocking methods, such as our attribute-agnostic blocking approach. We introduce novel blocking schemes that build blocks based on a variety of evidences, including entity identifiers and relationships between entities; they significantly reduce the required number of comparisons, while maintaining blocking effectiveness at very high levels. We also introduce two theoretical measures that provide a reliable estimation of the performance of a blocking method, without requiring the analytical processing of its blocks. Based on these measures, we develop two techniques for improving the performance of blocking: combining individual, complementary blocking schemes, and purging blocks until given criteria are satisfied. We test our methods through an extensive experimental evaluation, using a voluminous data set with 182 million heterogeneous entities. The outcomes of our study show the applicability and the high performance of our approach.
【Keywords】: attribute-agnostic blocking; data cleaning; entity resolution
【Paper Link】 【Pages】:63-72
【Authors】: Yi Fang ; Luo Si ; Naveen Somasundaram ; Zhengtao Yu
【Abstract】: This paper presents a novel opinion mining research problem, which is called Contrastive Opinion Modeling (COM). Given any query topic and a set of text collections from multiple perspectives, the task of COM is to present the opinions of the individual perspectives on the topic, and furthermore to quantify their difference. This general problem subsumes many interesting applications, including opinion summarization and forecasting, government intelligence and cross-cultural studies. We propose a novel unsupervised topic model for contrastive opinion modeling. It simulates the generative process of how opinion words occur in the documents of different collections. The ad hoc opinion search process can be efficiently accomplished based on the learned parameters in the model. The difference of perspectives can be quantified in a principled way by the Jensen-Shannon divergence among the individual topic-opinion distributions. An extensive set of experiments have been conducted to evaluate the proposed model on two datasets in the political domain: 1) statement records of U.S. senators; 2) world news reports from three representative media in U.S., China and India, respectively. The experimental results with both qualitative and quantitative analysis have shown the effectiveness of the proposed model.
【Keywords】: contrastive opinions; opinion mining; opinion retrieval; topic modeling
【Paper Link】 【Pages】:73-82
【Authors】: Partha Pratim Talukdar ; Derry Tanti Wijaya ; Tom M. Mitchell
【Abstract】: Recent research has made significant advances in automatically constructing knowledge bases by extracting relational facts (e.g., Bill Clinton-presidentOf-US) from large text corpora. Temporally scoping such relational facts in the knowledge base (i.e., determining that Bill Clinton-presidentOf-US is true only during the period 1993 - 2001) is an important, but relatively unexplored problem. In this paper, we propose a joint inference framework for this task, which leverages fact-specific temporal constraints, and weak supervision in the form of a few labeled examples. Our proposed framework, CoTS (Coupled Temporal Scoping), exploits temporal containment, alignment, succession, and mutual exclusion constraints among facts from within and across relations. Our contribution is multi-fold. Firstly, while most previous research has focused on micro-reading approaches for temporal scoping, we pose it in a macro-reading fashion, as a change detection in a time series of facts' features computed from a large number of documents. Secondly, to the best of our knowledge, there is no other work that has used joint inference for temporal scoping. We show that joint inference is effective compared to doing temporal scoping of individual facts independently. We conduct our experiments on large scale open-domain publicly available time-stamped datasets, such as English Gigaword Corpus and Google Books Ngrams, demonstrating CoTS's effectiveness.
【Keywords】: joint inference; knowledge base; temporal scoping
【Paper Link】 【Pages】:83-92
【Authors】: Anirban Dasgupta ; Maxim Gurevich ; Liang Zhang ; Belle L. Tseng ; Achint Oommen Thomas
【Abstract】: Many large Internet websites are accessed by users anonymously, without requiring registration or logging-in. However, to provide personalized service these sites build anonymous, yet persistent, user models based on repeated user visits. Cookies, issued when a web browser first visits a site, are typically employed to anonymously associate a website visit with a distinct user (web browser). However, users may reset cookies, making such association short-lived and noisy. In this paper we propose a solution to the cookie churn problem: a novel algorithm for grouping similar cookies into clusters that are more persistent than individual cookies. Such clustering could potentially allow more robust estimation of the number of unique visitors of the site over a certain long time period, and also better user modeling which is key to plenty of web applications such as advertising and recommender systems. We present a novel method to cluster browser cookies into groups that are likely to belong to the same browser based on a statistical model of browser visitation patterns. We address each step of the clustering as a binary classification problem estimating the probability that two different subsets of cookies belong to the same browser. We observe that our clustering problem is a generalized interval graph coloring problem, and propose a greedy heuristic algorithm for solving it. The scalability of this method allows us to cluster hundreds of millions of browser cookies and provides significant improvements over baselines such as constrained K-means.
【Keywords】: bayes factor; browser cookie churn; clustering algorithms; distributed computing; similarity measure
【Paper Link】 【Pages】:93-102
【Authors】: Jiliang Tang ; Huiji Gao ; Huan Liu
【Abstract】: Traditionally, research about trust assumes a single type of trust between users. However, trust, as a social concept, inherently has many facets indicating multiple and heterogeneous trust relationships between users. Due to the presence of a large trust network for an online user, it is necessary to discern multi-faceted trust as there are naturally experts of different types. Our study in product review sites reveals that people place trust differently to different people. Since the widely used adjacency matrix cannot capture multi-faceted trust relationships between users, we propose a novel approach by incorporating these relationships into traditional rating prediction algorithms to reliably estimate their strengths. Our work results in interesting findings such as heterogeneous pairs of reciprocal links. Experimental results on real-world data from Epinions and Ciao show that our work of discerning multi-faceted trust can be applied to improve the performance of tasks such as rating prediction, facet-sensitive ranking, and status theory.
【Keywords】: heterogeneous trust; multi-dimension tie strength; multi-faceted trust; trust network
【Paper Link】 【Pages】:103-112
【Authors】: Marc Najork ; Dennis Fetterly ; Alan Halverson ; Krishnaram Kenthapadi ; Sreenivas Gollapudi
【Abstract】: Many phenomena and artifacts such as road networks, social networks and the web can be modeled as large graphs and analyzed using graph algorithms. However, given the size of the underlying graphs, efficient implementation of basic operations such as connected component analysis, approximate shortest paths, and link-based ranking (e.g. PageRank) becomes challenging. This paper presents an empirical study of computations on such large graphs in three well-studied platform models, viz., a relational model, a data-parallel model, and a special-purpose in-memory model. We choose a prototypical member of each platform model and analyze the computational efficiencies and requirements for five basic graph operations used in the analysis of real-world graphs viz., PageRank, SALSA, Strongly Connected Components (SCC), Weakly Connected Components (WCC), and Approximate Shortest Paths (ASP). Further, we characterize each platform in terms of these computations using model-specific implementations of these algorithms on a large web graph. Our experiments show that there is no single platform that performs best across different classes of operations on large graphs. While relational databases are powerful and flexible tools that support a wide variety of computations, there are computations that benefit from using special-purpose storage systems and others that can exploit data-parallel platforms.
【Keywords】: data-parallel computing; databases; graph algorithms; graph servers; very large graphs
【Paper Link】 【Pages】:113-122
【Authors】: Bo Long ; Yi Chang ; Anlei Dong ; Jianzhang He
【Abstract】: Learning to rank arises in many information retrieval applications, ranging from Web search engine, online advertising to recommendation systems. Traditional ranking mainly focuses on one type of data source, and effective modeling relies on a sufficiently large number of labeled examples, which require expensive and time-consuming labeling process. However, in many real-world applications, ranking over multiple related heterogeneous domains becomes a common situation, where in some domains we may have a relatively large amount of training data while in some other domains we can only collect very little. Theretofore, how to leverage labeled information from related heterogeneous domain to improve ranking in a target domain has become a problem of great interests. In this paper, we propose a novel probabilistic model, pairwise cross-domain factor model, to address this problem. The proposed model learns latent factors(features) for multi-domain data in partially-overlapped heterogeneous feature spaces. It is capable of learning homogeneous feature correlation, heterogeneous feature correlation, and pairwise preference correlation for cross-domain knowledge transfer. We also derive two PCDF variations to address two important special cases. Under the PCDF model, we derive a stochastic gradient based algorithm, which facilitates distributed optimization and is flexible to adopt different loss functions and regularization functions to accommodate different data distributions. The extensive experiments on real world data sets demonstrate the effectiveness of the proposed model and algorithm.
【Keywords】: heterogeneous transfer ranking; homogeneous transfer ranking; pairwise cross-domain factor model; ranking; source domain; stochastic gradient descent; target domain
【Paper Link】 【Pages】:123-132
【Authors】: Amr Ahmed ; Mohamed Aly ; Joseph Gonzalez ; Shravan M. Narayanamurthy ; Alexander J. Smola
【Abstract】: Latent variable techniques are pivotal in tasks ranging from predicting user click patterns and targeting ads to organizing the news and managing user generated content. Latent variable techniques like topic modeling, clustering, and subspace estimation provide substantial insight into the latent structure of complex data with little or no external guidance making them ideal for reasoning about large-scale, rapidly evolving datasets. Unfortunately, due to the data dependencies and global state introduced by latent variables and the iterative nature of latent variable inference, latent-variable techniques are often prohibitively expensive to apply to large-scale, streaming datasets. In this paper we present a scalable parallel framework for efficient inference in latent variable models over streaming web-scale data. Our framework addresses three key challenges: 1) synchronizing the global state which includes global latent variables (e.g., cluster centers and dictionaries); 2) efficiently storing and retrieving the large local state which includes the data-points and their corresponding latent variables (e.g., cluster membership); and 3) sequentially incorporating streaming data (e.g., the news). We address these challenges by introducing: 1) a novel delta-based aggregation system with a bandwidth-efficient communication protocol; 2) schedule-aware out-of-core storage; and 3) approximate forward sampling to rapidly incorporate new data. We demonstrate state-of-the-art performance of our framework by easily tackling datasets two orders of magnitude larger than those addressed by the current state-of-the-art. Furthermore, we provide an optimized and easily customizable open-source implementation of the framework1.
【Keywords】: graphical models; inference; large-scale systems; latent models
【Paper Link】 【Pages】:133-142
【Authors】: Steffen Rendle
【Abstract】: Many factorization models like matrix or tensor factorization have been proposed for the important application of recommender systems. The success of such factorization models depends largely on the choice of good values for the regularization parameters. Without a careful selection they result in poor prediction quality as they either underfit or overfit the data. Regularization values are typically determined by an expensive search that requires learning the model parameters several times: once for each tuple of candidate values for the regularization parameters. In this paper, we present a new method that adapts the regularization automatically while training the model parameters. To achieve this, we optimize simultaneously for two criteria: (1) as usual the model parameters for the regularized objective and (2) the regularization of future parameter updates for the best predictive quality on a validation set. We develop this for the generic model class of Factorization Machines which subsumes a wide variety of factorization models. We show empirically, that the advantages of our adaptive regularization method compared to expensive hyperparameter search do not come to the price of worse predictive quality. In total with our method, learning regularization parameters is as easy as learning model parameters and thus there is no need for any time-consuming search of regularization values because they are found on-the-fly. This makes our method highly attractive for practical use.
【Keywords】: matrix factorization; regularization; tensor factorization
【Paper Link】 【Pages】:143-152
【Authors】: Suhrid Balakrishnan ; Sumit Chopra
【Abstract】: Typical recommender systems use the root mean squared error (RMSE) between the predicted and actual ratings as the evaluation metric. We argue that RMSE is not an optimal choice for this task, especially when we will only recommend a few (top) items to any user. Instead, we propose using a ranking metric, namely normalized discounted cumulative gain (NDCG), as a better evaluation metric for this task. Borrowing ideas from the learning to rank community for web search, we propose novel models which approximately optimize NDCG for the recommendation task. Our models are essentially variations on matrix factorization models where we also additionally learn the features associated with the users and the items for the ranking task. Experimental results on a number of standard collaborative filtering data sets validate our claims. The results also show the accuracy and efficiency of our models and the benefits of learning features for ranking.
【Keywords】: collaborative ranking; learning to rank; ndcg; recommender systems; rmse
【Paper Link】 【Pages】:153-162
【Authors】: Gianmarco De Francisci Morales ; Aristides Gionis ; Claudio Lucchese
【Abstract】: We propose a new methodology for recommending interesting news to users by exploiting the information in their twitter persona. We model relevance between users and news articles using a mix of signals drawn from the news stream and from twitter: the profile of the social neighborhood of the users, the content of their own tweet stream, and topic popularity in the news and in the whole twitter-land. We validate our approach on a real-world dataset of approximately 40k articles coming from Yahoo! News and one month of crawled twitter data. We train our model using a learning-to-rank approach and support-vector machines. The train and test set are drawn from Yahoo! toolbar log data. We heuristically identify 3214 users of twitter in the log and use their clicks on news articles to train our system. Our methodology is able to predict with good accuracy the news articles clicked by the users and rank them higher than other news articles. The results show that the combination of various signals from real-time Web and micro-blogging platforms can be a useful resource to understand user behavior.
【Keywords】: micro-blogging applications; news recommendation; personalization; real-time web; recommendation systems
【Paper Link】 【Pages】:163-172
【Authors】: Samaneh Moghaddam ; Mohsen Jamali ; Martin Ester
【Abstract】: Online reviews are valuable sources of information for a variety of decision-making processes such as purchasing products. As the number of online reviews is growing rapidly, it becomes increasingly difficult for users to identify those that are helpful. This has motivated research into the problem of identifying high quality and helpful reviews automatically. The current methods assume that the helpfulness of a review is independent from the readers of that review. However, we argue that the quality of a review may not be the same for different users. For example, a professional and an amateur photographer may rate the helpfulness of a review very differently. In this paper, we introduce the problem of predicting a personalized review quality for recommendation of helpful reviews. To address this problem, we propose a series of increasingly sophisticated probabilistic graphical models, based on Matrix Factorization and Tensor Factorization. We evaluate the proposed models using a database of 1.5 million reviews and more than 13 million quality ratings obtained from Epinions.com. The experiments demonstrate that the proposed latent factor models outperform the state-of-the art approaches using textual and social features. Finally, our experiments confirm that the helpfulness of a review is indeed not the same for all users and that there are some latent factors that affect a user's evaluation of the review quality.
【Keywords】: matrix factorization; personalized review quality prediction; review recommendation; tensor factorization
【Paper Link】 【Pages】:173-182
【Authors】: Artus Krohn-Grimberghe ; Lucas Drumond ; Christoph Freudenthaler ; Lars Schmidt-Thieme
【Abstract】: A key element of the social networks on the internet such as Facebook and Flickr is that they encourage users to create connections between themselves, other users and objects. One important task that has been approached in the literature that deals with such data is to use social graphs to predict user behavior (e.g. joining a group of interest). More specifically, we study the cold-start problem, where users only participate in some relations, which we will call social relations, but not in the relation on which the predictions are made, which we will refer to as target relations. We propose a formalization of the problem and a principled approach to it based on multi-relational factorization techniques. Furthermore, we derive a principled feature extraction scheme from the social data to extract predictors for a classifier on the target relation. Experiments conducted on real world datasets show that our approach outperforms current methods.
【Keywords】: cold-start; item prediction; item recommendation; joint factorization; matrix factorization; multi-relational learning; ranking; recommender systems; social network
【Paper Link】 【Pages】:183-192
【Authors】: Ravi Kant ; Srinivasan H. Sengamedu ; Krishnan S. Kumar
【Abstract】: Comments are supported by several web sites to increase user participation. Users can usually comment on a variety of media types - photos, videos, news articles, blogs, etc. Comment spam is one of the biggest challenges facing this feature. The traditional approach to combat spam is to train classifiers using various machine learning techniques. Since the commonly used classifiers work on the entire comment text, it is easy to mislead them by embedding spam content in good content. In this paper, we make several contributions towards comment spam detection. (1) We propose a new framework for spam detection that is immune to embed attacks. We characterize spam by a set of frequently occurring sequential patterns. (2) We introduce a variant (called min-closed) of the frequent closed sequence mining problem that succinctly captures all the frequently occurring patterns. We prove as well as experimentally show that the set of min-closed sequences is an order of magnitude smaller than the set of closed sequences and yet has exactly the same coverage. (3) We describe MCPRISM, extension of the recently published PRISM algorithm that effectively mines min-closed sequences, using prime encoding. In the process, we solve the open problem of using the prime-encoding technique to speed up traditional closed sequence mining. (4) We finally need to whittle down the set of frequent subsequences to a small set without sacrificing coverage. This problem is NP-Hard but we show that the coverage function is submodular and hence the greedy heuristic gives a fast algorithm that is close to optimal. We then describe the experiments that were carried out on a large real world comment data and the publicly available Gazelle dataset. (1) We show that nearly 80% of spam on real world data can be effectively captured by the mined sequences at very low false positive rates. (2) The sequences mined are highly discriminative. (3) On Gazelle data, the proposed algorithmic enhancements are faster by at least by a factor and by an order of magnitude on the larger comment dataset.
【Keywords】: comments spam detection; frequent subsequence mining
【Paper Link】 【Pages】:193-202
【Authors】: Hadi Amiri ; Tat-Seng Chua
【Abstract】: Current opinion lexicons contain most of the common opinion words, but they miss slang and so-called urban opinion words and phrases (e.g. delish, cozy, yummy, nerdy, and yuck). These subjectivity clues are frequently used in community questions and are useful for opinion question analysis. This paper introduces a principled approach to constructing an opinion lexicon for community-based question answering (cQA) services. We formulate the opinion lexicon induction as a semi-supervised learning task in the graph context. Our method makes use of existing opinion words to extract new opinion entities (slang and urban words/phrases) from community questions. It then models the opinion entities in a graph context to learn the polarity of the new opinion entities based on the graph connectivity information. In contrast to previous approaches, our method not only learns such polarities from the labeled data but also from the unlabeled data and is more feasible in the web context where the dictionary-based relations (such as synonym, antonym, or hyponym) between most words are not available for constructing a high quality graph. The experiments show that our approach is effective both in terms of the quality of the discovered new opinion entities as well as its ability in inferring their polarity. Furthermore, since the value of opinion lexicons lies in their usefulness in applications, we show the utility of the constructed lexicon in the sentiment classification task.
【Keywords】: opinion lexicon; opinion mining; sentiment analysis; sentiment orientation; slang; urban word
【Paper Link】 【Pages】:203-212
【Authors】: Jeff Huang ; Thomas Lin ; Ryen W. White
【Abstract】: Today's Web browsers allow users to open links in new windows or tabs. This action, which we call 'branching', is sometimes performed on search results when the user plans to eventually visit multiple results. We detect branching behavior on a large commercial search engine with a client-side script on the results page. Two-fifths of all users spawned new tabs on search results in the timeframe of our study; branching usage varied with different query types and vertical. Both branching and backtracking are viable methods for visiting multiple search results. To understand user search strategies, we treat multiple result clicks following a query as ordered events to understand user search strategies. Users branching in a query are more likely to click search results from top to bottom, while users who backtrack are less likely to do so; this is especially true for queries involving more than two clicks. These findings inform an experiment in which we take a popular click model and modify it to account for the differing user behavior when branching. By understanding that users continue examining search results before viewing a branched result, we can improve the click model for branching queries.
【Keywords】: browser tabs; click models; examining search results
【Paper Link】 【Pages】:213-222
【Authors】: Jin Young Kim ; Kevyn Collins-Thompson ; Paul N. Bennett ; Susan T. Dumais
【Abstract】: A user's expertise or ability to understand a document on a given topic is an important aspect of that document's relevance. However, this aspect has not been well-explored in information retrieval systems, especially those at Web scale where the great diversity of content, users, and tasks presents an especially challenging search problem. To help improve our modeling and understanding of this diversity, we apply automatic text classifiers, based on reading difficulty and topic prediction, to estimate a novel type of profile for important entities in Web search -- users, websites, and queries. These profiles capture topic and reading level distributions, which we then use in conjunction with search log data to characterize and compare different entities. We find that reading level and topic distributions provide an important new representation of Web content and user interests, and that using both together is more effective than using either one separately. In particular we find that: 1) the reading level of Web content and the diversity of visitors to a website can vary greatly by topic; 2) the degree to which a user's profile matches with a site's profile is closely correlated with the user's preference of the website in search results, and 3) site or URL profiles can be used to predict 'expertness' whether a given site or URL is oriented toward expert vs. non-expert users. Our findings provide strong evidence in favor of jointly incorporating reading level and topic distribution metadata into a variety of critical tasks in Web information systems.
【Keywords】: domain expertise; log analysis; reading level prediction; topic prediction; web search
【Paper Link】 【Pages】:223-232
【Authors】: Ugo Scaiella ; Paolo Ferragina ; Andrea Marino ; Massimiliano Ciaramita
【Abstract】: Search results clustering (SRC) is a challenging algorithmic problem that requires grouping together the results returned by one or more search engines in topically coherent clusters, and labeling the clusters with meaningful phrases describing the topics of the results included in them. In this paper we propose to solve SRC via an innovative approach that consists of modeling the problem as the labeled clustering of the nodes of a newly introduced graph of topics. The topics are Wikipedia-pages identified by means of recently proposed topic annotators [9, 11, 16, 20] applied to the search results, and the edges denote the relatedness among these topics computed by taking into account the linkage of the Wikipedia-graph. We tackle this problem by designing a novel algorithm that exploits the spectral properties and the labels of that graph of topics. We show the superiority of our approach with respect to academic state-of-the-art work [6] and well-known commercial systems (CLUSTY and LINGO3G) by performing an extensive set of experiments on standard datasets and user studies via Amazon Mechanical Turk. We test several standard measures for evaluating the performance of all systems and show a relative improvement of up to 20%.
【Keywords】: search result clustering; spectral clustering; topical annotation; user study
【Paper Link】 【Pages】:233-242
【Authors】: Chenhao Tan ; Evgeniy Gabrilovich ; Bo Pang
【Abstract】: Imagine a physician and a patient doing a search on antibiotic resistance. Or a chess amateur and a grandmaster conducting a search on Alekhine's Defence. Although the topic is the same, arguably the two users in each case will satisfy their information needs with very different texts. Yet today search engines mostly adopt the one-size-fits-all solution, where personalization is restricted to topical preference. We found that users do not uniformly prefer simple texts, and that the text comprehensibility level should match the user's level of preparedness. Consequently, we propose to model the comprehensibility of texts as well as the users' reading proficiency in order to better explain how different users choose content for further exploration. We also model topic-specific reading proficiency, which allows us to better explain why a physician might choose to read sophisticated medical articles yet simple descriptions of SLR cameras. We explore different ways to build user profiles, and use collaborative filtering techniques to overcome data sparsity. We conducted experiments on large-scale datasets from a major Web search engine and a community question answering forum. Our findings confirm that explicitly modeling text comprehensibility can significantly improve content ranking (search results or answers, respectively).
【Keywords】: personalization; re-ranking; text comprehensibility; user modeling
【Paper Link】 【Pages】:243-252
【Authors】: Bhavana Bharat Dalvi ; William W. Cohen ; Jamie Callan
【Abstract】: We describe a open-domain information extraction method for extracting concept-instance pairs from an HTML corpus. Most earlier approaches to this problem rely on combining clusters of distributionally similar terms and concept-instance pairs obtained with Hearst patterns. In contrast, our method relies on a novel approach for clustering terms found in HTML tables, and then assigning concept names to these clusters using Hearst patterns. The method can be efficiently applied to a large corpus, and experimental results on several datasets show that our method can accurately extract large numbers of concept-instance pairs.
【Keywords】: clustering; hyponymy relation acquisition; web mining
【Paper Link】 【Pages】:253-262
【Authors】: Pallika H. Kanani ; Andrew McCallum
【Abstract】: Given a database with missing or uncertain content, our goal is to correct and fill the database by extracting specific information from a large corpus such as the Web, and to do so under resource limitations. We formulate the information gathering task as a series of choices among alternative, resource-consuming actions and use reinforcement learning to select the best action at each time step. We use temporal difference q-learning method to train the function that selects these actions, and compare it to an online, error-driven algorithm called SampleRank. We present a system that finds information such as email, job title and department affiliation for the faculty at our university, and show that the learning-based approach accomplishes this task efficiently under a limited action budget. Our evaluations show that we can obtain 92.4% of the final F1, by only using 14.3% of all possible actions.
【Keywords】: active information acquisition; reinforcement learning; resource-bounded information extraction; web mining
【Paper Link】 【Pages】:263-272
【Authors】: Debmalya Panigrahi ; Atish Das Sarma ; Gagan Aggarwal ; Andrew Tomkins
【Abstract】: The phenomenal growth in the volume of easily accessible information via various web-based services has made it essential for service providers to provide users with personalized representative summaries of such information. Further, online commercial services including social networking and micro-blogging websites, e-commerce portals, leisure and entertainment websites, etc. recommend interesting content to users that is simultaneously diverse on many different axes such as topic, geographic specificity, etc. The key algorithmic question in all these applications is the generation of a succinct, representative, and relevant summary from a large stream of data coming from a variety of sources. In this paper, we formally model this optimization problem, identify its key structural characteristics, and use these observations to design an extremely scalable and efficient algorithm. We analyze the algorithm using theoretical techniques to show that it always produces a nearly optimal solution. In addition, we perform large-scale experiments on both real-world and synthetically generated datasets, which confirm that our algorithm performs even better than its analytical guarantees in practice, and also outperforms other candidate algorithms for the problem by a wide margin.
【Keywords】: online algorithm; result diversity
【Paper Link】 【Pages】:273-282
【Authors】: Reid Andersen ; David F. Gleich ; Vahab S. Mirrokni
【Abstract】: Most graph decomposition procedures seek to partition a graph into disjoint sets of vertices. Motivated by applications of clustering in distributed computation, we describe a graph decomposition algorithm for the paradigm where the partitions intersect. This algorithm covers the vertex set with a collection of overlapping clusters. Each vertex in the graph is well-contained within some cluster in the collection. We then describe a framework for distributed computation across a collection of overlapping clusters and describe how this framework can be used in various algorithms based on the graph diffusion process. In particular, we focus on two illustrative examples: (i) the simulation of a randomly walking particle and (ii) the solution of a linear system, e.g. PageRank. Our simulation results for these two cases show a significant reduction in swapping between clusters in a random walk, a significant decrease in communication volume during a linear system solve in a geometric mesh, and some ability to reduce the communication volume during a linear system solve in an information network.
【Keywords】: conductance; covering; diffusion; leave-time; pagerank
【Paper Link】 【Pages】:283-292
【Authors】: Panagiotis Papadimitriou ; Hector Garcia-Molina
【Abstract】: In sponsored search auctions advertisers compete for ad slots in the search engine results page, by bidding on keywords of interest. To improve advertiser expressiveness, we augment the bidding process with conflict constraints. With such constraints, advertisers can condition their bids on the non-appearance of certain undesired ads on the results page. We study the complexity of the allocation problem in these augmented SSA and we introduce an algorithm that can efficiently allocate the ad slots to advertisers. We evaluate the algorithm run time in simulated conflict scenarios and we study the implications of the conflict constraints on search engine revenue. Our results show that the allocation problem can be solved within few tens of milliseconds and that the adoption of conflict constraints can potentially increase search engine revenue.
【Keywords】: auction design; branch-and-bound; conflicts; sponsored search
【Paper Link】 【Pages】:293-302
【Authors】: Rómer Rosales ; Haibin Cheng ; Eren Manavoglu
【Abstract】: In on-line search and display advertising, the click-trough rate (CTR) has been traditionally a key measure of ad/campaign effectiveness. More recently, the market has gained interest in more direct measures of profitability, one early alternative is the conversion rate (CVR). CVRs measure the proportion of certain users who take a predefined, desirable action, such as a purchase, registration, download, etc.; as compared to simply page browsing. We provide a detailed analysis of conversion rates in the context of non-guaranteed delivery targeted advertising. In particular we focus on the post-click conversion (PCC) problem or the analysis of conversions after a user click on a referring ad. The key elements we study are the probability of a conversion given a click in a user/page context, P(conversion | click, context). We provide various fundamental properties of this process based on contextual information, formalize the problem of predicting PCC, and propose an approach for measuring attribute relevance when the underlying attribute distribution is non-stationary. We provide experimental analyses based on logged events from a large-scale advertising platform.
【Keywords】: conversion modeling; conversion rate; display advertising; non-guaranteed delivery; post-click conversion
【Paper Link】 【Pages】:303-312
【Authors】: Danqing Xu ; Yiqun Liu ; Min Zhang ; Shaoping Ma ; Liyun Ru
【Abstract】: Click-through behaviors are treated as invaluable sources of user feedback and they have been leveraged in several commercial search engines in recent years. However, estimating unbiased relevance is always a challenging task because of position bias. To solve this problem, many researchers have proposed a variety of assumptions to model click-through behaviors. Most of these models share a common examination hypothesis, which is that users examine search results from the top to the bottom. Nevertheless, this model cannot draw a complete picture of information-seeking behaviors. Many eye-tracking studies find that user interactions are not sequential but contain revisiting patterns. If a user clicks on a higher ranked document after having clicked on a lower-ranked one, we call this scenario a revisiting pattern, and we believe that the revisiting patterns are important signals regarding a user's click preferences. This paper incorporates revisiting behaviors into click models and introduces a novel click model named Temporal Hidden Click Model (THCM). This model dynamically models users' click behaviors with a temporal order. In our experiment, we collect over 115 million query sessions from a widely-used commercial search engine and then conduct a comparative analysis between our model and several state-of-the-art click models. The experimental results show that the THCM model achieves a significant improvement in the Normalized Discounted Cumulative Gain (NDCG), the click perplexity and click distributions metrics.
【Keywords】: click-through behavior; document relevance; revisit; temporal hidden click model
【Paper Link】 【Pages】:313-322
【Authors】: Weizhu Chen ; Dong Wang ; Yuchen Zhang ; Zheng Chen ; Adish Singla ; Qiang Yang
【Abstract】: Recent advances in click model have established it as an attractive approach to infer document relevance. Most of these advances consider the user click/skip behavior as binary events but neglect the context in which a click happens. We show that real click behavior in industrial search engines is often noisy and not always a good indication of relevance. For a considerable percentage of clicks, users select what turn out to be irrelevant documents and these clicks should not be directly used as evidence for relevance inference. Thus in this paper, we put forward an observation that the relevance indication degree of a click is not a constant, but can be differentiated by user preferences and the context in which the user makes her click decision. In particular, to interpret the click behavior discriminatingly, we propose a Noise-aware Click Model (NCM) by characterizing the noise degree of a click, which indicates the quality of the click for inferring relevance. Specifically, the lower the click noise is, the more important the click is in its role for relevance inference. To verify the necessity of explicitly accounting for the uninformative noise in a user click, we conducted experiments on a billion-scale dataset. Extensive experimental results demonstrate that as compared with two state-of-the-art click models in Web Search, NCM can better interpret user click behavior and achieve significant improvements in terms of both perplexity and NDCG.
【Keywords】: click model; click noise; log analysis
【Paper Link】 【Pages】:323-332
【Authors】: Si Shen ; Botao Hu ; Weizhu Chen ; Qiang Yang
【Abstract】: Click modeling aims to interpret the users' search click data in order to predict their clicking behavior. Existing models can well characterize the position bias of documents and snippets in relation to users' mainstream click behavior. Yet, current advances depict users' search actions only in a general setting by implicitly assuming that all users act in the same way, regardless of the fact that anyone, motivated with some individual interest, is more likely to click on a link than others. It is in light of this that we put forward a novel personalized click model to describe the user-oriented click preferences, which applies and extends matrix / tensor factorization from the view of collaborative filtering to connect users, queries and documents together. Our model serves as a generalized personalization framework that can be incorporated to the previously proposed click models and, in many cases, to their future extensions. Despite the sparsity of search click data, our personalized model demonstrates its advantage over the best click models previously discussed in the Web-search literature, supported by our large-scale experiments on a real dataset. A delightful bonus is the model's ability to gain insights into queries and documents through latent feature vectors, and hence to handle rare and even new query-document pairs much better than previous click models.
【Keywords】: click log analysis; click model; personalization; search engine; user behavior
【Paper Link】 【Pages】:333-342
【Authors】: Amr Ahmed ; Choon Hui Teo ; S. V. N. Vishwanathan ; Alexander J. Smola
【Abstract】: Relevance, diversity and personalization are key issues when presenting content which is apt to pique a user's interest. This is particularly true when presenting an engaging set of news stories. In this paper we propose an efficient algorithm for selecting a small subset of relevant articles from a streaming news corpus. It offers three key pieces of improvement over past work: 1) It is based on a detailed model of a user's viewing behavior which does not require explicit feedback. 2) We use the notion of submodularity to estimate the propensity of interacting with content. This improves over the classical context independent relevance ranking algorithms. Unlike existing methods, we learn the submodular function from the data. 3) We present an efficient online algorithm which can be adapted for personalization, story adaptation, and factorization models. Experiments show that our system yields a significant improvement over a retrieval system deployed in production.
【Keywords】: graphical models; online learning; personalization; submodularity
【Paper Link】 【Pages】:343-352
【Authors】: Chen Wang ; Keping Bi ; Yunhua Hu ; Hang Li ; Guihong Cao
【Abstract】: In web search, relevance ranking of popular pages is relatively easy, because of the inclusion of strong signals such as anchor text and search log data. In contrast, with less popular pages, relevance ranking becomes very challenging due to a lack of information. In this paper the former is referred to as head pages, and the latter tail pages. We address the challenge by learning a model that can extract search-focused key n-grams from web pages, and using the key n-grams for searches of the pages, particularly, the tail pages. To the best of our knowledge, this problem has not been previously studied. Our approach has four characteristics. First, key n-grams are search-focused in the sense that they are defined as those which can compose "good queries" for searching the page. Second, key n-grams are learned in a relative sense using learning to rank techniques. Third, key n-grams are learned using search log data, such that the characteristics of key n-grams in the search log data, particularly in the heads; can be applied to the other data, particularly to the tails. Fourth, the extracted key n-grams are used as features of the relevance ranking model also trained with learning to rank techniques. Experiments validate the effectiveness of the proposed approach with large-scale web search datasets. The results show that our approach can significantly improve relevance ranking performance on both heads and tails; and particularly tails, compared with baseline approaches. Characteristics of our approach have also been fully investigated through comprehensive experiments.
【Keywords】: key n-gram extraction; learning to rank; ranking; search relevance; tail page
【Paper Link】 【Pages】:353-362
【Authors】: Yang Song ; Dengyong Zhou ; Li-wei He
【Abstract】: Query suggestion is an interactive approach for search engines to better understand users information need. In this paper, we propose a novel query suggestion framework which leverages user re-query feedbacks from search engine logs. Specifically, we mined user query reformulation activities where the user only modifies part of the query by (1) adding terms after the query, (2) deleting terms within the query, or (3) modifying terms to new terms. We build a term-transition graph based on the mined data. Two models are proposed which address topic-level and term-level query suggestions, respectively. In the first topic-based unsupervised Pagerank model, we perform random walk on each of the topic-based term-transition graph and calculate the Pagerank for each term within a topic. Given a new query, we suggest relevant queries based on its topic distribution and term-transition probability within each topic. Our second model resembles the supervised learning-to-rank (LTR) framework, in which term modifications are treated as documents so that each query reformulation is treated as a training instance. A rich set of features are constructed for each (query, document) pair from Pagerank, Wikipedia, N-gram, ODP and so on. This supervised model is capable of suggesting new queries on a term level which addresses the limitation of previous methods. Experiments are conducted on a large data set from a commercial search engine. By comparing the with state-of-the-art query suggestion methods [4, 2], our proposals exhibit significant performance increase for all categories of queries.
【Keywords】: learning-to-rank; pagerank; query suggestion
【Paper Link】 【Pages】:363-372
【Authors】: Yosi Mass ; Yehoshua Sagiv
【Abstract】: In keyword search over data graphs, an answer is a non-redundant subtree that includes the given keywords. This paper focuses on improving the effectiveness of that type of search. A novel approach that combines language models with structural relevance is described. The proposed approach consists of three steps. First, language models are used to assign dynamic, query-dependent weights to the graph. Those weights complement static weights that are pre-assigned to the graph. Second, an existing algorithm returns candidate answers based on their weights. Third, the candidate answers are re-ranked by creating a language model for each one. The effectiveness of the proposed approach is verified on a benchmark of three datasets: IMDB, Wikipedia and Mondial. The proposed approach outperforms all existing systems on the three datasets, which is a testament to its robustness. It is also shown that the effectiveness can be further improved by augmenting keyword queries with very basic knowledge about the structure.
【Keywords】: data graphs; language models; ranking; semantic weights
【Paper Link】 【Pages】:373-382
【Authors】: Georg Buscher ; Ryen W. White ; Susan T. Dumais ; Jeff Huang
【Abstract】: Understanding the impact of individual and task differences on search result page examination strategies is important in developing improved search engines. Characterizing these effects using query and click data alone is common but insufficient since they provide an incomplete picture of result examination behavior. Cursor- or gaze-tracking studies reveal richer interaction patterns but are often done in small-scale laboratory settings. In this paper we leverage large-scale rich behavioral log data in a naturalistic setting. We examine queries, clicks, cursor movements, scrolling, and text highlighting for millions of queries on the Bing commercial search engine to better understand the impact of user, task, and user-task interactions on user behavior on search result pages (SERPs). By clustering users based on cursor features, we identify individual, task, and user-task differences in how users examine results which are similar to those observed in small-scale studies. Our findings have implications for developing search support for behaviorally-similar searcher cohorts, modeling search behavior, and designing search systems that leverage implicit feedback.
【Keywords】: individual differences; rich interaction logging; task differences
【Paper Link】 【Pages】:383-392
【Authors】: Jackie Chi Kit Cheung ; Xiao Li
【Abstract】: One popular form of semantic search observed in several modern search engines is to recognize query patterns that trigger instant answers or domain-specific search, producing semantically enriched search results. This often requires understanding the query intent in addition to the meaning of the query terms in order to access structured data sources. A major challenge in intent understanding is to construct a domain-dependent schema and to annotate search queries based on such a schema, a process that to date has required much manual annotation effort. We present an unsupervised method for clustering queries with similar intent and for producing a pattern consisting of a sequence of semantic concepts and/or lexical items for each intent. Furthermore, we leverage the discovered intent patterns to automatically annotate a large number of queries beyond those used in clustering. We evaluated our method on 10 selected domains, discovering over 1400 intent patterns and automatically annotating 125K (and potentially many more) queries. We found that over 90% of patterns and 80% of instance annotations tested are judged to be correct by a majority of annotators.
【Keywords】: clustering; query intent discovery; semantic search
【Paper Link】 【Pages】:393-402
【Authors】: Virgiliu Pavlu ; Shahzad Rajput ; Peter B. Golbus ; Javed A. Aslam
【Abstract】: The development of information retrieval systems such as search engines relies on good test collections, including assessments of retrieved content. The widely employed Cranfield paradigm dictates that the information relevant to a topic be encoded at the level of documents, therefore requiring effectively complete document relevance assessments. As this is no longer practical for modern corpora, numerous problems arise, including scalability, reusability, and applicability. We propose a new method for relevance assessment based on relevant information, not relevant documents. Once the relevant 'nuggets' are collected, our matching method can assess any document for relevance with high accuracy, and so any retrieved list of documents can be assessed for performance. In this paper we analyze the performance of the matching function by looking at specific cases and by comparing with other methods. We then show how these inferred relevance assessments can be used to perform IR system evaluation, and we discuss in particular reusability and scalability. Our main contribution is a methodology for producing test collections that are highly accurate, more complete, scalable, reusable, and can be generated with similar amounts of effort as existing methods, with great potential for future applications.
【Keywords】: nuggets; test collection
【Paper Link】 【Pages】:403-412
【Authors】: Alexander Kotov ; ChengXiang Zhai
【Abstract】: Query expansion is an important and commonly used technique for improving Web search results. Existing methods for query expansion have mostly relied on global or local analysis of document collection, click-through data, or simple ontologies such as WordNet. In this paper, we present the results of a systematic study of the methods leveraging the ConceptNet knowledge base, an emerging new Web resource, for query expansion. Specifically, we focus on the methods leveraging ConceptNet to improve the search results for poorly performing (or difficult) queries. Unlike other lexico-semantic resources, such as WordNet and Wikipedia, which have been extensively studied in the past, ConceptNet features a graph-based representation model of commonsense knowledge, in which the terms are conceptually related through rich relational ontology. Such representation structure enables complex, multi-step inferences between the concepts, which can be applied to query expansion. We first demonstrate through simulation experiments that expanding queries with the related concepts from ConceptNet has great potential for improving the search results for difficult queries. We then propose and study several supervised and unsupervised methods for selecting the concepts from ConceptNet for automatic query expansion. The experimental results on multiple data sets indicate that the proposed methods can effectively leverage ConceptNet to improve the retrieval performance of difficult queries both when used in isolation as well as in combination with pseudo-relevance feedback.
【Keywords】: conceptnet; knowledge bases; query analysis; query expansion
【Paper Link】 【Pages】:413-422
【Authors】: Samuel Ieong ; Nina Mishra ; Eldar Sadikov ; Li Zhang
【Abstract】: This paper uncovers a new phenomenon in web search that we call domain bias --- a user's propensity to believe that a page is more relevant just because it comes from a particular domain. We provide evidence of the existence of domain bias in click activity as well as in human judgments via a comprehensive collection of experiments. We begin by studying the difference between domains that a search engine surfaces and that users click. Surprisingly, we find that despite changes in the overall distribution of surfaced domains, there has not been a comparable shift in the distribution of clicked domains. Users seem to have learned the landscape of the internet and their click behavior has thus become more predictable over time. Next, we run a blind domain test, akin to a Pepsi/Coke taste test, to determine whether domains can shift a user's opinion of which page is more relevant. We find that domains can actually flip a user's preference about 25% of the time. Finally, we demonstrate the existence of systematic domain preferences, even after factoring out confounding issues such as position bias and relevance, two factors that have been used extensively in past work to explain user behavior. The existence of domain bias has numerous consequences including, for example, the importance of discounting click activity from reputable domains.
【Keywords】: domain bias; user behavior; web search
【Paper Link】 【Pages】:423-432
【Authors】: Dongdong Shan ; Shuai Ding ; Jing He ; Hongfei Yan ; Xiaoming Li
【Abstract】: Large web search engines are facing formidable performance challenges because they have to process thousands of queries per second on tens of billions of documents, within interactive response time. Among many others, Top-k query processing (also called early termination or dynamic pruning) is an important class of optimization techniques that can improve the search efficiency and achieve faster query processing by avoiding the scoring of documents that are unlikely to be in the top results. One recent technique is using Block-Max index. In the Block-Max index, the posting lists are organized as blocks and the maximum score for each block is stored to improve the query efficiency. Although query processing speedup is achieved with Block-Max index, the ranking function for the Top-k results is the term-based approach. It is well known that documents' static scores are also important for a good ranking function. In this paper, we show that the performance of the state-of-the-art algorithms with the Block-Max index is degraded when the static score is added in the ranking function. Then we study efficient techniques for Top-k query processing in the case where a page's static score is given, such as PageRank, in addition to the term-based approach. In particular, we propose a set of new algorithms based on the WAND and MaxScore with Block-Max index using local score, which outperform the existing ones. Then we propose new techniques to estimate a better score upper bound for each block. We also study the search efficiency on different index structures where the document identifiers are assigned by URL sorting or by static document scores. Experiments on TREC GOV2 and ClueWeb09B show that considerable performance gains are achieved.
【Keywords】: block-max inverted index; early termination; top-k query processing
【Paper Link】 【Pages】:433-442
【Authors】: David Sontag ; Kevyn Collins-Thompson ; Paul N. Bennett ; Ryen W. White ; Susan T. Dumais ; Bodo Billerbeck
【Abstract】: We present a new approach for personalizing Web search results to a specific user. Ranking functions for Web search engines are typically trained by machine learning algorithms using either direct human relevance judgments or indirect judgments obtained from click-through data from millions of users. The rankings are thus optimized to this generic population of users, not to any specific user. We propose a generative model of relevance which can be used to infer the relevance of a document to a specific user for a search query. The user-specific parameters of this generative model constitute a compact user profile. We show how to learn these profiles from a user's long-term search history. Our algorithm for computing the personalized ranking is simple and has little computational overhead. We evaluate our personalization approach using historical search data from thousands of users of a major Web search engine. Our findings demonstrate gains in retrieval performance for queries with high ambiguity, with particularly large improvements for acronym queries.
【Keywords】: machine learning; personalization; probabilistic models; re-ranking
【Paper Link】 【Pages】:443-452
【Authors】: Michael Bendersky ; Donald Metzler ; W. Bruce Croft
【Abstract】: Most standard information retrieval models use a single source of information (e.g., the retrieval corpus) for query formulation tasks such as term and phrase weighting and query expansion. In contrast, in this paper, we present a unified framework that automatically optimizes the combination of information sources used for effective query formulation. The proposed framework produces fully weighted and expanded queries that are both more effective and more compact than those produced by the current state-of-the-art query expansion and weighting methods. We conduct an empirical evaluation of our framework for both newswire and web corpora. In all cases, our combination of multiple information sources for query formulation is found to be more effective than using any single source. The proposed query formulations are especially advantageous for large scale web corpora, where they also reduce the number of terms required for effective query expansion, and improve the diversity of the retrieved results.
【Keywords】: external sources; query expansion; query formulation
【Paper Link】 【Pages】:453-462
【Authors】: Changsung Kang ; Xuanhui Wang ; Yi Chang ; Belle L. Tseng
【Abstract】: Many vertical search tasks such as local search focus on specific domains. The meaning of relevance in these verticals is domain-specific and usually consists of multiple well-defined aspects (e.g., text matching and distance in local search). Thus the overall relevance between a query and a document is a tradeoff between multiple relevance aspects. Such a tradeoff can vary for different types of queries or in different contexts. In this paper, we explore these vertical-specific aspects in the learning to rank setting. We propose a novel formulation in which the relevance between a query and a document is assessed with respect to each aspect, forming the multi-aspect relevance. In order to compute a ranking function, we study two types of learning-based approaches to estimate the tradeoff between these relevance aspects: a label aggregation method and a model aggregation method. Since there are only a few aspects, a minimal amount of training data is needed to learn the tradeoff. We conduct both offline and online test experiments on a local search engine and the experimental results show that our proposed multi-aspect relevance formulation is very promising. The two types of aggregation methods perform more effectively than a set of baseline methods including a conventional learning to rank method.
【Keywords】: aggregation; multi-aspect relevance; web search
【Paper Link】 【Pages】:463-472
【Authors】: Danqi Chen ; Weizhu Chen ; Haixun Wang ; Zheng Chen ; Qiang Yang
【Abstract】: Click models have been positioned as an effective approach to interpret user click behavior in search engines. Existing click models mostly focus on traditional Web search that considers only ten homogeneous Web HTML documents that appear on the first search-result page. However, in modern commercial search engines, more and more Web search results are federated from multiple sources and contain non-HTML results returned by other heterogeneous vertical engines, such as video or image search engines. In this paper, we study user click behavior in federated search. We observed that user click behavior in federated search is highly different from that in traditional Web search, making it difficult to interpret using existing click models. In response, we propose a novel federated click model (FCM) to interpret user click behavior in federated search. In particular, we take into considerations two new biases in FCM. The first comes from the observation that users tend to be attracted by vertical results and their visual attention on them may increase the examination probability of other nearby web results. The other illustrates that user click behavior on vertical results may lead to more clues of search relevance due to their presentation style in federated search. With these biases and an effective model to correct them, FCM is more accurate in characterizing user click behavior in federated search. Our extensive experimental results show that FCM can outperform other click models in interpreting user click behavior in federated search and achieve significant improvements in terms of both perplexity and log-likelihood.
【Keywords】: click model; federated search; log analysis
【Paper Link】 【Pages】:473-482
【Authors】: Yandong Liu ; Sandeep Pandey ; Deepak Agarwal ; Vanja Josifovski
【Abstract】: The ultimate goal of advertisers are conversions representing desired user actions on the advertisers' websites in the form of purchases and product information request. In this paper we address the problem of finding the right audience for display campaigns by finding the users that are most likely to convert. This challenging problem is at the heart of display campaign optimization and has to deal with several issues such as very small percentage of converters in the general population, high-dimensional representation of the user profiles, large churning rate of users and advertisers. To overcome these difficulties, in our approach we use two sources of information: a seed set of users that have converted for a campaign in the past; and a description of the campaign based on the advertiser's website. We explore the importance of the information provided by each of these two sources in a principled manner and then combine them to propose models for predicting converters. In particular, we show how seed set can be used to capture the campaign-specific targeting constraints, while the campaign metadata allows to share targeting knowledge across campaigns. We give methods for learning these models and perform experiments on real-world advertising campaigns. Our findings show that the seed set and the campaign metadata are complimentary to each other and both sources provide valuable information for conversion optimization.
【Keywords】: advertising; conversions; modeling
【Paper Link】 【Pages】:483-492
【Authors】: Deepak Agarwal ; Maxim Gurevich
【Abstract】: A crucial task in many recommender problems like computational advertising, content optimization, and others is to retrieve a small set of items by scoring a large item inventory through some elaborate statistical/machine-learned model. This is challenging since the retrieval has to be fast (few milliseconds) to load the page quickly. Fast retrieval is well studied in the information retrieval (IR) literature, especially in the context of document retrieval for queries. When queries and documents have sparse representation and relevance is measured through cosine similarity (or some variant thereof), one could build highly efficient retrieval algorithms that scale gracefully to increasing item inventory. The key components exploited by such algorithms is sparse query-document representation and the special form of the relevance function. Many machine-learned models used in modern recommender problems do not satisfy these properties and since brute force evaluation is not an option with large item inventory, heuristics that filter out some items are often employed to reduce model computations at runtime. In this paper, we take a two-stage approach where the first stage retrieves top-K items using our approximate procedures and the second stage selects the desired top-k using brute force model evaluation on the K retrieved items. The main idea of our approach is to reduce the first stage to a standard IR problem, where each item is represented by a sparse feature vector (a.k.a. the vector-space representation) and the query-item relevance score is given by vector dot product. The sparse item representation is learnt to closely approximate the original machine-learned score by using retrospective data. Such a reduction allows leveraging extensive work in IR that resulted in highly efficient retrieval systems. Our approach is model-agnostic, relying only on data generated from the machine-learned model. We obtain significant improvements in the computational cost vs. accuracy tradeoff compared to several baselines in our empirical evaluation on both synthetic models and on a click-through (CTR) model used in online advertising.
【Keywords】: inverted index; lasso; machine learning; support vector machines; top-k retrieval
【Paper Link】 【Pages】:493-502
【Authors】: Chenyan Xiong ; Taifeng Wang ; Wenkui Ding ; Yidong Shen ; Tie-Yan Liu
【Abstract】: This paper is concerned with the prediction of clicking an ad in sponsored search. The accurate prediction of user's click on an ad plays an important role in sponsored search, because it is widely used in both ranking and pricing of the ads. Previous work on click prediction usually takes a single ad as input, and ignores its relationship to the other ads shown in the same page. This independence assumption here, however, might not be valid in the real scenario. In this paper, we first perform an analysis on this issue by looking at the click-through rates (CTR) of the same ad, in the same position and for the same query, but surrounded by different ads. We found that in most cases the CTR varies largely, which suggests that the relationship between ads is really an important factor in predicting click probability. Furthermore, our investigation shows that the more similar the surrounding ads are to an ad, the lower the CTR of the ad is. Based on this observation, we design a continuous conditional random fields (CRF) based model for click prediction, which considers both the features of an ad and its similarity to the surrounding ads. We show that the model can be effectively learned using maximum likelihood estimation, and can also be efficiently inferred due to its closed form solution. Our experimental results on the click-through log from a commercial search engine show that the proposed model can predict clicks more accurately than previous independent models. To our best knowledge this is the first work that predicts ad clicks by considering the relationship between ads.
【Keywords】: continuous crf; online advertising; relational click prediction; sponsored search
【Paper Link】 【Pages】:503-512
【Authors】: Yang Liu ; Roy Chen ; Yan Chen ; Qiaozhu Mei ; Suzy Salib
【Abstract】: As a new paradigm of online communities, microfinance sites such as Kiva.org have attracted much public attention. To understand lender motivations on Kiva, we classify the lenders' self-stated motivations into ten categories with human coders and machine learning based classifiers. We employ text classifiers using lexical features, along with social features based on lender activity information on Kiva, to predict the categories of lender motivation statements. Although the task appears to be much more challenging than traditional topic-based categorization, our classifiers can achieve high precision in most categories. Using the results of this classification along with Kiva teams information, we predict lending activity from lender motivation and team affiliations. Finally, we make design recommendations regarding Kiva practices which might increase pro-social lending.
【Keywords】: kiva; lending motivation; microfinance; pro-social lending; text classification
【Paper Link】 【Pages】:513-522
【Authors】: Eduardo J. Ruiz ; Vagelis Hristidis ; Carlos Castillo ; Aristides Gionis ; Alejandro Jaimes
【Abstract】: We study the problem of correlating micro-blogging activity with stock-market events, defined as changes in the price and traded volume of stocks. Specifically, we collect messages related to a number of companies, and we search for correlations between stock-market events for those companies and features extracted from the micro-blogging messages. The features we extract can be categorized in two groups. Features in the first group measure the overall activity in the micro-blogging platform, such as number of posts, number of re-posts, and so on. Features in the second group measure properties of an induced interaction graph, for instance, the number of connected components, statistics on the degree distribution, and other graph-based properties. We present detailed experimental results measuring the correlation of the stock market events with these features, using Twitter as a data source. Our results show that the most correlated features are the number of connected components and the number of nodes of the interaction graph. The correlation is stronger with the traded volume than with the price of the stock. However, by using a simulator we show that even relatively small correlations between price and micro-blogging features can be exploited to drive a stock trading strategy that outperforms other baseline strategies.
【Keywords】: financial time series; micro-blogging; social networks
【Paper Link】 【Pages】:523-532
【Authors】: Rawia Awadallah ; Maya Ramanath ; Gerhard Weikum
【Abstract】: The wikileaks documents about the death of Osama Bin Laden and the debates about the economic crisis in Greece and other European countries are some of the controversial topics being played on the news everyday. Each of these topics has many different aspects, and there is no absolute, simple truth in answering questions such as: should the EU guarantee the financial stability of each member country, or should the countries themselves be solely responsible? To understand the landscape of opinions, it would be helpful to know which politician or other stakeholder takes which position - support or opposition - on these aspects of controversial topics.
【Keywords】: political opinion mining; web information extraction
【Paper Link】 【Pages】:533-542
【Authors】: Hila Becker ; Dan Iter ; Mor Naaman ; Luis Gravano
【Abstract】: User-contributed Web data contains rich and diverse information about a variety of events in the physical world, such as shows, festivals, conferences and more. This information ranges from known event features (e.g., title, time, location) posted on event aggregation platforms (e.g., Last.fm events, EventBrite, Facebook events) to discussions and reactions related to events shared on different social media sites (e.g., Twitter, YouTube, Flickr). In this paper, we focus on the challenge of automatically identifying user-contributed content for events that are planned and, therefore, known in advance, across different social media sites. We mine event aggregation platforms to extract event features, which are often noisy or missing. We use these features to develop query formulation strategies for retrieving content associated with an event on different social media sites. Further, we explore ways in which event content identified on one social media site can be used to retrieve additional relevant event content on other social media sites. We apply our strategies to a large set of user-contributed events, and analyze their effectiveness in retrieving relevant event content from Twitter, YouTube, and Flickr.
【Keywords】: cross-site document retrieval; event identification; social media
【Paper Link】 【Pages】:543-552
【Authors】: John W. Byers ; Michael Mitzenmacher ; Georgios Zervas
【Abstract】: Daily deal sites have become the latest Internet sensation, providing discounted offers to customers for restaurants, ticketed events, services, and other items. We begin by undertaking a study of the economics of daily deals on the web, based on a dataset we compiled by monitoring Groupon and LivingSocial sales in 20 large cities over several months. We use this dataset to characterize deal purchases; glean insights about operational strategies of these firms; and evaluate customers' sensitivity to factors such as price, deal scheduling, and limited inventory. We then marry our daily deals dataset with additional datasets we compiled from Facebook and Yelp users to study the interplay between social networks and daily deal sites. First, by studying user activity on Facebook while a deal is running, we provide evidence that daily deal sites benefit from significant word-of-mouth effects during sales events, consistent with results predicted by cascade models. Second, we consider the effects of daily deals on the longer-term reputation of merchants, based on their Yelp reviews before and after they run a daily deal. Our analysis shows that while the number of reviews increases significantly due to daily deals, average rating scores from reviewers who mention daily deals are 10% lower than scores of their peers on average.
【Keywords】: daily deals; reputation; social diffusion
【Paper Link】 【Pages】:553-562
【Authors】: Guo-Jun Qi ; Charu C. Aggarwal ; Thomas S. Huang
【Abstract】: The clustering of social media objects provides intrinsic understanding of the similarity relationships between documents, images, and their contextual sources. Both content and link structure provide important cues for an effective clustering algorithm of the underlying objects. While link information provides useful hints for improving the clustering process, it also contains a significant amount of noisy information. Therefore, a robust clustering algorithm is required to reduce the impact of noisy links. In order to address the aforementioned problems, we propose heterogeneous random fields to model the structure and content of social media networks. We design a probability measure on the social media networks which output a configuration of clusters that are consistent with both content and link structure. Furthermore, noisy links can also be detected, and their impact on the clustering algorithm can be significantly reduced. We conduct experiments on a real social media network and show the advantage of the method over other state-of-the-art algorithms.
【Keywords】: noisy links; robust clustering; social media networks
【Paper Link】 【Pages】:563-572
【Authors】: Edgar Meij ; Wouter Weerkamp ; Maarten de Rijke
【Abstract】: Microblogs have become an important source of information for the purpose of marketing, intelligence, and reputation management. Streams of microblogs are of great value because of their direct and real-time nature. Determining what an individual microblog post is about, however, can be non-trivial because of creative language usage, the highly contextualized and informal nature of microblog posts, and the limited length of this form of communication. We propose a solution to the problem of determining what a microblog post is about through semantic linking: we add semantics to posts by automatically identifying concepts that are semantically related to it and generating links to the corresponding Wikipedia articles. The identified concepts can subsequently be used for, e.g., social media mining, thereby reducing the need for manual inspection and selection. Using a purpose-built test collection of tweets, we show that recently proposed approaches for semantic linking do not perform well, mainly due to the idiosyncratic nature of microblog posts. We propose a novel method based on machine learning with a set of innovative features and show that it is able to achieve significant improvements over all other methods, especially in terms of precision.
【Keywords】: microblogs; semantic linking; wikipedia
【Paper Link】 【Pages】:573-582
【Authors】: Junming Huang ; Xueqi Cheng ; Huawei Shen ; Tao Zhou ; Xiaolong Jin
【Abstract】: Word-of-mouth has proven an effective strategy for promoting products through social relations. Particularly, existing studies have convincingly demonstrated that word-of-mouth recommendations can boost users' prior expectation and hence encourage them to adopt a certain innovation, such as buying a book or watching a movie. However, less attention has been paid to studying the posterior effect of word-of-mouth recommendations, i.e., whether or not word-of-mouth recommendations can influence users' posterior evaluation on the products or services recommended to them, the answer to which is critical to estimating user satisfaction when proposing a word-of-mouth marketing strategy. In order to fill this gap, in this paper we empirically study the above issue and verify that word-of-mouth recommendations are strongly associated with users' posterior evaluation. Through elaborately designed statistical hypothesis tests we prove the causality that word-of-mouth recommendations directly prompt the posterior evaluation of receivers. Finally, we propose a method for investigating users' social influence, namely, their ability to affect followers' posterior evaluation via word-of-mouth recommendations, by examining the number of their followers and their sensitivity of discovering good items. The experimental results on real datasets show that our method can successfully identify 78% influential friends with strong social influence.
【Keywords】: posterior evaluation; social influence; word-of-mouth recommendation
【Paper Link】 【Pages】:583-592
【Authors】: Xueke Xu ; Songbo Tan ; Yue Liu ; Xueqi Cheng ; Zheng Lin ; Jiafeng Guo
【Abstract】: This paper aims to find blog feeds having a principal inclination towards making opinionated comments on the given topic, so that we can subscribe to them to track influential and interesting opinions in the blogosphere. One major challenge is assigning topic-related opinion scores to blog feeds, which is embodied in two aspects. Firstly, we should identify whether the blog feed has a principal on-topic opinionated inclination. This inclination should be collectively revealed by all posts of the feed. We should fully consider evidences from all the posts of the feed to identify salient information among many posts of the feed. Secondly, we should capture topic-related opinions in the blog feed while ignoring irrelevant opinions. In this paper, we propose a unified framework for opinionated blog feed retrieval, which combines topic relevance and opinion scores with a generative model. Furthermore, we propose a language modeling approach to estimating opinion scores that is seamlessly integrated into the framework, where two language models, Topic-specific Opinion Model (TOM) and Topic-biased Feed Model (TFM), work collectively to reflect whether the blog feed shows a principal on-topic opinionated inclination. To estimate TFM, we propose a topic-biased random walk to exploit both content and structural information to capture topic-biased salient information in the feed. As for TOM estimation, we propose to use a generative mixture model with prior guidance to effectively capture topic-specific opinion expressing language usage. The conducted experiments in the context of the TREC 2009-2010 Blog Track show the effectiveness of our proposed approaches.
【Keywords】: mixture model; opinionated blog feed retrieval; topic-biased random walk; topic-related opinionatedness
【Paper Link】 【Pages】:593-602
【Authors】: Anish Das Sarma ; Sreenivas Gollapudi ; Rina Panigrahy ; Li Zhang
【Abstract】: Motivated by trends in popularity of products, we present a formal model for studying trends in our choice of products in terms of three parameters: (1) their innate utility; (2) individual boredom associated with repeated usage of an item; and (3) social influences associated with the preferences from other people. Different from previous work, in this paper we introduce boredom to explain the cyclic pattern in individual and social choices. We formally model boredom and show that a rational individual would make cyclic choices when considering the boredom factor. Furthermore, we extend the model to social choices by showing that a society that votes for a particular style or product can be viewed as a single individual cycling through different choices. We adopt a natural model of utility an individual derives from using an item, i.e., the utility of an item gets discounted by its repeated use and increases when the item is not used. We address the problem of optimally choosing items for usage, so as to maximize overall user satisfaction over a period of time. First we show that the simple greedy heuristic of always choosing the item with the maximum current composite utility can be arbitrarily worse than the optimal. Second, we prove that even with just a single individual, determining the optimal strategy for choosing items is NP-hard. Third, we show that a simple modification to the greedy algorithm is a provably close approximation to the optimal strategy. Finally, we present an experimental study over real-world data collected from query logs to compare our algorithms.
【Keywords】: cyclic trends; fashion trends; social choice
【Paper Link】 【Pages】:603-612
【Authors】: Smriti Bhagat ; Amit Goyal ; Laks V. S. Lakshmanan
【Abstract】: One of the key objectives of viral marketing is to identify a small set of users in a social network, who when convinced to adopt a product will influence others in the network leading to a large number of adoptions in an expected sense. The seminal work of Kempe et al. [13] approaches this as the problem of influence maximization. This and other previous papers tacitly assume that a user who is influenced (or, informed) about a product necessarily adopts the product and encourages her friends to adopt it. However, an influenced user may not adopt the product herself, and yet form an opinion based on the experiences of her friends, and share this opinion with others. Furthermore, a user who adopts the product may not like the product and hence not encourage her friends to adopt it to the same extent as another user who adopted and liked the product. This is independent of the extent to which those friends are influenced by her. Previous works do not account for these phenomena. We argue that it is important to distinguish product adoption from influence. We propose a model that factors in a user's experience (or projected experience) with a product. We adapt the classical Linear Threshold (LT) propagation model by defining an objective function that explicitly captures product adoption, as opposed to influence. We show that under our model, adoption maximization is NP-hard and the objective function is monotone and submodular, thus admitting an approximation algorithm. We perform experiments on three real popular social networks and show that our model is able to distinguish between influence and adoption, and predict product adoption much more accurately than approaches based on the classical LT model.
【Keywords】: influence; product adoption; viral marketing
【Paper Link】 【Pages】:613-622
【Authors】: Ingmar Weber ; Antti Ukkonen ; Aristides Gionis
【Abstract】: We investigate the problem of mining "tips" from Yahoo! Answers and displaying those tips in response to related web queries. Here, a "tip" is a short, concrete and self-contained bit of non-obvious advice such as "To zest a lime if you don't have a zester : use a cheese grater." First, we estimate the volume of web queries with "how-to" intent, which could be potentially addressed by a tip. Second, we analyze how to detect such queries automatically without solely relying on literal "how to *" patterns. Third, we describe how to derive potential tips automatically from Yahoo! Answers, and we develop machine-learning techniques to remove low-quality tips. Finally, we discuss how to match web queries with "how-to" intent to tips. We evaluate both the quality of these direct displays as well as the size of the query volume that can be addressed by serving tips.
【Keywords】: direct display; how-to queries; tips; web search
【Paper Link】 【Pages】:623-632
【Authors】: Peifeng Yin ; Ping Luo ; Min Wang ; Wang-Chien Lee
【Abstract】: Prediction of popular items in online content sharing systems has recently attracted a lot of attention due to the tremendous need of users and its commercial values. Different from previous works that make prediction by fitting a popularity growth model, we tackle this problem by exploiting the latent conforming and maverick personalities of those who vote to assess the quality of on-line items. We argue that the former personality prompts a user to cast her vote conforming to the majority of the service community while on the contrary the later personality makes her vote different from the community. We thus propose a Conformer-Maverick (CM) model to simulate the voting process and use it to rank top-k potentially popular items based on the early votes they received. Through an extensive experimental evaluation, we validate our ideas and find that our proposed CM model achieves better performance than baseline solutions, especially for smaller k.
【Keywords】: conformer; generative model; maverick; popular ranking
【Paper Link】 【Pages】:633-642
【Authors】: Onur Kucuktunc ; Berkant Barla Cambazoglu ; Ingmar Weber ; Hakan Ferhatosmanoglu
【Abstract】: Sentiment extraction from online web documents has recently been an active research topic due to its potential use in commercial applications. By sentiment analysis, we refer to the problem of assigning a quantitative positive/negative mood to a short bit of text. Most studies in this area are limited to the identification of sentiments and do not investigate the interplay between sentiments and other factors. In this work, we use a sentiment extraction tool to investigate the influence of factors such as gender, age, education level, the topic at hand, or even the time of the day on sentiments in the context of a large online question answering site. We start our analysis by looking at direct correlations, e.g., we observe more positive sentiments on weekends, very neutral ones in the Science & Mathematics topic, a trend for younger people to express stronger sentiments, or people in military bases to ask the most neutral questions. We then extend this basic analysis by investigating how properties of the (asker, answerer) pair affect the sentiment present in the answer. Among other things, we observe a dependence on the pairing of some inferred attributes estimated by a user's ZIP code. We also show that the best answers differ in their sentiments from other answers, e.g., in the Business & Finance topic, best answers tend to have a more neutral sentiment than other answers. Finally, we report results for the task of predicting the attitude that a question will provoke in answers. We believe that understanding factors influencing the mood of users is not only interesting from a sociological point of view, but also has applications in advertising, recommendation, and search.
【Keywords】: attitude; collaborative question answering; prediction; sentiment analysis; sentimentality
【Paper Link】 【Pages】:643-652
【Authors】: Oren Tsur ; Ari Rappoport
【Abstract】: Current social media research mainly focuses on temporal trends of the information flow and on the topology of the social graph that facilitates the propagation of information. In this paper we study the effect of the content of the idea on the information propagation. We present an efficient hybrid approach based on a linear regression for predicting the spread of an idea in a given time frame. We show that a combination of content features with temporal and topological features minimizes prediction error. Our algorithm is evaluated on Twitter hashtags extracted from a dataset of more than 400 million tweets. We analyze the contribution and the limitations of the various feature types to the spread of information, demonstrating that content aspects can be used as strong predictors thus should not be disregarded. We also study the dependencies between global features such as graph topology and content features.
【Keywords】: hashtags; information diffusion; microblogging; social media; twitter
【Paper Link】 【Pages】:653-662
【Authors】: Marisa A. Vasconcelos ; Saulo M. R. Ricci ; Jussara M. Almeida ; Fabrício Benevenuto ; Virgílio A. F. Almeida
【Abstract】: Online Location Based Social Networks (LBSNs), which combine social network features with geographic information sharing, are becoming increasingly popular. One such application is Foursquare, which doubled its user population in less than six months. Among other features, Foursquare allows users to leave tips (i.e., reviews or recommendations) at specific venues as well as to give feedback on previously posted tips by adding them to their to-do lists or marking them as done. In this paper, we analyze how Foursquare users exploit these three features - tips, dones and to-dos - uncovering different behavior profiles. Our study reveals the existence of very active and influential users, some of which are famous businesses and brands, that seem engaged in posting tips at a large variety of venues while also receiving a great amount of user feedback on them. We also provide evidence of spamming, showing the existence of users that post tips whose contents are unrelated to the nature or domain of the venue where the tips were left.
【Keywords】: location based social networks; spamming; user behavior characterization
【Paper Link】 【Pages】:663-672
【Authors】: Yizhou Sun ; Jiawei Han ; Charu C. Aggarwal ; Nitesh V. Chawla
【Abstract】: Link prediction, i.e., predicting links or interactions between objects in a network, is an important task in network analysis. Although the problem has attracted much attention recently, there are several challenges that have not been addressed so far. First, most existing studies focus only on link prediction in homogeneous networks, where all objects and links belong to the same type. However, in the real world, heterogeneous networks that consist of multi-typed objects and relationships are ubiquitous. Second, most current studies only concern the problem of whether a link will appear in the future but seldom pay attention to the problem of when it will happen. In this paper, we address both issues and study the problem of predicting when a certain relationship will happen in the scenario of heterogeneous networks. First, we extend the link prediction problem to the relationship prediction problem, by systematically defining both the target relation and the topological features, using a meta path-based approach. Then, we directly model the distribution of relationship building time with the use of the extracted topological features. The experiments on citation relationship prediction between authors on the DBLP network demonstrate the effectiveness of our methodology.
【Keywords】: heterogeneous information networks; relationship prediction
【Paper Link】 【Pages】:673-682
【Authors】: Sanjay Ram Kairam ; Dan J. Wang ; Jure Leskovec
【Abstract】: We pose a fundamental question in understanding how to identify and design successful communities: What factors predict whether a community will grow and survive in the long term? Social scientists have addressed this question extensively by analyzing offline groups which endeavor to attract new members, such as social movements, finding that new individuals are influenced strongly by their ties to members of the group. As a result, prior work on the growth of communities has treated growth primarily as a diffusion processes, leading to findings about group evolution which can be difficult to explain. The proliferation of online social networks and communities, however, has created new opportunities to study, at a large scale and with very fine resolution, the mechanisms which lead to the formation, growth, and demise of online groups. In this paper, we analyze data from several thousand online social networks built on the Ning platform with the goal of understanding the factors contributing to the growth and longevity of groups within these networks. Specifically, we investigate the role that two types of growth (growth through diffusion and growth by other means) play during a group's formative stages from the perspectives of both the individual member and the group. Applying these insights to a population of groups of different ages and sizes, we build a model to classify groups which will grow rapidly over the short-term and long-term. Our model achieves over 79% accuracy in predicting group growth over the following two months and over 78% accuracy in predictions over the following two years. We utilize a similar approach to predict which groups will die within a year. The results of our combined analysis provide insight into how both early non-diffusion growth and a complex set of network constraints appear to contribute to the initial and continued growth and success of groups within social networks. Finally we discuss implications of this work for the design, maintenance, and analysis of online communities.
【Keywords】: group formation; information diffusion; online communities; social networks
【Paper Link】 【Pages】:683-692
【Authors】: Chia-Jung Lee ; W. Bruce Croft ; Jinyoung Kim
【Abstract】: The prevalence of social media applications is generating potentially large personal archives of posts, tweets, and other communications. The existence of these archives creates a need for search tools, which can be seen as an extension of current desktop search services. Little is currently known about the best search techniques for personal archives of social data, because of the difficulty of creating test collections. In this paper, we describe how test collections for personal social data can be created by using games to collect queries. We then compare a range of retrieval models that exploit the semi-structured nature of social data. Our results show that a mixture of language models with field distribution estimation can be effective for this type of data, with certain fields, such as the name of the poster, being particularly important. We also analyze the properties of the queries that were generated by users with two versions of the games.
【Keywords】: desktop search; facebook; personal social media collections; search evaluation; semi-structured document; twitter
【Paper Link】 【Pages】:693-702
【Authors】: Ankan Saha ; Vikas Sindhwani
【Abstract】: As massive repositories of real-time human commentary, social media platforms have arguably evolved far beyond passive facilitation of online social interactions. Rapid analysis of information content in online social media streams (news articles, blogs,tweets etc.) is the need of the hour as it allows business and government bodies to understand public opinion about products and policies. In most of these settings, data points appear as a stream of high dimensional feature vectors. Guided by real-world industrial deployment scenarios, we revisit the problem of online learning of topics from streaming social media content. On one hand, the topics need to be dynamically adapted to the statistics of incoming datapoints, and on the other hand, early detection of rising new trends is important in many applications. We propose an online nonnegative matrix factorizations framework to capture the evolution and emergence of themes in unstructured text under a novel temporal regularization framework. We develop scalable optimization algorithms for our framework, propose a new set of evaluation metrics, and report promising empirical results on traditional TDT tasks as well as streaming Twitter data. Our system is able to rapidly capture emerging themes, track existing topics over time while maintaining temporal consistency and continuity in user views, and can be explicitly configured to bound the amount of information being presented to the user.
【Keywords】: dictionary learning; nmf; time series analysis; topic models
【Paper Link】 【Pages】:703-712
【Authors】: Ashton Anderson ; Daniel P. Huttenlocher ; Jon M. Kleinberg ; Jure Leskovec
【Abstract】: There are many settings in which users of a social media application provide evaluations of one another. In a variety of domains, mechanisms for evaluation allow one user to say whether he or she trusts another user, or likes the content they produced, or wants to confer special levels of authority or responsibility on them. Earlier work has studied how the relative status between two users - that is, their comparative levels of status in the group - affects the types of evaluations that one user gives to another. Here we study how similarity in the characteristics of two users can affect the evaluation one user provides of another. We analyze this issue under a range of natural similarity measures, showing how the interaction of similarity and status can produce strong effects. Among other consequences, we find that evaluations are less status-driven when users are more similar to each other; and we use effects based on similarity to provide a plausible mechanism for a complex phenomenon observed in studies of user evaluation, that evaluations are particularly low among users of roughly equal status. Our work has natural applications to the prediction of evaluation outcomes based on user characteristics, and the use of similarity information makes possible a novel application that we introduce here - to estimate the chance of a favorable overall evaluation from a group knowing only the attributes of the group's members, but not their expressed opinions.
【Keywords】: ballot-blind prediction; status; user similarity; user-to-user evaluations
【Paper Link】 【Pages】:713-722
【Authors】: Rina Panigrahy ; Marc Najork ; Yinglian Xie
【Abstract】: Previous research has suggested that people who are in the same social circle exhibit similar behaviors and tastes. The rise of social networks gives us insights into the social circles of web users, and recommendation services (including search engines, advertisement engines, and collaborative filtering engines) provide a motivation to adapt recommendations to the interests of the audience. An important primitive for supporting these applications is the ability to quantify how connected two users are in a social network. The shortest-path distance between a pair of users is an obvious candidate measure. This paper introduces a new measure of "affinity" in social networks that takes into account not only the distance between two users, but also the number of edge-disjoint paths between them, i.e. the "robustness" of their connection. Our measure is based on a sketch-based approach, and affinity queries can be answered extremely efficiently (at the expense of a one-time offline sketch computation). We compare this affinity measure against the "approximate shortest-path distance", a sketch-based distance measure with similar efficiency characteristics. Our empirical study is based on a Hotmail email exchange graph combined with demographic information and Bing query history, and a Twitter mention-graph together with the text of the underlying tweets. We found that users who are close to each other - either in terms of distance or affinity - have a higher similarity in terms of demographics, queries, and tweets.
【Keywords】: affinity; distance; influence; sketching; social networks
【Paper Link】 【Pages】:723-732
【Authors】: Adam Sadilek ; Henry A. Kautz ; Jeffrey P. Bigham
【Abstract】: Location plays an essential role in our lives, bridging our online and offline worlds. This paper explores the interplay between people's location, interactions, and their social ties within a large real-world dataset. We present and evaluate Flap, a system that solves two intimately related tasks: link and location prediction in online social networks. For link prediction, Flap infers social ties by considering patterns in friendship formation, the content of people's messages, and user location. We show that while each component is a weak predictor of friendship alone, combining them results in a strong model, accurately identifying the majority of friendships. For location prediction, Flap implements a scalable probabilistic model of human mobility, where we treat users with known GPS positions as noisy sensors of the location of their friends. We explore supervised and unsupervised learning scenarios, and focus on the efficiency of both learning and inference. We evaluate Flap on a large sample of highly active users from two distinct geographical areas and show that it (1) reconstructs the entire friendship graph with high accuracy even when no edges are given; and (2) infers people's fine-grained location, even when they keep their data private and we can only access the location of their friends. Our models significantly outperform current comparable approaches to either task.
【Keywords】: graphical models; link prediction; location modeling; machine learning; social networks; visualization
【Paper Link】 【Pages】:733-742
【Authors】: Yaron Singer
【Abstract】: Throughout the past decade there has been extensive research on algorithmic and data mining techniques for solving the problem of influence maximization in social networks: if one can incentivize a subset of individuals to become early adopters of a new technology, which subset should be selected so that the word-of-mouth effect in the social network is maximized? Despite the progress in modeling and techniques, the incomplete information aspect of the problem has been largely overlooked. While data can often provide the network structure and influence patterns may be observable, the inherent cost individuals have to become early adopters is difficult to extract. In this paper we introduce mechanisms that elicit individuals' costs while providing desirable approximation guarantees in some of the most well-studied models of social network influence. We follow the mechanism design framework which advocates for allocation and payment schemes that incentivize individuals to report their true information. We also performed experiments using the Mechanical Turk platform and social network data to provide evidence of the framework's effectiveness in practice.
【Keywords】: influence maximization; mechanism design; social networks; viral marketing
【Paper Link】 【Pages】:743-752
【Authors】: Jie Tang ; Tiancheng Lou ; Jon M. Kleinberg
【Abstract】: It is well known that different types of social ties have essentially different influence on people. However, users in online social networks rarely categorize their contacts into "family", "colleagues", or "classmates". While a bulk of research has focused on inferring particular types of relationships in a specific social network, few publications systematically study the generalization of the problem of inferring social ties over multiple heterogeneous networks. In this work, we develop a framework for classifying the type of social relationships by learning across heterogeneous networks. The framework incorporates social theories into a factor graph model, which effectively improves the accuracy of inferring the type of social relationships in a target network by borrowing knowledge from a different source network. Our empirical study on five different genres of networks validates the effectiveness of the proposed framework. For example, by leveraging information from a coauthor network with labeled advisor-advisee relationships, the proposed framework is able to obtain an F1-score of 90% (8-28% improvements over alternative methods) for inferring manager-subordinate relationships in an enterprise email network.
【Keywords】: inferring social ties; predictive model; social influence analysis; social network
【Paper Link】 【Pages】:753-754
【Authors】: Hilary Mason
【Abstract】: The social web is a messy place! At bitly, we see hundreds of millions of shares and clicks per day--clicks that contain all sorts of wonderful content from lolcats to spacecraft launches. I'll discuss our philosophy, tools, and techniques for looking at the data, and new research opportunities that weren't possible before.
【Keywords】: social content; social web
【Paper Link】 【Pages】:755-756
【Authors】: Luca Chiarandini
【Abstract】: The accumulation of large collections of social media data poses new challenges for the design of exploratory experiences, such as when a user browses through a collection to discover content (e.g. exploring photo collections, network of friends, etc). Cardinality and characteristics of the set, together with volatility of the information, resulting from fast and continuous creation, deletion and updating of entries, trigger novel research questions. In this context, we plan to investigate and contribute to the data analysis, and user interface design of exploratory experiences. The proposed approach is an iterative process where analysis and design phases are performed in cycles. The long-term vision is to understand the underlying reasoning in order to be able to automatically replicate it.
【Keywords】: data mining; human computer interaction; social media
【Paper Link】 【Pages】:757-758
【Authors】: Kushal S. Dave
【Abstract】: Computational advertising refers to finding the most relevant ads matching a particular context on the web. The core problem attacked in computational advertising CA is of the match making between the ads and the context. My research work aims at leveraging various user interaction, ad and advertiser related information and contextual information for improving the relevance, ranking and targeting of ads. The research work focuses on the identification of various factors that contribute in retrieving and ranking the most relevant set of ads that match best with the context. Specifically, information associated with the user, publisher and advertiser is leveraged for this purpose.
【Keywords】: computational advertising; contextual advertising; sponsored search; viral marketing
【Paper Link】 【Pages】:759-760
【Authors】: Brian Thompson
【Abstract】: In this work we propose a novel approach to anomaly detection in streaming communication data. We first build a stochastic model for the system based on temporal communication patterns across each edge, which we call the REWARDS (REneWal theory Approach for Real-time Data Streams) model. We then define a measure of anomaly for an arbitrary subgraph based on the likelihood of its recent activity given past behavior. Finally, we develop an algorithm to efficiently identify subgraphs with the most anomalous activity. Although our work has until now focused on the cybersecurity domain, the model we present is more broadly applicable to information retrieval in data streams and information networks.
【Keywords】: anomaly detection; communication networks; graph algorithms; information networks; renewal theory; streaming time series models; time-evolving graphs
【Paper Link】 【Pages】:761-762
【Authors】: Elizeu Santos-Neto
【Abstract】: Assessing the value of individual users' contributions in peer-production systems is paramount to the design of mechanisms that support collaboration and improve users' experience. For instance, to incentivize contributions, file sharing systems based on the BitTorrent protocol equate value with volume of contributed content and use a prioritization mechanism to reward users who contribute more. This approach and similar techniques used in resource sharing systems rely on the fact that the physical resources shared among users are easily quantifiable. In contrast, information-sharing systems, like social tagging systems, lack the notion of a physical resource unit (e.g., content size, bandwidth) that facilitates the task of evaluating user contributions. For this reason, the issue of estimating the value of user contributions in information sharing systems remains largely unexplored. This paper outlines a research project to tackle the problem of assessing the value of contributions in social tagging systems.
【Keywords】: information value; peer-production; social; systems; tagging
【Paper Link】 【Pages】:763-764
【Authors】: Eugene Agichtein ; Evgeniy Gabrilovich
【Abstract】: Proliferation of ubiquitous access to the Internet enables millions of Web users to collaborate online on a variety of activities. Many of these activities result in the construction of large repositories of knowledge, either as their primary aim (e.g., Wikipedia) or as a by-product (e.g., Yahoo! Answers). In this tutorial, we will discuss organizing and exploiting Collaboratively Generated Content (CGC) for information organization and retrieval. Specifically, we intend to cover two complementary areas of the problem: (1) using such content as a powerful enabling resource for knowledge-enriched, intelligent representations and new information retrieval algorithms, and (2) development of supporting technologies for extracting, filtering, and organizing collaboratively created content. The unprecedented amounts of information in CGC enable new, knowledge-rich approaches to information access, which are significantly more powerful than the conventional word-based methods. Considerable progress has been made in this direction over the last few years. Examples include explicit manipulation of human-defined concepts and their use to augment the bag of words (cf. Explicit Semantic Analysis), using large-scale taxonomies of topics from Wikipedia or the Open Directory Project to construct additional class-based features, or using Wikipedia for better word sense disambiguation. However, the quality and comprehensiveness of collaboratively created content vary significantly, and in order for this resource to be useful, a significant amount of preprocessing, filtering, and organization is necessary. Consequently, new methods for analyzing CGC and corresponding user interactions are required to effectively harness the resulting knowledge. Thus, not only the content repositories can be used to improve IR methods, but the reverse pollination is also possible, as better information extraction methods can be used for automatically collecting more knowledge, or verifying the contributed content. This natural connection between modeling the generation process of CGC and effectively using the accumulated knowledge suggests covering both areas together in a single tutorial. The intended audience of the tutorial includes IR researchers and graduate students, who would like to learn about the recent advances and research opportunities in working with collaboratively generated content. The emphasis of the tutorial is on comparing the existing approaches and presenting practical techniques that IR practitioners can use in their research. We also cover open research challenges, as well as survey available resources (software tools and data) for getting started in this research field.
【Keywords】: collaboratively-generated content
【Paper Link】 【Pages】:765-766
【Authors】: Chirag Shah
【Abstract】: The course will introduce the student to theories, methodologies, and tools that focus on information retrieval/seeking in collaboration. The student will have an opportunity to learn about the social aspect of IR with a focus on collaborative information seeking (CIS) situations, systems, and evaluation techniques. Traditionally, IR is considered an individual pursuit, and not surprisingly, the majority of tools, techniques, and models developed for addressing information need, retrieval, and usage have focused on single users. The assumption of information seekers being independent and IR problem being individual has been challenged often in the recent past. This course will introduce such works to the students, with an emphasis on understanding models and systems that support collaborative search or browsing. In addition, the course will provide samples of data collected through several experiments to demonstrate various mining and analysis techniques. Specifically, the course will (1) outline the research and latest developments in the field of collaborative IR, (2) list the challenges for designing and evaluating collaborative IR systems, and (3) show how traditional single user IR models and systems could be mapped to those for CIS. This will be achieved through introduction to appropriate literature, algorithms and interfaces that facilitate CIS, and methodologies for studying and evaluating them. Thus, the course will offer a balance between theoretical and practical elements of CIS.
【Keywords】: collaborative information seeking
【Paper Link】 【Pages】:767-768
【Abstract】: In web search, relevance is one of the most important factors to meet users' satisfaction, and the success of a web search engine heavily depends on its performance on relevance. It has been observed that many hard cases in search relevance are due to term mismatch between query and documnt (e.g., query 'ny times' does not match well with document only containing 'new york times'), and thus it is not exaggerated to say that dealing with mismatch between query and document is one of the most critical research problems in web search. Recently researchers have spent significant effort to address the grand challenge. The major approach is to conduct more query and document understanding, and perform better matching between enriched query and document representations. With the availability of large amount of log data and advanced machine learning techniques, this becomes more feasible and significant progress has been made recently. In this tutorial, we will give a systematic and detailed presentation on newly developed machine learning technologies for query document matching in search. We will focus on the fundamental problems, as well as the novel solutions for query document matching at word form level, word sense level, topic level, and structure level. We will talk about novel technologies about query spelling error correction [3, 13], query rewriting [1, 4, 6, 7], query classification [2], topic modeling of documents [5, 9], query document matching [8, 10, 11, 12], and query document-title translation. The ideas and solutions introduced in this tutorial may motivate industrial practitioners to turn the research fruits into product reality. The summary of the state-of-the-art methods and the discussions on the technical issues in this tutorial may stimulate academic researchers to find new research directions and solutions. Matching between query and document is not limited to search, and similar problems can be observed at online advertisement, recommendation system, and other applications, as matching between objects from two spaces. The technologies we introduce can be generalized into more general machine learning techniques, which we call learning to match.
【Keywords】: machine learning; query-document matching; web search
【Paper Link】 【Pages】:769-770
【Authors】: Craig Macdonald ; Jun Wang ; Charles L. A. Clarke
【Abstract】: When an ambiguous query is received, a sensible approach is for the information retrieval (IR) system to diversify the results retrieved for this query, in the hope that at least one of the interpretations of the query intent will satisfy the user. Diversity is an increasingly important topic, of interest to both academic researchers (such as participants in the TREC Web and Blog track diversity tasks, or the NTCIR INTENT task), as well as to search engines professionals. In the 2nd edition of the Diversity in Document Retrieval workshop (DDR 2012), we solicited submissions both on approaches and models for diversity, the evaluation of diverse search results, and on applications of diverse search results. This workshop builds upon a successful 1st edition of DDR which was held at ECIR 2011 in Dublin, Ireland.
【Keywords】: diversity; information retrieval; search
【Paper Link】 【Pages】:771-772
【Authors】: Pavel Serdyukov ; Nick Craswell ; Georges Dupret
【Abstract】: WSCD2012 is the second workshop on Web Search Click Data, following WSCD2009. It is a forum for new research relating to Web search usage logs and for discussing desirable properties of publicly released search log datasets. Research relating to search logs has been hampered by the limited availability of click datasets. This workshop comes with a new click dataset based on click logs and an accompanying challenge to predict the relevance of documents based on clicks.
【Keywords】: click data; web search