Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM 2015, Shanghai, China, February 2-6, 2015. ACM 【DBLP Link】
【Paper Link】 【Pages】:1-2
【Authors】: Michael Franklin
【Abstract】: The Berkeley AMPLab is creating a new approach to data analytics. Launching in early 2011, the lab aims to seamlessly integrate the three main resources available for making sense of data at scale: Algorithms (machine learning and statistical techniques), Machines (in the form of scalable clusters and elastic cloud computing), and People (both individually as analysts and in crowds). The lab is realizing its ideas through the development of a freely-available Open Source software stack called BDAS: the Berkeley Data Analytics Stack. In the four years the lab has been in operation, we've released major components of BDAS. Several of these components have gained significant traction in industry and elsewhere: the Mesos cluster resource manager, the Spark in-memory computation framework, and the Shark query processing system. BDAS features prominently in many industry discussions of the future of the Big Data analytics ecosystem -- a rare degree of impact for an ongoing academic project. Given this initial success, the lab is continuing on its research path, moving "up the stack" to better integrate and support advanced analytics and to make people a full-fledged resource for making sense of data. In this talk, I'll first outline the motivation and insights behind our research approach and describe how we have organized to address the cross-disciplinary nature of Big Data challenges. I will then describe the current state of BDAS with an emphasis on our newest efforts, including some or all of: the GraphX graph processing system, the Velox and MLBase machine learning platforms, and the SampleClean framework for hybrid human/computer data cleaning. Finally I will present our current views of how all the pieces will fit together to form a system that can adaptively bring the right resources to bear on a given data-driven question to meet time, cost and quality requirements throughout the analytics lifecycle.
【Keywords】: big data
【Paper Link】 【Pages】:3-4
【Authors】: Jure Leskovec
【Abstract】: Recommender systems are an integral part of how we experience the Web today and they have become so ubiquitous that we do not even notice them anymore. However, today's recommender systems mostly treat items they recommend as black boxes and primarily focus on extracting correlations and co-counts from user behavior data. In this talk I argue that next generation recommender systems will require deep understanding of items being recommended as well as modeling the relationships between those items. I will present examples how auxiliary data about items (descriptions, reviews, product specifications) can be used to improve recommendations.
【Keywords】: recommendation system
【Paper Link】 【Pages】:5-6
【Authors】: Andrei Z. Broder ; Lada A. Adamic ; Michael Franklin ; Maarten de Rijke ; Eric P. Xing ; Kai Yu
【Abstract】: The Gartner's 2014 Hype Cycle released last August moves Big Data technology from the Peak of Inflated Expectations to the beginning of the Trough of Disillusionment when interest starts to wane as reality does not live up to previous promises. As the hype is starting to dissipate it is worth asking what Big Data (however defined) means from a scientific perspective: Did the emergence of gigantic corpora exposed the limits of classical information retrieval and data mining and led to new concepts and challenges, the way say, the study of electromagnetism showed the limits of Newtonian mechanics and led to Relativity Theory, or is it all just "sound and fury, signifying nothing", simply a matter of scaling up well understood technologies? To answer this question, we have assembled a distinguished panel of eminent scientists, from both Industry and Academia: Lada Adamic (Facebook), Michael Franklin (University of California at Berkeley), Maarten de Rijke (University of Amsterdam), Eric Xing (Carnegie Mellon University), and Kai Yu (Baidu) will share their point of view and take questions from the moderator and the audience.
【Keywords】: big data
【Paper Link】 【Pages】:7-16
【Authors】: Saehoon Kim ; Yuxiong He ; Seung-won Hwang ; Sameh Elnikety ; Seungjin Choi
【Abstract】: A commercial web search engine shards its index among many servers, and therefore the response time of a search query is dominated by the slowest server that processes the query. Prior approaches target improving responsiveness by reducing the tail latency of an individual search server. They predict query execution time, and if a query is predicted to be long-running, it runs in parallel, otherwise it runs sequentially. These approaches are, however, not accurate enough for reducing a high tail latency when responses are aggregated from many servers because this requires each server to reduce a substantially higher tail latency (e.g., the 99.99th-percentile), which we call extreme tail latency. We propose a prediction framework to reduce the extreme tail latency of search servers. The framework has a unique set of characteristics to predict long-running queries with high recall and improved precision. Specifically, prediction is delayed by a short duration to allow many short-running queries to complete without parallelization, and to allow the predictor to collect a set of dynamic features using runtime information. These features estimate query execution time with high accuracy. We also use them to estimate the prediction errors to override an uncertain prediction by selectively accelerating the query for a higher recall. We evaluate the proposed prediction framework to improve search engine performance in two scenarios using a simulation study: (1) query parallelization on a multicore processor, and (2) query scheduling on a heterogeneous processor. The results show that, for both scenarios, the proposed framework is effective in reducing the extreme tail latency compared to a start-of-the-art predictor because of its higher recall, and it improves server throughput by more than 70% because of its improved precision.
【Keywords】: parallelization; prediction; search engine; tail latency
【Paper Link】 【Pages】:17-26
【Authors】: Masrour Zoghi ; Shimon Whiteson ; Maarten de Rijke
【Abstract】: A key challenge in information retrieval is that of on-line ranker evaluation: determining which one of a finite set of rankers performs the best in expectation on the basis of user clicks on presented document lists. When the presented lists are constructed using interleaved comparison methods, which interleave lists proposed by two different candidate rankers, then the problem of minimizing the total regret accumulated while evaluating the rankers can be formalized as a K-armed dueling bandit problem. In the setting of web search, the number of rankers under consideration may be large. Scaling effectively in the presence of so many rankers is a key challenge not adequately addressed by existing algorithms. We propose a new method, which we call mergeRUCB, that uses "localized" comparisons to provide the first provably scalable K-armed dueling bandit algorithm. Empirical comparisons on several large learning to rank datasets show that mergeRUCB can substantially outperform the state of the art K-armed dueling bandit algorithms when many rankers must be compared. Moreover, we provide theoretical guarantees demonstrating the soundness of our algorithm.
【Keywords】: evaluation; implicit feedback; on-line learning
【Paper Link】 【Pages】:27-36
【Authors】: Alexey Drutsa ; Gleb Gusev ; Pavel Serdyukov
【Abstract】: Nowadays, billions of people use the Web in connection with their daily needs. A significant part of the needs are constituted by search tasks that are usually addressed by search engines. Thus, daily search needs result in regular user engagement with a search engine. User engagement with web sites and services was studied in various aspects, but there appear to be no studies of its regularity and periodicity. In this paper, we studied periodicity of the user engagement with a popular search engine through applying spectrum analysis to temporal sequences of different engagement metrics. We found periodicity patterns of user engagement and revealed classes of users whose periodicity patterns do not change over a long period of time. In addition, we used the spectrum series as metrics to evaluate search quality.
【Keywords】: DFT; periodicity; quality metrics; spectrum analysis; user engagement
【Paper Link】 【Pages】:37-46
【Authors】: Lihong Li ; Jinyoung Kim ; Imed Zitouni
【Abstract】: A standard approach to estimating online click-based metrics of a ranking function is to run it in a controlled experiment on live users. While reliable and popular in practice, configuring and running an online experiment is cumbersome and time-intensive. In this work, inspired by recent successes of offline evaluation techniques for recommender systems, we study an alternative that uses historical search log to reliably predict online click-based metrics of a \emph{new} ranking function, without actually running it on live users. To tackle novel challenges encountered in Web search, variations of the basic techniques are proposed. The first is to take advantage of diversified behavior of a search engine over a long period of time to simulate randomized data collection, so that our approach can be used at very low cost. The second is to replace exact matching (of recommended items in previous work) by \emph{fuzzy} matching (of search result pages) to increase data efficiency, via a better trade-off of bias and variance. Extensive experimental results based on large-scale real search data from a major commercial search engine in the US market demonstrate our approach is promising and has potential for wide use in Web search.
【Keywords】: contextual bandits; evaluation; experimentation; information retrieval; web search
【Paper Link】 【Pages】:47-56
【Authors】: Giuseppe Ottaviano ; Nicola Tonellotto ; Rossano Venturini
【Abstract】: Inverted indexes are usually represented by dividing posting lists into constant-sized blocks and representing them with an encoder for sequences of integers. Different encoders yield a different point in the space-time trade-off curve, with the fastest being several times larger than the most space-efficient. An important design decision for an index is thus the choice of the fastest encoding method such that the index fits in the available memory. However, a better usage of the space budget could be obtained by using faster encoders for frequently accessed blocks, and more space-efficient ones those that are rarely accessed. To perform this choice optimally, we introduce a linear time algorithm that, given a query distribution and a set of encoders, selects the best encoder for each index block to obtain the lowest expected query processing time respecting a given space constraint. To demonstrate the effectiveness of this approach we perform an extensive experimental analysis, which shows that our algorithm produces indexes which are significantly faster than single-encoder indexes under several query processing strategies, while respecting the same space constraints.
【Keywords】: compression; inverted indexes; knapsack problems
【Paper Link】 【Pages】:57-66
【Authors】: Jiepu Jiang ; Ahmed Hassan Awadallah ; Xiaolin Shi ; Ryen W. White
【Abstract】: Understanding and estimating satisfaction with search engines is an important aspect of evaluating retrieval performance. Research to date has modeled and predicted search satisfaction on a binary scale, i.e., the searchers are either satisfied or dissatisfied with their search outcome. However, users' search experience is a complex construct and there are different degrees of satisfaction. As such, binary classification of satisfaction may be limiting. To the best of our knowledge, we are the first to study the problem of understanding and predicting graded (multi-level) search satisfaction. We ex-amine sessions mined from search engine logs, where searcher satisfaction was also assessed on multi-point scale by human annotators. Leveraging these search log data, we observe rich and non-monotonous changes in search behavior in sessions with different degrees of satisfaction. The findings suggest that we should predict finer-grained satisfaction levels. To address this issue, we model search satisfaction using features indicating search outcome, search effort, and changes in both outcome and effort during a session. We show that our approach can predict subtle changes in search satisfaction more accurately than state-of-the-art methods, affording greater insight into search satisfaction. The strong performance of our models has implications for search providers seeking to accu-rately measure satisfaction with their services.
【Keywords】: effort; evaluation; search satisfaction; session; utility
【Paper Link】 【Pages】:67-76
【Authors】: Pengyuan Wang ; Wei Sun ; Dawei Yin ; Jian Yang ; Yi Chang
【Abstract】: As the online advertising industry has evolved into an age of diverse ad formats and delivery channels, users are exposed to complex ad treatments involving various ad characteristics. The diversity and generality of ad treatments call for accurate and causal measurement of ad effectiveness, i.e., how the ad treatment causes the changes in outcomes without the confounding effect by user characteristics. Various causal inference approaches have been proposed to measure the causal effect of ad treatments. However, most existing causal inference methods focus on univariate and binary treatment and are not well suited for complex ad treatments. Moreover, to be practical in the data-rich online environment, the measurement needs to be highly general and efficient, which is not addressed in conventional causal inference approaches. In this paper we propose a novel causal inference framework for assessing the impact of general advertising treatments. Our new framework enables analysis on uni- or multi-dimensional ad treatments, where each dimension (ad treatment factor) could be discrete or continuous. We prove that our approach is able to provide an unbiased estimation of the ad effectiveness by controlling the confounding effect of user characteristics. The framework is computationally efficient by employing a tree structure that specifies the relationship between user characteristics and the corresponding ad treatment. This tree-based framework is robust to model misspecification and highly flexible with minimal manual tuning. To demonstrate the efficacy of our approach, we apply it to two advertising campaigns. In the first campaign we evaluate the impact of different ad frequencies, and in the second one we consider the synthetic ad effectiveness across TV and online platforms. Our framework successfully provides the causal impact of ads with different frequencies in both campaigns. Moreover, it shows that the ad frequency usually has a treatment effect cap, which is usually over-estimated by naive estimation.
【Keywords】: advertising; causal inference; online strategy measurement
【Paper Link】 【Pages】:77-86
【Authors】: Silvio Lattanzi ; Yaron Singer
【Abstract】: The friendship paradox is a sociological phenomenon first discovered by Feld which states that individuals are likely to have fewer friends than their friends do, on average. This phenomenon has become common knowledge, has several interesting applications, and has also been observed in various data sets. In his seminal paper Feld provides an intuitive explanation by showing that in any graph the average degree of edges in the graph is an upper bound on the average degree of nodes. Despite the appeal of this argument, it does not prove the existence of the friendship paradox. In fact, it is easy to construct networks -- even with power law degree distributions -- where the ratio between the average degree of neighbors and the average degree of nodes is high, but all nodes have the exact same degree as their neighbors. Which models, then, explain the friendship paradox? In this paper we give a strong characterization that provides a formal understanding of the friendship paradox. We show that for any power law graph with exponential parameter in (1,3), when every edge is rewired with constant probability, the friendship paradox holds, i.e. there is an asymptotic gap between the average degree of the sample of polylogarithmic size and the average degree of a random set of its neighbors of equal size. To examine this characterization on real data, we performed several experiments on social network data sets that complement our theoretical analysis. We also discuss the applications of our result to influence maximization.
【Keywords】: friendship paradox; influence maximization; social networks
【Paper Link】 【Pages】:87-96
【Authors】: Jiliang Tang ; Shiyu Chang ; Charu C. Aggarwal ; Huan Liu
【Abstract】: Signed network analysis has attracted increasing attention in recent years. This is in part because research on signed network analysis suggests that negative links have added value in the analytical process. A major impediment in their effective use is that most social media sites do not enable users to specify them explicitly. In other words, a gap exists between the importance of negative links and their availability in real data sets. Therefore, it is natural to explore whether one can predict negative links automatically from the commonly available social network data. In this paper, we investigate the novel problem of negative link prediction with only positive links and content-centric interactions in social media. We make a number of important observations about negative links, and propose a principled framework NeLP, which can exploit positive links and content-centric interactions to predict negative links. Our experimental results on real-world social networks demonstrate that the proposed NeLP framework can accurately predict negative links with positive links and content-centric interactions. Our detailed experiments also illustrate the relative importance of various factors to the effectiveness of the proposed framework.
【Keywords】: negative link prediction; negative links; signed social networks; social media
【Paper Link】 【Pages】:97-106
【Authors】: Ashwin Rajadesingan ; Reza Zafarani ; Huan Liu
【Abstract】: Sarcasm is a nuanced form of language in which individuals state the opposite of what is implied. With this intentional ambiguity, sarcasm detection has always been a challenging task, even for humans. Current approaches to automatic sarcasm detection rely primarily on lexical and linguistic cues. This paper aims to address the difficult task of sarcasm detection on Twitter by leveraging behavioral traits intrinsic to users expressing sarcasm. We identify such traits using the user's past tweets. We employ theories from behavioral and psychological studies to construct a behavioral modeling framework tuned for detecting sarcasm. We evaluate our framework and demonstrate its efficiency in identifying sarcastic tweets.
【Keywords】: behavioral modeling; sarcasm detection; social media
【Paper Link】 【Pages】:107-116
【Authors】: Shuai Gao ; Jun Ma ; Zhumin Chen
【Abstract】: Popularity prediction on microblogging platforms aims to predict the future popularity of a message based on its retweeting dynamics in the early stages. Existing works mainly focus on exploring effective features for prediction, while ignoring the underlying arrival process of retweets. Also, the effect of user activity variation on the retweeting dynamics in the early stages has been neglected. In this paper, we propose an extended reinforced Poisson process model with time mapping process to model the retweeting dynamics and predict the future popularity. The proposed model explicitly characterizes the process through which a message gain its retweets, by capturing a power-law temporal relaxation function corresponding to the aging in the ability of the message to attract new retweets and an exponential reinforcement mechanism characterizing the "richer-get-richer" phenomenon. Further, we introduce the notation of weibo time and integrate a time mapping process into the proposed model to eliminate the effect of user activity variation. Extensive experiments on two Weibo datasets, with 10K and 18K messages respectively, well demonstrate the effectiveness of our proposed model in popularity prediction.
【Keywords】: microblogging platforms; popularity prediction; reinforced poisson process; retweeting dynamics
【Paper Link】 【Pages】:117-126
【Authors】: Jialu Liu ; Charu C. Aggarwal ; Jiawei Han
【Abstract】: The problem of community detection has recently been studied widely in the context of the web and social media networks. Most algorithms for community detection assume that the entire network is available for online analysis. In practice, this is not really true, because only restricted portions of the network may be available at any given time for analysis. Many social networks such as Facebook have privacy constraints, which do not allow the discovery of the entire structure of the social network. Even in the case of more open networks such as Twitter, it may often be challenging to crawl the entire network from a practical perspective. For many other scenarios such as adversarial networks, the discovery of the entire network may itself be a costly task, and only a small portion of the network may be discovered at any given time. Therefore, it can be useful to investigate whether network mining algorithms can integrate the network discovery process tightly into the mining process, so that the best results are achieved for particular constraints on discovery costs. In this context, we will discuss algorithms for integrating community detection with network discovery. We will tightly integrate with the cost of actually discovering a network with the community detection process, so that the two processes can support each other and are performed in a mutually cohesive way. We present experimental results illustrating the advantages of the approach.
【Keywords】: community detection; network discovery
【Paper Link】 【Pages】:127-136
【Authors】: David Flatow ; Mor Naaman ; Ke Eddie Xie ; Yana Volkovich ; Yaron Kanza
【Abstract】: Social media users share billions of items per year, only a small fraction of which is geotagged. We present a data-driven approach for identifying non-geotagged content items that can be associated with a hyper-local geographic area by modeling the location distributions of n-grams that appear in the text. We explore the trade-off between accuracy and coverage of this method. Further, we explore differences across content received from multiple platforms and devices, and show, for example, that content shared via different sources and applications produces significantly different geographic distributions, and that it is preferred to model and predict location for items according to their source. Our findings show the potential and the bounds of a data-driven approach to assigning location data to short social media texts, and offer implications for all applications that use data-driven approaches to locate content.
【Keywords】: geotagging; location-based services; social media
【Paper Link】 【Pages】:137-138
【Authors】: Thorsten Joachims
【Abstract】: The ability to learn from user interactions can give systems access to unprecedented amounts of world knowledge. This is already evident in search engines, recommender systems, and electronic commerce, and other applications are likely to follow in the near future (e.g., education, smart homes). More generally, the ability to learn from user interactions promises pathways for solving knowledge-intensive tasks ranging from natural language understanding to autonomous robotics. Learning from user interactions, however, means learning from data that does not necessarily fit the assumptions of the standard machine learning models. Since interaction data consists of the choices that humans make, it has to be interpreted with respect to how humans make decisions, which is influenced by the decision context and constraints like human motivation and human abilities. In this talk, I argue that we need learning approaches that explicitly model user-interaction data as the result of human decision making. To this effect, the talk explores how integrating microeconomic models of human behavior into the learning process leads to new learning algorithms that have provable guarantees under verifiable assumptions and to learning systems that perform robustly in practice. These findings imply that the design space of such human-interactive learning systems encompasses not only the machine learning algorithm itself, but also the design of the interaction under an appropriate model of user behavior.
【Keywords】: explicit feedback; implicit feedback; machine learning; user models
【Paper Link】 【Pages】:139-148
【Authors】: Bin Bi ; Hao Ma ; Bo-June Paul Hsu ; Wei Chu ; Kuansan Wang ; Junghoo Cho
【Abstract】: Over the past few years, major web search engines have introduced knowledge bases to offer popular facts about people, places, and things on the entity pane next to regular search results. In addition to information about the entity searched by the user, the entity pane often provides a ranked list of related entities. To keep users engaged, it is important to develop a recommendation model that tailors the related entities to individual user interests. We propose a probabilistic Three-way Entity Model (TEM) that provides personalized recommendation of related entities using three data sources: knowledge base, search click log, and entity pane log. Specifically, TEM is capable of extracting hidden structures and capturing underlying correlations among users, main entities, and related entities. Moreover, the TEM model can also exploit the click signals derived from the entity pane log. We further provide an inference technique to learn the parameters in TEM, and propose a principled preference learning method specifically designed for ranking related entities. Extensive experiments with two real-world datasets show that TEM with our probabilistic framework significantly outperforms a state of the art baseline, confirming the effectiveness of TEM and our probabilistic framework in related entity recommendation.
【Keywords】: entity pane; recommender systems; related entities; three-way entity model
【Paper Link】 【Pages】:149-158
【Authors】: Yuxiao Dong ; Reid A. Johnson ; Nitesh V. Chawla
【Abstract】: Scientific impact plays a central role in the evaluation of the output of scholars, departments, and institutions. A widely used measure of scientific impact is citations, with a growing body of literature focused on predicting the number of citations obtained by any given publication. The effectiveness of such predictions, however, is fundamentally limited by the power-law distribution of citations, whereby publications with few citations are extremely common and publications with many citations are relatively rare. Given this limitation, in this work we instead address a related question asked by many academic researchers in the course of writing a paper, namely: "Will this paper increase my h-index?" Using a real academic dataset with over 1.7 million authors, 2 million papers, and 8 million citation relationships from the premier online academic service ArnetMiner, we formalize a novel scientific impact prediction problem to examine several factors that can drive a paper to increase the primary author's h-index. We find that the researcher's authority on the publication topic and the venue in which the paper is published are crucial factors to the increase of the primary author's h-index, while the topic popularity and the co-authors' h-indices are of surprisingly little relevance. By leveraging relevant factors, we find a greater than 87.5% potential predictability for whether a paper will contribute to an author's h-index within five years. As a further experiment, we generate a self-prediction for this paper, estimating that there is a 76% probability that it will contribute to the h-index of the co-author with the highest current h-index in five years. We conclude that our findings on the quantification of scientific impact can help researchers to expand their influence and more effectively leverage their position of "standing on the shoulders of giants."
【Keywords】: citation prediction; popularity prediction; science of science; scientific impact
【Paper Link】 【Pages】:159-168
【Authors】: Yiming Yang ; Hanxiao Liu ; Jaime G. Carbonell ; Wanli Ma
【Abstract】: This paper addresses an open challenge in educational data mining, i.e., the problem of using observed prerequisite relations among courses to learn a directed universal concept graph, and using the induced graph to predict unobserved prerequisite relations among a broader range of courses. This is particularly useful to induce prerequisite relations among courses from different providers (universities, MOOCs, etc.). We propose a new framework for inference within and across two graphs---at the course level and at the induced concept level---which we call Concept Graph Learning (CGL). In the training phase, our system projects the course-level links onto the concept space to induce directed concept links; in the testing phase, the concept links are used to predict (unobserved) prerequisite links for test-set courses within the same institution or across institutions. The dual mappings enable our system to perform an interlingua-style transfer learning, e.g. treating the concept graph as the interlingua, and inducing prerequisite links in a transferable manner across different universities. Experiments on our newly collected data sets of courses from MIT, Caltech, Princeton and CMU show promising results, including the viability of CGL for transfer learning.
【Keywords】: multi-scale directed graph learning; new inference algorithms; online education; transfer learning
【Paper Link】 【Pages】:169-178
【Authors】: Thanh-Son Nguyen ; Hady Wirawan Lauw ; Panayiotis Tsaparas
【Abstract】: Micro-reviews is a new type of user-generated content arising from the prevalence of mobile devices and social media in the past few years. Micro-reviews are bite-size reviews (usually under 200 characters), commonly posted on social media or check-in services, using a mobile device. They capture the immediate reaction of users, and they are rich in information, concise, and to the point. However, the abundance of micro-reviews, and their telegraphic nature make it increasingly difficult to go through them and extract the useful information, especially on a mobile device. In this paper, we address the problem of summarizing the micro-reviews of an entity, such that the summary is representative, compact, and readable. We formulate the summarization problem as that of synthesizing a new ``review'' using snippets of full-text reviews. To produce a summary that naturally balances compactness and representativeness, we work within the Minimum Description Length framework. We show that finding the optimal summary is NP-hard, and we consider approximation and heuristic algorithms. We perform a thorough evaluation of our methodology on real-life data collected from Foursquare and Yelp. We demonstrate that our summaries outperform individual reviews, as well as existing summarization approaches.
【Keywords】: micro-review summarization; review synthesis
【Paper Link】 【Pages】:179-188
【Authors】: Roi Blanco ; Giuseppe Ottaviano ; Edgar Meij
【Abstract】: Entity linking deals with identifying entities from a knowledge base in a given piece of text and has become a fundamental building block for web search engines, enabling numerous downstream improvements from better document ranking to enhanced search results pages. A key problem in the context of web search queries is that this process needs to run under severe time constraints as it has to be performed before any actual retrieval takes place, typically within milliseconds. In this paper we propose a probabilistic model that leverages user-generated information on the web to link queries to entities in a knowledge base. There are three key ingredients that make the algorithm fast and space-efficient. First, the linking process ignores any dependencies between the different entity candidates, which allows for a O(k2) implementation in the number of query terms. Second, we leverage hashing and compression techniques to reduce the memory footprint. Finally, to equip the algorithm with contextual knowledge without sacrificing speed, we factor the distance between distributional semantics of the query words and entities into the model. We show that our solution significantly outperforms several state-of-the-art baselines by more than 14% while being able to process queries in sub-millisecond times---at least two orders of magnitude faster than existing systems.
【Keywords】: entity linking; queries; web search; wikipedia
【Paper Link】 【Pages】:189-198
【Authors】: Isac S. Ribeiro ; Rodrygo L. T. Santos ; Marcos André Gonçalves ; Alberto H. F. Laender
【Abstract】: Building expertise profiles is a crucial step towards identifying experts in different knowledge areas. However, summarizing the topics of expertise of a given individual is a challenging task, primarily due to the semi-structured and heterogeneous nature of the documentary evidence available for this task. In this paper, we investigate the suitability of tag recommendation as a mechanism to produce effective expertise profiles. In particular, we perform a large-scale user study with academic experts from different knowledge areas to assess the effectiveness of multiple supervised and unsupervised tag recommendation approaches as well as multiple sources of textual evidence. Our analysis reveals that traditional content-based tag recommenders perform well at identifying expertise-oriented tags, with article keywords being a particularly effective source of evidence across profiles in different knowledge areas and with various levels of sparsity. Moreover, by combining multiple recommenders and sources of evidence as learning signals, we further demonstrate the effectiveness of tag recommendation for expertise profiling.
【Keywords】: expertise profiling; learning to rank; tag recommendation
【Paper Link】 【Pages】:199-208
【Authors】: Yao Wu ; Martin Ester
【Abstract】: Aspect-based opinion mining from online reviews has attracted a lot of attention recently. Given a set of reviews, the main task of aspect-based opinion mining is to extract major aspects of the items and to infer the latent aspect ratings from each review. However, users may have different preferences which might lead to different opinions on the same aspect of an item. Even if fine-grained aspect rating analysis is provided for each review, it is still difficult for a user to judge whether a specific aspect of an item meets his own expectation. In this paper, we study the problem of estimating personalized sentiment polarities on different aspects of the items. We propose a unified probabilistic model called Factorized Latent Aspect ModEl (FLAME), which combines the advantages of collaborative filtering and aspect based opinion mining. FLAME learns users' personalized preferences on different aspects from their past reviews, and predicts users' aspect ratings on new items by collective intelligence. Experiments on two online review datasets show that FLAME outperforms state-of-the-art methods on the tasks of aspect identification and aspect rating prediction.
【Keywords】: collaborative filtering; opinion mining; text mining
【Paper Link】 【Pages】:209-210
【Authors】: Juchao Zhuo ; Zeqian Huang ; Yunfeng Liu ; Zhanhui Kang ; Xun Cao ; Mingzhi Li ; Long Jin
【Abstract】: Past years, with the growth of smart-phones and applications, APP market has become an important mobile internet portal. As an important function in application market, APP search gains lots of attentions.However, mismatch between queries and APP is the most critical problem in APP search because of less text within term matching search engine. In this talk, we describe a semantic matching architecture in APP search--which mining topics and tags in big data. It enriches query and APP representations with topics and tags to achieve semantic matching in search. Some challenge must be considered: 1) How to extract tag-APP relationship from large web text. 2) How to use machine learning technologies to process de-noising and computing confidence. 3) How to hybrid ranking apps retrieved by different matching method. These will be introduced in some of our related works and as examples to describe how semantic matching is used in Tencent MyApp, an application market which serving hundreds of millions of users.
【Keywords】: APP search; APP store; semantic matching; tag; topic model
【Paper Link】 【Pages】:211-212
【Authors】: Kaihua Zhu
【Abstract】: Recent years have witnessed dramatic changes in how people interact with search engines. Search engines are expected to be more intelligent in understanding users' intention and fulfilling users' needs with direct answers rather than raw information. Furthermore, search engines are expected to be equipped with recommendation and dialogue capabilities, making the interaction with users more natural and smoother. In this talk, I will introduce Baidu's work on how to make some of them come true through the deep understanding of users, queries and web pages, and discuss challenges behind these technologies.
【Keywords】: deep question answering; dialogue; recommendation
【Paper Link】 【Pages】:213-222
【Authors】: Ravi Kumar ; Mohammad Mahdian ; Bo Pang ; Andrew Tomkins ; Sergei Vassilvitskii
【Abstract】: In this work we study the dynamics of geographic choice, i.e., how users choose one from a set of objects in a geographic region. We postulate a model in which an object is selected from a slate of candidates with probability that depends on how far it is (distance) and how many closer alternatives exist (rank). Under a discrete choice formulation, we argue that there exists a factored form in which unknown functions of rank and distance may be combined to produce an accurate estimate of the likelihood that a user will select each alternative. We then learn these hidden functions and show that each can be closely approximated by an appropriately parameterized lognormal, even though the respective marginals look quite different. We give a theoretical justification to support the presence of lognormal distributions. We then apply this framework to study restaurant choices in map search logs. We show that a four-parameter model based on combinations of lognormals has excellent performance at predicting restaurant choice, even compared to baseline models with access to the full (densely parameterized) marginal distribution for rank and distance. Finally, we show how this framework can be extended to simultaneously learn a per-restaurant quality score representing the residual likelihood of choice after distance and rank have been accounted for. We show that, compared to a per-place score that predicts likelihood without factoring out rank and distance, our score is a significantly better predictor of user quality judgments.
【Keywords】: driving directions; geographic choice; lognormal distributions; rank-based models; restaurant selection
【Paper Link】 【Pages】:223-232
【Authors】: Marios Kokkodis ; Panagiotis Papadimitriou ; Panagiotis G. Ipeirotis
【Abstract】: In an online labor marketplace employers post jobs, receive freelancer applications and make hiring decisions. These hiring decisions are based on the freelancer's observed (e.g., education) and latent (e.g., ability) characteristics. Because of the heterogeneity that appears in the observed characteristics, and the existence of latent ones, identifying and hiring the best possible applicant is a very challenging task. In this work we study and model the employer's hiring behavior. We assume that employers are utility maximizers and make rational decisions by hiring the best possible applicant at hand. Based on this premise, we propose a series of probabilistic models that estimate the hiring probability of each applicant. We train and test our models on more than 600,000 job applications obtained by oDesk.com, and we show evidence that the proposed models outperform currently in-use baselines. To get further insights, we conduct an econometric analysis and observe that the attributes that are strongly correlated with the hiring probability are whether or not the freelancer and the employer have previously worked together, the available information on the freelancer's profile, the countries of the employer and the freelancer and the skillset of the freelancer. Finally, we find that the faster a freelancer applies to an opening, the higher is the probability to get the job.
【Keywords】: Bayesian network; econometric analysis; hiring decisions; online labor markets; ranking
【Paper Link】 【Pages】:233-242
【Authors】: Komal Kapoor ; Karthik Subbian ; Jaideep Srivastava ; Paul R. Schrater
【Abstract】: Recommendation methods have mainly dealt with the problem of recommending new items to the user while user visitation behavior to the familiar items (items which have been consumed before) are little understood. In this paper, we analyze user activity streams and show that user's temporal consumption of familiar items is driven by boredom. Specifically, users move on to a different item when bored and return to the same item when their interest is restored. To model this behavior we include two latent psychological states of preference for items - sensitization and boredom. In the sensitization state the user is highly engaged with the item, while in the boredom state the user is disinterested. We model this behavior using a Hidden Semi-Markov Model for the gaps between user consumption activities. We show that our model performs much better than the state-of-the-art temporal recommendation models at predicting the revisit time to the item. Moreover, we attribute two main reasons for this: (1) recommending items that are not in the bored state for the user, (2) recommending items where user has restored her interests.
【Keywords】: activity streams; boredom; dynamic preferences; temporal recommender systems
【Paper Link】 【Pages】:243-252
【Authors】: Honglei Zhuang ; Joel Young
【Abstract】: Data annotation bias is found in many situations. Often it can be ignored as just another component of the noise floor. However, it is especially prevalent in crowdsourcing tasks and must be actively managed. Annotation bias on single data items has been studied with regard to data difficulty, annotator bias, etc., while annotation bias on batches of multiple data items simultaneously presented to annotators has not been studied. In this paper, we verify the existence of "in-batch annotation bias" between data items in the same batch. We propose a factor graph based batch annotation model to quantitatively capture the in-batch annotation bias, and measure the bias during a crowdsourcing annotation process of inappropriate comments in LinkedIn. We discover that annotators tend to make polarized annotations for the entire batch of data items in our task. We further leverage the batch annotation model to propose a novel batch active learning algorithm. We test the algorithm on a real crowdsourcing platform and find that it outperforms in-batch bias naïve algorithms.
【Keywords】: active learning; annotation bias; crowdsourcing
【Paper Link】 【Pages】:253-262
【Authors】: Shuzi Niu ; Yanyan Lan ; Jiafeng Guo ; Xueqi Cheng ; Lei Yu ; Guoping Long
【Abstract】: Inferring a gold-standard ranking over a set of objects, such as documents or images, is a key task to build test collections for various applications like Web search and recommender systems. Crowdsourcing services provide an efficient and inexpensive way to collect judgments via labeling by sets of annotators. We thus study the problem of finding a consensus ranking from crowdsourced judgments. In contrast to conventional rank aggregation methods which minimize the distance between predicted ranking and input judgments from either pointwise or pairwise perspective, we argue that it is critical to consider the distance in a listwise way to emphasize the position importance in ranking. Therefore, we introduce a new listwise approach in this paper, where ranking measure based objective functions are utilized for optimization. In addition, we also incorporate the annotator quality into our model since the reliability of annotators can vary significantly in crowdsourcing. For optimization, we transform the optimization problem to the Linear Sum Assignment Problem, and then solve it by a very efficient algorithm named CrowdAgg guaranteeing the optimal solution. Experimental results on two benchmark data sets from different crowdsourcing tasks show that our algorithm is much more effective, efficient and robust than traditional methods.
【Keywords】: crowdsourced labeling; evaluation measures; rank aggregation
【Paper Link】 【Pages】:263-272
【Authors】: Maria Daltayanni ; Luca de Alfaro ; Panagiotis Papadimitriou
【Abstract】: In online labor marketplaces two parties are involved; employers and workers. An employer posts a job in the marketplace to receive applications from interested workers. After evaluating the match to the job, the employer hires one (or more workers) to accomplish the job via an online contract. At the end of the contract, the employer can provide his worker with some rating that becomes visible in the worker online profile. This form of explicit feedback guides future hiring decisions, since it is indicative of worker true ability. In this paper, first we discuss some of the shortcomings of the existing reputation systems that are based on the end-of-contract ratings. Then we propose a new reputation mechanism that uses Bayesian updates to combine employer implicit feedback signals in a link-analysis approach. The new system addresses the shortcomings of existing approaches, while yielding better signal for the worker quality towards hiring decision.
【Keywords】: crowdsourcing; link analysis; online markets; reputation systems
【Paper Link】 【Pages】:273-274
【Authors】: Lada A. Adamic
【Abstract】: Vast amounts of information are propagated in online social networks such as Facebook. This talk will describe several studies characterizing how information diffuses over social ties, from the growth of individual cascades to the predictability of their eventual size. It will also characterize the diffusion of specific kinds of information, including rumors, memes, and social movements.
【Keywords】: information diffusion; social networks
【Paper Link】 【Pages】:275-284
【Authors】: Ramanathan V. Guha ; Vineet Gupta ; Vivek Raghunathan ; Ramakrishnan Srikant
【Abstract】: We present a user modeling system that serves as the foundation of a personal assistant. The system ingests web search history for signed-in users, and identifies coherent contexts that correspond to tasks, interests, and habits. Unlike past work which focused on either in-session tasks or tasks over a few days, we look at several months of history in order to identify not just short-term tasks, but also long-term interests and habits. The features we use for identifying coherent contexts yield substantially higher precision and recall than past work. We also present an algorithm for identifying contexts that is 8 to 30 times faster than previous algorithms. The user modeling system has been deployed in production. It runs over hundreds of millions of users, and updates the models with a 10-minute latency. The contexts identified by the system serve as the foundation for generating recommendations in Google Now.
【Keywords】: personal assistant; user modeling
【Paper Link】 【Pages】:285-294
【Authors】: Ricardo A. Baeza-Yates ; Di Jiang ; Fabrizio Silvestri ; Beverly Harrison
【Abstract】: Given the large number of installed apps and the limited screen size of mobile devices, it is often tedious for users to search for the app they want to use. Although some mobile OSs provide categorization schemes that enhance the visibility of useful apps among those installed, the emerging category of homescreen apps aims to take one step further by automatically organizing the installed apps in a more intelligent and personalized way. In this paper, we study how to improve homescreen apps' usage experience through a prediction mechanism that allows to show to users which app she is going to use in the immediate future. The prediction technique is based on a set of features representing the real-time spatiotemporal contexts sensed by the homescreen app. We model the prediction of the next app as a classification problem and propose an effective personalized method to solve it that takes full advantage of human-engineered features and automatically derived features. Furthermore, we study how to solve the two naturally associated cold-start problems: app cold-start and user cold-start. We conduct large-scale experiments on log data obtained from Yahoo Aviate, showing that our approach can accurately predict the next app that a person is going to use.
【Keywords】: aviate; machine learning; mobile app; prediction
【Paper Link】 【Pages】:295-304
【Authors】: Yuan Zhong ; Nicholas Jing Yuan ; Wen Zhong ; Fuzheng Zhang ; Xing Xie
【Abstract】: User profiling is crucial to many online services. Several recent studies suggest that demographic attributes are predictable from different online behavioral data, such as users' "Likes" on Facebook, friendship relations, and the linguistic characteristics of tweets. But location check-ins, as a bridge of users' offline and online lives, have by and large been overlooked in inferring user profiles. In this paper, we investigate the predictive power of location check-ins for inferring users' demographics and propose a simple yet general location to profile (L2P) framework. More specifically, we extract rich semantics of users' check-ins in terms of spatiality, temporality, and location knowledge, where the location knowledge is enriched with semantics mined from heterogeneous domains including both online customer review sites and social networks. Additionally, tensor factorization is employed to draw out low dimensional representations of users' intrinsic check-in preferences considering the above factors. Meanwhile, the extracted features are used to train predictive models for inferring various demographic attributes. We collect a large dataset consisting of profiles of 159,530 verified users from an online social network. Extensive experimental results based upon this dataset validate that: 1) Location check-ins are diagnostic representations of a variety of demographic attributes, such as gender, age, education background, and marital status; 2) The proposed framework substantially outperforms compared models for profile inference in terms of various evaluation metrics, such as precision, recall, F-measure, and AUC.
【Keywords】: demographics; location knowledge; prediction; spatiality; temporality; tensor facotorization
【Paper Link】 【Pages】:305-314
【Authors】: Ning Chen ; Steven C. H. Hoi ; Shaohua Li ; Xiaokui Xiao
【Abstract】: With the popularity of smart phones and mobile devices, the number of mobile applications (a.k.a. "apps") has been growing rapidly. Detecting semantically similar apps from a large pool of apps is a basic and important problem, as it is beneficial for various applications, such as app recommendation, app search, etc. However, there is no systematic and comprehensive work so far that focuses on addressing this problem. In order to fill this gap, in this paper, we explore multi-modal heterogeneous data in app markets (e.g., description text, images, user reviews, etc.), and present "SimApp" -- a novel framework for detecting similar apps using machine learning. Specifically, it consists of two stages: (i) a variety of kernel functions are constructed to measure app similarity for each modality of data; and (ii) an online kernel learning algorithm is proposed to learn the optimal combination of similarity functions of multiple modalities. We conduct an extensive set of experiments on a real-world dataset crawled from Google Play to evaluate SimApp, from which the encouraging results demonstrate that SimApp is effective and promising.
【Keywords】: mobile applications; multi-modal data; multiple kernels; online kernel learning; similarity function
【Paper Link】 【Pages】:315-324
【Authors】: Bin Liu ; Deguang Kong ; Lei Cen ; Neil Zhenqiang Gong ; Hongxia Jin ; Hui Xiong
【Abstract】: Recent years have witnessed a rapid adoption of mobile devices and a dramatic proliferation of mobile applications (Apps for brevity). However, the large number of mobile Apps makes it difficult for users to locate relevant Apps. Therefore, recommending Apps becomes an urgent task. Traditional recommendation approaches focus on learning the interest of a user and the functionality of an item (e.g., an App) from a set of user-item ratings, and they recommend an item to a user if the item's functionality well matches the user's interest. However, Apps could have privileges to access a user's sensitive resources ( e.g., contact, message, and location). As a result, a user chooses an App not only because of its functionality, but also because it respects the user's privacy preference. To the best of our knowledge, this paper presents the first systematic study on incorporating both interest-functionality interactions and users' privacy preferences to perform personalized App recommendations. Specifically, we first construct a new model to capture the trade-off between functionality and user privacy preference. Then we crawled a real-world dataset (16,344 users, 6,157 Apps, and 263,054 ratings) from Google Play and use it to comprehensively evaluate our model and previous methods. We find that our method consistently and substantially outperforms the state-of-the-art approaches, which implies the importance of user privacy preference on personalized App recommendations. Moreover, we explore the impact of different levels of privacy information on the performances of our method, which gives us insights on what resources are more likely to be treated as private by users and influence users' behaviors at selecting Apps.
【Keywords】: mobile apps; privacy and security; recommender systems
【Paper Link】 【Pages】:325-334
【Authors】: Mu Li ; Amr Ahmed ; Alexander J. Smola
【Abstract】: Inferring movement trajectories can be a challenging task, in particular when detailed tracking information is not available due to privacy and data collection constraints. In this paper we present a complete and computationally tractable model for estimating and predicting trajectories based on sparsely sampled, anonymous GPS land-marks that we call GPS snippets. To combat data sparsity we use mapping data as side information to constrain the inference process. We show the efficacy of our approach on a set of prediction tasks over data collected from different cities in the US.
【Keywords】: GPS; motion modeling; movement trajectories
【Paper Link】 【Pages】:335-336
【Authors】: Tushar Chandra
【Abstract】: This talk will focus on our experience in managing the complexity of Sibyl, a large scale machine learning system that is widely used within Google. We believe that a large fraction of the challenges faced by Sibyl are inherent to large scale production machine learning and that other production systems are likely to encounter them as well [1]. Thus, these challenges present interesting opportunities for future research. The Sibyl system is complex for a number of reasons. We have learnt that a complete end-to-end machine learning solution has to have subsystems to address a variety of different needs: data ingestion, data analysis, data verification, experimentation, model analysis, model serving, configuration, data transformations, support for different kinds of loss functions and modeling, machine learning algorithm implementations, etc. Machine learning algorithms themselves constitute a relatively small fraction of the overall system. Each subsystem consists of a number of distinct components to support the variety of product needs. For example, Sibyl supports more than 5 different model serving systems, each with its own idiosyncrasies and challenges. In addition, Sibyl configuration contains more lines of code than the core Sibyl learner itself. Finally existing solutions for some of the challenges don't feel adequate and we believe these challenges present opportunities for future research. Though the overall system is complex, our users need to be able to deploy solutions quickly. This is because a machine learning deployment is typically an iterative process of model improvements. At each iteration, our users experiment with new features, find those that improve the model's prediction capability, and then "launch" a new model with those improved features. A user may go through 10 or more such productive launches. Not only is speed of iteration crucial to our users, but they are often willing to sacrifice the improved prediction quality of a high quality but cumbersome system for the speed of iteration of a lower quality but nimble system. In this talk I will give an example of how simplification drives systems design and sometimes the design of novel algorithms.
【Keywords】: machine learning
【Paper Link】 【Pages】:337-338
【Authors】: Rong Ji
【Abstract】: Online display advertisement has been examined by numerous studies. Most online display ad systems take the greedy approach, namely they display, for each user, the set of ads that match best with the user's interests. One shortcoming of the greedy approach is that it does not take into account the budget limitation of each advertiser. As a result, we often observed that some ads are popular and match with the interests of millions of users; but due to the budget restriction, these ads can only be presented by a limited times, leading to a suboptimal performance. To make our point clear, let's consider a simple case where we only have two advertisers (i.e. A and B), and two users (i.e. a and b). We assume that both advertisers have only a budget of one display. We further assume that user a is interested in both ads even though he is more interested in ad A, while user b is only interested in ad A. Now, if we take the greedy approach, we will always present ad A to user a; as a result, if user a comes before user b, we will have no appropriate ad to be displayed for user b. On the other hand, if we can take into account the budget limitation of both advertisers, a better approach is to present ad B to user a and ad A to user b. This simple example motivates us to develop the global optimization approach for online display advertisement that explicitly take into account the budget limitation of advertisers when deciding the ad presentation for individual users. The key idea of the proposed approach is to compute a user-ad assignment matrix that maximizes the number of clicks under the constraint of ad budgets from individual advertisers. The main computational challenge is the size of variable to be optimized: since the number of users and advertisements involved in our system are 1 billion and ten thousands, respectively, we need to estimate a matrix of billions times ten thousands. We address this challenge by converting the original optimization problem into its dual problem, in which the number of variables is reduced to only ten thousands. A distributed computing algorithm, based on the Nesterov's method and map-reduce framework, was developed to efficiently solve the related optimization problem. We have observed that, the proposed algorithm significantly improves the effectiveness of ad presentation compared to the greedy algorithm.
【Keywords】: online advertisement
【Paper Link】 【Pages】:339-348
【Authors】: Nam Khanh Tran ; Andrea Ceroni ; Nattiya Kanhabua ; Claudia Niederée
【Abstract】: Fully understanding an older news article requires context knowledge from the time of article creation. Finding information about such context is a tedious and time-consuming task, which distracts the reader. Simple contextualization via Wikification is not sufficient here. The retrieved context information has to be time-aware, concise (not full Wikipages) and focused on the coherence of the article topic. In this paper, we present an approach for time-aware recontextualization, which takes those requirements into account in order to improve reading experience. For this purpose, we propose (1) different query formulation methods for retrieving contextualization candidates and (2) ranking methods taking into account topical and temporal relevance as well as complementarity with respect to the original text. We evaluate our proposed approaches through extensive experiments using real-world datasets and ground-truth consisting of over 9,400 article/context pairs. To this end, our experimental results show that our approaches retrieve contextualization information for older articles from the New York Times Archive with high precision and outperform baselines significantly.
【Keywords】: complementarity; interpretation; news; temporal context; time-aware re-contextualization; wikipedia
【Paper Link】 【Pages】:349-358
【Authors】: Alex Deng ; Victor Hu
【Abstract】: Online controlled experiments, also called A/B testing, is playing a central role in many data-driven web-facing companies. It is well known and intuitively obvious to many practitioners that when testing a feature with low coverage, analyzing all data collected without zooming into the part that could be affected by the treatment often leads to under-powered hypothesis testing. A common practice is to use triggered analysis. To estimate the overall treatment effect, certain dilution formula is then applied to translate the estimated effect in triggered analysis back to the original all up population. In this paper, we discuss two different types of trigger analyses. We derive correct dilution formulas and show for a set of widely used metrics, namely ratio metrics, correctly deriving and applying those dilution formulas are not trivial. We observe many practitioners in this industry are often applying approximate formulas or even wrong formulas when doing effect dilution calculation. To deal with that, instead of estimating trigger treatment effect followed by effect translation using dilution formula, we aim at combining these two steps into one streamlined analysis, producing more accurate estimation of overall treatment effect together with even higher statistical power than a triggered analysis. The approach we propose in this paper is intuitive, easy to apply and general enough for all types of triggered analyses and all types of metrics.
【Keywords】: a/b testing; controlled experiment; dilution; feature coverage; metric; variance reduction
【Paper Link】 【Pages】:359-368
【Authors】: Ravi Kumar ; Andrew Tomkins ; Sergei Vassilvitskii ; Erik Vee
【Abstract】: We consider the problem of inferring choices made by users based only on aggregate data containing the relative popularity of each item. We propose a framework that models the problem as that of inferring a Markov chain given a stationary distribution. Formally, we are given a graph and a target steady-state distribution on its nodes. We are also give a mapping from per-node scores to a transition matrix, from a broad family of such mappings. The goal is to set the scores of each node such that the resulting transition matrix induces the desired steady state. We prove sufficient conditions under which this problem is feasible and, for the feasible instances, obtain a simple algorithm for a generic version of the problem. This iterative algorithm provably finds the unique solution to this problem and has a polynomial rate of convergence; in practice we find that the algorithm converges after fewer than ten iterations. We then apply this framework to choice problems in online settings and show that our algorithm is able to explain the observed data and predict the user choices much better than other competing baselines across a variety of diverse datasets.
【Keywords】: Markov chains; choice theory; steady-state distributions; transition matrix
【Paper Link】 【Pages】:369-378
【Authors】: Bhavana Bharat Dalvi ; Einat Minkov ; Partha Pratim Talukdar ; William W. Cohen
【Abstract】: While there has been much research on automatically constructing structured Knowledge Bases (KBs), most of it has focused on generating facts to populate a KB. However, a useful KB must go beyond facts. For example, glosses (short natural language definitions) have been found to be very useful in tasks such as Word Sense Disambiguation. However, the important problem of Automatic Gloss Finding, i.e., assigning glosses to entities in an initially gloss-free KB, is relatively unexplored. We address that gap in this paper. In particular, we propose GLOFIN, a hierarchical semi-supervised learning algorithm for this problem which makes effective use of limited amounts of supervision and available ontological constraints. To the best of our knowledge, GLOFIN is the first system for this task. Through extensive experiments on real-world datasets, we demonstrate GLOFIN's effectiveness. It is encouraging to see that GLOFIN outperforms other state-of-the-art SSL algorithms, especially in low supervision settings. We also demonstrate GLOFIN's robustness to noise through experiments on a wide variety of KBs, ranging from user contributed (e.g., Freebase) to automatically constructed (e.g., NELL). To facilitate further research in this area, we have made the datasets and code used in this paper publicly available.
【Keywords】: gloss finding; hierarchical learning; web mining
【Paper Link】 【Pages】:379-388
【Authors】: Oana Denisa Balalau ; Francesco Bonchi ; T.-H. Hubert Chan ; Francesco Gullo ; Mauro Sozio
【Abstract】: Finding dense subgraphs in large graphs is a key primitive in a variety of real-world application domains, encompassing social network analytics, event detection, biology, and finance. In most such applications, one typically aims at finding several (possibly overlapping) dense subgraphs which might correspond to communities in social networks or interesting events. While a large amount of work is devoted to finding a single densest subgraph, perhaps surprisingly, the problem of finding several dense subgraphs with limited overlap has not been studied in a principled way, to the best of our knowledge. In this work we define and study a natural generalization of the densest subgraph problem, where the main goal is to find at most $k$ subgraphs with maximum total aggregate density, while satisfying an upper bound on the pairwise Jaccard coefficient between the sets of nodes of the subgraphs. After showing that such a problem is NP-Hard, we devise an efficient algorithm that comes with provable guarantees in some cases of interest, as well as, an efficient practical heuristic. Our extensive evaluation on large real-world graphs confirms the efficiency and effectiveness of our algorithms.
【Keywords】: dense subgraph; graph algorithms; graph mining
【Paper Link】 【Pages】:389-398
【Authors】: Bruno Ribeiro ; Christos Faloutsos
【Abstract】: How does a new startup drive the popularity of competing websites into oblivion like Facebook famously did to MySpace? This question is of great interest to academics, technologists, and financial investors alike. In this work we exploit the singular way in which Facebook wiped out the popularity of MySpace, Hi5, Friendster, and Multiply to guide the design of a new popularity competition model. Our model provides new insights into what Nobel Laure- ate Herbert A. Simon called the "marketplace of attention," which we recast as the attention-activity marketplace. Our model design is further substantiated by user-level activity of 250,000 MySpace users obtained between 2004 and 2009. The resulting model not only accurately fits the observed Daily Active Users (DAU) of Facebook and its competitors but also predicts their fate four years into the future.
【Keywords】: market of at- tention; non-equilibrium statistical physics; online social network competition forecast
【Paper Link】 【Pages】:399-408
【Authors】: Michael Röder ; Andreas Both ; Alexander Hinneburg
【Abstract】: Quantifying the coherence of a set of statements is a long standing problem with many potential applications that has attracted researchers from different sciences. The special case of measuring coherence of topics has been recently studied to remedy the problem that topic models give no guaranty on the interpretablity of their output. Several benchmark datasets were produced that record human judgements of the interpretability of topics. We are the first to propose a framework that allows to construct existing word based coherence measures as well as new ones by combining elementary components. We conduct a systematic search of the space of coherence measures using all publicly available topic relevance data for the evaluation. Our results show that new combinations of components outperform existing measures with respect to correlation to human ratings. nFinally, we outline how our results can be transferred to further applications in the context of text mining, information retrieval and the world wide web.
【Keywords】: topic coherence; topic evaluation; topic model
【Paper Link】 【Pages】:409-410
【Authors】: Hui Yang ; Marc Sloan ; Jun Wang
【Abstract】: In Dynamic Information Retrieval modeling we model dynamic systems which change or adapt over time or a sequence of events using a range of techniques from artificial intelligence and reinforcement learning. Many of the open problems in current IR research can be described as dynamic systems, for instance, session search or computational advertising. State of the art research provides solutions to these problems that are responsive to a changing environment, learn from past interactions and predict future utility. Advances in IR interface, personalization and ad display demand models that can react to users in real time and in an intelligent, contextual way. The objective of this half-day tutorial is to provide a comprehensive and up-to-date introduction to Dynamic Information Retrieval Modeling. We motivate a conceptual model linking static, interactive and dynamic retrieval and use this to define dynamics within the context of IR. We then cover a number of algorithms and techniques from the artificial intelligence (AI) and online learning literature such as Markov Decision Processes (MDP), their partially observable variation (POMDP) and multi-armed bandits. Following this we describe how to identify dynamics in an IR problem and demonstrate how to model them using the described techniques. The remainder of the tutorial will then cover an array of state-of-the-art research on dynamic systems in IR and how they can be modeled using using dynamic IR. We use research on session search, multi-page search and online advertising as in-depth examples of such work. This tutorial is of relevance to IR practitioners and researchers, where we will present the merits of dynamic information retrieval modeling and introduce the relevant techniques. The content will be of particular interest to researchers working in the areas of statistical modeling, personalization and recommendation, and is also relevant to practitioners in Web search, online advertising and anyone who works with big data. After this tutorial, attendees will: Be able to identify the dynamics in an IR system; Be able to model these dynamics using techniques from AI and reinforcement learning; Have knowledge of the state-of-the-art research in dynamic information retrieval modeling.
【Keywords】: dynamic information retrieval modeling; probabilistic relevance model; reinforcement learning
【Paper Link】 【Pages】:411-412
【Authors】: Berkant Barla Cambazoglu ; Ricardo A. Baeza-Yates
【Abstract】: Commercial web search engines need to process thousands of queries every second and provide responses to user queries within a few hundred milliseconds. As a consequence of these tight performance constraints, search engines construct and maintain very large computing infrastructures for crawling the Web, indexing discovered pages, and processing user queries. The scalability and efficiency of these infrastructures require careful performance optimizations in every major component of the search engine. This tutorial aims to provide a fairly comprehensive overview of the scalability and efficiency challenges in large-scale web search engines. In particular, the tutorial provides an in-depth architectural overview of a web search engine, mainly focusing on the web crawling, indexing, and query processing components. The scalability and efficiency issues encountered in the above-mentioned components are presented at four different granularities: at the level of a single computer, a cluster of computers, a single data center, and a multi-center search engine. The tutorial also points at the open research problems and provides recommendations to researchers who are new to the field.
【Keywords】: caching; crawling; efficiency; indexing; query processing; scalability; web search engines
【Paper Link】 【Pages】:413-414
【Authors】: Lihong Li
【Abstract】: Evaluating and optimizing an interactive system (like search engines, recommender and advertising systems) from historical data against a predefined online metric is challenging, especially when that metric is computed from user feedback such as clicks and payments. The key challenge is counterfactual in nature: we only observe a user's feedback for actions taken by the system, but we do not know what that user would have reacted to a different action. The golden standard to evaluate such metrics of a user-interacting system is online A/B experiments (a.k.a. randomized controlled experiments), which can be expensive in terms of both time and engineering resources. Offline evaluation/optimization (sometimes referred to as off-policy learning in the literature) thus becomes critical, aiming to evaluate the same metrics without running (many) expensive A/B experiments on live users. One approach to offline evaluation is to build a user model that simulates user behavior (clicks, purchases, etc.) under various contexts, and then evaluate metrics of a system with this simulator. While being straightforward and common in practice, the reliability of such model-based approaches relies heavily on how well the user model is built. Furthermore, it is often difficult to know a priori whether a user model is good enough to be trustable. Recent years have seen a growing interest in another solution to the offline evaluation problem. Using statistical techniques like importance sampling and doubly robust estimation, the approach can give unbiased estimates of metrics for a wide range of problems. It enjoys other benefits as well. For example, it often allows data scientists to obtain a confidence interval for the estimate to quantify the amount of uncertainty; it does not require building user models, so is more robust and easier to apply. All these benefits make the approach particularly attractive to a wide range of problems. Successful applications have been reported in the last few years by some of the industrial leaders. This tutorial gives a review of the basic theory and representative techniques. Applications of these techniques are illustrated through several case studies done at Microsoft and Yahoo!.
【Keywords】: advertising; contextual bandits; counterfactual analysis; information retrieval; interactive systems; offline evaluation; recommender systems; web search
【Paper Link】 【Pages】:415-416
【Authors】: Jun Wang ; Shuai Yuan
【Abstract】: In display and mobile advertising, the most significant development in recent years is the Real-Time Bidding (RTB), which allows selling and buying in real-time one ad impression at a time. Since then, RTB has fundamentally changed the landscape of the digital marketing by scaling the buying process across a large number of available inventories. The demand for automation, integration and optimisation in RTB brings new research opportunities in the IR/DM/ML fields. However, despite its rapid growth and huge potential, many aspects of RTB remain unknown to the research community for many reasons. In this tutorial, together with invited distinguished speakers from online advertising industry, we aim to bring the insightful knowledge from the real-world systems to bridge the gaps and provide an overview of the fundamental infrastructure, algorithms, and technical and research challenges of this new frontier of computational advertising. We will also introduce to researchers the datasets, tools, and platforms which are publicly available thus they can get hands-on quickly.
【Keywords】: bidding algorithms; computational advertising; datasets; real-time bidding; revenue optimisation
【Paper Link】 【Pages】:417-418
【Authors】: Elad Yom-Tov ; Ingemar Johansson Cox ; Vasileios Lampos
【Abstract】: Surveys show that around 70% of US Internet users consult the Internet when they require medical information. People seek this information using both traditional search engines and via social media. The information created using the search process offers an unprecedented opportunity for applications to monitor and improve the quality of life of people with a variety of medical conditions. In recent years, research in this area has addressed public-health questions such as the effect of media on development of anorexia, developed tools for measuring influenza rates and assessing drug safety, and examined the effects of health information on individual wellbeing. This tutorial will show how Internet data can facilitate medical research, providing an overview of the state-of-the-art in this area. During the tutorial we will discuss the information which can be gleaned from a variety of Internet data sources, including social media, search engines, and specialized medical websites. We will provide an overview of analysis methods used in recent literature, and show how results can be evaluated using publicly-available health information and online experimentation. Finally, we will discuss ethical and privacy issues and possible technological solutions. This tutorial is intended for researchers of user generated content who are interested in applying their knowledge to improve health and medicine.
【Keywords】: information retrieval; machine learning; medicine; user-generated data
【Paper Link】 【Pages】:419-420
【Authors】: Silvio Lattanzi ; Vahab S. Mirrokni
【Abstract】: As a fundamental tool in modeling and analyzing social, and information networks, large-scale graph mining is an important component of any tool set for big data analysis. Processing graphs with hundreds of billions of edges is only possible via developing distributed algorithms under distributed graph mining frameworks such as MapReduce, Pregel, Gigraph, and alike. For these distributed algorithms to work well in practice, we need to take into account several metrics such as the number of rounds of computation and the communication complexity of each round. For example, given the popularity and ease-of-use of MapReduce framework, developing practical algorithms with good theoretical guarantees for basic graph algorithms is a problem of great importance. In this tutorial, we first discuss how to design and implement algorithms based on traditional MapReduce architecture. In this regard, we discuss various basic graph theoretic problems such as computing connected components, maximum matching, MST, counting triangle and overlapping or balanced clustering. We discuss a computation model for MapReduce and describe the sampling, filtering, local random walk, and core-set techniques to develop efficient algorithms in this framework. At the end, we explore the possibility of employing other distributed graph processing frameworks. In particular, we study the effect of augmenting MapReduce with a distributed hash table (DHT) service and also discuss the use of a new graph processing framework called ASYMP based on asynchronous message-passing method. In particular, we will show that using ASyMP, one can improve the CPU usage, and achieve significantly improved running time.
【Keywords】: large scale data-mining; mapreduce algorithms; parallel computing
【Paper Link】 【Pages】:421-422
【Authors】: Bin Gao ; Jiang Bian
【Abstract】: In recent years, deep learning has been a very hot topic in the machine learning community. It has brought break-through results in image classification and speech recognition. Most recently, researchers have also got many promising results in natural language processing using deep learning techniques. As machine learning techniques are widely used in the Web search and data mining applications, many researchers and practitioners are studying the possibility of applying the recently-developed deep learning techniques into these applications. Some of them have made very promising progress, and thus it is a good time to hold a workshop to discuss and share the problems and progress in using deep learning techniques to improve Web search and data mining tasks.
【Keywords】: data mining; deep learning; information retrieval; machine learning
【Paper Link】 【Pages】:423-424
【Authors】: Ke Zhou ; Roger Jie Luo ; Djoerd Hiemstra ; Joemon M. Jose
【Abstract】: The HIA'15 workshop aims to bring together information retrieval practitioners from industry and academic researchers concerned with heterogeneous information access and search federation. We would like to create a forum to encourage discussion and exchange of ideas on heterogeneous information access in different contexts. To facilitate the discussion, we encourage submissions on ideas and results from different aspects of heterogeneous information access including aggregated search, composite retrieval, personal search, structured search, etc. Another objective of the workshop is to encourage submissions with novel ideas (e.g. new applications) on heterogeneous information access and potential future directions of this area.
【Keywords】: aggregated search; composite retrieval; federated search; heterogeneous information access
【Paper Link】 【Pages】:425-426
【Authors】: Kaizhu Huang ; Haiqin Yang ; Irwin King ; Michael R. Lyu
【Abstract】: The SDA workshop at WSDM 2015 is the fifth International Workshop on Scalable Data Analytics, following the previous four workshops of SDA respectively held at IEEE Big Data 2013, PAKDD 2014, IEEE Big Data 2014, and IEEE ICDM 2014. This series of workshops aims to provide professionals, researchers, and technologists with a single forum where they can discuss and share the state-of-the-art theories and applications of scalable data analytics technologies. In particular, in the era of information explosion, the scientific, biomedical, and engineering research communities are undergoing a profound transformation where discoveries and innovations increasingly rely on massive amounts of data. The characteristics of volume, velocity, variety and veracity originated in the massive big data then bring challenges to current data analytics techniques. The focus of the fifth SDA is to discuss how we can scale up data analytics techniques for modeling and analyzing big data from various domains.
【Keywords】: big data; data analytics; scalable
【Paper Link】 【Pages】:427-428
【Authors】: Dawei Yin ; Chih-Chieh Hung ; Rui Li ; Yi Chang
【Abstract】: As the web information exponentially grows and the needs of users become more specific, traditional general web search engines are not able to perfectly satisfy the nowadays user requirement. Vertical search engines have emerged in various domains, which more focus on specific segments of online content, including local, shopping, medical information, travel search, etc. Vertical search engines start attracting more attention while relevance ranking in different vertical search engines is becoming the key technology. In addition, vertical search results are often slotted into general Web search results. Hence, designing effective ranking functions for vertical search has become practically important to improve users' experience in both web search and vertical search. The workshop bring together researchers from IR, ML, NLP, and other areas of computer and information science, who are working on or interested in this area. It provides a forum for the researchers to identify the issues and the challenges, to share their latest research results, to express a diverse range of opinions about this topic, and to discuss future directions.
【Keywords】: ranking; vertical relevance; vertical search
【Paper Link】 【Pages】:429-434
【Authors】: Ekaterina Chernyak
【Abstract】: An approach to multiple labelling research papers is explored. We develop techniques for annotating/labeling research papers in informatics and computer sciences with key phrases taken from the ACM Computing Classification System. The techniques utilize a phrase-to-text relevance measure so that only those phrases that are most relevant go to the annotation. Three phrase-to-text relevance measures are experimentally compared in this setting. The measures are: (a) cosine relevance score between conventional vector space representations of the texts coded with tf-idf weighting; (b) popular characteristic of probability of term generation BM25; and (c) an in-house characteristic of conditional probability of symbols averaged over matching fragments in suffix trees representing texts and phrases, CPAMF. In an experiment conducted over a set of texts published in journals of the ACM and manually annotated by their authors, CPAMF outperforms both the cosine measure and BM25 by a wide margin.
【Keywords】: annotated suffix tree; phrase-to-text relevance; text labelling
【Paper Link】 【Pages】:435-440
【Authors】: Yongfeng Zhang
【Abstract】: Previous research on Recommender Systems (RS), especially the continuously popular approach of Collaborative Filtering (CF), has been mostly focusing on the information resource of explicit user numerical ratings or implicit (still numerical) feedbacks. However, the ever-growing availability of textual user reviews has become an important information resource, where a wealth of explicit product attributes/features and user attitudes/sentiments are expressed therein. This information rich resource of textual reviews have clearly exhibited brand-new approaches to solving many of the important problems that have been perplexing the research community for years, such as the paradox of cold-start, the explanation of recommendation, and the automatic generation of user or item profiles. However, it is only recently that the fundamental importance of textual reviews has gained wide recognition, perhaps mainly because of the difficulty in formatting, structuring and analyzing the free-texts. In this research, we stress the importance of incorporating textual reviews for recommendation through phrase-level sentiment analysis, and further investigate the role that the texts play in various important recommendation tasks.
【Keywords】: collaborative filtering; personalized recommendation; sentiment analysis; text mining
【Paper Link】 【Pages】:441-446
【Authors】: Mark Kibanov
【Abstract】: Ubiquitous Computing is an emerging research area of computer science. Similarly, social network analysis and mining became very important in the last years. We aim to combine these two research areas to explore the nature of processes happening around users. The presented research focuses on exploring and analyzing different groups of persons or entities (communities, clusters and classes), their stability and semantics. An example of ubiquitous social data are social networks captured during scientific conferences using face-to-face RFID proximity tags. Another example of ubiquitous data is crowd-generated environmental sensor data. In this paper we generalize various problems connected to these and further datasets and consider them as a task for measuring group stability. Group stability can be used to improve state-of-the-art methods to analyze data. We also aim to improve the performance of different data mining algorithms, eg. by better handling of data with a skewed density distribution. We describe significant results some experiments that show how the presented approach can be applied and discuss the planned experiments.
【Keywords】: data mining; physical computing; ubiquitous systems
【Paper Link】 【Pages】:447-452
【Authors】: Duyu Tang
【Abstract】: In this paper, we propose a representation learning research framework for document-level sentiment analysis. Given a document as the input, document-level sentiment analysis aims to automatically classify its sentiment/opinion (such as thumbs up or thumbs down) based on the textural information. Despite the success of feature engineering in many previous studies, the hand-coded features do not well capture the semantics of texts. In this research, we argue that learning sentiment-specific semantic representations of documents is crucial for document-level sentiment analysis. We decompose the document semantics into four cascaded constitutes: (1) word representation, (2) sentence structure, (3) sentence composition and (4) document composition. Specifically, we learn sentiment-specific word representations, which simultaneously encode the contexts of words and the sentiment supervisions of texts into the continuous representation space. According to the principle of compositionality, we learn sentiment-specific sentence structures and sentence-level composition functions to produce the representation of each sentence based on the representations of the words it contains. The semantic representations of documents are obtained through document composition, which leverages the sentiment-sensitive discourse relations and sentence representations.
【Keywords】: deep learning; natural language processing; sentiment analysis
【Paper Link】 【Pages】:453-458
【Authors】: Zhuoren Jiang
【Abstract】: Scientific information recommendation is crucial to assist scholars for their researches. Citation recommendation is an important field of scientific recommendation. Traditional approaches ignore the chronological nature of the citation recommendation task. In this study, I propose the "Chronological Citation Recommendation," which assumes initial user information need could shift while they are looking for the papers in different time slices. Specifically, I employed a supervised dynamic topic model to characterize the content "time-varying" dynamics and constructed a novel heterogeneous graph that contains dynamic topic-based information, time-decay citation information and word-based information. I applied different meta-paths for different ranking hypotheses, which carried different types of information for citation recommendation in different time slices along with information need shifting. I plan to generate the final "Chronological Citation Recommendation" rankings by feature integration using Learning to Rank. "Chronological Citation Recommendation" will recommend time-series ranking lists based on initial user textual information need. Preliminary experiments on the ACM corpus show that chronological citation recommendation will significantly improve the citation recommendation performance.
【Keywords】: chronological citation recommendation; heterogeneous graph; information need shifting
【Paper Link】 【Pages】:459-464
【Authors】: Rishabh Mehrotra
【Abstract】: Accurate understanding of a user's interests, preferences and behaviours is possibly one of the most critical research challenges faced while developing personalized systems for behavior targeting and information access. We intend to develop comprehensive latent variable models for web search personalization which jointly models user's topical interests along with user's click based relevance preferences while at the same time taking into account user's intended search tasks along with information about other similar users. We further augment this model by incorporating topic-level relevance parameters, which, to the best of our knowledge, is the first attempt at modeling result ranking preferences at the topic level. Additionally, we intend to explore the possibility of modeling users in terms of the search tasks they perform thereby coupling users' topical interests with their search task behavior to learn user representations. Finally, we wish to evaluate the proposition of extending user representations to hierarchical structures as an alternative to existing flat representations. The evaluation of these alternative approaches for user modeling is based on their performance on a variety of tasks such as collaborative query recommendations, user cohort modeling and search result personalization. This proposal provides the motivation to pursue these research directions, summarizes key research problems being targeted, glances through potential ways of tackling these research challenges and highlights some initial results obtained.
【Keywords】: latent variable models; personalization; search tasks; user modelling