9. WSDM 2016:San Francisco, CA, USA

Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, San Francisco, CA, USA, February 22-25, 2016. ACM 【DBLP Link】

Paper Num: 93 || Session Num: 17

Big Data Algorithms 10
Communities and Social Interaction 10
Doctoral Consortium 12
Entities and Structure 8
Keynote Address 1
Keynote Address 1
Keynote Address 1
Leveraging Users 14
Observing Users 2
Practice & Experience Track 1
Practice & Experience Track 1
Practice & Experience Track 2
Search and Semantics 10
Social Events 2
Social Networks 12
Tutorials 2
Workshop Summaries 4

Keynote Address 1

1. Large-Scale Deep Learning For Building Intelligent Computer Systems.

【Paper Link】【Pages】:1

【Authors】: Jeffrey Dean

【Abstract】: For the past five years, the Google Brain team has focused on conducting research in difficult problems in artificial intelligence, on building large-scale computer systems for machine learning research, and, in collaboration with many teams at Google, on applying our research and systems to dozens of Google products. Our group has recently open-sourced the TensorFlow system (tensorflow.org), a system designed to easily express machine ideas, and to quickly train, evaluate and deploy machine learning systems. In this talk, I'll highlight some of the design decisions we made in building TensorFlow, discuss research results produced within our group, and describe ways in which these ideas have been applied to a variety of problems in Google's products, usually in close collaboration with other teams. This talk describes joint work with many people at Google.

【Keywords】: computer vision; deep learning; distributed systems; language understanding; machine learning; neural networks; speech recognition

【Paper Link】【Pages】:3-12

【Authors】: Nicholas Jing Yuan ; Yuan Zhong ; Fuzheng Zhang ; Xing Xie ; Chin-Yew Lin ; Yong Rui

【Abstract】: Friendships are dynamic. Previous studies have converged to suggest that social interactions, in both online and offline social networks, are diagnostic reflections of friendship relations (also called social ties). However, most existing approaches consider a social tie as either a binary relation, or a fixed value (named tie strength). In this paper, we investigate the dynamics of dyadic friend relationships through online social interactions, in terms of a variety of aspects, such as reciprocity, temporality, and contextuality. In turn, we propose a model to predict repliers and retweeters given a particular tweet posted at a certain time in a microblog-based social network. More specifically, we have devised a learning-to-rank approach to train a ranker that considers elaborate user-level and tweet-level features (like sentiment, self-disclosure, and responsiveness) to address these dynamics. In the prediction phase, a tweet posted by a user is deemed a query and the predicted repliers/retweeters are retrieved using the learned ranker. We have collected a large dataset containing 73.3 million dyadic relationships with their interactions (replies and retweets). Extensive experimental results based on this dataset show that by incorporating the dynamics of friendship relations, our approach significantly outperforms state-of-the-art models in terms of multiple evaluation metrics, such as MAP, NDCG and Topmost Accuracy. In particular, the advantage of our model is even more promising in predicting the exact sequence of repliers/retweeters considering their orders. Furthermore, the proposed approach provides emerging implications for many high-value applications in online social networks.

【Keywords】: dynamic tie strength; friendship relation; online social interaction

【Paper Link】【Pages】:13-22

【Authors】: Quanzeng You ; Jiebo Luo ; Hailin Jin ; Jianchao Yang

【Abstract】: Sentiment analysis of online user generated content is important for many social media analytics tasks. Researchers have largely relied on textual sentiment analysis to develop systems to predict political elections, measure economic indicators, and so on. Recently, social media users are increasingly using additional images and videos to express their opinions and share their experiences. Sentiment analysis of such large-scale textual and visual content can help better extract user sentiments toward events or topics. Motivated by the needs to leverage large-scale social multimedia content for sentiment analysis, we propose a cross-modality consistent regression (CCR) model, which is able to utilize both the state-of-the-art visual and textual sentiment analysis techniques. We first fine-tune a convolutional neural network (CNN) for image sentiment analysis and train a paragraph vector model for textual sentiment analysis. On top of them, we train our multi-modality regression model. We use sentimental queries to obtain half a million training samples from Getty Images. We have conducted extensive experiments on both machine weakly labeled and manually labeled image tweets. The results show that the proposed model can achieve better performance than the state-of-the-art textual and visual sentiment analysis algorithms alone.

【Keywords】: cross-modality regression; multimodality analysis; sentiment analysis

4. How Relevant is the Irrelevant Data: Leveraging the Tagging Data for a Learning-to-Rank Model.

【Paper Link】【Pages】:23-32

【Authors】: Noor Ifada ; Richi Nayak

【Abstract】: For the task of tag-based item recommendations, the underlying tensor model faces several challenges such as high data sparsity and inferring latent factors effectively. To overcome the inherent sparsity issue of tensor models, we propose the graded-relevance interpretation scheme that leverages the tagging data effectively. Unlike the existing schemes, the graded-relevance scheme interprets the tagging data richly, differentiates the non-observed tagging data insightfully, and annotates each entry as one of the "relevant", "likely relevant", "irrelevant", or "indecisive" labels. To infer the latent factors of tensor models correctly to produce the high quality recommendation, we develop a novel learning-to-rank method, Go-Rank, that optimizes Graded Average Precision (GAP). Evaluating the proposed method on real-world datasets, we show that the proposed interpretation scheme produces a denser tensor model by revealing "relevant" entries from the previously assumed "irrelevant" entries. Optimizing GAP as the ranking metric, the quality of the recommendations generated by Go-Rank is found superior against the benchmarking methods.

【Keywords】: graded average precision; graded-relevance scheme; item recommendation; tagging data

【Paper Link】【Pages】:33-42

【Authors】: Kiran Garimella ; Gianmarco De Francisci Morales ; Aristides Gionis ; Michael Mathioudakis

【Abstract】: Which topics spark the most heated debates in social media? Identifying these topics is a first step towards creating systems which pierce echo chambers. In this paper, we perform a systematic methodological study of controversy detection using social media network structure and content. Unlike previous work, rather than identifying controversy in a single hand-picked topic and use domain-specific knowledge, we focus on comparing topics in any domain. Our approach to quantifying controversy is a graph-based three-stage pipeline, which involves (i) building a conversation graph about a topic, which represents alignment of opinion among users; (ii) partitioning the conversation graph to identify potential sides of the controversy; and (iii)measuring the amount of controversy from characteristics of the~graph. We perform an extensive comparison of controversy measures, as well as graph building approaches and data sources. We use both controversial and non-controversial topics on Twitter, as well as other external datasets. We find that our new random-walk-based measure outperforms existing ones in capturing the intuitive notion of controversy, and show that content features are vastly less helpful in this task.

【Keywords】: controversy; random walks; social media; twitter

【Paper Link】【Pages】:43-52

【Authors】: Suhas Ranganath ; Xia Hu ; Jiliang Tang ; Huan Liu

【Abstract】: Social media is increasingly being used to access and disseminate information on sociopolitical issues like gun rights and general elections. The popularity and openness of social media makes it conducive for some individuals, known as advocates, who use social media to push their agendas on these issues strategically. Identifying these advocates will caution social media users before reading their information and also enable campaign managers to identify advocates for their digital political campaigns. A significant challenge in identifying advocates is that they employ nuanced strategies to shape user opinion and increase the spread of their messages, making it difficult to distinguish them from random users posting on the campaign. In this paper, we draw from social movement theories and design a quantitative framework to study the nuanced message strategies, propagation strategies, and community structure adopted by advocates for political campaigns in social media. Based on observations of their social media activities manifesting from these strategies, we investigate how to model these strategies for identifying them. We evaluate the framework using two datasets from Twitter, and our experiments demonstrate its effectiveness in identifying advocates for political campaigns with ramifications of this work directed towards assisting users as they navigate through social media spaces.

【Keywords】: advocacy; political campaigns; user interactions

7. Exploiting New Sentiment-Based Meta-level Features for Effective Sentiment Analysis.

【Paper Link】【Pages】:53-62

【Authors】: Sérgio D. Canuto ; Marcos André Gonçalves ; Fabrício Benevenuto

【Abstract】: In this paper we address the problem of automatically learning to classify the sentiment of short messages/reviews by exploiting information derived from meta-level features i.e., features derived primarily from the original bag-of-words representation. We propose new meta-level features especially designed for the sentiment analysis of short messages such as: (i) information derived from the sentiment distribution among the k nearest neighbors of a given short test document x, (ii) the distribution of distances of x to their neighbors and (iii) the document polarity of these neighbors given by unsupervised lexical-based methods. Our approach is also capable of exploiting information from the neighborhood of document x regarding (highly noisy) data obtained from 1.6 million Twitter messages with emoticons. The set of proposed features is capable of transforming the original feature space into a new one, potentially smaller and more informed. Experiments performed with a substantial number of datasets (nineteen) demonstrate that the effectiveness of the proposed sentiment-based meta-level features is not only superior to the traditional bag-of-word representation (by up to 16%) but is also superior in most cases to state-of-art meta-level features previously proposed in the literature for text classification tasks that do not take into account some idiosyncrasies of sentiment analysis. Our proposal is also largely superior to the best lexicon-based methods as well as to supervised combinations of them. In fact, the proposed approach is the only one to produce the best results in all tested datasets in all scenarios.

【Keywords】: meta features; sentiment analysis

8. Mobile App Tagging.

【Paper Link】【Pages】:63-72

【Authors】: Ning Chen ; Steven C. H. Hoi ; Shaohua Li ; Xiaokui Xiao

【Abstract】: Mobile app tagging aims to assign a list of keywords indicating core functionalities, main contents, key features or concepts of a mobile app. Mobile app tags can be potentially useful for app ecosystem stakeholders or other parties to improve app search, browsing, categorization, and advertising, etc. However, most mainstream app markets, e.g., Google Play, Apple App Store, etc., currently do not explicitly support such tags for apps. To address this problem, we propose a novel auto mobile app tagging framework for annotating a given mobile app automatically, which is based on a search-based annotation paradigm powered by machine learning techniques. Specifically, given a novel query app without tags, our proposed framework (i) first explores online kernel learning techniques to retrieve a set of top-N similar apps that are semantically most similar to the query app from a large app repository; and (ii) then mines the text data of both the query app and the top-N similar apps to discover the most relevant tags for annotating the query app. To evaluate the efficacy of our proposed framework, we conduct an extensive set of experiments on a large real-world dataset crawled from Google Play. The encouraging results demonstrate that our technique is effective and promising.

【Keywords】: app tagging; mobile app markets; online kernel learning

9. CCCF: Improving Collaborative Filtering via Scalable User-Item Co-Clustering.

【Paper Link】【Pages】:73-82

【Authors】: Yao Wu ; Xudong Liu ; Min Xie ; Martin Ester ; Qing Yang

【Abstract】: Collaborative Filtering (CF) is the most popular method for recommender systems. The principal idea of CF is that users might be interested in items that are favorited by similar users, and most of the existing CF methods measure users' preferences by their behaviours over all the items. However, users might have different interests over different topics, thus might share similar preferences with different groups of users over different sets of items. In this paper, we propose a novel and scalable method CCCF which improves the performance of CF methods via user-item co-clustering. CCCF first clusters users and items into several subgroups, where each subgroup includes a set of like-minded users and a set of items in which these users share their interests. Then, traditional CF methods can be easily applied to each subgroup, and the recommendation results from all the subgroups can be easily aggregated. Compared with previous works, CCCF has several advantages including scalability, flexibility, interpretability and extensibility. Experimental results on four real world data sets demonstrate that the proposed method significantly improves the performance of several state-of-the-art recommendation algorithms.

【Keywords】: co-clustering; collaborative filtering; recommender systems

【Paper Link】【Pages】:83-92

【Authors】: Mahmoudreza Babaei ; Przemyslaw A. Grabowicz ; Isabel Valera ; Krishna P. Gummadi ; Manuel Gomez-Rodriguez

【Abstract】: Social media sites are information marketplaces, where users produce and consume a wide variety of information and ideas. In these sites, users typically choose their information sources, which in turn determine what specific information they receive, how much information they receive and how quickly this information is shown to them. In this context, a natural question that arises is how efficient are social media users at selecting their information sources. In this work, we propose a computational framework to quantify users' efficiency at selecting information sources. Our framework is based on the assumption that the goal of users is to acquire a set of unique pieces of information. To quantify user's efficiency, we ask if the user could have acquired the same pieces of information from another set of sources more efficiently. We define three different notions of efficiency -- link, in-flow, and delay -- corresponding to the number of sources the user follows, the amount of (redundant) information she acquires and the delay with which she receives the information. Our definitions of efficiency are general and applicable to any social media system with an underlying in- formation network, in which every user follows others to receive the information they produce. In our experiments, we measure the efficiency of Twitter users at acquiring different types of information. We find that Twitter users exhibit sub-optimal efficiency across the three notions of efficiency, although they tend to be more efficient at acquiring non- popular pieces of information than they are at acquiring popular pieces of information. We then show that this lack of efficiency is a consequence of the triadic closure mechanism by which users typically discover and follow other users in social media. Thus, our study reveals a tradeoff between the efficiency and discoverability of information sources. Finally, we develop a heuristic algorithm that enables users to be significantly more efficient at acquiring the same unique pieces of information.

【Keywords】: cover set; efficiency; information; information network; lossless; optimization; rewiring algorithm; social media

11. Modeling and Predicting Learning Behavior in MOOCs.

【Paper Link】【Pages】:93-102

【Authors】: Jiezhong Qiu ; Jie Tang ; Tracy Xiao Liu ; Jie Gong ; Chenhui Zhang ; Qian Zhang ; Yufei Xue

【Abstract】: Massive Open Online Courses (MOOCs), which collect complete records of all student interactions in an online learning environment, offer us an unprecedented opportunity to analyze students' learning behavior at a very fine granularity than ever before. Using dataset from xuetangX, one of the largest MOOCs from China, we analyze key factors that influence students' engagement in MOOCs and study to what extent we could infer a student's learning effectiveness. We observe significant behavioral heterogeneity in students' course selection as well as their learning patterns. For example, students who exert higher effort and ask more questions are not necessarily more likely to get certificates. Additionally, the probability that a student obtains the course certificate increases dramatically (3 x higher) when she has one or more "certificate friends". Moreover, we develop a unified model to predict students' learning effectiveness, by incorporating user demographics, forum activities, and learning behavior. We demonstrate that the proposed model significantly outperforms (+2.03-9.03% by F1-score) several alternative methods in predicting students' performance on assignments and course certificates. The model is flexible and can be applied to various settings. For example, we are deploying a new feature into xuetangX to help teachers dynamically optimize the teaching process.

【Keywords】: moocs; online engagement; predictive model; user behavior

Search and Semantics 10

12. Beyond Ranking: Optimizing Whole-Page Presentation.

【Paper Link】【Pages】:103-112

【Authors】: Yue Wang ; Dawei Yin ; Luo Jie ; Pengyuan Wang ; Makoto Yamada ; Yi Chang ; Qiaozhu Mei

【Abstract】: Modern search engines aggregate results from different verticals: webpages, news, images, video, shopping, knowledge cards, local maps, etc. Unlike "ten blue links", these search results are heterogeneous in nature and not even arranged in a list on the page. This revolution directly challenges the conventional "ranked list" formulation in ad hoc search. Therefore, finding proper presentation for a gallery of heterogeneous results is critical for modern search engines. We propose a novel framework that learns the optimal page presentation to render heterogeneous results onto search result page (SERP). Page presentation is broadly defined as the strategy to present a set of items on SERP, much more expressive than a ranked list. It can specify item positions, image sizes, text fonts, and any other styles as long as variations are within business and design constraints. The learned presentation is content-aware, i.e. tailored to specific queries and returned results. Simulation experiments show that the framework automatically learns eye-catchy presentations for relevant results. Experiments on real data show that simple instantiations of the framework already outperform leading algorithm in federated search result presentation. It means the framework can learn its own result presentation strategy purely from data, without even knowing the "probability ranking principle".

【Keywords】: federated search; user feedback; whole-page presentation

13. Understanding User Attention and Engagement in Online News Reading.

【Paper Link】【Pages】:113-122

【Authors】: Dmitry Lagun ; Mounia Lalmas

【Abstract】: Prior work on user engagement with online media identified web page dwell time as a key metric reflecting level of user engagement with online news articles. While on average, dwell time gives a reasonable estimate of user experience with a news article, it is not able to capture important aspects of user interaction with the page, such as how much time a user spends reading the article vs. viewing the comment posted by other users, or the actual proportion of article read by the user. In this paper, we propose a set of user engagement classes along with new user engagement metrics that, unlike dwell time, more accurately reflect user experience with the content. Our user engagement classes provide clear and interpretable taxonomy of user engagement with online news, and are defined based on amount of time user spends on the page, proportion of the article user actually reads and the amount of interaction users performs with the comments. Moreover, we demonstrate that our metrics are relatively easier to predict from the news article content, compared to the dwell time, making optimization of user engagement more attainable goal.

【Keywords】: attention; engagement; large scale; news reading; topic modeling; user modeling; viewport

14. Publication Date Prediction through Reverse Engineering of the Web.

【Paper Link】【Pages】:123-132

【Authors】: Liudmila Ostroumova Prokhorenkova ; Petr Prokhorenkov ; Egor Samosvat ; Pavel Serdyukov

【Abstract】: In this paper, we focus on one of the most challenging tasks in temporal information retrieval: detection of a web page publication date. The natural approach to this problem is to find the publication date in the HTML body of a page. However, there are two fundamental problems with this approach. First, not all web pages contain the publication dates in their texts. Second, it is hard to distinguish the publication date among all the dates found in the page's text. The approach we suggest in this paper supplements methods of date extraction from the page's text with novel link-based methods of dating. Some of our link-based methods are based on a probabilistic model of the Web graph structure evolution, which relies on the publication dates of web pages as on its parameters. We use this model to estimate the publication dates of web pages: based only on the link structure currently observed, we perform a ``reverse engineering'' to reveal the whole process of the Web's evolution.

【Keywords】: likelihood optimization; link-based method; publication dates; web pages

15. To Suggest, or Not to Suggest for Queries with Diverse Intents: Optimizing Search Result Presentation.

【Paper Link】【Pages】:133-142

【Authors】: Makoto P. Kato ; Katsumi Tanaka

【Abstract】: We propose a method of optimizing search result presentation for queries with diverse intents, by selectively presenting query suggestions for leading users to more relevant search results. The optimization is based on a probabilistic model of users who click on query suggestions in accordance with their intents, and modified versions of intent-aware evaluation metrics that take into account the co-occurrence between intents. Showing many query suggestions simply increases a chance to satisfy users with diverse intents in this model, while it in fact requires users to spend additional time for scanning and selecting suggestions, and may result in low satisfaction for some users. Therefore, we measured the loss of time caused by query suggestion presentation by conducting a user study in different settings, and included its negative effects in our optimization problem. Our experiments revealed that the optimization of search result presentation significantly improved that of a single ranked list, and was beneficial especially for patient users. Moreover, experimental results showed that our optimization was effective particularly when intents of a query often co-occur with a small subset of intents.

【Keywords】: optimization; query suggestion; search result diversification

16. Term-by-Term Query Auto-Completion for Mobile Search.

【Paper Link】【Pages】:143-152

【Authors】: Saúl Vargas ; Roi Blanco ; Peter Mika

【Abstract】: With the ever increasing usage of mobile search, where text input is typically slow and error-prone, assisting users to formulate their queries contributes to a more satisfactory search experience. Query auto-completion (QAC) techniques, which predict possible completions for user queries, are the archetypal example of query assistance and are present in most search engines. We argue, however, that classic QAC, which operates by suggesting whole-query completions, may be sub-optimal for the case of mobile search as the available screen real estate to show suggestions is limited and editing is typically slower than in desktop search. In this paper we propose the idea of term-by-term QAC, which is a new technique inspired by predictive keyboards that suggests to the user one term at a time, instead of whole-query completions. We describe an efficient mechanism to implement this technique and an adaptation of a prior user model to evaluate the effectiveness of both standard and term-by-term QAC approaches using query log data. Our experiments with a mobile query log from a commercial search engine show the validity of our approach according to this user model with respect to saved characters, saved terms and examination effort. Finally, a user study provides further insights about our term-by-term technique compared with standard QAC with respect to the variables analyzed in the query log-based evaluation and additional variables related to the successfulness, the speed of the interactions and the properties of the submitted queries.

【Keywords】: query auto completion; query logs; user models; word prediction

17. Collaborative Denoising Auto-Encoders for Top-N Recommender Systems.

【Paper Link】【Pages】:153-162

【Authors】: Yao Wu ; Christopher DuBois ; Alice X. Zheng ; Martin Ester

【Abstract】: Most real-world recommender services measure their performance based on the top-N results shown to the end users. Thus, advances in top-N recommendation have far-ranging consequences in practical applications. In this paper, we present a novel method, called Collaborative Denoising Auto-Encoder (CDAE), for top-N recommendation that utilizes the idea of Denoising Auto-Encoders. We demonstrate that the proposed model is a generalization of several well-known collaborative filtering models but with more flexible components. Thorough experiments are conducted to understand the performance of CDAE under various component settings. Furthermore, experimental results on several public datasets demonstrate that CDAE consistently outperforms state-of-the-art top-N recommendation methods on a variety of common evaluation metrics.

【Keywords】: collaborative filtering; denoising auto- encoders; recommender systems

18. Personalized PageRank Estimation and Search: A Bidirectional Approach.

【Paper Link】【Pages】:163-172

【Authors】: Peter Lofgren ; Siddhartha Banerjee ; Ashish Goel

【Abstract】: We present new algorithms for Personalized PageRank estimation and Personalized PageRank search. First, for the problem of estimating Personalized PageRank (PPR) from a source distribution to a target node, we present a new bidirectional estimator with simple yet strong guarantees on correctness and performance, and 3x to 8x speedup over existing estimators in experiments on a diverse set of networks. Moreover, it has a clean algebraic structure which enables it to be used as a primitive for the Personalized PageRank Search problem: Given a network like Facebook, a query like "people named John," and a searching user, return the top nodes in the network ranked by PPR from the perspective of the searching user. Previous solutions either score all nodes or score candidate nodes one at a time, which is prohibitively slow for large candidate sets. We develop a new algorithm based on our bidirectional PPR estimator which identifies the most relevant results by sampling candidates based on their PPR; this is the first solution to PPR search that can find the best results without iterating through the set of all candidate results. Finally, by combining PPR sampling with sequential PPR estimation and Monte Carlo, we develop practical algorithms for PPR search, and we show via experiments that our algorithms are efficient on networks with billions of edges.

【Keywords】: personalized pagerank; personalized search; social network analysis

19. Your Cart tells You: Inferring Demographic Attributes from Purchase Data.

【Paper Link】【Pages】:173-182

【Authors】: Pengfei Wang ; Jiafeng Guo ; Yanyan Lan ; Jun Xu ; Xueqi Cheng

【Abstract】: Demographic attributes play an important role in retail market to characterize different types of users. Such signals however are often only available for a small fraction of users in practice due to the difficulty in manual collection process by retailers. In this paper, we aim to harness the power of big data to automatically infer users' demographic attributes based on their purchase data. Typically, demographic prediction can be formalized as a multi-task multi-class prediction problem, i.e., multiple demographic attributes (e.g., gender, age and income) are to be inferred for each user where each attribute may belong to one of N possible classes (N-2). Most previous work on this problem explores different types of features and usually predicts different attributes independently. However, modeling the tasks separately may lose the ability to leverage the correlations among different attributes. Meanwhile, manually defined features require professional knowledge and often suffer from under specification. To address these problems, we propose a novel Structured Neural Embedding (SNE) model to automatically learn the representations from users' purchase data for predicting multiple demographic attributes simultaneously. Experiments are conducted on a real-world retail dataset where five attributes (gender, marital status, income, age, and education level) are to be predicted. The empirical results show that our SNE model can improve the performance significantly compared with state-of-the-art baselines.

【Keywords】: demographic attribute; multitask multi-class prediction; structured neural embedding

20. Reducing Click and Skip Errors in Search Result Ranking.

【Paper Link】【Pages】:183-192

【Authors】: Jiepu Jiang ; James Allan

【Abstract】: Search engines provide result summaries to help users quickly identify whether or not it is worthwhile to click on a result and read in detail. However, users may visit non-relevant results and/or skip relevant ones. These actions are usually harmful to the user experience, but few considered this problem in search result ranking. This paper optimizes relevance of results and user click and skip activities at the same time. Comparing two equally relevant results, our approach learns to rank the one that users are more likely to click on at a higher position. Similarly, it demotes non-relevant web pages with high click probabilities. Experimental results show this approach reduces about 10%-20% of the click and skip errors with a trade off of 2.1% decline in nDCG@10.

【Keywords】: click; interactive search; search result ranking; web search

21. Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies.

【Paper Link】【Pages】:193-202

【Authors】: Bhavana Dalvi ; Aditya Kumar Mishra ; William W. Cohen

【Abstract】: In an entity classification task, topic or concept hierarchies are often incomplete. Previous work by Dalvi et al. [12] has showed that in non-hierarchical semi-supervised classification tasks, the presence of such unanticipated classes can cause semantic drift for seeded classes. The Exploratory learning [12] method was proposed to solve this problem; however it is limited to the flat classification task. This paper builds such exploratory learning methods for hierarchical classification tasks. We experimented with subsets of the NELL [8] ontology and text, and HTML table datasets derived from the ClueWeb09 corpus. Our method (OptDAC-ExploreEM) outperforms the existing Exploratory EM method, and its naive extension (DAC-ExploreEM), in terms of seed class F1 on average by 10% and 7% respectively.

【Keywords】: concept discovery; hierarchical classification; ontology extension; semi-supervised learning

Practice & Experience Track 1

22. Is Mail The Next Frontier In Search And Data Mining?

【Paper Link】【Pages】:203

【Authors】: Yoelle Maarek

【Abstract】: The nature of Web mail traffic has significantly evolved in the last two decades, and consequently the behavior of Web mail users has also changed. For instance a recent study conducted by Yahoo Labs showed that today 90% of Web mail traffic is machine-generated. This partly explains why email traffic continues to grow even if a significant amount of personal communications has moved towards social media. Most users today are receiving in their inbox important invoices, receipts, and travel itineraries, together with non-malicious junk mail such as hotel newsletters or shopping promotions that could safely ignore. This is one of the reasons that a majority of messages remain unread, and many are deleted without being read. In that sense, Web mail has become quite similar to traditional snail mail. In spite of this drastic change in nature, many mail features remain unchanged. While 70% of mail users do not define even a single folder, folders are still predominant in the left trail of many Web mail clients. Mail search results are still mostly ranked by date, which makes the retrieving of older messages extremely challenging. This is even more painful to users, as unlike in Web search, they will know when a relevant previously read message has not been returned. In this talk, I present the results of multiple large-scale studies that have been conducted at Yahoo Labs in the last few years. I highlight the inherent challenges associated with such studies, especially around privacy concerns. I will discuss the new nature of consumer Web mail, which is dominated by machine-generated messages of highly heterogeneous forms and value. I will show how the change has not been fully recognized yet by my most email clients. As an example, why should there still be a reply option associated with a message coming from a "do-not-reply@" address?. I will introduce some approaches for large-scale mail mining specifically tailored to machine-generated email. I will conclude by discussing possible applications and research directions.

【Keywords】: machine-generated email; mail search; web mail

Observing Users 2

23. Portrait of an Online Shopper: Understanding and Predicting Consumer Behavior.

【Paper Link】【Pages】:205-214

【Authors】: Farshad Kooti ; Kristina Lerman ; Luca Maria Aiello ; Mihajlo Grbovic ; Nemanja Djuric ; Vladan Radosavljevic

【Abstract】: Consumer spending accounts for a large fraction of economic footprint of modern countries. Increasingly, consumer activity is moving to the web, where digital receipts of online purchases provide valuable data sources detailing consumer behavior. We consider such data extracted from emails and combined with with consumers' demographic information, which we use to characterize, model, and predict purchasing behavior. We analyze such behavior of consumers in different age and gender groups, and find interesting, actionable patterns that can be used to improve ad targeting systems. For example, we found that the amount of money spent on online purchases grows sharply with age, peaking in the late 30s, while shoppers from wealthy areas tend to purchase more expensive items and buy them more frequently. Furthermore, we look at the influence of social connections on purchasing habits, as well as at the temporal dynamics of online shopping where we discovered daily and weekly behavioral patterns. Finally, we build a model to predict when shoppers are most likely to make a purchase and how much will they spend, showing improvement over baseline approaches. The presented results paint a clear picture of a modern online shopper, and allow better understanding of consumer behavior that can help improve marketing efforts and make shopping more pleasant and efficient experience for online customers.

【Keywords】: demographics; online shopping; prediction; shopping

24. Evolution of Privacy Loss in Wikipedia.

【Paper Link】【Pages】:215-224

【Authors】: Marian-Andrei Rizoiu ; Lexing Xie ; Tiberio Caetano ; Manuel Cebrián

【Abstract】: The cumulative effect of collective online participation has an important and adverse impact on individual privacy. As an online system evolves over time, new digital traces of individual behavior may uncover previously hidden statistical links between an individual's past actions and her private traits. To quantify this effect, we analyze the evolution of individual privacy loss by studying the edit history of Wikipedia over 13 years, including more than 117,523 different users performing 188,805,088 edits. We trace each Wikipedia's contributor using apparently harmless features, such as the number of edits performed on predefined broad categories in a given time period (e.g. Mathematics, Culture or Nature). We show that even at this unspecific level of behavior description, it is possible to use off-the-shelf machine learning algorithms to uncover usually undisclosed personal traits, such as gender, religion or education. We provide empirical evidence that the prediction accuracy for almost all private traits consistently improves over time. Surprisingly, the prediction performance for users who stopped editing after a given time still improves. The activities performed by new users seem to have contributed more to this effect than additional activities from existing (but still active) users. Insights from this work should help users, system designers, and policy makers understand and make long-term design choices in online content creation systems.

【Keywords】: de-anonymization; online privacy; temporal loss of privacy

Keynote Address 1

25. Keynote Speaker Bio.

【Paper Link】【Pages】:225

【Authors】: Yiling Chen

【Abstract】:

【Keywords】:

Leveraging Users 14

26. Modeling Intransitivity in Matchup and Comparison Data.

【Paper Link】【Pages】:227-236

【Authors】: Shuo Chen ; Thorsten Joachims

【Abstract】: We present a method for learning potentially intransitive preference relations from pairwise comparison and matchup data. Unlike standard preference-learning models that represent the properties of each item/player as a single number, our method infers a multi-dimensional representation for the different aspects of each item/player's strength. We show that our model can represent any pairwise stochastic preference relation and provide a comprehensive evaluation of its predictive performance on a wide range of pairwise comparison tasks and matchup problems from online video games and sports, to peer grading and election. We find that several of these task -- especially matchups in online video games -- show substantial intransitivity that our method can model effectively.

【Keywords】: games; matchup; pairwise comparison; ranking; representation learning; sports

27. Crowdsourcing High Quality Labels with a Tight Budget.

【Paper Link】【Pages】:237-246

【Authors】: Qi Li ; Fenglong Ma ; Jing Gao ; Lu Su ; Christopher J. Quinn

【Abstract】: In the past decade, commercial crowdsourcing platforms have revolutionized the ways of classifying and annotating data, especially for large datasets. Obtaining labels for a single instance can be inexpensive, but for large datasets, it is important to allocate budgets wisely. With limited budgets, requesters must trade-off between the quantity of labeled instances and the quality of the final results. Existing budget allocation methods can achieve good quantity but cannot guarantee high quality of individual instances under a tight budget. However, in some scenarios, requesters may be willing to label fewer instances but of higher quality. Moreover, they may have different requirements on quality for different tasks. To address these challenges, we propose a flexible budget allocation framework called Requallo. Requallo allows requesters to set their specific requirements on the labeling quality and maximizes the number of labeled instances that achieve the quality requirement under a tight budget. The budget allocation problem is modeled as a Markov decision process and a sequential labeling policy is produced. The proposed policy greedily searches for the instance to query next as the one that can provide the maximum reward for the goal. The Requallo framework is further extended to consider worker reliability so that the budget can be better allocated. Experiments on two real-world crowdsourcing tasks as well as a simulated task demonstrate that when the budget is tight, the proposed Requallo framework outperforms existing state-of-the-art budget allocation methods from both quantity and quality aspects.

【Keywords】: budget allocation; crowdsourcing

28. Project Success Prediction in Crowdfunding Environments.

【Paper Link】【Pages】:247-256

【Authors】: Yan Li ; Vineeth Rakesh ; Chandan K. Reddy

【Abstract】: Crowdfunding has gained widespread attention in recent years. Despite the huge success of crowdfunding platforms, the percentage of projects that succeed in achieving their desired goal amount is only around 40%. Moreover, many of these crowdfunding platforms follow "all-or-nothing" policy which means the pledged amount is collected only if the goal is reached within a certain predefined time duration. Hence, estimating the probability of success for a project is one of the most important research challenges in the crowdfunding domain. To predict the project success, there is a need for new prediction models that can potentially combine the power of both classification (which incorporate both successful and failed projects) and regression (for estimating the time for success). In this paper, we formulate the project success prediction as a survival analysis problem and apply the censored regression approach where one can perform regression in the presence of partial information. We rigorously study the project success time distribution of crowdfunding data and show that the logistic and log-logistic distributions are a natural choice for learning from such data. We investigate various censored regression models using comprehensive data of 18K Kickstarter (a popular crowdfunding platform) projects and 116K corresponding tweets collected from Twitter. We show that the models that take complete advantage of both the successful and failed projects during the training phase will perform significantly better at predicting the success of future projects compared to the ones that only use the successful projects. We provide a rigorous evaluation on many sets of relevant features and show that adding few temporal features that are obtained at the project's early stages can dramatically improve the performance.

【Keywords】: crowdfunding.; prediction; project success; regression; survival analysis

29. Probabilistic Group Recommendation Model for Crowdfunding Domains.

【Paper Link】【Pages】:257-266

【Authors】: Vineeth Rakesh ; Wang-Chien Lee ; Chandan K. Reddy

【Abstract】: Crowdfunding has gained a widespread popularity by fueling the creative minds of entrepreneurs. Not only has it democratized the funding of startups, it has also bridged the gap between the venture capitalists and the entrepreneurs by providing a plethora of opportunities for people seeking to invest in new business ventures. Nonetheless, despite the huge success of the crowdfunding platforms, not every project reaches its funding goal. One of the main reasons for a project's failure is the difficulty in establishing a linkage between it's founders and those investors who are interested in funding such projects. A potential solution to this problem is to develop recommendation systems that suggest suitable projects to crowdfunding investors by capturing their interests. In this paper, we explore Kickstarter, a popular reward-based crowdfunding platform. Being a highly heterogeneous platform, Kickstarter is fuelled by a dynamic community of people who constantly interact with each other before investing in projects. Therefore, the decision to invest in a project depends not only on the preference of individuals, but also on the influence of groups that a person belongs and the on-going status of the projects. In this paper, we propose a probabilistic recommendation model, called CrowdRec, that recommends Kickstarter projects to a group of investors by incorporating the on-going status of projects, the personal preference of individual members, and the collective preference of the group . Using a comprehensive dataset of over 40K crowdfunding groups and 5K projects, we show that our model is effective in recommending projects to groups of Kickstarter users.

【Keywords】: crowdfunding; group recommendation; probabilistic models; recommendation; topic models

30. Quality Management in Crowdsourcing using Gold Judges Behavior.

【Paper Link】【Pages】:267-276

【Authors】: Gabriella Kazai ; Imed Zitouni

【Abstract】: Crowdsourcing relevance labels has become an accepted practice for the evaluation of IR systems, where the task of constructing a test collection is distributed over large populations of unknown users with widely varied skills and motivations. Typical methods to check and ensure the quality of the crowd's output is to inject work tasks with known answers (gold tasks) on which workers' performance can be measured. However, gold tasks are expensive to create and have limited application. A more recent trend is to monitor the workers' interactions during a task and estimate their work quality based on their behavior. In this paper, we show that without gold behavior signals that reflect trusted interaction patterns, classifiers can perform poorly, especially for complex tasks, which can lead to high quality crowd workers getting blocked while poorly performing workers remain undetected. Through a series of crowdsourcing experiments, we compare the behaviors of trained professional judges and crowd workers and then use the trained judges' behavior signals as gold behavior to train a classifier to detect poorly performing crowd workers. Our experiments show that classification accuracy almost doubles in some tasks with the use of gold behavior data.

【Keywords】: experimentation; measurement

31. On Obtaining Effort Based Judgements for Information Retrieval.

【Paper Link】【Pages】:277-286

【Authors】: Manisha Verma ; Emine Yilmaz ; Nick Craswell

【Abstract】: Document relevance has been the primary focus in the design, optimization and evaluation of retrieval systems. Traditional testcollections are constructed by asking judges the relevance grade for a document with respect to an input query. Recent work of Yilmaz et al. found an evidence that effort is another important factor in determining document utility, suggesting that more thought should be given into incorporating effort into information retrieval. However, that work did not ask judges to directly assess the level of effort required to consume a document or analyse how effort judgements relate to traditional relevance judgements. In this work, focusing on three aspects associated with effort, we show that it is possible to get judgements of effort from the assessors. We further show that given documents of the same relevance grade, effort needed to find the portion of the document relevant to the query is a significant factor in determining user satisfaction as well as user preference between these documents. Our results suggest that if the end goal is to build retrieval systems that optimize user satisfaction, effort should be included as an additional factor to relevance in building and evaluating retrieval systems. We further show that new retrieval features are needed if the goal is to build retrieval systems that jointly optimize relevance and effort and propose a set of such features. Finally, we focus on the evaluation of retrieval systems and show that incorporating effort into retrieval evaluation could lead to significant differences regarding the performance of retrieval systems.

【Keywords】: effort; evaluation; information retrieval; judgements

32. A Semantic Graph based Topic Model for Question Retrieval in Community Question Answering.

【Paper Link】【Pages】:287-296

【Authors】: Long Chen ; Joemon M. Jose ; Haitao Yu ; Fajie Yuan ; Dell Zhang

【Abstract】: Community Question Answering (CQA) services, such as Yahoo! Answers and WikiAnswers, have become popular with users as one of the central paradigms for satisfying users' information needs. The task of question retrieval aims to resolve one's query directly by finding the most relevant questions (together with their answers) from an archive of past questions. However, as the text of each question is short, there is usually a lexical gap between the queried question and the past questions. To alleviate this problem, we present a hybrid approach that blends several language modelling techniques for question retrieval, namely, the classic (query-likelihood) language model, the state-of-the-art translation-based language model, and our proposed semantics-based language model. The semantics of each candidate question is given by a probabilistic topic model which makes use of local and global semantic graphs for capturing the hidden interactions among entities (e.g., people, places, and concepts) in question-answer pairs. Experiments on two real-world datasets show that our approach can significantly outperform existing ones.

【Keywords】: community question answering; knowledge repository; language modelling; question retrieval; topic modelling

33. Modeling Check-in Preferences with Multidimensional Knowledge: A Minimax Entropy Approach.

【Paper Link】【Pages】:297-306

【Authors】: Jingjing Wang ; Min Li ; Jiawei Han ; Xiaolong Wang

【Abstract】: We propose a single unified minimax entropy approach for user preference modeling with multidimensional knowledge. Our approach provides a discriminative learning protocol which is able to simultaneously a) leverage explicit human knowledge, which are encoded as explicit features, and b) model the more ambiguous hidden intent, which are encoded as latent features. A latent feature can be carved by any parametric form, which allows it to accommodate arbitrary underlying assumptions. We present our approach in the scenario of check-in preference learning and demonstrate it is capable of modeling user preference in an optimized manner. Check-in preference is a fundamental component of Point-of-Interest (POI) prediction and recommendation. A user's check-in can be affected at multiple dimensions, such as the particular time, popularity of the place, his/her category and geographic preference, etc. With the geographic preferences modeled as latent features and the rest as explicit features, our approach provides an in-depth understanding of users' time-varying preferences over different POIs, as well as a reasonable representation of the hidden geographic clusters in a joint manner. Experimental results based on the task of POI prediction/recommendation with two real-world check-in datasets demonstrate that our approach can accurately model the check-in preferences and significantly outperforms the state-of-art models.

【Keywords】: check-in; minimax entropy; multidimensional knowledge

34. You've got Mail, and Here is What you Could do With It!: Analyzing and Predicting Actions on Email Messages.

【Paper Link】【Pages】:307-316

【Authors】: Dotan Di Castro ; Zohar Shay Karnin ; Liane Lewin-Eytan ; Yoelle Maarek

【Abstract】: With email traffic increasing, leading Web mail services have started to offer features that assist users in reading and processing their inboxes. One approach is to identify "important" messages, while a complementary one is to bundle messages, especially machine-generated ones, in pre-defined categories. We rather propose here to go back to the task at hand and consider what actions the users might conduct on received messages. We thoroughly studied, in a privacy-preserving manner, the actions of a large number of users in Yahoo mail, and found out that the most frequent actions are typically read, reply, delete and a sub-type of delete, delete-without-read. We devised a learning framework for predicting these four actions, for users with various levels of activity per action. Our framework leverages both vertical learning for personalization and horizontal learning for regularization purposes. In order to verify the quality of our predictions, we conducted a large-scale experiment involving users who had previously agreed to participate in such research studies. Our results show that, for recall values of 90%, we can predict important actions such as read or reply at precision levels up to 40% for active users, which we consider pretty encouraging for an assistance task. For less active users, we show that our regularization achieves an increase in AUC of close to 50%. To the best of our knowledge, our work is the first to provide a unified framework of this scale for predicting multiple actions on Web email, which hopefully provides a new ground for inventing new user experiences to help users process their inboxes.

【Keywords】: actions; email; email message; predicting actions

35. Hierarchical Label Propagation and Discovery for Machine Generated Email.

【Paper Link】【Pages】:317-326

【Authors】: James Bradley Wendt ; Michael Bendersky ; Lluis Garcia Pueyo ; Vanja Josifovski ; Balint Miklos ; Ivo Krka ; Amitabh Saikia ; Jie Yang ; Marc-Allen Cartright ; Sujith Ravi

【Abstract】: Machine-generated documents such as email or dynamic web pages are single instantiations of a pre-defined structural template. As such, they can be viewed as a hierarchy of template and document specific content. This hierarchical template representation has several important advantages for document clustering and classification. First, templates capture common topics among the documents, while filtering out the potentially noisy variabilities such as personal information. Second, template representations scale far better than document representations since a single template captures numerous documents. Finally, since templates group together structurally similar documents, they can propagate properties between all the documents that match the template. In this paper, we use these advantages for document classification by formulating an efficient and effective hierarchical label propagation and discovery algorithm. The labels are propagated first over a template graph (constructed based on either term-based or topic-based similarities), and then to the matching documents. We evaluate the performance of the proposed algorithm using a large donated email corpus and show that the resulting template graph is significantly more compact than the corresponding document graph and the hierarchical label propagation is both efficient and effective in increasing the coverage of the baseline document classification algorithm. We demonstrate that the template label propagation achieves more than 91% precision and 93% recall, while increasing the label coverage by more than 11%.

【Keywords】: hierarchical label propagation; machine-generated email; structural template

36. Enforcing k-anonymity in Web Mail Auditing.

【Paper Link】【Pages】:327-336

【Authors】: Dotan Di Castro ; Liane Lewin-Eytan ; Yoelle Maarek ; Ran Wolff ; Eyal Zohar

【Abstract】: We study the problem of k-anonymization of mail messages in the realistic scenario of auditing mail traffic in a major commercial Web mail service. Mail auditing is necessary in various Web mail debugging and quality assurance activities, such as anti-spam or the qualitative evaluation of novel mail features. It is conducted by trained professionals, often referred to as "auditors", who are shown messages that could expose personally identifiable information. We address here the challenge of k-anonymizing such messages, focusing on machine generated mail messages that represent more than 90% of today's mail traffic. We introduce a novel message signature Mail-Hash, specifically tailored to identifying structurally-similar messages, which allows us to put such messages in a same equivalence class. We then define a process that generates, for each class, masked mail samples that can be shown to auditors, while guaranteeing the k-anonymity of users. The productivity of auditors is measured by the amount of non-hidden mail content they can see every day, while considering normal working conditions, which set a limit to the number of mail samples they can review. In addition, we consider k-anonymity over time since, by definition of k-anonymity, every new release places additional constraints on the assignment of samples. We describe in details the results we obtained over actual Yahoo mail traffic, and thus demonstrate that our methods are feasible at Web mail scale. Given the constantly growing concern of users over their email being scanned by others, we argue that it is critical to devise such algorithms that guarantee k-anonymity, and implement associated processes in order to restore the trust of mail users.

【Keywords】: k-anonymization; mail auditing; mail templating; quality assurance

37. An Information-Theoretic Approach to Individual Sequential Data Sanitization.

【Paper Link】【Pages】:337-346

【Authors】: Luca Bonomi ; Liyue Fan ; Hongxia Jin

【Abstract】: Fine-grained, personal data has been largely, continuously generated nowadays, such as location check-ins, web histories, physical activities, etc. Those data sequences are typically shared with untrusted parties for data analysis and promotional services. However, the individually-generated sequential data contains behavior patterns and may disclose sensitive information if not properly sanitized. Furthermore, the utility of the released sequence can be adversely affected by sanitization techniques. In this paper, we study the problem of individual sequence data sanitization with minimum utility loss, given user-specified sensitive patterns. We propose a privacy notion based on information theory and sanitize sequence data via generalization. We show the optimization problem is hard and develop two efficient heuristic solutions. Extensive experimental evaluations are conducted on real-world datasets and the results demonstrate the efficiency and effectiveness of our solutions.

【Keywords】: data sanitization; mutual information; sequential patterns

38. Improving IP Geolocation using Query Logs.

【Paper Link】【Pages】:347-356

【Authors】: Ovidiu Dan ; Vaibhav Parikh ; Brian D. Davison

【Abstract】: IP geolocation databases map IP addresses to their geographical locations. These databases are important for several applications such as local search engine relevance, credit card fraud protection, geotargetted advertising, and online content delivery. While they are the most popular method of geolocation, they can have low accuracy at the city level. In this paper we evaluate and improve IP geolocation databases using data collected from search engine logs. We generate a large ground-truth dataset using real time global positioning data extracted from search engine logs. We show that incorrect geolocation information can have a negative impact on implicit user metrics. Using the dataset we measure the accuracy of three state-of-the-art commercial IP geolocation databases. We then introduce a technique to improve existing geolocation databases by mining explicit locations from query logs. We show significant accuracy gains in 44 to 49 out of the top 50 countries, depending on the IP geolocation database. Finally, we validate the approach with a large scale A/B experiment that shows improvements in several user metrics.

【Keywords】: contextual relevance; geographic personalization; geographic targeting; geotargeting; ip geolocation; local search

39. Geographic Segmentation via Latent Poisson Factor Model.

【Paper Link】【Pages】:357-366

【Authors】: Rose Yu ; Andrew Gelfand ; Suju Rajan ; Cyrus Shahabi ; Yan Liu

【Abstract】: Discovering latent structures in spatial data is of critical importance to understanding the user behavior of location-based services. In this paper, we study the problem of geographic segmentation of spatial data, which involves dividing a collection of observations into distinct geo-spatial regions and uncovering abstract correlation structures in the data. We introduce a novel, Latent Poisson Factor (LPF) model to describe spatial count data. The model describes the spatial counts as a Poisson distribution with a mean that factors over a joint item-location latent space. The latent factors are constrained with weak labels to help uncover interesting spatial dependencies. We study the LPF model on a mobile app usage data set and a news article readership data set. We empirically demonstrate its effectiveness on a variety of prediction tasks on these two data sets.

【Keywords】: geographic segmentation; mobile app usage; spatial data

Big Data Algorithms 10

40. Scaling up Link Prediction with Ensembles.

【Paper Link】【Pages】:367-376

【Authors】: Liang Duan ; Charu Aggarwal ; Shuai Ma ; Renjun Hu ; Jinpeng Huai

【Abstract】: A network with $n$ nodes contains O(n2) possible links. Even for networks of modest size, it is often difficult to evaluate all pairwise possibilities for links in a meaningful way. Furthermore, even though link prediction is closely related to missing value estimation problems, such as collaborative filtering, it is often difficult to use sophisticated models such as latent factor methods because of their computational complexity over very large networks. Due to this computational complexity, most known link prediction methods are designed for evaluating the link propensity over a specified subset of links, rather than for performing a global search over the entire networks. In practice, however, it is essential to perform an exhaustive search over the entire networks. In this paper, we propose an ensemble enabled approach to scaling up link prediction, which is able to decompose traditional link prediction problems into subproblems of smaller size. These subproblems are each solved with the use of latent factor models, which can be effectively implemented over networks of modest size. Furthermore, the ensemble enabled approach has several advantages in terms of performance. We show the advantage of using ensemble-based latent factor models with experiments on very large networks. Experimental results demonstrate the effectiveness and scalability of our approach.

【Keywords】: big data; ensembles; link prediction; networks

41. DiFacto: Distributed Factorization Machines.

【Paper Link】【Pages】:377-386

【Authors】: Mu Li ; Ziqi Liu ; Alexander J. Smola ; Yu-Xiang Wang

【Abstract】: Factorization Machines offer good performance and useful embeddings of data. However, they are costly to scale to large amounts of data and large numbers of features. In this paper we describe DiFacto, which uses a refined Factorization Machine model with sparse memory adaptive constraints and frequency adaptive regularization. We show how to distribute DiFacto over multiple machines using the Parameter Server framework by computing distributed subgradients on minibatches asynchronously. We analyze its convergence and demonstrate its efficiency in computational advertising datasets with billions examples and features.

【Keywords】: distributed systems; factorization machines; optimization; recommender systems; statistics

42. Distributed Balanced Partitioning via Linear Embedding.

【Paper Link】【Pages】:387-396

【Authors】: Kevin Aydin ; MohammadHossein Bateni ; Vahab S. Mirrokni

【Abstract】: Balanced partitioning is often a crucial first step in solving large-scale graph optimization problems: in some cases, a big graph is chopped into pieces that fit on one machine to be processed independently before stitching the results together, leading to certain suboptimality from the interaction among different pieces. In other cases, links between different parts may show up in the running time and/or network communications cost, hence the desire to have small cut size. We study a distributed balanced partitioning problem where the goal is to partition the vertices of a given graph into k pieces, minimizing the total cut size. Our algorithm is composed of a few steps that are easily implementable in distributed computation frameworks, e.g., MapReduce. The algorithm first embeds nodes of the graph onto a line, and then processes nodes in a distributed manner guided by the linear embedding order. We examine various ways to find the first embedding, e.g., via a hierarchical clustering or Hilbert curves. Then we apply four different techniques such as local swaps, minimum cuts on partition boundaries, as well as contraction and dynamic programming. Our empirical study compares the above techniques with each other, and to previous work in distributed algorithms, e.g., a label propagation method, FENNEL and Spinner. We report our results both on a private map graph and several public social networks, and show that our results beat previous distributed algorithms: we notice, e.g., 15-25% reduction in cut size over [UB13]. We also observe that our algorithms allow for scalable distributed implementation for any number of partitions. Finally, we apply our techniques for the Google Maps Driving Directions to minimize the number of multi-shard queries with the goal of saving in CPU usage. During live experiments, we observe an ≈ 40% drop in the number of multi-shard queries when comparing our method with a standard geography-based method.

【Keywords】: cut minimization; embedding to line; imbalance; local improvement; mapreduce; maps; partitioning; social networks

43. Kangaroo: Workload-Aware Processing of Range Data and Range Queries in Hadoop.

【Paper Link】【Pages】:397-406

【Authors】: Ahmed M. Aly ; Hazem Elmeleegy ; Yan Qi ; Walid G. Aref

【Abstract】: Despite the importance and widespread use of range data, e.g., time intervals, spatial ranges, etc., little attention has been devoted to study the processing and querying of range data in the context of big data. The main challenge relies in the nature of the traditional index structures e.g., B-Tree and R-Tree, being centralized by nature, and hence are almost crippled when deployed in a distributed environment. To address this challenge, this paper presents Kangaroo, a system built on top of Hadoop to optimize the execution of range queries over range data. The main idea behind Kangaroo is to split the data into non-overlapping partitions in a way that minimizes the query execution time. Kangaroo is query workload-aware, i.e., results in partitioning layouts that minimize the query processing time of given query patterns. In this paper, we study the design challenges Kangaroo addresses in order to be deployed on top of a distributed file system, i.e., HDFS. We also study four different partitioning schemes that Kangaroo can support. With extensive experiments using real range data of more than one billion records and real query workload of more than 30,000 queries, we show that the partitioning schemes of Kangaroo can significantly reduce the I/O of range queries on range data.

【Keywords】: big data; hadoop; indexing; query processing and optimization

44. Feedback Control of Real-Time Display Advertising.

【Paper Link】【Pages】:407-416

【Authors】: Weinan Zhang ; Yifei Rong ; Jun Wang ; Tianchi Zhu ; Xiaofan Wang

【Abstract】: Real-Time Bidding (RTB) is revolutionising display advertising by facilitating per-impression auctions to buy ad impressions as they are being generated. Being able to use impression-level data, such as user cookies, encourages user behaviour targeting, and hence has significantly improved the effectiveness of ad campaigns. However, a fundamental drawback of RTB is its instability because the bid decision is made per impression and there are enormous fluctuations in campaigns' key performance indicators (KPIs). As such, advertisers face great difficulty in controlling their campaign performance against the associated costs. In this paper, we propose a feedback control mechanism for RTB which helps advertisers dynamically adjust the bids to effectively control the KPIs, e.g., the auction winning ratio and the effective cost per click. We further formulate an optimisation framework to show that the proposed feedback control mechanism also has the ability of optimising campaign performance. By settling the effective cost per click at an optimal reference value, the number of campaign's ad clicks can be maximised with the budget constraint. Our empirical study based on real-world data verifies the effectiveness and robustness of our RTB control system in various situations. The proposed feedback control mechanism has also been deployed on a commercial RTB platform and the online test has shown its success in generating controllable advertising performance.

【Keywords】: demand-side platform; display advertising; feedback control; real-time bidding

45. Multi-Score Position Auctions.

【Paper Link】【Pages】:417-425

【Authors】: Denis Xavier Charles ; Nikhil R. Devanur ; Balasubramanian Sivan

【Abstract】: In this paper we propose a general family of position auctions used in paid search, which we call multi-score position auctions. These auctions contain the GSP auction and the GSP auction with squashing as special cases. We show experimentally that these auctions contain special cases that perform better than the GSP auction with squashing, in terms of revenue, and the number of clicks on ads. In particular, we study in detail the special case that squashes the first slot alone and show that this beats pure squashing (which squashes all slots uniformly). We study the equilibria that arise in this special case to examine both the first order and the second order effect of moving from the squashing-all-slots auction to the squash-only-the-top-slot auction. For studying the second order effect, we simulate auctions using the value-relevance correlated distribution suggested in Lahaie and Pennock [2007]. Since this distribution is derived from a study of value and relevance distributions in Yahoo! we believe the insights derived from this simulation to be valuable. For measuring the first order effect, in addition to the said simulation, we also conduct experiments using auction data from Bing over several weeks that includes a random sample of all auctions.

【Keywords】: multi-score position auctions; squashing

46. Multi-view Machines.

【Paper Link】【Pages】:427-436

【Authors】: Bokai Cao ; Hucheng Zhou ; Guoqiang Li ; Philip S. Yu

【Abstract】: With rapidly growing amount of data available on the web, it becomes increasingly likely to obtain data from different perspectives for multi-view learning. Some successive examples of web applications include recommendation and target advertising. Specifically, to predict whether a user will click an ad in a query context, there are available features extracted from user profile, ad information and query description, and each of them can only capture part of the task signals from a particular aspect/view. Different views provide complementary information to learn a practical model for these applications. Therefore, an effective integration of the multi-view information is critical to facilitate the learning performance. In this paper, we propose a general predictor, named multi-view machines (MVMs), that can effectively explore the full-order interactions between features from multiple views. A joint factorization is applied for the interaction parameters which makes parameter estimation more accurate under sparsity and renders the model with the capacity to avoid overfitting. Moreover, MVMs can work in conjunction with different loss functions for a variety of machine learning tasks. The advantages of MVMs are illustrated through comparison with other methods for multi-view prediction, including support vector machines (SVMs), support tensor machines (STMs) and factorization machines (FMs). A stochastic gradient descent method and a distributed implementation on Spark are presented to learn the MVM model. Through empirical studies on two real-world web application datasets, we demonstrate the effectiveness of MVMs on modeling feature interactions in multi-view data. A 3.51\% accuracy improvement is shown on MVMs over FMs for the problem of movie rating prediction, and 0.57\% for ad click prediction.

【Keywords】: factorization; feature interaction; multi-view learning

47. Transductive Classification on Heterogeneous Information Networks with Edge Betweenness-based Normalization.

【Paper Link】【Pages】:437-446

【Authors】: Phiradet Bangcharoensap ; Tsuyoshi Murata ; Hayato Kobayashi ; Nobuyuki Shimizu

【Abstract】: This paper proposes a novel method for transductive classification on heterogeneous information networks composed of multiple types of vertices. Such networks naturally represent many real-world Web data such as DBLP data (author, paper, and conference). Given a network where some vertices are labeled, the classifier aims to predict labels for the remaining vertices by propagating the labels to the entire network. In the label propagation process, many studies reduce the importance of edges connecting to a high-degree vertex. The assumption is unsatisfactory when reliability of a label of a vertex cannot be implied from its degree. On the basis of our intuition that edges bridging across communities are less trustworthy, we adapt edge betweenness to imply the importance of edges. Since directly applying the conventional edge betweenness is inefficient on heterogeneous networks, we propose two additional refinements. First, the centrality utilizes the fact that networks contain multiple types of vertices. Second, the centrality ignores flows originating from endpoints of considering edges. The experimental results on real-world datasets show our proposed method is more effective than a state-of-the-art method, GNetMine. On average, our method yields 92.79 ± 1.25% accuracy on a DBLP network even if only 1.92% of vertices are labeled. Our simple weighting scheme results in more than 5 percentage points increase in accuracy compared with GNetMine.

【Keywords】: edge betweenness centrality; heterogeneous information network; transductive classification

48. The Troll-Trust Model for Ranking in Signed Networks.

【Paper Link】【Pages】:447-456

【Authors】: Zhaoming Wu ; Charu C. Aggarwal ; Jimeng Sun

【Abstract】: Signed social networks have become increasingly important in recent years because of the ability to model trust-based relationships in review sites like Slashdot, Epinions, and Wikipedia. As a result, many traditional network mining problems have been re-visited in the context of networks in which signs are associated with the links. Examples of such problems include community detection, link prediction, and low rank approximation. In this paper, we will examine the problem of ranking nodes in signed networks. In particular, we will design a ranking model, which has a clear physical interpretation in terms of the sign of the edges in the network. Specifically, we propose the Troll-Trust model that models the probability of trustworthiness of individual data sources as an interpretation for the underlying ranking values. We will show the advantages of this approach over a variety of baselines.

【Keywords】: data mining; ranking; signed networks

49. Multileave Gradient Descent for Fast Online Learning to Rank.

【Paper Link】【Pages】:457-466

【Authors】: Anne Schuth ; Harrie Oosterhuis ; Shimon Whiteson ; Maarten de Rijke

【Abstract】: Modern search systems are based on dozens or even hundreds of ranking features. The dueling bandit gradient descent (DBGD) algorithm has been shown to effectively learn combinations of these features solely from user interactions. DBGD explores the search space by comparing a possibly improved ranker to the current production ranker. To this end, it uses interleaved comparison methods, which can infer with high sensitivity a preference between two rankings based only on interaction data. A limiting factor is that it can compare only to a single exploratory ranker. We propose an online learning to rank algorithm called multileave gradient descent (MGD) that extends DBGD to learn from so-called multileaved comparison methods that can compare a set of rankings instead of merely a pair. We show experimentally that MGD allows for better selection of candidates than DBGD without the need for more comparisons involving users. An important implication of our results is that orders of magnitude less user interaction data is required to find good rankers when multileaved comparisons are used within online learning to rank. Hence, fewer users need to be exposed to possibly inferior rankers and our method allows search engines to adapt more quickly to changes in user preferences.

【Keywords】: information retrieval; interleaved comparisons; learning to rank; multileaved comparisons

Practice & Experience Track 2

50. AMiner: Toward Understanding Big Scholar Data.

【Paper Link】【Pages】:467

【Authors】: Jie Tang

【Abstract】: In this talk, I will present a novel academic search and mining system, AMiner, the second generation of the ArnetMiner system. Different from traditional academic search systems that focus on document (paper) search, AMiner aims to provide a systematic modeling approach for researchers (authors), ultimately to gain a deep understanding of the big (heterogeneous) network formed by authors, papers they have published, and venues they published those papers. The system extracts researchers' profiles automatically from the Web and integrates the researcher profiles with publication papers after name disambiguation. For now, the system has collected a big scholar data with more than 130,000,000 researcher profiles and 100,000,000 papers from multiple publication databases. We also developed an approach named COSNET to connect AMiner with several professional social networks such as LinkedIn and VideoLectures, which significantly enriches the metadata of the scholarly data. Based on the integrated big scholar data, we devise a unified topic modeling approach for modeling the different entities (authors, papers, venues) simultaneously and provide a topic-level expertise search by leveraging the modeling results. In addition, AMiner offers a set of researcher-centered functions including social influence analysis, influence visualization, collaboration recommendation, relationship mining, similarity analysis and community evolution. The system has been put into operation since 2006 and has attracted more than 7,000,000 independent IP accesses from over 200 countries/regions.

【Keywords】: academic search; community evolution; integration; recommendation; social influence

51. Serving a Billion Personalized News Feeds.

【Paper Link】【Pages】:469

【Authors】: Lars Backstrom

【Abstract】: Feed ranking's goal is to provide perople with over a billion personalized experiences. We strive to provide the most compelling content to each person, personalized to them so that they are most likely to see the content that is most interesting to them. Similar to a newspaper, putting the right stories above the fold has always been critical to engaging customers and interesting them in the rest of the paper. In feed ranking, we face a similar challenge, but on a grander scale. Each time a person visits, we need to find the best piece of content out of all the available stories and put it at the top of feed where people are most likely to see it. To accomplish this, we do large-scale machine learning to model each person, figure out which friends, pages and topics they care about and pick the stories each particular person is interested in. In addition to the large-scale machine learning problems we work on, another primary area of research is understanding the value we are creating for people and making sure that our objective function is in alignment with what people want.

【Keywords】: facebook; machine learning; news feed; social networks

Keynote Address 1

52. The Predictive Power of Massive Data about our Fine-Grained Behavior.

【Paper Link】【Pages】:471-472

【Authors】: Foster J. Provost

【Abstract】: What really is it about "big data" that makes it different from traditional data? In this talk I illustrate one important aspect: massive ultra-fine-grained data on individuals' behaviors holds remarkable predictive power. I examine several applications to marketing-related tasks, showing how machine learning methods can extract the predictive power and how the value of the data "asset" seems different from the value of traditional data used for predictive modeling. I then dig deeper into explaining the predictions made from massive numbers of fine-grained behaviors by applying a counter-factual framework for explaining model behavior based on treating the individual behaviors as evidence that is combined by the model. This analysis shows that the fine-grained behavior data incorporate various sorts of information that we traditionally have sought to capture by other means. For example, for marketing modeling the behavior data effectively incorporate demographics, psychographics, category interest, and purchase intent. Finally, I discuss the flip side of the coin: the remarkable predictive power based on fine-grained information on individuals raises new privacy concerns. In particular, I discuss privacy concerns based on inferences drawn about us (in contrast to privacy concerns stemming from violations to data confidentiality). The evidence counterfactual approach used to explain the predictions also can be used to provide online consumers with transparency into the reasons why inferences are drawn about them. In addition, it offers the possibility to design novel solutions such as a privacy-friendly "cloaking device" to inhibit inferences from being drawn based on particular behaviors.

【Keywords】: keynote talk

【Paper Link】【Pages】:473-482

【Authors】: Lada A. Adamic ; Thomas M. Lento ; Eytan Adar ; Pauline C. Ng

【Abstract】: Social networks readily transmit information, albeit with less than perfect fidelity. We present a large-scale measurement of this imperfect information copying mechanism by examining the dissemination and evolution of thousands of memes, collectively replicated hundreds of millions of times in the online social network Facebook. The information undergoes an evolutionary process that exhibits several regularities. A meme's mutation rate characterizes the population distribution of its variants, in accordance with the Yule process. Variants further apart in the diffusion cascade have greater edit distance, as would be expected in an iterative, imperfect replication process. Some text sequences can confer a replicative advantage; these sequences are abundant and transfer "laterally" between different memes. Subpopulations of the social network can preferentially transmit a specific variant of a meme if the variant matches their beliefs or culture. Understanding the mechanism driving change in diffusing information has important implications for how we interpret and harness the information that reaches us through our social networks.

【Keywords】: evolution; memes; social computing; social networks

54. Nonlinear Laplacian for Digraphs and its Applications to Network Analysis.

【Paper Link】【Pages】:483-492

【Authors】: Yuichi Yoshida

【Abstract】: In this work, we introduce a new Markov operator associated with a digraph, which we refer to as a nonlinear Laplacian. Unlike previous Laplacians for digraphs, the nonlinear Laplacian does not rely on the stationary distribution of the random walk process and is well defined on digraphs that are not strongly connected. We show that the nonlinear Laplacian has nontrivial eigenvalues and give a Cheeger-like inequality, which relates the conductance of a digraph and the smallest non-zero eigenvalue of its nonlinear Laplacian. Finally, we apply the nonlinear Laplacian to the analysis of real-world networks and obtain encouraging results.

【Keywords】: spectral graph theory

【Paper Link】【Pages】:493-502

【Authors】: Karthik Subbian ; Charu C. Aggarwal ; Jaideep Srivastava

【Abstract】: Influence analysis is an important problem in social network analysis due to its impact on viral marketing and targeted advertisements. Most of the existing influence analysis methods determine the influencers in a static network with an influence propagation model based on pre-defined edge propagation probabilities. However, none of these models can be queried to find influencers in both context and time-sensitive fashion from a streaming social data. In this paper, we propose an approach to maintain real-time influence scores of users in a social stream using a topic and time-sensitive approach, while the network and topic is constantly evolving over time. We show that our approach is efficient in terms of online maintenance and effective in terms various types of real-time context- and time-sensitive queries. We evaluate our results on both social and collaborative network data sets.

【Keywords】: influence analysis; influence maximization; querying influencer; social streams; tracking influencer

56. Centrality-Aware Link Recommendations.

【Paper Link】【Pages】:503-512

【Authors】: Nikos Parotsidis ; Evaggelia Pitoura ; Panayiotis Tsaparas

【Abstract】: Link recommendations are critical for both improving the utility and expediting the growth of social networks. Most previous approaches focus on suggesting links that are highly likely to be adopted. In this paper, we add a different perspective to the problem by aiming at recommending links that also improve specific properties of the network. In particular, our goal is to recommend to users links that if adopted would improve their centrality in the network. Specifically, we introduce the centrality-aware link recommendation problem as the problem of recommending to a user u, k links from a pool of recommended links so as to maximize the expected decrease of the sum of the shortest path distances of $u$ to all other nodes in the network. We show that the problem is NP-hard, but our optimization function is monotone and sub-modular which guarantees a constant approximation ratio for the greedy algorithm. We present a fast algorithm for computing the expected decrease caused by a set of recommendations which we use as a building block in our algorithms. We provide experimental results that evaluate the performance of our algorithms with respect to both the accuracy of the prediction and the improvement in the centrality of the nodes, and we study the tradeoff between the two.

【Keywords】: link recommendations; node centrality; probabilistic networks; social networks

【Paper Link】【Pages】:513-522

【Authors】: Liang Wu ; Xia Hu ; Huan Liu

【Abstract】: Relational learning has been proposed to cope with the interdependency among linked instances in social network analysis, which often adopts network connectivity and social media content for prediction. A common assumption in existing relational learning methods is that data instances are equally important. The algorithms developed based on the assumption may be significantly affected by outlier data and thus less robust. In the meantime, it has been well established in social sciences that actors are naturally of different social status in a social network. Motivated by findings from social sciences, in this paper, we investigate whether social status analysis could facilitate relational learning. Particularly, we propose a novel framework RESA to model social status using the network structure. It extracts robust and intrinsic latent social dimensions for social actors, which are further exploited as features for supervised learning models. The proposed method is applicable for real-world relational learning problems where noise exists. Extensive experiments are conducted on datasets obtained from real-world social media platforms. Empirical results demonstrate the effectiveness of RESA and further experiments are conducted to help understand the effects of parameter settings to the proposed model and how local social status works.

【Keywords】: relational learning; social dimensions; social media

【Paper Link】【Pages】:523-532

【Authors】: Katherine Ellis ; Moisés Goldszmidt ; Gert R. G. Lanckriet ; Nina Mishra ; Omer Reingold

【Abstract】: Online groups, including chat groups and forums, are becoming important avenues for gathering and exchanging information ranging from troubleshooting devices, to sharing experiences, to finding medical information and advice. Thus, issues about the health and stability of these groups are of particular interest to both industry and academia. In this paper we conduct a large scale study with the objectives of first, characterizing essential aspects of the interactions between the participants of such groups and second, characterizing how the nature of these interactions relate to the health of the groups. Specifically, we concentrate on Twitter Discussion Groups (TDGs), self-organized groups that meet on Twitter by agreeing on a hashtag, date and time. These groups have repeated, real-time meetings and are a rising phenomenon on Twitter. We examine the interactions in these groups in terms of the social equality and mobility of the exchange of attention between participants, according to the @mention convention on Twitter. We estimate the health of a group by measuring the retention rate of participants and the change in the number of meetings over time. We find that social equality and mobility are correlated, and that equality and mobility are related to a group's health. In fact, equality and mobility are as predictive of a group's health as some prior characteristics used to predict health of other online groups. Our findings are based on studying 100 thousand sessions of over two thousand discussion groups over the period of June 2012 to June 2013. These finding are not only relevant to stakeholders interested in maintaining these groups, but to researchers and academics interested in understanding the behavior of participants in online discussions. We also find the parallel with findings on the relationship between economic mobility and equality and health indicators in real-world nations striking and thought-provoking.

【Keywords】: generative model; graph analysis; social networks

59. Learning Distributed Representations of Data in Community Question Answering for Question Retrieval.

【Paper Link】【Pages】:533-542

【Authors】: Kai Zhang ; Wei Wu ; Fang Wang ; Ming Zhou ; Zhoujun Li

【Abstract】: We study the problem of question retrieval in community question answering (CQA). The biggest challenge within this task is lexical gaps between questions since similar questions are usually expressed with different but semantically related words. To bridge the gaps, state-of-the-art methods incorporate extra information such as word-to-word translation and categories of questions into the traditional language models. We find that the existing language model based methods can be interpreted using a new framework, that is they represent words and question categories in a vector space and calculate question-question similarities with a linear combination of dot products of the vectors. The problem is that these methods are either heuristic on data representation or difficult to scale up. We propose a principled and efficient approach to learning representations of data in CQA. In our method, we simultaneously learn vectors of words and vectors of question categories by optimizing an objective function naturally derived from the framework. In question retrieval, we incorporate learnt representations into traditional language models in an effective and efficient way. We conduct experiments on large scale data from Yahoo! Answers and Baidu Knows, and compared our method with state-of-the-art methods on two public data sets. Experimental results show that our method can significantly improve on baseline methods for retrieval relevance. On 1 million training data, our method takes less than 50 minutes to learn a model on a single multicore machine, while the translation based language model needs more than 2 days to learn a translation table on the same machine.

【Keywords】: community question answering; question retrieval; unsupervised model; word vector representation

【Paper Link】【Pages】:543-552

【Authors】: Yusuke Tanaka ; Takeshi Kurashima ; Yasuhiro Fujiwara ; Tomoharu Iwata ; Hiroshi Sawada

【Abstract】: This paper proposes a method for inferring from single-source data the factors that trigger purchases. Here, single-source data are the histories of item purchases and media advertisement views for each individual. We assume a sequence of purchase events to be a stochastic process incorporating the following three factors: (a) user preference, (b) social effects received from other users, and (c) media advertising effects. As our user-purchase model incorporates the latent relationships between users and advertisers, it can infer the latent triggers of purchases. Experiments on real single-source data show that our model can (a) achieve high prediction accuracy for purchases, (b) discover the key information, i.e., popular items, influential users, and influential advertisers, (c) estimate the relative impact of the three factors on purchases, and (d) find user segments according to the estimated factors.

【Keywords】: marked point processes; purchase behavior; single-source data

【Paper Link】【Pages】:553-562

【Authors】: Daniel Kershaw ; Matthew Rowe ; Patrick Stacey

【Abstract】: Language change and innovation is constant in online and offline communication, and has led to new words entering people's lexicon and even entering modern day dictionaries, with recent additions of 'e-cig' and 'vape'. However the manual work required to identify these 'innovations' is both time consuming and subjective. In this work we demonstrate how such innovations in language can be identified across two different OSN's (Online Social Networks) through the operationalisation of known language acceptance models that incorporate relatively simple statistical tests. From grounding our work in language theory, we identified three statistical tests that can be applied - variation in; frequency, form and meaning. Each show different success rates across the two networks (Geo-bound Twitter sample and a sample of Reddit). These tests were also applied to different community levels within the two networks allowing for different innovations to be identified across different community structures over the two networks, for instance: identifying regional variation across Twitter, and variation across groupings of Subreddits, where identified example innovations included 'casualidad' and 'cym'.

【Keywords】: language change; language evolution; osn; reddit; twitter

62. Discriminative Learning of Infection Models.

【Paper Link】【Pages】:563-572

【Authors】: Nir Rosenfeld ; Mor Nitzan ; Amir Globerson

【Abstract】: Infection and diffusion processes over networks arise in many domains. These introduce many challenging prediction tasks, such as influence estimation, trend prediction, and epidemic source localization. The standard approach to such problems is generative: assume an underlying infection model, learn its parameters, and infer the required output. In order to learn efficiently, the chosen infection models are often simple, and learning is focused on inferring the parameters of the model rather than on optimizing prediction accuracy. Here we argue that for prediction tasks, a discriminative approach is more adequate. We introduce DIMPLE, a novel discriminative learning framework for training classifiers based on dynamic infection models. We show how highly non-linear predictors based on infection models can be "linearized" by considering a larger class of prediction functions. Efficient learning over this class is performed by constructing "infection kernels" based on the outputs of infection models, and can be plugged into any kernel-supporting framework. DIMPLE can be applied to virtually any infection-related prediction task and any infection model for which the desired output can be calculated or simulated. For influence estimation in well-known infection models, we show that the kernel can either be computed in closed form, or reduces to estimating co-influence of seed pairs. We apply DIMPLE to the tasks of influence estimation on synthetic and real data from Digg, and to predicting customer network value in Polly, a viral phone-based development-related service deployed in low-literate communities. Our results show that DIMPLE outperforms strong baselines.

【Keywords】: branching processes; diffusion processes; discriminative learning; improper learning; independent cascade model; infection kernels; infection processes; kernel methods; kernels; linear threshold model; social networks

【Paper Link】【Pages】:573-582

【Authors】: Simon Bourigault ; Sylvain Lamprier ; Patrick Gallinari

【Abstract】: In this paper, we focus on information diffusion through social networks. Based on the well-known Independent Cascade model, we embed users of the social network in a latent space to extract more robust diffusion probabilities than those defined by classical graphical learning approaches. Better generalization abilities provided by the use of such a projection space allows our approach to present good performances on various real-world datasets, for both diffusion prediction and influence relationships inference tasks. Additionally, the use of a projection space enables our model to deal with larger social networks.

【Keywords】: information diffusion; machine learning; representation learning

64. Ensemble Models for Data-driven Prediction of Malware Infections.

【Paper Link】【Pages】:583-592

【Authors】: Chanhyun Kang ; Noseong Park ; B. Aditya Prakash ; Edoardo Serra ; V. S. Subrahmanian

【Abstract】: Given a history of detected malware attacks, can we predict the number of malware infections in a country? Can we do this for different malware and countries? This is an important question which has numerous implications for cyber security, right from designing better anti-virus software, to designing and implementing targeted patches to more accurately measuring the economic impact of breaches. This problem is compounded by the fact that, as externals, we can only detect a fraction of actual malware infections. In this paper we address this problem using data from Symantec covering more than 1.4 million hosts and 50 malware spread across 2 years and multiple countries. We first carefully design domain-based features from both malware and machine-hosts perspectives. Secondly, inspired by epidemiological and information diffusion models, we design a novel temporal non-linear model for malware spread and detection. Finally we present ESM, an ensemble-based approach which combines both these methods to construct a more accurate algorithm. Using extensive experiments spanning multiple malware and countries, we show that ESM can effectively predict malware infection ratios over time (both the actual number and trend) upto 4 times better compared to several baselines on various metrics. Furthermore, ESM's performance is stable and robust even when the number of detected infections is low.

【Keywords】: anti-virus; cyber security; information diffusion; malware attacks; prediction model

Entities and Structure 8

65. WSDM Cup 2016: Entity Ranking Challenge.

【Paper Link】【Pages】:593-594

【Authors】: Alex D. Wade ; Kuansan Wang ; Yizhou Sun ; Antonio Gulli

【Abstract】: In this paper, we describe the WSDM Cup entity ranking challenge held in conjunction with the 2016 Web Search and Data Mining conference (WSDM 2016). Participants in the challenge were provided access to the Microsoft Academic Graph (MAG), a large heterogeneous graph of academic entities, and were invited to calculate the query-independent importance of each publication in the graph. Submissions for the challenge were open from August through November 2015, and a public leaderboard displayed teams? progress against a set of training judgements. Final evaluations were performed against a separate, withheld portion of the evaluation judgements. The top eight performing teams were then invited to submit papers to the WSDM Cup workshop, held at the WSDM 2016 conference.

【Keywords】: heterogeneous graphs; microsoft academic; microsoft academic graph

66. Dynamic Collective Entity Representations for Entity Ranking.

【Paper Link】【Pages】:595-604

【Authors】: David Graus ; Manos Tsagkias ; Wouter Weerkamp ; Edgar Meij ; Maarten de Rijke

【Abstract】: Entity ranking, i.e., successfully positioning a relevant entity at the top of the ranking for a given query, is inherently difficult due to the potential mismatch between the entity's description in a knowledge base, and the way people refer to the entity when searching for it. To counter this issue we propose a method for constructing dynamic collective entity representations. We collect entity descriptions from a variety of sources and combine them into a single entity representation by learning to weight the content from different sources that are associated with an entity for optimal retrieval effectiveness. Our method is able to add new descriptions in real time and learn the best representation as time evolves so as to capture the dynamics of how people search entities. Incorporating dynamic description sources into dynamic collective entity representations improves retrieval effectiveness by 7% over a state-of-the-art learning to rank baseline. Periodic retraining of the ranker enables higher ranking effectiveness for dynamic collective entity representations.

【Keywords】: content representation; entity ranking; entity retrieval; fielded retrieval

67. Relationship Queries on Extended Knowledge Graphs.

【Paper Link】【Pages】:605-614

【Authors】: Mohamed Yahya ; Denilson Barbosa ; Klaus Berberich ; Qiuyue Wang ; Gerhard Weikum

【Abstract】: Entity search over text corpora is not geared for relationship queries where answers are tuples of related entities and where a query often requires joining cues from multiple documents. With large knowledge graphs, structured querying on their relational facts is an alternative, but often suffers from poor recall because of mismatches between user queries and the knowledge graph or because of weakly populated relations. This paper presents the TriniT search engine for querying and ranking on extended knowledge graphs that combine relational facts with textual web contents. Our query language is designed on the paradigm of SPO triple patterns, but is more expressive, supporting textual phrases for each of the SPO arguments. We present a model for automatic query relaxation to compensate for mismatches between the data and a user's query. Query answers -- tuples of entities -- are ranked by a statistical language model. We present experiments with different benchmarks, including complex relationship queries, over a combination of the Yago knowledge graph and the entity-annotated ClueWeb'09 corpus.

【Keywords】: extended knowledge graphs; query relaxation; relationship queries

68. Improving Website Hyperlink Structure Using Server Logs.

【Paper Link】【Pages】:615-624

【Authors】: Ashwin Paranjape ; Robert West ; Leila Zia ; Jure Leskovec

【Abstract】: Good websites should be easy to navigate via hyperlinks, yet maintaining a high-quality link structure is difficult. Identifying pairs of pages that should be linked may be hard for human editors, especially if the site is large and changes frequently. Further, given a set of useful link candidates, the task of incorporating them into the site can be expensive, since it typically involves humans editing pages. In the light of these challenges, it is desirable to develop data-driven methods for automating the link placement task. Here we develop an approach for automatically finding useful hyperlinks to add to a website. We show that passively collected server logs, beyond telling us which existing links are useful, also contain implicit signals indicating which nonexistent links would be useful if they were to be introduced. We leverage these signals to model the future usefulness of yet nonexistent links. Based on our model, we define the problem of link placement under budget constraints and propose an efficient algorithm for solving it. We demonstrate the effectiveness of our approach by evaluating it on Wikipedia, a large website for which we have access to both server logs (used for finding useful new links) and the complete revision history (containing a ground truth of new links). As our method is based exclusively on standard server logs, it may also be applied to any other website, as we show with the example of the biomedical research site Simtk.

【Keywords】: browsing; link prediction; log analysis; navigation

69. Long-tail Vocabulary Dictionary Extraction from the Web.

【Paper Link】【Pages】:625-634

【Authors】: Zhe Chen ; Michael J. Cafarella ; H. V. Jagadish

【Abstract】: A dictionary --- a set of instances belonging to the same conceptual class --- is central to information extraction and is a useful primitive for many applications, including query log analysis and document categorization. Considerable work has focused on generating accurate dictionaries given a few example seeds, but methods to date cannot obtain long-tail (rare) items with high accuracy and recall. In this paper, we develop a novel method to construct high-quality dictionaries, especially for long-tail vocabularies, using just a few user-provided seeds for each topic. Our algorithm obtains long-tail (i.e., rare) items by building and executing high-quality webpage-specific extractors. We use webpage-specific structural and textual information to build more accurate per-page extractors in order to detect the long-tail items from a single webpage. These webpage-specific extractors are obtained via a co-training procedure using distantly-supervised training data. By aggregating the page-specific dictionaries of many webpages, Lyretail is able to output a high-quality comprehensive dictionary. Our experiments demonstrate that in long-tail vocabulary settings, we obtained a 17.3% improvement on mean average precision for the dictionary generation process, and a 30.7% improvement on F1 for the page-specific extraction, when compared to previous state-of-the-art methods.

【Keywords】: information extraction; long-tail dictionary; set expansion

70. Semantic Documents Relatedness using Concept Graph Representation.

【Paper Link】【Pages】:635-644

【Authors】: Yuan Ni ; Qiong Kai Xu ; Feng Cao ; Yosi Mass ; Dafna Sheinwald ; Huijia Zhu ; Shao Sheng Cao

【Abstract】: We deal with the problem of document representation for the task of measuring semantic relatedness between documents. A document is represented as a compact concept graph where nodes represent concepts extracted from the document through references to entities in a knowledge base such as DBpedia. Edges represent the semantic and structural relationships among the concepts. Several methods are presented to measure the strength of those relationships. Concepts are weighted through the concept graph using closeness centrality measure which reflects their relevance to the aspects of the document. A novel similarity measure between two concept graphs is presented. The similarity measure first represents concepts as continuous vectors by means of neural networks. Second, the continuous vectors are used to accumulate pairwise similarity between pairs of concepts while considering their assigned weights. We evaluate our method on a standard benchmark for document similarity. Our method outperforms state-of-the-art methods including ESA (Explicit Semantic Annotation) while our concept graphs are much smaller than the concept vectors generated by ESA. Moreover, we show that by combining our concept graph with ESA, we obtain an even further improvement.

【Keywords】: dbpedia; document representation; document semantic similarity; graph model; neural network

71. EgoSet: Exploiting Word Ego-networks and User-generated Ontology for Multifaceted Set Expansion.

【Paper Link】【Pages】:645-654

【Authors】: Xin Rong ; Zhe Chen ; Qiaozhu Mei ; Eytan Adar

【Abstract】: A key challenge of entity set expansion is that multifaceted input seeds can lead to significant incoherence in the result set. In this paper, we present a novel solution to handling multifaceted seeds by combining existing user-generated ontologies with a novel word-similarity metric based on skip-grams. By blending the two resources we are able to produce sparse word ego-networks that are centered on the seed terms and are able to capture semantic equivalence among words. We demonstrate that the resulting networks possess internally-coherent clusters, which can be exploited to provide non-overlapping expansions, in order to reflect different semantic classes of the seeds. Empirical evaluation against state-of-the-art baselines shows that our solution, EgoSet, is able to not only capture multiple facets in the input query, but also generate expansions for each facet with higher precision.

【Keywords】: entity set expansion; information extraction; web mining

72. Extracting Search Query Patterns via the Pairwise Coupled Topic Model.

【Paper Link】【Pages】:655-664

【Authors】: Takuya Konishi ; Takuya Ohwa ; Sumio Fujita ; Kazushi Ikeda ; Kohei Hayashi

【Abstract】: A fundamental yet new challenge in information retrieval is the identification of patterns behind search queries. For example, the query "NY restaurant" and "boston hotel" shares the common pattern "LOCATION SERVICE". However, because of the diversity of real queries, existing approaches require data preprocessing by humans or specifying the target query domains, which hinders their applicability. We propose a probabilistic topic model that assumes that each term (e.g., "NY") has a topic (LOCATION). The key idea is that we consider topic co-occurrence in a query rather than a topic sequence, which significantly reduces computational cost yet enables us to acquire coherent topics without the preprocessing. Using two real query datasets, we demonstrate that the obtained topics are intelligible by humans, and are highly accurate in keyword prediction and query generation tasks.

【Keywords】: query intent analysis; search queries; topic model

Practice & Experience Track 1

73. The Past and Future of Systems for Current Events.

【Paper Link】【Pages】:665

【Authors】: Mor Naaman

【Abstract】: People share in social media an overwhelming amount of content from real-world events. These events range from major global events like an uprising or an earthquake, to local events and emergencies such as a fire or a parade; from media events like the Oscar's, to events that enjoy little media coverage such as a conference or a music concert. This shared media represents an important part of our society, culture and history. At the same time, this social media content is still fragmented across services, hard to find, and difficult to consume and understand.

【Keywords】: events; information organization; social media; startup

74. Barbara Made the News: Mining the Behavior of Crowds for Time-Aware Learning to Rank.

【Paper Link】【Pages】:667-676

【Authors】: Flávio Martins ; João Magalhães ; Jamie Callan

【Abstract】: In Twitter, and other microblogging services, the generation of new content by the crowd is often biased towards immediacy: what is happening now. Prompted by the propagation of commentary and information through multiple mediums, users on the Web interact with and produce new posts about newsworthy topics and give rise to trending topics. This paper proposes to leverage on the behavioral dynamics of users to estimate the most relevant time periods for a topic. Our hypothesis stems from the fact that when a real-world event occurs it usually has peak times on the Web: a higher volume of tweets, new visits and edits to related Wikipedia articles, and news published about the event. In this paper, we propose a novel time-aware ranking model that leverages on multiple sources of crowd signals. Our approach builds on two major novelties. First, a unifying approach that given query q, mines and represents temporal evidence from multiple sources of crowd signals. This allows us to predict the temporal relevance of documents for query q. Second, a principled retrieval model that integrates temporal signals in a learning to rank framework, to rank results according to the predicted temporal relevance. Evaluation on the TREC 2013 and 2014 Microblog track datasets demonstrates that the proposed model achieves a relative improvement of 13.2% over lexical retrieval models and 6.2% over a learning to rank baseline.

【Keywords】: learning to rank; microblog search; social media; temporal information retrieval; time-aware ranking models; twitter

75. Wiggins: Detecting Valuable Information in Dynamic Networks Using Limited Resources.

【Paper Link】【Pages】:677-686

【Authors】: Ahmad Mahmoody ; Matteo Riondato ; Eli Upfal

【Abstract】: Detecting new information and events in a dynamic network by probing individual nodes has many practical applications: discovering new webpages, analyzing influence properties in network, and detecting failure propagation in electronic circuits or infections in public drinkable water systems. In practice, it is infeasible for anyone but the owner of the network (if existent) to monitor all nodes at all times. In this work we study the constrained setting when the observer can only probe a small set of nodes at each time step to check whether new pieces of information (items) have reached those nodes. We formally define the problem through an infinite time generating process that places new items in subsets of nodes according to an unknown probability distribution. Items have an exponentially decaying novelty, modeling their decreasing value. The observer uses a probing schedule (i.e., a probability distribution over the set of nodes) to choose, at each time step, a small set of nodes to check for new items. The goal is to compute a schedule that minimizes the average novelty of undetected items. We present an algorithm, WIGGINS, to compute the optimal schedule through convex optimization, and then show how it can be adapted when the parameters of the problem must be learned or change over time. We also present a scalable variant of WIGGINS for the MapReduce framework. The results of our experimental evaluation on real social networks demonstrate the practicality of our approach.

【Keywords】: information diffusion; rumor catching; social networks

Tutorials 2

76. Understanding Offline Political Systems by Mining Online Political Data.

【Paper Link】【Pages】:687-688

【Authors】: David Lazer ; Oren Tsur ; Tina Eliassi-Rad

【Abstract】: "Man is by nature a political animal", as asserted by Aristotle. This political nature manifests itself in the data we produce and the traces we leave online. In this tutorial, we address a number of fundamental issues regarding mining of political data: What types of data could be considered political? What can we learn from such data? Can we use the data for prediction of political changes, etc? How can these prediction tasks be done efficiently? Can we use online socio-political data in order to get a better understanding of our political systems and of recent political changes? What are the pitfalls and inherent shortcomings of using online data for political analysis? In recent years, with the abundance of data, these questions, among others, have gained importance, especially in light of the global political turmoil and the upcoming 2016 US presidential election. We introduce relevant political science theory, describe the challenges within the framework of computational social science and present state of the art approaches bridging social network analysis, graph mining, and natural language processing.

【Keywords】: computational social science; graph mining; political data; social and information networks

77. Click Models for Web Search and their Applications to IR: WSDM 2016 Tutorial.

【Paper Link】【Pages】:689-690

【Authors】: Aleksandr Chuklin ; Ilya Markov ; Maarten de Rijke

【Abstract】: In this tutorial we give an overview of click models for web search. We show how the framework of probabilistic graphical models helps to explain user behavior, build new evaluation metrics and perform simulations. The tutorial discusses foundational aspects alongside experimental details and applications, with live demos and discussions of publicly available resources.

【Keywords】: click models; web search

Workshop Summaries 4

78. Query Understanding for Search on All Devices at WSDM 2016.

【Paper Link】【Pages】:691-692

【Authors】: Amit Goyal ; Jianfeng Gao ; Hongbo Deng ; Yi Chang

【Abstract】:

【Keywords】: mobile search; query auto completion; query reformulation; query understanding; spelling correction

79. TargetAd2016: 2nd International Workshop on Ad Targeting at Scale.

【Paper Link】【Pages】:693-694

【Authors】: Mihajlo Grbovic ; Nemanja Djuric ; Vladan Radosavljevic

【Abstract】: The 2nd International Workshop on Ad Targeting at Scale will be held in San Francisco, California, USA on February 22nd, 2016, co-located with the 9th ACM International Conference on Web Search and Data Mining (WSDM). The main objective of the workshop is to address the challenges of ad targeting in web-scale settings. The workshop brings together interdisciplinary researchers in computational advertising, recommender systems, personalization, and related areas, to share, exchange, learn, and develop preliminary results, new concepts, ideas, principles, and methodologies on applying data mining technologies to ad targeting. We have constructed an exciting program of eight refereed papers and several invited talks that will help us better understand the future of ad targeting.

【Keywords】: ad targeting; audience modeling; computational advertising

80. WSDM 2016 Workshop on the Ethics of Online Experimentation.

【Paper Link】【Pages】:695-696

【Authors】: Fernando Diaz ; Solon Barocas

【Abstract】:

【Keywords】: ethics; experimentation

81. Second Workshop on Search and Exploration of X-Rated Information (SEXI'16): WSDM Workshop Summary.

【Paper Link】【Pages】:697-698

【Authors】: Vanessa Murdock ; Charles L. A. Clarke ; Jaap Kamps ; Jussi Karlgren

【Abstract】: Adult content is pervasive on the web, has been a driving factor in the adoption of the Internet medium, and is responsible for a significant fraction of traffic and revenues, yet rarely attracts attention in research. The research questions surrounding adult content access behaviors are unique, and interesting and valuable research in this area can be done ethically. WSDM 2016 features a half day workshop on Search and Exploration of X-Rated Information (SEXI) for information access tasks related to adult content. While the scope of the workshop remains broad, special attention is devoted to the privacy and security issues surrounding adult content by inviting keynote speakers with extensive experience on these topics. The recent release of the personal data belonging to customers of the adult dating site Ashley Madison provides a timely context for the focus on privacy and security.

【Keywords】: adult content; privacy and security; research ethics; research practice

Doctoral Consortium 12

82. Mining Complaints to Improve a Product: a Study about Problem Phrase Extraction from User Reviews.

【Paper Link】【Pages】:699

【Authors】: Elena Tutubalina

【Abstract】: The rapidly growing availability of user reviews has become an important resource for companies to detect customer dissatisfaction from textual opinions. Much research in opinion mining focuses on extracting customers' opinions from products' reviews and predicting their sentiment orientation or ratings with the aim of helping other users to make a decision on whether to buy a product. However, there have been few recent studies conducted on business-related opinion tasks to extract more refined opinions about a product's quality problems or technical failures. The focus of this study is the extraction of problem phrases, mentioned in user reviews about products. We explore main opinion mining tasks to determine whether given text from reviews contains a mention of a problem. We formulate research questions and propose knowledge-based methods and probabilistic models to classify users' phrases and extract latent problem indicators, aspects and related sentiments from online reviews.

【Keywords】: opinion mining; problem phrase extraction; user reviews

83. Web-scale Multimedia Search for Internet Video Content.

【Paper Link】【Pages】:701

【Authors】: Lu Jiang

【Abstract】: The Internet has been witnessing an explosion of video content. According to a Cisco study, video content is estimated to account for 80% of all the entire world's internet traffic by 2019. Video data are becoming one of the most valuable sources to assess information and knowledge. However, existing video search solutions are still based on text matching (text-to-text search), and could fail for the huge volumes of videos that have little relevant metadata or no metadata at all. The need for large-scale and intelligent video search, which bridges the gap between the user's information need and the video content, seems to be urgent. In this thesis, we propose an accurate, efficient and scalable search method for video content. As opposed to text matching, the proposed method relies on automatic video content understanding, and allows for intelligent and flexible search paradigms over the video content, including text-to-video and text&video-to-video search. Suppose our goal is to search the videos about birthday party. In traditional text-to-text queries, we have to search the keywords in the user-generated metadata (titles or descriptions). In a text-to-video query, however, we might look for visual clues in the video content such as "cake", "gift" and "kids", audio clues like "birthday song" and "cheering sound", or visible text like "happy birthday". Text-to-video queries are flexible and can be further refined by Boolean and temporal operators. After watching the retrieved videos, the user may select a few interesting videos to find more relevant videos like these. This can be achieved by issuing a text&video-to-video query which adds the selected video examples to the query. The proposed method provides a new dimension of looking at content-based video search, from finding a simple concept like "puppy" to searching a complex incident like "a scene in urban area where people running away after an explosion". To achieve this ambitious goal, we propose several novel methods focusing on accuracy, efficiency and scalability in the novel search paradigm. First, we introduce a novel self-paced curriculum learning theory that allows for training more accurate semantic concepts. Second, we propose a novel and scalable approach to index semantic concepts that can significantly improve the search efficiency with minimum accuracy loss. Third, we design a novel video reranking algorithm that can boost accuracy for video retrieval. The extensive experiments demonstrate that the proposed methods are able to surpass state-of-the-art accuracy on multiple datasets. In addition, our method can efficiently scale up the search to hundreds of millions videos, and only takes about 0.2 second to search a semantic query on a collection of 100 million videos, 1 second to process a hybrid query over 1 million videos. Based on the proposed methods, we implement E-Lamp Lite, the first of its kind large-scale semantic search engine for Internet videos. According to National Institute of Standards and Technology (NIST), it achieved the best accuracy in the TRECVID Multimedia Event Detection (MED) 2013, 2014 and 2015, the most representative task for content-based video search. To the best of our knowledge, E-Lamp Lite is the first content-based semantic search engine that is capable of indexing and searching a collection of 100 million videos.

【Keywords】: big data; content-based retrieval; multimedia event detection; video content search; web search

84. Affective Computing of Image Emotion Perceptions.

【Paper Link】【Pages】:703

【Authors】: Sicheng Zhao

【Abstract】:

【Keywords】: image emotion; personalized perceptions; supervised learning

85. Event Search and Analytics: Detecting Events in Semantically Annotated Corpora for Search & Analytics.

【Paper Link】【Pages】:705

【Authors】: Dhruv Gupta

【Abstract】: In this article, I present the questions that I seek to answer in my PhD research. I posit to analyze natural language text with the help of semantic annotations and mine important events for navigating large text corpora. Semantic annotations such as named entities, geographic locations, and temporal expressions can help us mine events from the given corpora. These events thus provide us with useful means to discover the locked knowledge in them. I pose three problems that can help unlock this knowledge vault in semantically annotated text corpora: i. identifying important events; ii. semantic search; iii. and event analytics.

【Keywords】: diversity & novelty; information retrieval; semantic annotations; semantic search; text analytics; text summarization

86. Understanding Diffusion Processes: Inference and Theory.

【Paper Link】【Pages】:707

【Authors】: Xinran He

【Abstract】: With increasing popularity of social media and social networks sites, analyzing the social networks offers great potential to shed light on human social structure and provides great marketing opportunities. Usually, social network analysis starts with extracting or learning the social network and the associated parameters. Contrary to other analytical tasks, this step is highly non-trivial due to amorphous nature of social ties and the challenges of noisy and incomplete observations. My research focuses on improving accuracy in inferring the network as well as analyzing the consequences when the extracted network is noisy or erroneous. To be more precise, I propose to study the following two questions with a special focus on analyzing diffusion behaviors: (1) How to utilize special properties of social networks to improve accuracy of the extracted network under noisy and missing data; (2) How to characterize the impact of noise in the inferred network and carry out robust analysis and optimization.

【Keywords】: diffusion process; influence maximization; network inference; social network analysis

87. E-commerce Product Recommendation by Personalized Promotion and Total Surplus Maximization.

【Paper Link】【Pages】:709

【Authors】: Qi Zhao

【Abstract】: Existing recommendation algorithms treat recommendation problem as rating prediction and the recommendation quality is measured by RMSE or other similar metrics. However, we argued that when it comes to E-commerce product recommendation, recommendation is more than rating prediction by realizing the fact price plays a critical role in recommendation result. In this work, we propose to build E-commerce product recommender systems based on fundamental economic notions. We first proposed an incentive compatible method that can effectively elicit consumer's willingness-to-pay in a typical E-commerce setting and in a further step, we formalize the recommendation problem as maximizing total surplus. We validated the proposed WTP elicitation algorithm through crowd sourcing and the results demonstrated that the proposed approach can achieve higher seller profit by personalizing promotion. We also proposed a total surplus maximization (TSM) based recommendation framework. We specified TSM by three of the most representative settings - e-commerce where the product quantity can be viewed as infinity, P2P lending where the resource is bounded and freelancer marketing where the resource (job) can be assigned to one freelancer. The experimental results of the corresponding datasets shows that TSM exceeds existing approach in terms of total surplus.

【Keywords】: e-commerce; economics; personalized promotion; recommendation; surplus maximization

【Paper Link】【Pages】:711

【Authors】: Zhe Zhao

【Abstract】:

【Keywords】: early detection; signal activities; social media phenomena

【Paper Link】【Pages】:713

【Authors】: Yuxiao Dong

【Abstract】: This proposal aims to harness the power of data, social, and network sciences to model user behavior in social networks. Specifically, we focus on individual users and investigate the interplay between their behavior and subsequently emergent social phenomena. Work in this proposal unveils the significant social strategies that are used by people to satisfy their social needs. We apply computational methods to address user modeling problems, including demographic inference, link recommendation, and social impact prediction. The proposed research work can be translated into applications in large social systems, such as mobile communication, online social media, and academic collaboration.

【Keywords】: computational social science; social impact; user behavior

90. Feature Generation and Selection on the Heterogeneous Graph for Music Recommendation.

【Paper Link】【Pages】:715

【Authors】: Chun Guo

【Abstract】:

【Keywords】: feature selection; meta-path; music recommendation

91. Temporal Formation and Evolution of Online Communities.

【Paper Link】【Pages】:717

【Authors】: Hossein Fani

【Abstract】: Researchers have already studied the identification of online communities and the possible impact or influence relationships from several perspectives. For instance, communities of users that are formed based on shared relationships and topological similarities, or communities that consist of users that share similar content. However, little work has been done on detection of communities that simultaneously share topical and temporal similarities. Furthermore, these studies have not explored the causation relationship between the communities. Causation provides systematic explanation as to why communities are formed and helps to predict future communities. This proposal will address two main research questions: i) how can communities that share topical and temporal similarities be identified, and ii) how can causation relation between different online communities be detected and modelled. We model users' behaviour towards topics of interest through multivariate time series to identify like-minded communities. Further, we employ Granger's concept of causality to infer causation between detected communities from corresponding users' time series. Granger causality is the prominent approach in time series modelling and rests on a firm statistical foundation. We assess the proposed community detection methods through comparison with the state of the art and verify the causal model through its prediction accuracy.

【Keywords】: causality; community detection; online social network

92. Mining the Web for Intelligent Problem Solving for Programmers.

【Paper Link】【Pages】:719

【Authors】: Xin Rong

【Abstract】: Programming can be hard to learn and master. Novice programmers often find themselves struggling with terminology, concepts, or different solutions to the same problem with little clue on how to choose the best one. Professional programmers often spend a considerable amount of time learning to use third-party libraries, APIs, or an unfamiliar piece of code. Although programmers can turn to search engines or question-and-answer websites for help, the problem solving process can often take multiple iterations and can be time-consuming. An integrated system that can recognize a programmer's difficulties and provide contextualized solutions is thus desirable, as it may significantly reduce the amount of manual effort required in the loop of troubleshooting. Ideally, a programmer should be able to interact with such an intelligent system using natural language, in a way similar to how they document code or communicate with peers. However, using automatic natural language processing techniques to address programming questions is very difficult, mainly due to the following reasons: (1) the terms and common expressions vary greatly across different domains and individual programmers, making it difficult to associate relevant concepts together; (2) the solution to the user's trouble in programming often requires multiple steps or different resources, which requires deep understanding of the relations or dependencies of the possible solutions, as well as the user's personal capability of handling those solutions; (3) the documents in the training data usually include a mixture of general-domain expressions with mentions of variables, functions, and classes, as well as source code, making low-level text processing difficult; (4) the evaluation of the system generally requires skilled experts to provide ground truth, which is expensive and often unreliable. We address the above difficulties and build an intelligent programming helper system by mining the massive data available online related to programming, including question-and-answer websites, tutorials, blogs, and code repositories. In specific, the study involves three important components. First, we use information extraction techniques to extract common programming tasks, issues, and solutions from the Web data, and establish connections between these extracted elements by leveraging their discrete or distributed representations (e.g., using neural embedding models). Such techniques have been shown to be useful in helping general users solve problems that require interactions with a complex computer software application through the interface of natural language. Second, we study how to handle complicated problems that require multiple steps to solve. The existing troubleshooting instances documented online are collectively modeled as a heterogeneous network, on which the random walk paths can be exploited to recommend solutions. Third, we study how to personalize the problem-solving process for users with varying levels of skills and background knowledge. In particular, each user's past adoptions of technologies and the adoption behavior in his/her social community can be jointly leveraged to provide the appropriate recommendations of technologies and may even promote innovations (e.g., new algorithms) in the process. Collectively, these three components form an integral solution to computer-assisted problem solving for programmers driven by big data, and may have impact on various different domains, including information extraction, language modeling, natural language understanding, automatic problem solving, and social network analysis.

【Keywords】: human-computer interaction; language modeling; software engineering

【Paper Link】【Pages】:721

【Authors】: Nikita V. Spirin

【Abstract】: To help users cope with the scale and influx of new information, professional social networks (PSNs) provide a search functionality. However, most of the search engines within PSNs today only support keyword queries and basic faceted search capabilities overlooking serendipitous network exploration and search for relationships between entities. This results in siloed information and a limited search space. My thesis is that we must redesign all major elements of a search user interface, such as input, control, and informational, to enable more effective search interactions within PSNs. I will introduce new insights and algorithms supporting the thesis.

【Keywords】: entity search; filtering; job search; online social network; query log analysis; search user interaction; snippet

9. WSDM 2016:San Francisco, CA, USA

Paper Num: 93 || Session Num: 17

Keynote Address 1

1. Large-Scale Deep Learning For Building Intelligent Computer Systems.

Communities and Social Interaction 10

2. Who Will Reply to/Retweet This Tweet?: The Dynamics of Intimacy from Online Social Interactions.

3. Cross-modality Consistent Regression for Joint Visual-Textual Sentiment Analysis of Social Multimedia.

4. How Relevant is the Irrelevant Data: Leveraging the Tagging Data for a Learning-to-Rank Model.

5. Quantifying Controversy in Social Media.

6. Understanding and Identifying Advocates for Political Campaigns on Social Media.

7. Exploiting New Sentiment-Based Meta-level Features for Effective Sentiment Analysis.

8. Mobile App Tagging.

9. CCCF: Improving Collaborative Filtering via Scalable User-Item Co-Clustering.

10. On the Efficiency of the Information Networks in Social Media.

11. Modeling and Predicting Learning Behavior in MOOCs.

Search and Semantics 10

12. Beyond Ranking: Optimizing Whole-Page Presentation.

13. Understanding User Attention and Engagement in Online News Reading.

14. Publication Date Prediction through Reverse Engineering of the Web.

15. To Suggest, or Not to Suggest for Queries with Diverse Intents: Optimizing Search Result Presentation.

16. Term-by-Term Query Auto-Completion for Mobile Search.

17. Collaborative Denoising Auto-Encoders for Top-N Recommender Systems.

18. Personalized PageRank Estimation and Search: A Bidirectional Approach.

19. Your Cart tells You: Inferring Demographic Attributes from Purchase Data.

20. Reducing Click and Skip Errors in Search Result Ranking.

21. Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies.

Practice & Experience Track 1

22. Is Mail The Next Frontier In Search And Data Mining?

Observing Users 2

23. Portrait of an Online Shopper: Understanding and Predicting Consumer Behavior.

24. Evolution of Privacy Loss in Wikipedia.

Keynote Address 1

25. Keynote Speaker Bio.

Leveraging Users 14

26. Modeling Intransitivity in Matchup and Comparison Data.

27. Crowdsourcing High Quality Labels with a Tight Budget.

28. Project Success Prediction in Crowdfunding Environments.

29. Probabilistic Group Recommendation Model for Crowdfunding Domains.

30. Quality Management in Crowdsourcing using Gold Judges Behavior.

31. On Obtaining Effort Based Judgements for Information Retrieval.

32. A Semantic Graph based Topic Model for Question Retrieval in Community Question Answering.

33. Modeling Check-in Preferences with Multidimensional Knowledge: A Minimax Entropy Approach.

34. You've got Mail, and Here is What you Could do With It!: Analyzing and Predicting Actions on Email Messages.

35. Hierarchical Label Propagation and Discovery for Machine Generated Email.

36. Enforcing k-anonymity in Web Mail Auditing.

37. An Information-Theoretic Approach to Individual Sequential Data Sanitization.

38. Improving IP Geolocation using Query Logs.

39. Geographic Segmentation via Latent Poisson Factor Model.

Big Data Algorithms 10

40. Scaling up Link Prediction with Ensembles.

41. DiFacto: Distributed Factorization Machines.

42. Distributed Balanced Partitioning via Linear Embedding.

43. Kangaroo: Workload-Aware Processing of Range Data and Range Queries in Hadoop.

44. Feedback Control of Real-Time Display Advertising.

45. Multi-Score Position Auctions.

46. Multi-view Machines.

47. Transductive Classification on Heterogeneous Information Networks with Edge Betweenness-based Normalization.

48. The Troll-Trust Model for Ranking in Signed Networks.

49. Multileave Gradient Descent for Fast Online Learning to Rank.

Practice & Experience Track 2

50. AMiner: Toward Understanding Big Scholar Data.

51. Serving a Billion Personalized News Feeds.

Keynote Address 1

52. The Predictive Power of Massive Data about our Fine-Grained Behavior.

Social Networks 12

53. Information Evolution in Social Networks.

54. Nonlinear Laplacian for Digraphs and its Applications to Network Analysis.

55. Querying and Tracking Influencers in Social Streams.

56. Centrality-Aware Link Recommendations.

57. Relational Learning with Social Status Analysis.

58. Equality and Social Mobility in Twitter Discussion Groups.

59. Learning Distributed Representations of Data in Community Question Answering for Question Retrieval.

60. Inferring Latent Triggers of Purchases with Consideration of Social Effects and Media Advertisements.

61. Towards Modelling Language Innovation Acceptance in Online Social Networks.

62. Discriminative Learning of Infection Models.

63. Representation Learning for Information Diffusion through Social Networks: an Embedded Cascade Model.

64. Ensemble Models for Data-driven Prediction of Malware Infections.

Entities and Structure 8

65. WSDM Cup 2016: Entity Ranking Challenge.

66. Dynamic Collective Entity Representations for Entity Ranking.