Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM 2017, Cambridge, United Kingdom, February 6-10, 2017. ACM 【DBLP Link】
【Paper Link】 【Pages】:1-2
【Authors】: Ricardo A. Baeza-Yates
【Abstract】: In this keynote we attempt to cover the first ten years of ACM WSDM, the main conference on web search and data mining. We start from its inception during 2006 to the first conference held at Stanford in 2008, driven by some key people in the main web search engines of that time. We were confident on its success, but we never expected to become so fast the best venue for our research topics. We also cover the main highlights during all these years as well as current and future trends.
【Keywords】: characteristics; history; trends
【Paper Link】 【Pages】:3-12
【Authors】: Jiawei Zhang ; Philip S. Yu ; Yuanhua Lv
【Abstract】: Professional career training for novice employees at elementary levels to help them master necessary working skills is critical for both achieving employees' professional success and enhancing the enterprise growth. Besides adopting professional services from external career training agencies, companies can actually train the employees more effectively by involving them in various internal projects carried out in the companies. In this paper, we will study the "Employee Training" (ET) problem by assigning the employees to various concrete company internal projects. From the company perspective, besides training the employees, another important objective of carrying out these projects is to finish them successfully. The successful accomplishment of projects depends on various issues, like the skill qualification of the built teams and the effective collaboration among the team members. To achieve these two objectives simultaneously, a novel framework named "Team foRmAtion based employee traINing" (TRAIN) is proposed in this paper. TRAIN formulates the ET problem as a joint optimization problem, where the objective function considers the employees' overall skill gain and the team internal communication costs at the same time. To ensure the success of the projects, a new team skill qualification constraint is proposed and added to the optimization problem. Extensive experiments conducted on the real-world enterprise employee project team dataset demonstrate the effectiveness of TRAIN in addressing the problem.
【Keywords】: data mining; employee training; enterprise social network; team formation
【Paper Link】 【Pages】:13-21
【Authors】: Farshad Kooti ; Mihajlo Grbovic ; Luca Maria Aiello ; Eric Bax ; Kristina Lerman
【Abstract】: With mobile shopping surging in popularity, people are spending ever more money on digital purchases through their mobile devices and phones. However, few large-scale studies of mobile shopping exist. In this paper we analyze a large data set consisting of more than 776M digital purchases made on Apple mobile devices that include songs, apps, and in-app purchases. We find that 61% of all the spending is on in-app purchases and that the top 1% of users are responsible for 59% of all the spending. These big spenders are more likely to be male and older, and less likely to be from the US. We study how they adopt and abandon individual app, and find that, after an initial phase of increased daily spending, users gradually lose interest: the delay between their purchases increases and the spending decreases with a sharp drop toward the end. Finally, we model the in-app purchasing behavior in multiple steps: 1) we model the time between purchases; 2) we train a classifier to predict whether the user will make a purchase from a new app or continue purchasing from the existing app; and 3) based on the outcome of the previous step, we attempt to predict the exact app, new or existing, from which the next purchase will come. The results yield new insights into spending habits in the mobile digital marketplace.
【Keywords】: app purchases; demographics; iphone; prediction
【Paper Link】 【Pages】:23-32
【Authors】: Róbert Pálovics ; András A. Benczúr
【Abstract】: We analyze the fine-grained connections between the average degree and the power-law degree distribution exponent in growing information networks. Our starting observation is a power-law degree distribution with a decreasing exponent and increasing average degree as a function of the network size. Our experiments are based on three Twitter at-mention networks and three more from the Koblenz Network Collection. We observe that popular network models cannot explain decreasing power-law degree distribution exponent and increasing average degree at the same time. We propose a model that is the combination of exponential growth, and a power-law developing network, in which new "homophily" edges are continuously added to nodes proportional to their current homophily degree. Parameters of the average degree growth and the power-law degree distribution exponent functions depend on the ratio of the network growth exponent parameters. Specifically, we connect the growth of the average degree to the decreasing exponent of the power-law degree distribution. Prior to our work, only one of the two cases were handled. Existing models and even their combinations can only reproduce some of our key new observations in growing information networks.
【Keywords】: accelerated growth; densification; exponential growth; networks; power law
【Paper Link】 【Pages】:33-40
【Authors】: Xingjie Wei ; David Stillwell
【Abstract】: Profile images on social networks are users' opportunity to present themselves and to affect how others judge them. We examine what Facebook images say about users' perceived and measured intelligence. 1,122 Facebook users completed a matrices intelligence test and shared their current Facebook profile image. Strangers also rated the images for perceived intelligence. We use automatically extracted image features to predict both measured and perceived intelligence. Intelligence estimation from images is a difficult task even for humans, but experimental results show that human accuracy can be equalled using computing methods. We report the image features that predict both measured and perceived intelligence, and highlight misleading features such as "smiling'' and "wearing glasses'' that are correlated with perceived but not measured intelligence. Our results give insights into inaccurate stereotyping from profile images and also have implications for privacy, especially since in most social networks profile images are public by default.
【Keywords】: computational aesthetics; intelligence estimation; intelligence quotient; iq; measured intelligence; perceived intelligence
【Paper Link】 【Pages】:41-50
【Authors】: Jan R. Benetka ; Krisztian Balog ; Kjetil Nørvåg
【Abstract】: In this work we address the development of a smart personal assistant that is capable of anticipating a user's information needs based on a novel type of context: the person's activity inferred from her check-in records on a location-based social network. Our main contribution is a method that translates a check-in activity into an information need, which is in turn addressed with an appropriate information card. This task is challenging because of the large number of possible activities and related information needs, which need to be addressed in a mobile dashboard that is limited in size. Our approach considers each possible activity that might follow after the last (and already finished) activity, and selects the top information cards such that they maximize the likelihood of satisfying the user's information needs for all possible future scenarios. The proposed models also incorporate knowledge about the temporal dynamics of information needs. Using a combination of historical check-in data and manual assessments collected via crowdsourcing, we show experimentally the effectiveness of our approach.
【Keywords】: information cards; information needs; proactive ir; query-less; zero-query search
【Paper Link】 【Pages】:51-60
【Authors】: Ali Zarezade ; Utkarsh Upadhyay ; Hamid R. Rabiee ; Manuel Gomez-Rodriguez
【Abstract】: Users in social networks whose posts stay at the top of their followers' feeds the longest time are more likely to be noticed. Can we design an online algorithm to help them decide when to post to stay at the top? In this paper, we address this question as a novel optimal control problem for jump stochastic differential equations. For a wide variety of feed dynamics, we show that the optimal broadcasting intensity for any user is surprisingly simple ? it is given by the position of her most recent post on each of her follower's feeds. As a consequence, we are able to develop a simple and highly efficient online algorithm, RedQueen, to sample the optimal times for the user to post. Experiments on both synthetic and real data gathered from Twitter show that our algorithm is able to consistently make a user's posts more visible over time, is robust to volume changes on her followers' feeds, and significantly outperforms the state of the art.
【Keywords】: smart broadcasting; social networks; stochastic optimal control; temporal point processes; twitter
【Paper Link】 【Pages】:61-70
【Authors】: Utkarsh Upadhyay ; Isabel Valera ; Manuel Gomez-Rodriguez
【Abstract】: Learning from the crowd has become increasingly popular in the Web and social media. There is a wide variety of crowdlearning sites in which, on the one hand, users learn from the knowledge that other users contribute to the site, and, on the other hand, knowledge is reviewed and curated by the same users using assessment measures such as upvotes or likes. In this paper, we present a probabilistic modeling framework of crowdlearning, which uncovers the evolution of a user's expertise over time by leveraging other users' assessments of her contributions. The model allows for both off-site and on-site learning and captures forgetting of knowledge. We then develop a scalable estimation method to fit the model parameters from millions of recorded learning and contributing events. We show the effectiveness of our model by tracing activity of ~25 thousand users in Stack Overflow over a 4.5 year period. We find that answers with high knowledge value are rare. Newbies and experts tend to acquire less knowledge than users in the middle range. Prolific learners tend to be also proficient contributors that post answers with high knowledge value.
【Keywords】: education; markets and crowds; social and information networks; user modelling
【Paper Link】 【Pages】:71-79
【Authors】: Mohammad Akbari ; Tat-Seng Chua
【Abstract】: Recently community detection has attracted much interest in social media to understand the collective behaviours of users and allow individuals to be modeled in the context of the group. Most existing approaches for community detection exploit either users' social links or their published content, aiming at discovering groups of densely connected or highly similar users. They often fail to find effective communities due to excessive noise in content, sparsity in links, and heterogenous behaviours of users in social media. Further, they are unable to provide insights and rationales behind the formation of the group and the collective behaviours of the users. To tackle these challenges, we propose to discover communities in a low- dimensional latent space in which we simultaneously learn the representation of users and communities. In particular, we integrated different social views of the network into a low-dimensional latent space in which we sought dense clusters of users as communities. By imposing a Laplacian regularizer into affiliation matrix, we further incorporated prior knowledge into the community discovery process. Finally community profiles were computed by a linear operator integrating the profiles of members. Taking the wellness domain as an example, we conducted experiments on a large scale real world dataset of users tweeting about diabetes and its related concepts, which demonstrate the effectiveness of our approach in discovering and profiling user communities.
【Keywords】: factorization; group profiling; latent space learning; social networks; user profiling
【Paper Link】 【Pages】:81-90
【Authors】: Kiran Garimella ; Gianmarco De Francisci Morales ; Aristides Gionis ; Michael Mathioudakis
【Abstract】: Society is often polarized by controversial issues that split the population into groups with opposing views. When such issues emerge on social media, we often observe the creation of `echo chambers', i.e., situations where like-minded people reinforce each other's opinion, but do not get exposed to the views of the opposing side. In this paper we study algorithmic techniques for bridging these chambers, and thus reduce controversy. Specifically, we represent the discussion on a controversial issue with an endorsement graph, and cast our problem as an edge-recommendation problem on this graph. The goal of the recommendation is to reduce the controversy score of the graph, which is measured by a recently-developed metric based on random walks. At the same time, we take into account the acceptance probability of the recommended edge, which represents how likely the edge is to materialize in the endorsement graph. We propose a simple model based on a recently-developed user-level controversy score, that is competitive with state-of-the-art link-prediction algorithms. Our goal then becomes finding the edges that produce the largest reduction in the controversy score, in expectation. To solve this problem, we propose an efficient algorithm that considers only a fraction of all the possible combinations of edges. Experimental results show that our algorithm is more efficient than a simple greedy heuristic, while producing comparable score reduction. Finally, a comparison with other state-of-the-art edge-addition algorithms shows that this problem is fundamentally different from what has been studied in the literature.
【Keywords】: controversy; echo chambers; filter bubble; opposing views; polarization; social media
【Paper Link】 【Pages】:91-100
【Authors】: Tao Wang ; Markus Brede ; Antonella Ianni ; Emmanouil Mentzakis
【Abstract】: Eating disorders are complex mental disorders and responsible for the highest mortality rate among mental illnesses. Recent studies reveal that user-generated content on social media provides useful information in understanding these disorders. Most previous studies focus on studying communities of people who discuss eating disorders on social media, while few studies have explored community structures and interactions among individuals who suffer from this disease over social media. In this paper, we first develop a snowball sampling method to automatically gather individuals who self-identify as eating disordered in their profile descriptions, as well as their social network connections with one another on Twitter. Then, we verify the effectiveness of our sampling method by: 1. quantifying differences between the sampled eating disordered users and two sets of reference data collected for non-disordered users in social status, behavioral patterns and psychometric properties; 2. building predictive models to classify eating disordered and non-disordered users. Finally, leveraging the data of social connections between eating disordered individuals on Twitter, we present the first homophily study among eating-disorder communities on social media. Our findings shed new light on how an eating-disorder community develops on social media.
【Keywords】: eating disorder; homophily; mental health; network analysis; social media; text mining
【Paper Link】 【Pages】:101-110
【Authors】: Daniel M. Romero ; Katharina Reinecke ; Lionel P. Robert Jr.
【Abstract】: Sequential group decision-making processes, such as online event scheduling, can be subject to social influence if the decisions involve individuals? subjective preferences and values. Indeed, prior work has shown that scheduling polls that allow respondents to see others' answers are more likely to succeed than polls that hide other responses, suggesting the impact of social influence and coordination. In this paper, we investigate whether this difference is due to information cascade effects in which later respondents adopt the decisions of earlier respondents. Analyzing more than 1.3 million Doodle polls, we found evidence that cascading effects take place during event scheduling, and in particular, that early respondents have a larger influence on the outcome of a poll than people who come late. Drawing on simulations of an event scheduling model, we compare possible interventions to mitigate this bias and show that we can optimize the success of polls by hiding the responses of a small percentage of low availability respondents.
【Keywords】: decision-making; doodle; herding behavior; information cascade; social influence
【Paper Link】 【Pages】:111-120
【Authors】: Luca Maria Aiello ; Nicola Barbieri
【Abstract】: Ego-networks are fundamental structures in social graphs, yet the process of their evolution is still widely unexplored. In an online context, a key question is how link recommender systems may skew the growth of these networks, possibly restraining diversity. To shed light on this matter, we analyze the complete temporal evolution of 170M ego-networks extracted from Flickr and Tumblr, comparing links that are created spontaneously with those that have been algorithmically recommended. We find that the evolution of ego-networks is bursty, community-driven, and characterized by subsequent phases of explosive diameter increase, slight shrinking, and stabilization. Recommendations favor popular and well-connected nodes, limiting the diameter expansion. With a matching experiment aimed at detecting causal relationships from observational data, we find that the bias introduced by the recommendations fosters global diversity in the process of neighbor selection. Last, with two link prediction experiments, we show how insights from our analysis can be used to improve the effectiveness of social recommender systems.
【Keywords】: communities; ego-networks; flickr; groups; link recommendation; network evolution; social media; tumblr
【Paper Link】 【Pages】:121-130
【Authors】: Jiawei Zhang ; Jianhui Chen ; Junxing Zhu ; Yi Chang ; Philip S. Yu
【Abstract】: Inferring the links among entities in networks is an important research problem for various disciplines. Depending on the specific application settings, the links to be inferred are usually subject to different cardinality constraints, like one-to-one, one-to-many and many-to-many. However, most existing research works on link prediction problems fail to consider such a kind of constraint. In this paper, we propose to study the link prediction problem with general cardinality constraints, which is formally defined as the CLP (Cardinality Constrained Link Prediction) problem. By minimizing the projection loss of links from feature vectors to labels, the CLP problem is formulated as an optimization problem involving multiple variables, where the cardinality constraints are modeled as mathematical constraints on node degrees. The objective function is shown to be not jointly convex and the optimal solution subject to the cardinality constraints can be very time-consuming to achieve. To solve the optimization problem, an iterative variable updating based link prediction framework ITERCLIPS (Iterative Constrained Link Prediction & Selection) is introduced in this paper, which involves the steps on link updating and selection alternatively. To overcome the high time cost problem, a greedy link selection step is introduced in this paper, which picks links greedily while preserving the link cardinality constraints simultaneously. Meanwhile, to ensure the effectiveness of ITERCLIPS on large-scale networks, a distributed implementation of ITERCLIPS is further presented as a scalable solution to the CLP problem. Extensive experiments have been done on three real-world network datasets with different types of cardinality constraints, and the experimental results achieved by ITERCLIPS on all these datasets can demonstrate the effectiveness and advantages of ITERCLIPS in solving the CLP problem.
【Keywords】: aligned social networks; bibliographic networks; cardinality constraint; data mining; enterprise organizational chart; link prediction
【Paper Link】 【Pages】:131-139
【Authors】: Liangda Li ; Hongbo Deng ; Jianhui Chen ; Yi Chang
【Abstract】: Query auto completion (QAC) is a prominent feature in modern search engines. High quality QAC substantially improves search experiences by helping users in typing less while submitting the queries. Many studies have been proposed to improve quality and relevance of the QAC methods from different perspectives, including leveraging contexts in long term and short term query histories, investigating the temporal information for time-sensitive QAC, and analyzing user behaviors. Although these studies have shown the context, temporal, and user behavior data carry valuable information, most existing QAC approaches do not fully exploit or even completely ignore these information. We propose a novel Hawkes process based QAC algorithm, comprehensively taking into account the context, temporal, and position of the clicked recommended query completions (a type of user behavior data), for reliable query completion prediction. Our understanding of ranking query completions is consistent with the mathematical rationale of Hawke process; such a coincidence in turn validates our motivation of using Hawkes process for QAC. We also develop an efficient inference algorithm to compute the optimal solutions of the proposed QAC algorithm. The proposed method is evaluated on two real-world benchmark data in comparison with state-of-art methods, and the obtained experiments clearly demonstrate their effectiveness.
【Keywords】: contextual data; hawkes process; query autocompletion
【Paper Link】 【Pages】:141-150
【Authors】: Cheng Luo ; Yiqun Liu ; Tetsuya Sakai ; Ke Zhou ; Fan Zhang ; Xue Li ; Shaoping Ma
【Abstract】: Time plays an essential role in multiple areas of Information Retrieval (IR) studies such as search evaluation, user behavior analysis, temporal search result ranking and query understanding. Especially, in search evaluation studies, time is usually adopted as a measure to quantify users' efforts in search processes. Psychological studies have reported that the time perception of human beings can be affected by many stimuli, such as attention and motivation, which are closely related to many cognitive factors in search. Considering the fact that users' search experiences are affected by their subjective feelings of time, rather than the objective time measured by timing devices, it is necessary to look into the different factors that have impacts on search users' perception of time. In this work, we make a first step towards revealing the time perception mechanism of search users with the following contributions: (1) We establish an experimental research framework to measure the subjective perception of time while reading documents in search scenario, which originates from but is also different from traditional time perception measurements in psychological studies. (2) With the framework, we show that while users are reading result documents, document relevance has small yet visible effect on search users' perception of time. By further examining the impact of other factors, we demonstrate that the effect on relevant documents can also be influenced by individuals and tasks. (3) We conduct a preliminary experiment in which the difference between perceived time and dwell time is taken into consideration in a search evaluation task. We found that the revised framework achieved a better correlation with users' satisfaction feedbacks. This work may help us better understand the time perception mechanism of search users and provide insights in how to better incorporate time factor in search evaluation studies.
【Keywords】: interactive information retrieval; time perception; user behavior
【Paper Link】 【Pages】:151-159
【Authors】: Gong Cheng ; Cheng Jin ; Wentao Ding ; Danyun Xu ; Yuzhong Qu
【Abstract】: To embrace the open data movement, increasingly many datasets have been published on the Web to be reused. Users, when assessing the usefulness of an unfamiliar dataset, need means to quickly inspect its contents. To satisfy the needs, we propose to automatically extract an optimal small portion from a dataset, called a snippet, to concisely illustrate the contents of the dataset. We consider the quality of a snippet from three aspects: coverage, familiarity, and cohesion, which are jointly formulated in a new combinatorial optimization problem called the maximum-weight-and-coverage connected graph problem (MwcCG). We give a constant-factor approximation algorithm for this NP-hard problem, and experiment with our solution on real-world datasets. Our quantitative analysis and user study show that our approach outperforms a baseline approach.
【Keywords】: approximation algorithm; combinatorial optimization; dataset summarization; snippet generation
【Paper Link】 【Pages】:161-170
【Authors】: Xin Li ; Yiqun Liu ; Rongjie Cai ; Shaoping Ma
【Abstract】: With Web users' search tasks becoming increasingly complex, a single information source cannot necessarily satisfy their information needs. Searchers may rely on heterogeneous sources to complete their tasks, such as search engines, Community Question Answering (CQA), encyclopedia sites, and crowdsourcing platforms. Previous works focus on interaction behaviors with federated search results, including how to compose a federated Web search result page and what factors affect users' interaction behavior on aggregated search interfaces. However, little is known about which factors are crucial in determining users' search outcomes while facing multiple heterogeneous search services. In this paper, we design a lab-based user study to analyze what explicit and implicit factors affect search outcomes (information gain and user satisfaction) when users have access to heterogeneous information sources. In the study, each participant can access three different kinds of search services: a general search engine (Bing), a general CQA portal (Baidu Knows), and a high-quality CQA portal (Zhihu). Using questionnaires and interaction log data, we extract explicit and implicit signals to analyze how users' search outcomes are correlated with their behaviors on different information sources. Experimental results indicate that users' search experiences on CQA portals (such as users' perceived usefulness and number of result clicks) positively affect search outcome (information gain), while search satisfaction is significantly correlated with some other factors such as users' familiarity, interest and difficulty of the task. Besides, users' search satisfaction can be more accurately predicted by the implicit factors than search outcomes.
【Keywords】: heterogeneous information; search outcome; user study
【Paper Link】 【Pages】:171-180
【Authors】: Fidel Cacheda ; Nicola Barbieri ; Roi Blanco
【Abstract】: With the ubiquity of internet access and location services provided by smartphone devices, the volume of queries issued by users to find products and services that are located near them is rapidly increasing. Local search engines help users in this task by matching queries with a predefined geographical connotation ("local queries") against a database of local business listings. Local search differs from traditional web-search because to correctly capture users' click behavior, the estimation of relevance between query and candidate results must be integrated with geographical signals, such as distance. The intuition is that users prefer businesses that are physically closer to them. However, this notion of closeness is likely to depend upon other factors, like the category of the business, the quality of the service provided, the density of businesses in the area of interest, etc. In this paper we perform an extensive analysis of online users' behavior and investigate the problem of estimating the click-through rate on local search (LCTR) by exploiting the combination of standard retrieval methods with a rich collection of geo and business-dependent features. We validate our approach on a large log collected from a real-world local search service. Our evaluation shows that the non-linear combination of business information, geo-local and textual relevance features leads to a significant improvements over state of the art alternative approaches based on a combination of relevance, distance and business reputation.
【Keywords】: distance; information retrieval; model
【Paper Link】 【Pages】:181-190
【Authors】: Faezeh Ensan ; Ebrahim Bagheri
【Abstract】: This paper addresses the task of document retrieval based on the degree of document relatedness to the meanings of a query by presenting a semantic-enabled language model. Our model relies on the use of semantic linking systems for forming a graph representation of documents and queries, where nodes represent concepts extracted from documents and edges represent semantic relatedness between concepts. Based on this graph, our model adopts a probabilistic reasoning model for calculating the conditional probability of a query concept given values assigned to document concepts. We present an integration framework for interpolating other retrieval systems with the presented model in this paper. Our empirical experiments on a number of TREC collections show that the semantic retrieval has a synergetic impact on the results obtained through state of the art keyword-based approaches, and the consideration of semantic information obtained from entity linking on queries and documents can complement and enhance the performance of other retrieval models.
【Keywords】: information retrieval; language models; semantic linking; semantic relatedness; semantic search
【Paper Link】 【Pages】:191-200
【Authors】: Haitao Yu ; Adam Jatowt ; Roi Blanco ; Hideo Joho ; Joemon M. Jose ; Long Chen ; Fajie Yuan
【Abstract】: To cope with ambiguous and/or underspecified queries, search result diversification (SRD) is a key technique that has attracted a lot of attention. This paper focuses on implicit SRD, where the possible subtopics underlying a query are unknown beforehand. We formulate implicit SRD as a process of selecting and ranking k exemplar documents that utilizes integer linear programming (ILP). Unlike the common practice of relying on approximate methods, this formulation enables us to obtain the optimal solution of the objective function. Based on four benchmark collections, our extensive empirical experiments reveal that: (1) The factors, such as different initial runs, the number of input documents, query types and the ways of computing document similarity significantly affect the performance of diversification models. Careful examinations of these factors are highly recommended in the development of implicit SRD methods. (2) The proposed method can achieve substantially improved performance over the state-of-the-art unsupervised methods for implicit SRD.
【Keywords】: cluster-based ir; implicit srd; integer linear programming
【Paper Link】 【Pages】:201-210
【Authors】: Matt Crane ; J. Shane Culpepper ; Jimmy J. Lin ; Joel Mackenzie ; Andrew Trotman
【Abstract】: We present an empirical comparison between document-at-a-time (DaaT) and score-at-a-time (SaaT) document ranking strategies within a common framework. Although both strategies have been extensively explored, the literature lacks a fair, direct comparison: such a study has been difficult due to vastly different query evaluation mechanics and index organizations. Our work controls for score quantization, document processing, compression, implementation language, implementation effort, and a number of details, arriving at an empirical evaluation that fairly characterizes the performance of three specific techniques: WAND (DaaT), BMW (DaaT), and JASS (SaaT). Experiments reveal a number of interesting findings. The performance gap between WAND and BMW is not as clear as the literature suggests, and both methods are susceptible to tail queries that may take orders of magnitude longer than the median query to execute. Surprisingly, approximate query evaluation in WAND and BMW does not significantly reduce the risk of these tail queries. Overall, JASS is slightly slower than either WAND or BMW, but exhibits much lower variance in query latencies and is much less susceptible to tail query effects. Furthermore, JASS query latency is not particularly sensitive to the retrieval depth, making it an appealing solution for performance-sensitive applications where bounds on query latencies are desirable.
【Keywords】: block-max wand; document-ordered indexes; impact-ordered indexes; jass; wand
【Paper Link】 【Pages】:211-220
【Authors】: Venkatesh Vinayakarao ; Anita Sarma ; Rahul Purandare ; Shuktika Jain ; Saumya Jain
【Abstract】: Code search with natural language terms performs poorly because programming concepts do not always lexically match their syntactic forms. For example, in Java, the programming concept "array" does not match with its syntactic representation of "[ ]". Code search engines can assist developers more effectively over natural language queries if such mappings existed for a variety of programming languages. In this work, we present a programming language agnostic technique to discover such mappings between syntactic forms and natural language terms representing programming concepts. We use the questions and answers in Stack Overflow to create this mapping. We implement our approach in a tool called ANNE. To evaluate its effectiveness, we conduct a user study in an academic setting in which teaching assistants use ANNE to search for code snippets in student submissions. With the use of ANNE, we find that the participants are 29% quicker with no significant drop in correctness and completeness.
【Keywords】: assignment grading; code search; information retrieval; natural language processing
【Paper Link】 【Pages】:221-230
【Authors】: Yulu Wang ; Jimmy J. Lin
【Abstract】: The basic idea behind selective search is to partition a collection into topical clusters, and for each query, consider only a subset of the clusters that are likely to contain relevant documents. Previous work on web collections has shown that it is possible to retain high-quality results while considering only a small fraction of the collection. These studies, however, assume static collections where it is feasible to run batch clustering algorithms for partitioning. In this work, we consider the novel formulation of selective search on document streams (specifically, tweets), where partitioning must be performed incrementally. In our approach, documents are partitioned into temporal segments and selective search is performed within each segment: these segments can either be clustered using batch or online algorithms, and at different temporal granularities. For efficiency, we take advantage of word embeddings to reduce the dimensionality of the document vectors. Experiments with test collections from the TREC Microblog Tracks show that we are able to achieve precision indistinguishable from exhaustive search while considering only around 5% of the collection. Interestingly, we observe no significant effectiveness differences between batch vs. online clustering and between hourly vs. daily temporal segments, despite them being very different index organizations. This suggests that architectural choices should be primarily guided by efficiency considerations.
【Keywords】: clustering; effectiveness-efficiency tradeoffs; tweets
【Paper Link】 【Pages】:231-240
【Authors】: Vinay Setty ; Abhijit Anand ; Arunav Mishra ; Avishek Anand
【Abstract】: We deal with the problem of ranking news events on a daily basis for large news corpora, an essential building block for news aggregation. News ranking has been addressed in the literature before but with individual news articles as the unit of ranking. However, estimating event importance accurately requires models to quantify current day event importance as well as its significance in the historical context. Consequently, in this paper we show that a cluster of news articles representing an event is a better unit of ranking as it provides an improved estimation of popularity, source diversity and authority cues. In addition, events facilitate quantifying their historical significance by linking them with long-running topics and recent chain of events. Our main contribution in this paper is to provide effective models for improved news event ranking. To this end, we propose novel event mining and feature generation approaches for improving estimates of event importance. Finally, we conduct extensive evaluation of our approaches on two large real-world news corpora each of which span for more than a year with a large volume of up to tens of thousands of daily news articles. Our evaluations are large-scale and based on a clean human curated ground-truth from Wikipedia Current Events Portal. Experimental comparison with a state-of-the-art news ranking technique based on language models demonstrates the effectiveness of our approach.
【Keywords】: daily news ranking; event canonicalization; event linking; learning to rank; news clustering; news event mining; news event ranking
【Paper Link】 【Pages】:241-249
【Authors】: Kewen Liao ; Alistair Moffat ; Matthias Petri ; Anthony Wirth
【Abstract】: Vast amounts of data are collected and stored every day, as part of corporate knowledge bases and as a response to legislative compliance requirements. To reduce the cost of retaining such data, compression tools are often applied. But simply seeking the best compression ratio is not necessarily the most economical choice, and other factors also come in to play, including compression and decompression throughput, the main memory required to support a given level of on-going access to the stored data, and the types of storage available. Here we develop a model for the total retention cost (TRC) of a data archiving regime, and by applying the charging rates associated with a cloud computing provider, are able to derive dollar amounts for a range of compression options, and hence guide the development of new approaches that are more cost-effective than current mechanisms. In particular, we describe an enhancement to the Relative Lempel Ziv (RLZ) compression scheme, and show that in terms of TRC, it outperforms previous approaches in terms of providing economical long-term data retention.
【Keywords】: cost model; data archiving; data compression; data retention; relative lempel-ziv
【Paper Link】 【Pages】:251
【Authors】: Nick Craswell
【Abstract】: A fundamental concern in search engines is to determine which documents have the best content for satisfying the user, based on analysis of the user's query and the text of documents. For this query-content match, many learning to rank systems make use of IR features developed in the 1990s in the TREC framework. Such features are still important in a variety of search tasks, and particularly in the long tail where clicks, links and social media signals become sparse. I will present our current progress, in particular three different neural models, with the goal of surpassing the 1990s models in full text search. This will include evidence, using proprietary Bing datasets, that large-scale training data can be useful. I will also argue that for the field to make progress on query-content relevance modeling, it may be valuable to set up a shared blind evaluation similar to 1990s TREC, possibly with large-scale training data.
【Keywords】:
【Paper Link】 【Pages】:253-261
【Authors】: Yaliang Li ; Nan Du ; Chaochun Liu ; Yusheng Xie ; Wei Fan ; Qi Li ; Jing Gao ; Huan Sun
【Abstract】: Nowadays, increasingly more people are receiving medical diagnoses from healthcare-related question answering platforms as people can get diagnoses quickly and conveniently. However, such diagnoses from non-expert crowdsourcing users are noisy or even wrong due to the lack of medical domain knowledge, which can cause serious consequences. To unleash the power of crowdsourcing on healthcare question answering, it is important to identify trustworthy answers and filter out noisy ones from user-generated data. Truth discovery methods estimate user reliability degrees and infer trustworthy information simultaneously, and thus these methods can be adopted to discover trustworthy diagnoses from crowdsourced answers. However, existing truth discovery methods do not take into account the rich semantic meanings of the answers. In the light of this challenge, we propose a method to automatically capture the semantic meanings of answers, where answers are represented as real-valued vectors in the semantic space. To learn such vector representations from noisy user-generated data, we tightly combine the truth discovery and vector learning processes. In this way, the learned vector representations enable truth discovery method to model the semantic relations among answers, and the information trustworthiness inferred by truth discovery can help the procedure of vector representation learning. To demonstrate the effectiveness of the proposed method, we collect a large-scale real-world dataset that involves 219,527 medical diagnosis questions and 23,657 non-expert users. Experimental results show that the proposed method improves the accuracy of identified trustworthy answers due to the successful consideration of answers' semantic meanings. Further, we demonstrate the fast convergence and good scalability of the proposed method, which makes it practical for real-world applications.
【Keywords】: medical question answering; semantic meanings; truth discovery
【Paper Link】 【Pages】:263-272
【Authors】: Quan Yuan ; Wei Zhang ; Chao Zhang ; Xinhe Geng ; Gao Cong ; Jiawei Han
【Abstract】: The availability of massive geo-annotated social media data sheds light on studying human mobility patterns. Among them, periodic pattern, \ie an individual visiting a geographical region with some specific time interval, has been recognized as one of the most important. Mining periodic patterns has a variety of applications, such as location prediction, anomaly detection, and location- and time-aware recommendation. However, it is a challenging task: the regions of a person and the periods of each region are both unknown. The interdependency between them makes the task even harder. Hence, existing methods are far from satisfactory for detecting periodic patterns from the low-sampling and noisy social media data. We propose a Bayesian non-parametric model, named \textbf{P}eriodic \textbf{RE}gion \textbf{D}etection (PRED), to discover periodic mobility patterns by jointly modeling the geographical and temporal information. Our method differs from previous studies in that it is non-parametric and thus does not require priori knowledge about an individual's mobility (\eg number of regions, period length, region size). Meanwhile, it models the time gap between two consecutive records rather than the exact visit time, making it less sensitive to data noise. Extensive experimental results on both synthetic and real-world datasets show that PRED outperforms the state-of-the-art methods significantly in four tasks: periodic region discovery, outlier movement finding, period detection, and location prediction.
【Keywords】: location prediction; mobility modeling; periodicity detection
【Paper Link】 【Pages】:273-282
【Authors】: Yandi Li ; Nicola Barbieri ; Daniele Quercia
【Abstract】: To walk around the city, individuals use mobile mapping services, and such services mostly suggest shortest routes. To go beyond recommending such walkable routes, we propose a new framework for automatic wayfinding for pedestrians. This framework tackles two main drawbacks from which past work suffers, namely coarse-grained representation of space and absence of contextual dynamics. We model the human tendency to regularize space by borrowing a spatial representation, Space Syntax, from the discipline of Architecture. Moreover, the proposed framework accounts for contextual dynamics of individual streets by predicting the popularity of each street under different contexts (e.g., at a given time, with a certain weather condition). Using Foursquare check-ins (i.e., whereabouts of the users of the popular location-based service) and publicly available weather data, we validate our framework in the entire city of Barcelona. We find that, with paths slightly longer than the shortest ones, our framework is able to accommodate our mental topography and effectively capture contextual changes.
【Keywords】: experimental study; path recommendation; urban informatics
【Paper Link】 【Pages】:283-292
【Authors】: Huiwen Liu ; Jiajie Xu ; Kai Zheng ; Chengfei Liu ; Lan Du ; Xian Wu
【Abstract】: Nowadays, users of social networks like tweets and weibo have generated massive geo-tagged records, and these records reveal their activities in the physical world together with spatio-temporal dynamics. Existing trajectory data management studies mainly focus on analyzing the spatio-temporal properties of trajectories, while leaving the understanding of their activities largely untouched. In this paper, we incorporate the semantic analysis of the activity information embedded in trajectories into query modelling and processing, with the aim of providing end users more accurate and meaningful trip recommendations. To this end, we propose a novel trajectory query that not only considers the spatio-temporal closeness but also, more importantly, leverages probabilistic topic modelling to capture the semantic relevance of the activities between data and query. To support efficient query processing, we design a novel hybrid index structure, namely ST-tree, to organize the trajectory points hierarchically, which enables us to prune the search space in spatial and topic dimensions simultaneously. The experimental results on real datasets demonstrate the efficiency and scalability of the proposed index structure and search algorithms.
【Keywords】: activity trajectories query; semantic relevance; spatial keywords
【Paper Link】 【Pages】:293
【Authors】: Claire Cardie
【Abstract】: Increasingly central to online search and information discovery are methods from Natural Language Processing (NLP). Research in the field, however, is progressing at what feels a frenetic pace, making it a daunting task to keep up with the latest results. This talk will describe some of the tasks and sub-areas of NLP that have experienced fundamental advances in the state of the art over past few years, focusing on what these advances mean for understanding and extracting information from online text.
【Keywords】: natural language processing
【Paper Link】 【Pages】:295-304
【Authors】: Ting Chen ; Yizhou Sun
【Abstract】: In this paper, we study the problem of author identification under double-blind review setting, which is to identify potential authors given information of an anonymized paper. Different from existing approaches that rely heavily on feature engineering, we propose to use network embedding approach to address the problem, which can automatically represent nodes into lower dimensional feature vectors. However, there are two major limitations in recent studies on network embedding: (1) they are usually general-purpose embedding methods, which are independent of the specific tasks; and (2) most of these approaches can only deal with homogeneous networks, where the heterogeneity of the network is ignored. Hence, challenges faced here are two folds: (1) how to embed the network under the guidance of the author identification task, and (2) how to select the best type of information due to the heterogeneity of the network. To address the challenges, we propose a task-guided and path-augmented heterogeneous network embedding model. In our model, nodes are first embedded as vectors in latent feature space. Embeddings are then shared and jointly trained according to task-specific and network-general objectives. We extend the existing unsupervised network embedding to incorporate meta paths in heterogeneous networks, and select paths according to the specific task. The guidance from author identification task for network embedding is provided both explicitly in joint training and implicitly during meta path selection. Our experiments demonstrate that by using path-augmented network embedding with task guidance, our model can obtain significantly better accuracy at identifying the true authors comparing to existing methods.
【Keywords】: author identification; heterogeneous information networks; meta path; network embedding; path-augmented; task-guided
【Paper Link】 【Pages】:305-314
【Authors】: Honghao Wei ; Fuzheng Zhang ; Nicholas Jing Yuan ; Chuan Cao ; Hao Fu ; Xing Xie ; Yong Rui ; Wei-Ying Ma
【Abstract】: An incisive understanding of user personality is not only essential to many scientific disciplines, but also has a profound business impact on practical applications such as digital marketing, personalized recommendation, mental diagnosis, and human resources management. Previous studies have demonstrated that language usage in social media is effective in personality prediction. However, except for single language features, a less researched direction is how to leverage the heterogeneous information on social media to have a better understanding of user personality. In this paper, we propose a Heterogeneous Information Ensemble framework, called HIE, to predict users' personality traits by integrating heterogeneous information including self-language usage, avatar, emoticon, and responsive patterns. In our framework, to improve the performance of personality prediction, we have designed different strategies extracting semantic representations to fully leverage heterogeneous information on social media. We evaluate our methods with extensive experiments based on a real-world data covering both personality survey results and social media usage from thousands of volunteers. The results reveal that our approaches significantly outperform several widely adopted state-of-the-art baseline methods. To figure out the utility of HIE in a real-world interactive setting, we also present DiPsy, a personalized chatbot to predict user personality through heterogeneous information in digital traces and conversation logs.
【Keywords】: big five; heterogeneous information; user personality
【Paper Link】 【Pages】:315-324
【Authors】: Marius Pasca
【Abstract】: Given an instance (Julieta Pinto), most methods for open-domain information extraction focus on acquiring knowledge in the form of either class labels (Costa Rican short story writers, Women novelists) referring to concepts to which the instance belongs; or facts (nationality: Costa Rica) connecting the instance (Julieta Pinto) to other instances or concepts (Costa Rica), where the fact and the other instance often take the form of an attribute (nationality) and a value (Costa Rica) respectively. From extraction through internal representation and storage, class labels and facts are treated as if they carved out disconnected slices within the larger space of factual knowledge. This paper argues that class labels and facts pertaining to an instance exist in symbiosis rather than as a dichotomy. A constituent (Costa Rican) within a class label (Costa Rican short story writers) of an instance may be indicative of a fact (nationality: Costa Rica) applicable to the instance and vice-versa. As an illustration of the relationship between class labels and facts, the paper introduces an open-domain method for the better understanding of the semantics of class labels in one of the larger and most widely-used repositories of knowledge, namely the categories in the Wikipedia category network. The method exploits the category network to associate constituents (Costa Rican) within names of Wikipedia categories, with attributes (nationality) that explain their role.
【Keywords】: attributes; class labels; facts; open-domain information extraction
【Paper Link】 【Pages】:325-334
【Authors】: Xiang Ren ; Yuanhua Lv ; Kuansan Wang ; Jiawei Han
【Abstract】: This paper presents a novel research problem, Comparative Document Analysis (CDA), that is, joint discovery of commonalities and differences between two individual documents (or two sets of documents) in a large text corpus. Given any pair of documents from a (background) document collection, CDA aims to automatically identify sets of quality phrases to summarize the commonalities of both documents and highlight the distinctions of each with respect to the other informatively and concisely. Our solution uses a general graph-based framework to derive novel measures on phrase semantic commonality and pairwise distinction, where the background corpus is used for computing phrase-document semantic relevance. We use the measures to guide the selection of sets of phrases by solving two joint optimization problems. A scalable iterative algorithm is developed to integrate the maximization of phrase commonality or distinction measure with the learning of phrase-document semantic relevance. Experiments on large text corpora from two different domains---scientific papers and news---demonstrate the effectiveness and robustness of the proposed framework on comparing documents. Analysis on a 10GB+ text corpus demonstrates the scalability of our method, whose computation time grows linearly as the corpus size increases. Our case study on comparing news articles published at different dates shows the power of the proposed method on comparing sets of documents.
【Keywords】: automatic summarization; commonalities; comparative document analysis; distinction; large-scale text processing; massive text corpus
【Paper Link】 【Pages】:335-344
【Authors】: Sendong Zhao ; Quan Wang ; Sean Massung ; Bing Qin ; Ting Liu ; Bin Wang ; ChengXiang Zhai
【Abstract】: In this paper, we formally define the problem of representing and leveraging abstract event causality to power downstream applications. We propose a novel solution to this problem, which build an abstract causality network and embed the causality network into a continuous vector space. The abstract causality network is generalized from a specific one, with abstract event nodes represented by frequently co-occurring word pairs. To perform the embedding task, we design a dual cause-effect transition model. Therefore, the proposed method can obtain general, frequent, and simple causality patterns, meanwhile, simplify event matching. Given the causality network and the learned embeddings, our model can be applied to a wide range of applications such as event prediction, event clustering and stock market movement prediction. Experimental results demonstrate that 1) the abstract causality network is effective for discovering high-level causality rules behind specific causal events; 2) the embedding models perform better than state-of-the-art link prediction techniques in predicting events; and 3) the event causality embedding is an easy-to-use and sophisticated feature for downstream applications such as stock market movement prediction.
【Keywords】: causality; embedding methods; event causality network; event prediction; stock price movement prediction
【Paper Link】 【Pages】:345-354
【Authors】: David Tsurel ; Dan Pelleg ; Ido Guy ; Dafna Shahaf
【Abstract】: A significant portion of web search queries directly refers to named entities. Search engines explore various ways to improve the user experience for such queries. We suggest augmenting search results with trivia facts about the searched entity. Trivia is widely played throughout the world, and was shown to increase users' engagement and retention. Most random facts are not suitable for the trivia section. There is skill (and art) to curating good trivia. In this paper, we formalize a notion of trivia-worthiness and propose an algorithm that automatically mines trivia facts from Wikipedia. We take advantage of Wikipedia's category structure, and rank an entity's categories by their trivia-quality. Our algorithm is capable of finding interesting facts, such as Obama's Grammy or Elvis' stint as a tank gunner. In user studies, our algorithm captures the intuitive notion of "good trivia" 45% higher than prior work. Search-page tests show a 22% decrease in bounce rates and a 12% increase in dwell time, proving our facts hold users' attention.
【Keywords】: factoids; fun facts; serendipity; surprise; trivia
【Paper Link】 【Pages】:355-364
【Authors】: Cheng Li ; Michael Bendersky ; Vijay Garg ; Sujith Ravi
【Abstract】: We consider the problem of discovering local events on the web, where events are entities extracted from webpages. Examples of such local events include small venue concerts, farmers markets, sports activities, etc. Given an event entity, we propose a graph-based framework for retrieving a ranked list of related events that a user is likely to be interested in attending. Due to the difficulty of obtaining ground-truth labels for event entities, which are temporal and are constrained by location, our retrieval framework is unsupervised, and its graph-based formulation addresses (a) the challenge of feature sparseness and noisiness, and (b) the semantic mismatch problem in a self-contained and principled manner. To validate our methods, we collect human annotations and conduct a comprehensive empirical study, analyzing the performance of our methods with regard to relevance, recall, and diversity. This study shows that our graph-based framework is significantly better than any individual feature source, and can be further improved with minimal supervision.
【Keywords】: event discovery; event retrieval; schema.org entities; web events
【Paper Link】 【Pages】:365-374
【Authors】: Aasish Pappu ; Roi Blanco ; Yashar Mehdad ; Amanda Stent ; Kapil Thadani
【Abstract】: Text analytics systems often rely heavily on detecting and linking entity mentions in documents to knowledge bases for downstream applications such as sentiment analysis, question answering and recommender systems. A major challenge for this task is to be able to accurately detect entities in new languages with limited labeled resources. In this paper we present an accurate and lightweight, multilingual named entity recognition (NER) and linking (NEL) system. The contributions of this paper are three-fold: 1) Lightweight named entity recognition with competitive accuracy; 2) Candidate entity retrieval that uses search click-log data and entity embeddings to achieve high precision with a low memory footprint; and 3) efficient entity disambiguation. Our system achieves state-of-the-art performance on TAC KBP 2013 multilingual data and on English AIDA CONLL data.
【Keywords】: clustering entities; document processing; entity extraction; entity linking; natural language processing; unsupervised learning
【Paper Link】 【Pages】:375-383
【Authors】: Luis Galárraga ; Simon Razniewski ; Antoine Amarilli ; Fabian M. Suchanek
【Abstract】: Knowledge bases such as Wikidata, DBpedia, or YAGO contain millions of entities and facts. In some knowledge bases, the correctness of these facts has been evaluated. However, much less is known about their completeness, i.e., the proportion of real facts that the knowledge bases cover. In this work, we investigate different signals to identify the areas where a knowledge base is complete. We show that we can combine these signals in a rule mining approach, which allows us to predict where facts may be missing. We also show that completeness predictions can help other applications such as fact prediction.
【Keywords】: incompleteness; knowledge bases; quality; recall
【Paper Link】 【Pages】:385-394
【Authors】: Adi Omari ; Sharon Shoham ; Eran Yahav
【Abstract】: We address the problem of synthesizing a robust data-extractor from a family of websites that contain the same kind of information. This problem is common when trying to aggregate information from many web sites, for example, when extracting information for a price-comparison site. Given a set of example annotated web pages from multiple sites in a family, our goal is to synthesize a robust data extractor that performs well on all sites in the family (not only on the provided example pages). The main challenge is the need to trade off precision for generality and robustness. Our key contribution is the introduction of forgiving extractors that dynamically adjust their precision to handle structural changes, without sacrificing precision on the training set. Our approach uses decision tree learning to create a generalized extractor and converts it into a forgiving extractor, inthe form of an XPath query. The forgiving extractor captures a series of pruned decision trees with monotonically decreasing precision, and monotonically increasing recall, and dynamically adjusts precision to guarantee sufficient recall. We have implemented our approach in a tool called TREEX and applied it to synthesize extractors for real-world large scale web sites. We evaluate the robustness and generality of the forgiving extractors by evaluating their precision and recall on: (i) different pages from sites in the training set (ii) pages from different versions of sites in the training set (iii) pages from different (unseen) sites. We compare the results of our synthesized extractor to those of classifier-based extractors, and pattern-based extractors, and show that TREEX significantly improves extraction accuracy.
【Keywords】: data extraction; data mining; web data extraction; wrapper induction; wrappers
【Paper Link】 【Pages】:395-403
【Authors】: Pengwei Wang ; Yong Zhang ; Lei Ji ; Jun Yan ; Lianwen Jin
【Abstract】: The question retrieval, which aims to find similar questions of a given question, is playing pivotal role in various question answering (QA) systems. This task is quite challenging mainly on three aspects: lexical gap, polysemy and word order. In this paper, we propose a unified framework to simultaneously handle these three problems. We use word combined with corresponding concept information to handle the polysemous problem. The concept embedding and word embedding are learned at the same time from both context-dependent and context-independent view. The lexical gap problem is handled since the semantic information has been encoded into the embedding. Then, we propose to use a high-level feature embedded convolutional semantic model to learn the question embedding by inputting the concept embedding and word embedding without manually labeling training data. The proposed framework nicely represent the hierarchical structures of word information and concept information in sentences with their layer-by-layer composition and pooling. Finally, the framework is trained in a weakly-supervised manner on question answer pairs, which can be directly obtained without manually labeling. Experiments on two real question answering datasets show that the proposed framework can significantly outperform the state-of-the-art solutions.
【Keywords】: concept embedding; question embedding; question retrieval
【Paper Link】 【Pages】:405-414
【Authors】: Hongya Song ; Zhaochun Ren ; Shangsong Liang ; Piji Li ; Jun Ma ; Maarten de Rijke
【Abstract】: We aim at summarizing answers in community question-answering (CQA). While most previous work focuses on factoid question-answering, we focus on the non-factoid question-answering. Unlike factoid CQA, non-factoid question-answering usually requires passages as answers. The shortness, sparsity and diversity of answers form interesting challenges for summarization. To tackle these challenges, we propose a sparse coding-based summarization strategy that includes three core ingredients: short document expansion, sentence vectorization, and a sparse-coding optimization framework. Specifically, we extend each answer in a question-answering thread to a more comprehensive representation via entity linking and sentence ranking strategies. From answers extended in this manner, each sentence is represented as a feature vector trained from a short text convolutional neural network model. We then use these sentence representations to estimate the saliency of candidate sentences via a sparse-coding framework that jointly considers candidate sentences and Wikipedia sentences as reconstruction items. Given the saliency vectors for all candidate sentences, we extract sentences to generate an answer summary based on a maximal marginal relevance algorithm. Experimental results on a benchmark data collection confirm the effectiveness of our proposed method in answer summarization of non-factoid CQA, and moreover, its significant improvement compared to state-of-the-art baselines in terms of ROUGE metrics.
【Keywords】: community question-answering; document summarization; short text processing; sparse coding
【Paper Link】 【Pages】:415-424
【Authors】: Jong-Hoon Oh ; Kentaro Torisawa ; Canasai Kruengkrai ; Ryu Iida ; Julien Kloetzer
【Abstract】: Why-question answering (why-QA) is a task to retrieve answers (or answer passages) to why-questions (e.g., "why are tsunamis generated?") from a text archive. Several previously proposed methods for why-QA improved their performance by automatically recognizing causalities that are expressed with such explicit cues as "because" in answer passages and using the recognized causalities as a clue for finding proper answers. However, in answer passages, causalities might be implicitly expressed, (i.e., without any explicit cues): "An earthquake suddenly displaced sea water and a tsunami was generated." The previous works did not deal with such implicitly expressed causalities and failed to find proper answers that included the causalities. We improve why-QA based on the following two ideas. First, implicitly expressed causalities in one text might be expressed in other texts with explicit cues. If we can automatically recognize such explicitly expressed causalities from a text archive and use them to complement the implicitly expressed causalities in an answer passage, we can improve why-QA. Second, the causes of similar events tend to be described with a similar set of words (e.g., "seismic energy" and "tectonic plates" for "the Great East Japan Earthquake" and "the 1906 San Francisco Earthquake"). As such, even if we cannot find in a text archive any explicitly expressed cause of an event (e.g., "the Great East Japan Earthquake") expressed in a question (e.g., "Why did the Great East Japan earthquake happen?"), we might be able to identify its implicitly expressed causes with a set of words (e.g., "tectonic plates") that appear in the explicitly expressed cause of a similar event (e.g., "the 1906 San Francisco Earthquake"). We implemented these two ideas in our multi-column convolutional neural networks with a novel attention mechanism, which we call causality attention. Through experiments on Japanese why-QA, we confirmed that our proposed method outperformed the state-of-the-art systems.
【Keywords】: causality; convolutional neural network; neural attention; question answering; why-question answering
【Paper Link】 【Pages】:425-434
【Authors】: Lei Zheng ; Vahid Noroozi ; Philip S. Yu
【Abstract】: A large amount of information exists in reviews written by users. This source of information has been ignored by most of the current recommender systems while it can potentially alleviate the sparsity problem and improve the quality of recommendations. In this paper, we present a deep model to learn item properties and user behaviors jointly from review text. The proposed model, named Deep Cooperative Neural Networks (DeepCoNN), consists of two parallel neural networks coupled in the last layers. One of the networks focuses on learning user behaviors exploiting reviews written by the user, and the other one learns item properties from the reviews written for the item. A shared layer is introduced on the top to couple these two networks together. The shared layer enables latent factors learned for users and items to interact with each other in a manner similar to factorization machine techniques. Experimental results demonstrate that DeepCoNN significantly outperforms all baseline recommender systems on a variety of datasets.
【Keywords】: convolutional neural networks; deep learning; rating prediction; recommender systems
【Paper Link】 【Pages】:435-443
【Authors】: Qi Zhao ; Yongfeng Zhang ; Yi Zhang ; Daniel Friedman
【Abstract】: Basic economic relations such as substitutability and complementarity between products are crucial for recommendation tasks, since the utility of one product may depend on whether or not other products are purchased. For example, the utility of a camera lens could be high if the user possesses the right camera (complementarity), while the utility of another camera could be low because the user has already purchased one (substitutability). We propose \emph{multi-product utility maximization} (MPUM) as a general approach to recommendation driven by economic principles. MPUM integrates the economic theory of consumer choice with personalized recommendation, and focuses on the utility of \textit{sets} of products for individual users. MPUM considers what the users already have when recommending additional products. We evaluate MPUM against several popular recommendation algorithms on two real-world E-commerce datasets. Results confirm the underlying economic intuition, and show that MPUM significantly outperforms the comparison algorithms under top-K evaluation metrics.
【Keywords】: collaborative filtering; computational economics; recommender systems; utility maximization
【Paper Link】 【Pages】:445-453
【Authors】: Shay Ben-Elazar ; Gal Lavee ; Noam Koenigstein ; Oren Barkan ; Hilik Berezin ; Ulrich Paquet ; Tal Zaccai
【Abstract】: This paper describes an algorithm designed for Microsoft's Groove music service, which serves millions of users world wide. We consider the problem of automatically generating personalized music playlists based on queries containing a ``seed'' artist and the listener's user ID. Playlist generation may be informed by a number of information sources including: user specific listening patterns, domain knowledge encoded in a taxonomy, acoustic features of audio tracks, and overall popularity of tracks and artists. The importance assigned to each of these information sources may vary depending on the specific combination of user and seed~artist. The paper presents a method based on a variational Bayes solution for learning the parameters of a model containing a four-level hierarchy of global preferences, genres, sub-genres and artists. The proposed model further incorporates a personalization component for user-specific preferences. Empirical evaluations on both proprietary and public datasets demonstrate the effectiveness of the algorithm and showcase the contribution of each of its components.
【Keywords】: personalization; playlist generation; radio; recommendation; variational inference
【Paper Link】 【Pages】:455-464
【Authors】: Wenchao Yu ; Charu C. Aggarwal ; Wei Wang
【Abstract】: The problem of evolutionary network analysis has gained increasing attention in recent years, because of an increasing number of networks, which are encountered in temporal settings. For example, social networks, communication networks, and information networks continuously evolve over time, and it is desirable to learn interesting trends about how the network structure evolves over time, and in terms of other interesting trends. One challenging aspect of networks is that they are inherently resistant to parametric modeling, which allows us to truly express the edges in the network as functions of time. This is because, unlike multidimensional data, the edges in the network reflect interactions among nodes, and it is difficult to independently model the edge as a function of time, without taking into account its correlations and interactions with neighboring edges. Fortunately, we show that it is indeed possible to achieve this goal with the use of a matrix factorization, in which the entries are parameterized by time. This approach allows us to represent the edge structure of the network purely as a function of time, and predict the evolution of the network over time. This opens the possibility of using the approach for a wide variety of temporal network analysis problems, such as predicting future trends in structures, predicting links, and node-centric anomaly/event detection. This flexibility is because of the general way in which the approach allows us to express the structure of the network as a function of time. We present a number of experimental results on a number of temporal data sets showing the effectiveness of the approach.
【Keywords】: anomaly detection; evolutionary network analysis; link prediction; temporal matrix factorization
【Paper Link】 【Pages】:465-474
【Authors】: Xinran He ; Yan Liu
【Abstract】: Network Inference, i.e., discovering latent diffusion networks from observed cascades, has been studied extensively in recent years, leading to a series of excellent work. However, it has been observed that the accuracy of existing methods deteriorates significantly when the number of cascades are limited (compared with the large number of nodes), which is the norm in real world applications. Meanwhile, we are able to collect cascades on many different topics or over a long time period: the associated influence networks (either topic-specific or time-specific) are highly correlated while the number of cascade observations associated with each network is very limited. In this work, we propose a generative model, referred to as the MultiCascades model (MCM), to address the challenge of data scarcity by exploring the commonality between multiple related diffusion networks. MCM builds a hierarchical graphical model, where all the diffusion networks share the same network prior, e.g., the popular Stochastic Blockmodels or the latent space models. The parameters of the network priors can be effectively learned by gleaning evidence from a large number of inferred networks. In return, each individual network can be inferred more accurately thanks to the prior information. Furthermore, we develop efficient inference and learning algorithms so that MCM is scalable for practical applications. The results on both synthetic datasets and real-world datasets demonstrate that MCM infers both topic-specific and time-varying diffusion networks more accurately.
【Keywords】: graph prior; network inference; social influence
【Paper Link】 【Pages】:475-484
【Authors】: Jing Wang ; Jie Shen ; Ping Li ; Huan Xu
【Abstract】: This work studies the binary matrix completion problem underlying a large body of real-world applications such as signed link prediction and information propagation. That is, each entry of the matrix indicates a binary preference such as "like" or "dislike", "trust" or "distrust". However, the performance of existing matrix completion methods may be hindered owing to three practical challenges: 1) the observed data are with binary label (i.e., not real value); 2) the data are typically sampled non-uniformly (i.e., positive links dominate the negative ones) and 3) a network may have a huge volume of data (i.e., memory and computational issue). In order to remedy these problems, we propose a novel framework which {i} maximizes the resemblance between predicted and observed matrices as well as penalizing the logistic loss to fit the binary data to produce binary estimates; {ii} constrains the matrix max-norm and maximizes the F-score to handle non-uniformness and {iii} presents online optimization technique, hence mitigating the memory cost. Extensive experiments performed on four large-scale datasets with up to hundreds of thousands of users demonstrate the superiority of our framework over the state-of-the-art matrix completion based methods and popular link prediction approaches.
【Keywords】: link prediction; low-rank; matrix completion; online learning
【Paper Link】 【Pages】:485-494
【Authors】: Zhaochun Ren ; Shangsong Liang ; Piji Li ; Shuaiqiang Wang ; Maarten de Rijke
【Abstract】: A recommendation is called explainable if it not only predicts a numerical rating for an item, but also generates explanations for users' preferences. Most existing methods for explainable recommendation apply topic models to analyze user reviews to provide descriptions along with the recommendations they produce. So far, such methods have neglected user opinions and influences from social relations as a source of information for recommendations, even though these are known to improve the rating prediction. In this paper, we propose a latent variable model, called social collaborative viewpoint regression (sCVR), for predicting item ratings based on user opinions and social relations. To this end, we use so-called viewpoints, represented as tuples of a concept, topic, and a sentiment label from both user reviews and trusted social relations. In addition, such viewpoints can be used as explanations. We apply a Gibbs EM sampler to infer posterior distributions of sCVR. Experiments conducted on three large benchmark datasets show the effectiveness of our proposed method for predicting item ratings and for generating explanations.
【Keywords】: recommender systems; topic modeling; trusted social relations; user comment analysis
【Paper Link】 【Pages】:495-503
【Authors】: Chao-Yuan Wu ; Amr Ahmed ; Alex Beutel ; Alexander J. Smola ; How Jing
【Abstract】: Recommender systems traditionally assume that user profiles and movie attributes are static. Temporal dynamics are purely reactive, that is, they are inferred after they are observed, e.g. after a user's taste has changed or based on hand-engineered temporal bias corrections for movies. We propose Recurrent Recommender Networks (RRN) that are able to predict future behavioral trajectories. This is achieved by endowing both users and movies with a Long Short-Term Memory (LSTM) autoregressive model that captures dynamics, in addition to a more traditional low-rank factorization. On multiple real-world datasets, our model offers excellent prediction accuracy and it is very compact, since we need not learn latent state but rather just the state transition function.
【Keywords】: recommender systems; recurrent neural networks
【Paper Link】 【Pages】:505-514
【Authors】: Jérémie Rappaz ; Maria-Luiza Vladarean ; Julian McAuley ; Michele Catasta
【Abstract】: Bartering is a timeless practice that is becoming increasingly popular on the Web. Recommending trades for an online bartering platform shares many similarities with traditional approaches to recommendation, in particular the need to model the preferences of users and the properties of the items they consume. However, there are several aspects that make bartering problems interesting and challenging, specifically the fact that users are both suppliers and consumers, and that the trading environment is highly dynamic. Thus, a successful model of bartering requires us to understand not just users' preferences, but also the social dynamics of who trades with whom, and the temporal dynamics of when trades occur. We propose new models for bartering-based recommendation, for which we introduce three novel datasets from online bartering platforms. Surprisingly, we find that existing methods (based on matching algorithms) perform poorly on real-world platforms, as they rely on idealized assumptions that are not supported by real bartering data. We develop approaches based on Matrix Factorization in order to model the reciprocal interest between users and each other's items. We also find that the social ties between members have a strong influence, as does the time at which they trade, therefore we extend our model to be socially- and temporally-aware. We evaluate our approach on trades covering books, video games, and beers, where we obtain promising empirical performance compared to existing techniques.
【Keywords】: barter; collaborative filtering; exchange; matrix factorization; reciprocity; social dynamics; swap; temporal dynamics
【Paper Link】 【Pages】:515-524
【Authors】: How Jing ; Alexander J. Smola
【Abstract】: The ability to predict future user activity is invaluable when it comes to content recommendation and personalization. For instance, knowing when users will return to an online music service and what they will listen to increases user satisfaction and therefore user retention. We present a model based on Long-Short Term Memory to estimate when a user will return to a site and what their future listening behavior will be. In doing so, we aim to solve the problem of Just-In-Time recommendation, that is, to recommend the right items at the right time. We use tools from survival analysis for return time prediction and exponential families for future activity analysis. We show that the resulting multitask problem can be solved accurately, when applied to two real-world datasets.
【Keywords】: behavior modeling; neural network; recommendation; recurrent network; survival analysis
【Paper Link】 【Pages】:525-533
【Authors】: Ios Kotsogiannis ; Elena Zheleva ; Ashwin Machanavajjhala
【Abstract】: Recommender systems have become ubiquitous in online applications where companies personalize the user experience based on explicit or inferred user preferences. Most modern recommender systems concentrate on finding relevant items for each individual user. In this paper, we describe the problem of directed edge recommendations where the system recommends the best item that a user can gift, share or recommend to another user that he/she is connected to. We propose algorithms that utilize the preferences of both the sender and the recipient by integrating individual user preference models (e.g., based on items each user purchased for themselves) with models of sharing preferences (e.g., gift purchases for others) into the recommendation process. We compare our work to group recommender systems and social network edge labeling, showing that incorporating the task context leads to more accurate recommendations.
【Keywords】: directed edge recommenders; recommender systems; social networks
【Paper Link】 【Pages】:535
【Authors】: Ralf Herbrich
【Abstract】: In this talk I will give an introduction into the field of machine learning and discuss why it is a crucial technology for Amazon. Machine learning is the science of automatically extracting patterns from data in order to make automated predictions of future data. One way to differentiate machine learning tasks is by the following two factors: (1) How much noise is contained in the data? and (2) How far into the future is the prediction task? The former presents a limit to the learnability of task --- regardless which learning algorithm is used --- whereas the latter has a crucial implication on the representation of the predictions: while most tasks in search and advertising typically only forecast minutes into the future, tasks in e-commerce can require predictions up to a year into the future. The further the forecast horizon, the more important it is to take account of uncertainty in both the learning algorithm and the representation of the predictions. I will discuss which learning frameworks are best suited for the various scenarios, that is, short-term predictions with little noise vs. long-term predictions with lots of noise, and present some ideas to combine representation learning with probabilistic methods. In the second half of the talk, I will give an overview of the applications of machine learning at Amazon ranging from demand forecasting, machine translation to automation of computer vision tasks and robotics. I will also discuss the importance of tools for data scientist and share learnings on bringing machine learning algorithms into products.
【Keywords】: ecommerce; machine learning
【Paper Link】 【Pages】:537-546
【Authors】: Tim Althoff ; Pranav Jindal ; Jure Leskovec
【Abstract】: Many of today's most widely used computing applications utilize social networking features and allow users to connect, follow each other, share content, and comment on others' posts. However, despite the widespread adoption of these features, there is little understanding of the consequences that social networking has on user retention, engagement, and online as well as offline behavior. Here, we study how social networks influence user behavior in a physical activity tracking application. We analyze 791 million online and offline actions of 6 million users over the course of 5 years, and show that social networking leads to a significant increase in users' online as well as offline activities. Specifically, we establish a causal effect of how social networks influence user behavior. We show that the creation of new social connections increases user online in-application activity by 30%, user retention by 17%, and user offline real-world physical activity by 7% (about 400 steps per day). By exploiting a natural experiment we distinguish the effect of social influence of new social connections from the simultaneous increase in user's motivation to use the app and take more steps. We show that social influence accounts for 55% of the observed changes in user behavior, while the remaining 45% can be explained by the user's increased motivation to use the app. Further, we show that subsequent, individual edge formations in the social network lead to significant increases in daily steps. These effects diminish with each additional edge and vary based on edge attributes and user demographics. Finally, we utilize these insights to develop a model that accurately predicts which users will be most influenced by the creation of new social network connections.
【Keywords】: exercise; health; influence; physical activity; social features; social influence; social networks; susceptibility; user behavior
【Paper Link】 【Pages】:547-556
【Authors】: Guangde Chen ; Bee-Chung Chen ; Deepak Agarwal
【Abstract】: Most online social networks provide a mechanism for users to broadcast messages to their personalized network through actions like shares, likes and tweets. Receiving positive feedback from the network such as likes, comments and retweets in response to such actions can provide a strong incentive for users to broadcast more often in the future. We call such feedback by the network, that influences a user to perform certain desirable future actions, social incentives. For example, after a user shares an article to her social network, receiving positive feedback such as a ''like'' from a friend can potentially encourage her to continue sharing more regularly. Typically, for every user's visit to an online social network site, good messages need to be ranked and selected by a recommender system from a large set of candidate messages (broadcasted by the user's network). In this paper, we propose a novel recommendation problem: How should we recommend messages to users to incentivize neighbors in their personal network to perform desirable actions in the future with high likelihood, without significantly hurting overall engagement for the entire system? For instance, messages could be content shared by neighbors. The goal in this case would be to encourage more content shares in the future. We call this problem social incentive optimization and study an instance of it for LinkedIn's news feed. We observe that a user who receives positive social feedback from neighbors has a higher likelihood of broadcasting more frequently. Using this observation, we develop a novel recommendation framework that incentivize users to broadcast more often, without significantly hurting overall feed engagement. We demonstrate the effectiveness of our approach through causal analysis on retrospective data and online A/B experiments.
【Keywords】: causal data analysis; constrained optimization; counterfactual models; doubly robust estimation; sharer retention; social incentive optimization; social networking
【Paper Link】 【Pages】:557-566
【Authors】: Marco Bressan ; Flavio Chierichetti ; Ravi Kumar ; Stefano Leucci ; Alessandro Panconesi
【Abstract】: Counting graphlets is a well-studied problem in graph mining and social network analysis. Recently, several papers explored very simple and natural approaches based on Monte Carlo sampling of Markov Chains (MC), and reported encouraging results. We show, perhaps surprisingly, that this approach is outperformed by a carefully engineered version of color coding (CC) [1], a sophisticated algorithmic technique that we extend to the case of graphlet sampling and for which we prove strong statistical guarantees. Our computational experiments on graphs with millions of nodes show CC to be more accurate than MC. Furthermore, we formally show that the mixing time of the MC approach is too high in general, even when the input graph has high conductance. All this comes at a price however. While MC is very efficient in terms of space, CC's memory requirements become demanding when the size of the input graph and that of the graphlets grow. And yet, our experiments show that a careful implementation of CC can push the limits of the state of the art, both in terms of the size of the input graph and of that of the graphlets.
【Keywords】: color coding; graph mining; graphlet counting; random walk; subgraph enumeration; subgraph sampling
【Paper Link】 【Pages】:567-575
【Authors】: Fuzhen Zhuang ; Dan Luo ; Nicholas Jing Yuan ; Xing Xie ; Qing He
【Abstract】: Last decades have witnessed a vast amount of interest and research in recommendation systems. Collaborative filtering, which uses the known preferences of a group of users to make recommendations or predictions of the unknown preferences for other users, is one of the most successful approaches to build recommendation systems. Most previous collaborative filtering approaches employ the matrix factorization techniques to learn latent user feature profiles and item feature profiles. Also many subsequent works are proposed to incorporate users' social network information and items' attributions to further improve recommendation performance under the matrix factorization framework. However, the matrix factorization based methods may not make full use of the rating information, leading to unsatisfying performance. Recently deep learning has been approved to be able to find good representations in natural language processing, image classification, and so on. Along this line, we propose a collaborative ranking framework via representation learning with pair-wise constraints (REAP for short), in which autoencoder is used to simultaneously learn the latent factors of both users and items and pair-wise ranked loss defined by (user, item) pairs is considered. Extensive experiments are conducted on five data sets to demonstrate the effectiveness of the proposed framework.
【Keywords】: autoencoder; collaborative ranking; pair-wise constraints; representation learning
【Paper Link】 【Pages】:577
【Authors】: Steve Young
【Abstract】: This talk will review the principal components of a spoken dialogue system and then discuss the opportunities for applying machine learning for building robust high performance open-domain systems. The talk will be illustrated by recent work at Cambridge University using machine learning for belief tracking, reward estimation, multi-domain policy learning and natural language generation. The talk will conclude by discussing some of the key challenges in scaling these solutions to work in practical systems.
【Keywords】: gaussian processes; machine learning; neural networks; reinforcement learning; spoken dialogue systems
【Paper Link】 【Pages】:579
【Authors】: Anjali Joshi
【Abstract】: Internet search has become the first stop in many users' health journeys. Today, about 1 in 20 Google searches are related to healthcare. These queries span a broad range of information needs as people are looking for possible conditions related to their symptoms and are seeking to understand their diagnoses and prescribed treatments, decipher their test results, find pathways of self-care as well as connect to people with similar experiences. This talk will cover the approaches that we at Google have used to meet these diverse user needs. We will also discuss how we constructed and curated the Health Knowledge Graph, a large scale resource of highly accurate medical knowledge that powers many of our health applications. In the second part of the talk, we will show how the confluence of advances in technology enables us to revolutionize health data collection and perform it at unprecedented scale and granularity. Combined with contextual signals, anonymous aggregated user activity can be used to quantify public health phenomena and provide concerned authorities with actionable information about seasonal or situational health issues. We will conclude the talk with an outline of research directions that could enable people and organizations in personal and public health settings obtain actionable information in a timely manner. The work presented here was the product of collaboration of multiple teams at Google.
【Keywords】: healthcare; web search
【Paper Link】 【Pages】:581-590
【Authors】: Haifeng Zhang ; Weinan Zhang ; Yifei Rong ; Kan Ren ; Wenxin Li ; Jun Wang
【Abstract】: In this paper, we deal with the uncertainty of bidding for display advertising. Similar to the financial market trading, real-time bidding (RTB) based display advertising employs an auction mechanism to automate the impression level media buying; and running a campaign is no different than an investment of acquiring new customers in return for obtaining additional converted sales. Thus, how to optimally bid on an ad impression to drive the profit and return-on-investment becomes essential. However, the large randomness of the user behaviors and the cost uncertainty caused by the auction competition may result in a significant risk from the campaign performance estimation. In this paper, we explicitly model the uncertainty of user click-through rate estimation and auction competition to capture the risk. We borrow an idea from finance and derive the value at risk for each ad display opportunity. Our formulation results in two risk-aware bidding strategies that penalize risky ad impressions and focus more on the ones with higher expected return and lower risk. The empirical study on real-world data demonstrates the effectiveness of our proposed risk-aware bidding strategies: yielding profit gains of 15.4% in offline experiments and up to 17.5% in an online A/B test on a commercial RTB platform over the widely applied bidding strategies.
【Keywords】: demand-side platform; display advertising; real-time bidding; risk-aware bidding strategy; value at risk
【Paper Link】 【Pages】:591-600
【Authors】: Jinyoung Yeo ; Sungchul Kim ; Eunyee Koh ; Seung-won Hwang ; Nedim Lipka
【Abstract】: Generally 2% of shoppers make a purchase on the first visit to an online store while the other 98% enjoys only window-shopping. To bring people back to the store and close the deal, "retargeting" has been a vital online advertising strategy that leads to "conversion" of window-shoppers into buyers. As such retargeting is more effective as a focused tool, in this paper, we study the problem of identifying a conversion rate for a given product and its current customers, which is an important analytics metric for retargeting process. Compared to existing approaches using either of customer- or product-level conversion pattern, we propose a joint modeling of both level patterns based on the well-studied buying decision process. To evaluate the effectiveness of our method, we perform extensive experiments on the simulated dataset generated based on a set of real-world web logs. The evaluation results show that conversion predictions by our approach are consistently more accurate and robust than those by existing baselines in dynamic market environment.
【Keywords】: customer model; e-commerce; retargeting; sales prediction
【Paper Link】 【Pages】:601-610
【Authors】: Ashwin Paranjape ; Austin R. Benson ; Jure Leskovec
【Abstract】: Networks are a fundamental tool for modeling complex systems in a variety of domains including social and communication networks as well as biology and neuroscience. The counts of small subgraph patterns in networks, called network motifs, are crucial to understanding the structure and function of these systems. However, the role of network motifs for temporal networks, which contain many timestamped links between nodes, is not well understood. Here we develop a notion of a temporal network motif as an elementary unit of temporal networks and provide a general methodology for counting such motifs. We define temporal network motifs as induced subgraphs on sequences of edges, design several fast algorithms for counting temporal network motifs, and prove their runtime complexity. We also show that our fast algorithms achieve 1.3x to 56.5x speedups compared to a baseline method. We use our algorithms to count temporal network motifs in a variety of real-world datasets. Results show that networks from different domains have significantly different motif frequencies, whereas networks from the same domain tend to have similar motif frequencies. We also find that measuring motif counts at various time scales reveals different behavior.
【Keywords】: motifs; temporal networks
【Paper Link】 【Pages】:611-620
【Authors】: Zhenni Feng ; Yanmin Zhu ; Jian Cao
【Abstract】: Understanding air travel choice behavior of air passengers is of great significance for various purposes such as travel demand prediction and trip recommendation. Existing approaches based on surveys can only provide aggregate level air travel choice behavior of passengers and they fail to provide comprehensive information for personalized services. In this paper we focus on modeling individual level air travel choice behavior of passengers, which is valuable for recommendations and personalized services. We employ a probabilistic model to represent individual level air travel choice behavior based on a large dataset of historical booking records, leveraging several key factors, such as takeoff time, arrival time, elapsed time between reservation and takeoff, price, and seat class. However, each passenger has only a limited number of historical booking records, causing a serious data sparsity problem. To this end, we propose a mixed kernel density estimation (mix-KDE) approach for each passenger with a mixture model that combines probabilistic estimation of both regularity of the individual himself and social conformity of similar passengers. The proposed model is trained and evaluated via the expectation-maximization (EM) algorithm with a huge dataset of booking records of over 10 million air passengers from a popular online travel agency in China. Experimental results demonstrate that our mix-KDE approach outperforms the Gaussian mixture model (GMM) and the simple kernel density estimation in the presence of the sparsity issue.
【Keywords】: air travel choice behavior; individual level modeling; mixed kernel density estimation
【Paper Link】 【Pages】:621-630
【Authors】: Muhammad Aamir Saleem ; Rohit Kumar ; Toon Calders ; Xike Xie ; Torben Bach Pedersen
【Abstract】: Location-based social networks (LBSN) are social networks complemented with location data such as geo-tagged activity data of its users. In this paper, we study how users of a LBSN are navigating between locations and based on this information we select the most influential locations. In contrast to existing works on influence maximization, we are not per se interested in selecting the users with the largest set of friends or the set of locations visited by the most users; instead, we introduce a notion of location influence that captures the ability of a set of locations to reach out geographically. We provide an exact on-line algorithm and a more memory-efficient but approximate variant based on the HyperLogLog sketch to maintain a data structure called Influence Oracle that allows to efficiently find a top-k set of influential locations. Experiments show that our algorithms are efficient and scalable and that our new location influence notion favors diverse sets of locations with a large geographical spread.
【Keywords】: influence maximization; location-based influence; location-based social networks; outdoor marketing
【Paper Link】 【Pages】:631-640
【Authors】: Vineeth Rakesh ; Niranjan Jadhav ; Alexander Kotov ; Chandan K. Reddy
【Abstract】: The pervasive growth of location-based services such as Foursquare and Yelp has enabled researchers to incorpo- rate better personalization into recommendation models by leveraging the geo-temporal breadcrumbs left by a plethora of travelers. In this paper, we explore Travel path recommendation, which is one of the applications of intelligent urban navigation that aims in recommending sequence of point of interest (POIs) to tourists. Currently, travelers rely on a tedious and time-consuming process of searching the web, browsing through websites such as Trip Advisor, and reading travel blogs to compile an itinerary. On the other hand, people who do not plan ahead of their trip find it extremely difficult to do this in real-time since there are no automated systems that can provide personalized itinerary for travelers. To tackle this problem, we propose a tour recommendation model that uses a probabilistic generative framework to incorporate user's categorical preference, influence from their social circle, the dynamic travel transitions (or patterns) and the popularity of venues to recommend sequence of POIs for tourists. Through comprehensive experiments over a rich dataset of travel patterns from Foursquare, we show that our model is capable of outperforming the state-of-the-art probabilistic tour recommendation model by providing contextual and meaningful recommendation for travelers.
【Keywords】: foursquare; geo-location; probabilistic generative models; recommender systems; social media; topic models
【Paper Link】 【Pages】:641-649
【Authors】: Alex Deng ; Jiannan Lu ; Jonthan Litz
【Abstract】: A/B tests (or randomized controlled experiments) play an integral role in the research and development cycles of technology companies. As in classic randomized experiments (e.g., clinical trials), the underlying statistical analysis of A/B tests is based on assuming the randomization unit is independent and identically distributed (\iid). However, the randomization mechanisms utilized in online A/B tests can be quite complex and may render this assumption invalid. Analysis that unjustifiably relies on this assumption can yield untrustworthy results and lead to incorrect conclusions. Motivated by challenging problems arising from actual online experiments, we propose a new method of variance estimation that relies only on practically plausible assumptions, is directly applicable to a wide of range of randomization mechanisms, and can be implemented easily. We examine its performance and illustrate its advantages over two commonly used methods of variance estimation on both simulated and empirical datasets. Our results lead to a deeper understanding of the conditions under which the randomization unit can be treated as \iid In particular, we show that for purposes of variance estimation, the randomization unit can be approximated as \iid when the individual treatment effect variation is small; however, this approximation can lead to variance under-estimation when the individual treatment effect variation is large.
【Keywords】: asymptotic variance; causal inference; delta method; random effect; randomization unit
【Paper Link】 【Pages】:651-659
【Authors】: Eugene Kharitonov ; Alexey Drutsa ; Pavel Serdyukov
【Abstract】: Online search evaluation, and A/B testing in particular, is an irreplaceable tool for modern search engines. Typically, online experiments last for several days or weeks and require a considerable portion of the search traffic. This restricts their usefulness and applicability. To alleviate the need for large sample sizes in A/B experiments, several approaches were proposed. Primarily, these approaches are based on increasing the sensitivity (informally, the ability to detect changes with less observations) of the evaluation metrics. Such sensitivity improvements are achieved by applying variance reduction methods, e.g. stratification and control covariates. However, the ability to learn sensitive metric combinations that (a) agree with the ground-truth metric, and (b) are more sensitive, was not explored in the A/B testing scenario. In this work, we aim to close this gap. We formulate the problem of finding a sensitive metric combination as a data-driven machine learning problem and propose two intuitive optimization approaches to address it. Next, we perform an extensive experimental study of our proposed approaches. In our experiments, we use a dataset of 118 A/B tests performed by Yandex and study eight state-of-the-art ground-truth user engagement metrics, including Sessions per User and Absence Time. Our results suggest that a considerable sensitivity improvements over the ground-truth metrics can be achieved by using our proposed approaches.
【Keywords】: a/b tests; metric combination; online controlled experiments; online evaluation; sensitivity improvement
【Paper Link】 【Pages】:661-670
【Authors】: Han Cai ; Kan Ren ; Weinan Zhang ; Kleanthis Malialis ; Jun Wang ; Yong Yu ; Defeng Guo
【Abstract】: The majority of online display ads are served through real-time bidding (RTB) --- each ad display impression is auctioned off in real-time when it is just being generated from a user visit. To place an ad automatically and optimally, it is critical for advertisers to devise a learning algorithm to cleverly bid an ad impression in real-time. Most previous works consider the bid decision as a static optimization problem of either treating the value of each impression independently or setting a bid price to each segment of ad volume. However, the bidding for a given ad campaign would repeatedly happen during its life span before the budget runs out. As such, each bid is strategically correlated by the constrained budget and the overall effectiveness of the campaign (e.g., the rewards from generated clicks), which is only observed after the campaign has completed. Thus, it is of great interest to devise an optimal bidding strategy sequentially so that the campaign budget can be dynamically allocated across all the available impressions on the basis of both the immediate and future rewards. In this paper, we formulate the bid decision process as a reinforcement learning problem, where the state space is represented by the auction information and the campaign's real-time parameters, while an action is the bid price to set. By modeling the state transition via auction competition, we build a Markov Decision Process framework for learning the optimal bidding policy to optimize the advertising performance in the dynamic real-time bidding environment. Furthermore, the scalability problem from the large real-world auction volume and campaign budget is well handled by state value approximation using neural networks. The empirical study on two large-scale real-world datasets and the live A/B testing on a commercial platform have demonstrated the superior performance and high efficiency compared to state-of-the-art methods.
【Keywords】: bid optimization; display ads; reinforcement learning
【Paper Link】 【Pages】:671-680
【Authors】: Cheng Li ; Xiaoxiao Guo ; Qiaozhu Mei
【Abstract】: We consider the task of identifying attitudes towards a given set of entities from text. Conventionally, this task is decomposed into two separate subtasks: target detection that identifies whether each entity is mentioned in the text, either explicitly or implicitly, and polarity classification that classifies the exact sentiment towards an identified entity (the target) into positive, negative, or neutral. Instead, we show that attitude identification can be solved with an end-to-end machine learning architecture, in which the two subtasks are interleaved by a deep memory network. In this way, signals produced in target detection provide clues for polarity classification, and reversely, the predicted polarity provides feedback to the identification of targets. Moreover, the treatments for the set of targets also influence each other -- the learned representations may share the same semantics for some targets but vary for others. The proposed deep memory network, the AttNet, outperforms methods that do not consider the interactions between the subtasks or those among the targets, including conventional machine learning methods and the state-of-the-art deep learning models.
【Keywords】: attitude identification; deep learning; sentiment analysis
【Paper Link】 【Pages】:681-689
【Authors】: Kijung Shin ; Bryan Hooi ; Jisu Kim ; Christos Faloutsos
【Abstract】: How can we detect fraudulent lockstep behavior in large-scale multi-aspect data (i.e., tensors)? Can we detect it when data are too large to fit in memory or even on a disk? Past studies have shown that dense blocks in real-world tensors (e.g., social media, Wikipedia, TCP dumps, etc.) signal anomalous or fraudulent behavior such as retweet boosting, bot activities, and network attacks. Thus, various approaches, including tensor decomposition and search, have been used for rapid and accurate dense-block detection in tensors. However, all such methods have low accuracy, or assume that tensors are small enough to fit in main memory, which is not true in many real-world applications such as social media and web. To overcome these limitations, we propose D-Cube, a disk-based dense-block detection method, which also can be run in a distributed manner across multiple machines. Compared with state-of-the-art methods, D-Cube is (1) Memory Efficient: requires up to 1,600 times less memory and handles 1,000 times larger data (2.6TB), (2) Fast: up to 5 times faster due to its near-linear scalability with all aspects of data, (3) Provably Accurate: gives a guarantee on the densities of the blocks it finds, and (4) Effective: successfully spotted network attacks from TCP dumps and synchronized behavior in rating data with the highest accuracy.
【Keywords】: anomaly detection; dense-block detection; fraud detection; tensor
【Paper Link】 【Pages】:691-699
【Authors】: Yuan He ; Cheng Wang ; Changjun Jiang
【Abstract】: Document network is a kind of intriguing dataset which provides both topical (texts) and topological (links) information. Most previous work assumes that documents closely linked with each other share common topics. However, the associations among documents are usually complex, which are not limited to the homophily (i.e., tendency to link to similar others). Actually, the heterophily (i.e., tendency to link to different others) is another pervasive phenomenon in social networks. In this paper, we introduce a new tool, called copula, to separately model the documents and links, so that different copula functions can be applied to capture different correlation patterns. In statistics, a copula is a powerful framework for explicitly modeling the dependence of random variables by separating the marginals and their correlations. Though widely used in Economics, copulas have not been paid enough attention to by researchers in machine learning field. Besides, to further capture the potential associations among the unconnected documents, we apply the tree-averaged copula instead of a single copula function. This improvement makes our model achieve better expressive power, and also more elegant in algebra. We derive efficient EM algorithms to estimate the model parameters, and evaluate the performance of our model on three different datasets. Experimental results show that our approach achieves significant improvements on both topic and link modeling compared with the current state of the art.
【Keywords】: copula; document networks; plsa; topic models
【Paper Link】 【Pages】:701-709
【Authors】: Chun-Ta Lu ; Lifang He ; Weixiang Shao ; Bokai Cao ; Philip S. Yu
【Abstract】: Many real-world problems, such as web image analysis, document categorization and product recommendation, often exhibit dual-heterogeneity: heterogeneous features obtained in multiple views, and multiple tasks might be related to each other through one or more shared views. To address these Multi-Task Multi-View (MTMV) problems, we propose a tensor-based framework for learning the predictive multilinear structure from the full-order feature interactions within the heterogeneous data. The usage of tensor structure is to strengthen and capture the complex relationships between multiple tasks with multiple views. We further develop efficient multilinear factorization machines (MFMs) that can learn the task-specific feature map and the task-view shared multilinear structures, without physically building the tensor. In the proposed method, a joint factorization is applied to the full-order interactions such that the consensus representation can be learned. In this manner, it can deal with the partially incomplete data without difficulty as the learning procedure does not simply rely on any particular view. Furthermore, the complexity of MFMs is linear in the number of parameters, which makes MFMs suitable to large-scale real-world problems. Extensive experiments on four real-world datasets demonstrate that the proposed method significantly outperforms several state-of-the-art methods in a wide variety of MTMV problems.
【Keywords】: multi-task; multi-view; tensor factorization
【Paper Link】 【Pages】:711-719
【Authors】: Paul N. Bennett ; David Maxwell Chickering ; Christopher Meek ; Xiaojin Zhu
【Abstract】: Software applications often use classification models to trigger specialized experiences for users. Search engines, for example, use query classifiers to trigger specialized "instant answer" experiences where information satisfying the user query is shown directly on the result page, and email applications use classification models to automatically move messages to a spam folder. When such applications have acceptable default (i.e., non-specialized) behavior, users are often more sensitive to failures in model precision than failures in model recall. In this paper, we consider model-selection algorithms for these precision-constrained scenarios. We develop adaptive model-selection algorithms to identify, using as few samples as possible, the best classifier from among a set of (precision) qualifying classifiers. We provide statistical correctness and sample complexity guarantees for our algorithms. We show with an empirical validation that our algorithms work well in practice.
【Keywords】: efficient evaluation; guaranteed precision; model selection
【Paper Link】 【Pages】:721-729
【Authors】: Rohit Babbar ; Bernhard Schölkopf
【Abstract】: Extreme multi-label classification refers to supervised multi-label learning involving hundreds of thousands or even millions of labels. Datasets in extreme classification exhibit fit to power-law distribution, i.e. a large fraction of labels have very few positive instances in the data distribution. Most state-of-the-art approaches for extreme multi-label classification attempt to capture correlation among labels by embedding the label matrix to a low-dimensional linear sub-space. However, in the presence of power-law distributed extremely large and diverse label spaces, structural assumptions such as low rank can be easily violated. In this work, we present DiSMEC, which is a large-scale distributed framework for learning one-versus-rest linear classifiers coupled with explicit capacity control to control model size. Unlike most state-of-the-art methods, DiSMEC does not make any low rank assumptions on the label matrix. Using double layer of parallelization, DiSMEC can learn classifiers for datasets consisting hundreds of thousands labels within few hours. The explicit capacity control mechanism filters out spurious parameters which keep the model compact in size, without losing prediction accuracy. We conduct extensive empirical evaluation on publicly available real-world datasets consisting upto 670,000 labels. We compare DiSMEC with recent state-of-the-art approaches, including - SLEEC which is a leading approach for learning sparse local embeddings, and FastXML which is a tree-based approach optimizing ranking based loss function. On some of the datasets, DiSMEC can significantly boost prediction accuracies - 10% better compared to SLECC and 15% better compared to FastXML, in absolute terms.
【Keywords】: extreme classification; large-scale classification; multi-label learning
【Paper Link】 【Pages】:731-739
【Authors】: Xiao Huang ; Jundong Li ; Xia Hu
【Abstract】: Attributed network embedding aims to seek low-dimensional vector representations for nodes in a network, such that original network topological structure and node attribute proximity can be preserved in the vectors. These learned representations have been demonstrated to be helpful in many learning tasks such as network clustering and link prediction. While existing algorithms follow an unsupervised manner, nodes in many real-world attributed networks are often associated with abundant label information, which is potentially valuable in seeking more effective joint vector representations. In this paper, we investigate how labels can be modeled and incorporated to improve attributed network embedding. This is a challenging task since label information could be noisy and incomplete. In addition, labels are completely distinct with the geometrical structure and node attributes. The bewildering combination of heterogeneous information makes the joint vector representation learning more difficult. To address these issues, we propose a novel Label informed Attributed Network Embedding (LANE) framework. It can smoothly incorporate label information into the attributed network embedding while preserving their correlations. Experiments on real-world datasets demonstrate that the proposed framework achieves significantly better performance compared with the state-of-the-art embedding algorithms.
【Keywords】: attributed networks; label informed; network embedding
【Paper Link】 【Pages】:741-749
【Authors】: Linchuan Xu ; Xiaokai Wei ; Jiannong Cao ; Philip S. Yu
【Abstract】: Network embedding is increasingly employed to assist network analysis as it is effective to learn latent features that encode linkage information. Various network embedding methods have been proposed, but they are only designed for a single network scenario. In the era of big data, different types of related information can be fused together to form a coupled heterogeneous network, which consists of two different but related sub-networks connected by inter-network edges. In this scenario, the inter-network edges can act as comple- mentary information in the presence of intra-network ones. This complementary information is important because it can make latent features more comprehensive and accurate. And it is more important when the intra-network edges are ab- sent, which can be referred to as the cold-start problem. In this paper, we thus propose a method named embedding of embedding (EOE) for coupled heterogeneous networks. In the EOE, latent features encode not only intra-network edges, but also inter-network ones. To tackle the challenge of heterogeneities of two networks, the EOE incorporates a harmonious embedding matrix to further embed the em- beddings that only encode intra-network edges. Empirical experiments on a variety of real-world datasets demonstrate the EOE outperforms consistently single network embedding methods in applications including visualization, link prediction multi-class classification, and multi-label classification.
【Keywords】: coupled heterogeneous networks; data mining; network embedding
【Paper Link】 【Pages】:751-760
【Authors】: Yi Tay ; Anh Tuan Luu ; Siu Cheung Hui ; Falk Brauer
【Abstract】: Link prediction on knowledge graphs is useful in numerous application areas such as semantic search, question answering, entity disambiguation, enterprise decision support, recommender systems and so on. While many of these applications require a reasonably quick response and may operate on data that is constantly changing, existing methods often lack speed and adaptability to cope with these requirements. This is aggravated by the fact that knowledge graphs are often extremely large and may easily contain millions of entities rendering many of these methods impractical. In this paper, we address the weaknesses of current methods by proposing Random Semantic Tensor Ensemble (RSTE), a scalable ensemble-enabled framework based on tensor factorization. Our proposed approach samples a knowledge graph tensor in its graph representation and performs link prediction via ensembles of tensor factorization. Our experiments on both publicly available datasets and real world enterprise/sales knowledge bases have shown that our approach is not only highly scalable, parallelizable and memory efficient, but also able to increase the prediction accuracy significantly across all datasets.
【Keywords】: ensemble methods; knowledge graphs; semantic search; tensor factorization
【Paper Link】 【Pages】:761-770
【Authors】: Jinoh Oh ; Kijung Shin ; Evangelos E. Papalexakis ; Christos Faloutsos ; Hwanjo Yu
【Abstract】: Multi-aspect data appear frequently in many web-related applications. For example, product reviews are quadruplets of (user, product, keyword, timestamp). How can we analyze such web-scale multi-aspect data? Can we analyze them on an off-the-shelf workstation with limited amount of memory? Tucker decomposition has been widely used for discovering patterns in relationships among entities in multi-aspect data, naturally expressed as high-order tensors. However, existing algorithms for Tucker decomposition have limited scalability, and especially, fail to decompose high-order tensors since they explicitly materialize intermediate data, whose size rapidly grows as the order increases (≥ 4). We call this problem M-Bottleneck ("Materialization Bottleneck"). To avoid M-Bottleneck, we propose S-HOT, a scalable high-order tucker decomposition method that employs the on-the-fly computation to minimize the materialized intermediate data. Moreover, S-HOT is designed for handling disk-resident tensors, too large to fit in memory, without loading them all in memory at once. We provide theoretical analysis on the amount of memory space and the number of scans of data required by S-HOT. In our experiments, S-HOT showed better scalability not only with the order but also with the dimensionality and the rank than baseline methods. In particular, S-HOT decomposed tensors 1000× larger than baseline methods in terms dimensionality. S- HOT also successfully analyzed real-world tensors that are both large-scale and high-order on an off-the-shelf workstation with limited amount of memory, while baseline methods failed. The source code of S-HOT is publicly available at http://dm.postech.ac.kr/shot to encourage reproducibility.
【Keywords】:
【Paper Link】 【Pages】:771-779
【Authors】: Chin-Chi Hsu ; Yi-An Lai ; Wen-Hao Chen ; Ming-Han Feng ; Shou-De Lin
【Abstract】: PageRank has been the signature unsupervised ranking model for ranking node importance in a graph. One potential drawback of PageRank is that its computation depends only on input graph structures, not considering external information such as the attributes of nodes. This work proposes AttriRank, an unsupervised ranking model that considers not only graph structure but also the attributes of nodes. AttriRank is unsupervised and domain-independent, which is different from most of the existing works requiring either ground-truth labels or specific domain knowledge. Combining two reasonable assumptions about PageRank and node attributes, AttriRank transfers extra node information into a Markov chain model to obtain the ranking. We further develop approximation for AttriRank and reduce its complexity to be linear to the number of nodes or links in the graph, which makes it feasible for large network data. The experiments show that AttriRank outperforms competing models in diverse graph ranking applications.
【Keywords】: node ranking; pagerank; unsupervised learning
【Paper Link】 【Pages】:781-789
【Authors】: Thorsten Joachims ; Adith Swaminathan ; Tobias Schnabel
【Abstract】: Implicit feedback (e.g., clicks, dwell times, etc.) is an abundant source of data in human-interactive systems. While implicit feedback has many advantages (e.g., it is inexpensive to collect, user centric, and timely), its inherent biases are a key obstacle to its effective use. For example, position bias in search rankings strongly influences how many clicks a result receives, so that directly using click data as a training signal in Learning-to-Rank (LTR) methods yields sub-optimal results. To overcome this bias problem, we present a counterfactual inference framework that provides the theoretical basis for unbiased LTR via Empirical Risk Minimization despite biased data. Using this framework, we derive a Propensity-Weighted Ranking SVM for discriminative learning from implicit feedback, where click models take the role of the propensity estimator. In contrast to most conventional approaches to de-biasing the data using click models, this allows training of ranking functions even in settings where queries do not repeat. Beyond the theoretical support, we show empirically that the proposed learning method is highly effective in dealing with biases, that it is robust to noise and propensity model misspecification, and that it scales efficiently. We also demonstrate the real-world applicability of our approach on an operational search engine, where it substantially improves retrieval performance.
【Keywords】: click models; implicit feedback; learning to rank; propensity weighting; ranking svm
【Paper Link】 【Pages】:791-799
【Authors】: Michael Bendersky ; Xuanhui Wang ; Donald Metzler ; Marc Najork
【Abstract】: User interaction data (e.g., click data) has proven to be a powerful signal for learning-to-rank models in web search. However, such models require observing multiple interactions across many users for the same query-document pair to achieve statistically meaningful gains. Therefore, utilizing user interaction data for improving search over personal, rather than public, content is a challenging problem. First, the documents (e.g., emails or private files) are not shared across users. Second, user search queries are of personal nature (e.g., "alice's address") and may not generalize well across users. In this paper, we propose a solution to these challenges, by projecting user queries and documents into a multi-dimensional space of fine-grained and semantically coherent attributes. We then introduce a novel parameterization technique to overcome sparsity in the multi-dimensional attribute space. Attribute parameterization enables effective usage of cross-user interactions for improving personal search quality -- which is a first such published result, to the best of our knowledge. Experiments with a dataset derived from interactions of users of one of the world's largest personal search engines demonstrate the effectiveness of the proposed attribute parameterization technique.
【Keywords】: attribute parameterization; personal search; user interactions
【Paper Link】 【Pages】:801-810
【Authors】: Lu Jiang ; Yannis Kalantidis ; Liangliang Cao ; Sachin Farfade ; Jiliang Tang ; Alexander G. Hauptmann
【Abstract】: The ubiquity of mobile devices and cloud services has led to an unprecedented growth of online personal photo and video collections. Due to the scarcity of personal media search log data, research to date has mainly focused on searching images and videos on the web. However, in order to manage the exploding amount of personal photos and videos, we raise a fundamental question: what are the differences and similarities when users search their own photos versus the photos on the web? To the best of our knowledge, this paper is the first to study personal media search using large-scale real-world search logs. We analyze different types of search sessions mined from Flickr search logs and discover a number of interesting characteristics of personal media search in terms of information needs and click behaviors. The insightful observations will not only be instrumental in guiding future personal media search methods, but also benefit related tasks such as personal photo browsing and recommendation. Our findings suggest there is a significant gap between personal queries and automatically detected concepts, which is responsible for the low accuracy of many personal media search queries. To bridge the gap, we propose the deep query understanding model to learn a mapping from the personal queries to the concepts in the clicked photos. Experimental results verify the efficacy of the proposed method in improving personal media search, where the proposed method consistently outperforms baseline methods.
【Keywords】: content-based image search; deep learning; recurrent neural networks; search log analysis
【Paper Link】 【Pages】:811
【Authors】: Andrew Blake
【Abstract】: The Alan Turing Institute is the UK's newly-created national centre for data science, headquartered at the British Library in the heart of London's vibrant Knowledge Quarter. Our vision is to become a world leader in data science research and innovation. This lecture will take a whistle-stop tour through the Institute's scientific and innovation programme. It tackles challenges of social and economic importance, from engineering to finance, from health and well-being to cloud computing and smart cities. The Institute aims to make unique contributions, focussing on work that is complementary to what can be done in universities. In particular, we aim to act as a national hub, attracting a diversity of talent, interest and influence. We emphasise engineering and the ability to build high quality prototypes. We encompass, in one connected physical space, a sweep of disciplines from pure mathematics, through statistics and machine learning, to social science.
【Keywords】:
【Paper Link】 【Pages】:813-814
【Authors】: Bhaskar Mitra ; Nick Craswell
【Abstract】: In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing tasks, such as language modelling and machine translation. This suggests that neural models will also achieve good performance on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using a semantic rather than lexical matching. Although initial iterations of neural models do not outperform traditional lexical-matching baselines, the level of interest and effort in this area is increasing, potentially leading to a breakthrough. The popularity of the recent SIGIR 2016 workshop on Neural Information Retrieval provides evidence to the growing interest in neural models for IR. While recent tutorials have covered some aspects of deep learning for retrieval tasks, there is a significant scope for organizing a tutorial that focuses on the fundamentals of representation learning for text retrieval. The goal of this tutorial will be to introduce state-of-the-art neural embedding models and bridge the gap between these neural models with early representation learning approaches in IR (e.g., LSA). We will discuss some of the key challenges and insights in making these models work in practice, and demonstrate one of the toolsets available to researchers interested in this area.
【Keywords】: information retrieval; neural networks; representation learning
【Paper Link】 【Pages】:815-816
【Authors】: Laura Dietz ; Alexander Kotov ; Edgar Meij
【Abstract】: The past decade has witnessed the emergence of several publicly available and proprietary knowledge graphs (KGs). The increasing depth and breadth of content in KGs makes them not only rich sources of structured knowledge by themselves but also valuable resources for search systems. A surge of recent developments in entity linking and retrieval methods gave rise to a new line of research that aims at utilizing KGs for text-centric retrieval applications, making this an ideal time to pause and report current findings to the community, summarizing successful approaches, and soliciting new ideas. This tutorial is the first to disseminate the progress in this emerging field to researchers and practitioners. All tutorial resources are available online at http://github.com/laura-dietz/tutorial-utilizing-kg
【Keywords】: document retrieval; entity linking; knowledge graphs; query understanding
【Paper Link】 【Pages】:817-818
【Authors】: Yan Liu ; Sanjay Chawla
【Abstract】: Anomaly detection is of critical importance to prevent malicious activities such as bullying, terrorist attack planning, and fraud information dissemination. With the recent popularity of social media, new types of anomalous behaviors arise, causing concerns from various parties. While a large body of work haven been dedicated to traditional anomaly detection problems, we observe a surge of research interests in the new realm of social media anomaly detection. In this tutorial, we survey existing work on social media anomaly detection, focusing on the new anomalous phenomena in social media and most recent techniques to detect those special types of anomalies. We aim to provide a general overview of the problem domain, common formulations, existing methodologies and future directions.
【Keywords】: anomaly detection; social media analysis
【Paper Link】 【Pages】:819-820
【Authors】: Robert M. Patton ; Thomas E. Potok ; Petr Knoth ; Drahomira Herrmannova
【Abstract】:
【Keywords】: information retrieval; scholarly publications; web mining
【Paper Link】 【Pages】:821-822
【Authors】: Faezeh Ensan ; Zeinab Noorian ; Ebrahim Bagheri
【Abstract】: The first international workshop on Mining Actionable Insights from Social Networks (MAISoN'17) is to be held on February 10, 2017; co-located with the Tenth ACM International Web Search and Data Mining (WSDM) Conference in Cambridge, UK. MAISoN'17 aims at bringing together researchers and participants from different disciplines such as computer science, big data mining, machine learning, social network analysis and other related areas in order to identify challenging problems and share ideas, algorithms, and technologies for mining actionable insight from social network data. We organized a workshop program that includes the presentation of eight peer-reviewed papers and keynote talks, which foster discussions around state-of-the-art in social network mining and will hopefully lead to future collaborations and exchanges.
【Keywords】: predictive modeling; social network analysis; web mining
【Paper Link】 【Pages】:823-824
【Authors】: Theodora Tsikrika ; Babak Akhgar ; Vasilis Katos ; Stefanos Vrochidis ; Pete Burnap ; Matthew L. Williams
【Abstract】: The deliberate misuse of technical infrastructure (including the Web and social media) for cyber deviant and cybercriminal behaviour, ranging from the spreading of extremist and terrorism-related material to online fraud and cyber security attacks, is on the rise. This workshop aims to better understand such phenomena and develop methods for tackling them in an effective and efficient manner. The workshop brings together interdisciplinary researchers and experts in Web search, security informatics, social media analysis, machine learning, and digital forensics, with particular interests in cyber security. The workshop programme includes refereed papers, invited talks and a panel discussion for better understanding the current landscape, as well as the future of data mining for detecting cyber deviance.
【Keywords】: cyber security; cybercrime; data mining; security informatics; terrorist and extremist content
【Paper Link】 【Pages】:825-826
【Authors】: Nigel Collier ; Nut Limsopatham ; Aron Culotta ; Mike Conway ; Ingemar J. Cox ; Vasileios Lampos
【Abstract】: The workshop on Mining Online Health Reports (MOHRS) draws upon the rapidly developing field of Computational Health, focusing on textual content that has been generated through the various facets of Web activity. Online user-generated information mining, especially from social media platforms and search engines, has been in the forefront of many research efforts, especially in the fields of Information Retrieval and Natural Language Processing. The incorporation of such data and techniques in a number of health-oriented applications has provided strong evidence about the potential benefits, which include better population coverage, timeliness and the operational ability in places with less established health infrastructure. The workshop aims to create a platform where relevant state-of-the-art research is presented, but at the same time discussions among researchers with cross-disciplinary backgrounds can take place. It will focus on the characterisation of data sources, the essential methods for mining this textual information, as well as potential real-world applications and the arising ethical issues. MOHRS '17 will feature 3 keynote talks and 4 accepted paper presentations, together with a panel discussion session.
【Keywords】: computational health; machine learning; natural language processing; user-generated content
【Paper Link】 【Pages】:827-828
【Authors】: Stefan Heindorf ; Martin Potthast ; Hannah Bast ; Björn Buchhold ; Elmar Haussmann
【Abstract】: The WSDM Cup 2017 was a data mining challenge held in conjunction with the 10th International Conference on Web Search and Data Mining (WSDM). It addressed key challenges of knowledge bases today: quality assurance and entity search. For quality assurance, we tackle the task of vandalism detection, based on a dataset of more than 82 million user-contributed revisions of the Wikidata knowledge base, all of which annotated with regard to whether or not they are vandalism. For entity search, we tackle the task of triple scoring, using a dataset that comprises relevance scores for triples from type-like relations including occupation and country of citizenship, based on about 10,000 human relevance judgments. For reproducibility sake, participants were asked to submit their software on TIRA, a cloud-based evaluation platform, and they were incentivized to share their approaches open source.
【Keywords】: data quality; knowledge base; search; vandalism
【Paper Link】 【Pages】:829
【Authors】: Matthew Ryan Mitsui
【Abstract】: Query recommendation in Web search is typically manifested in algorithms that 1) recommend previously issued queries from a query log or 2) make incremental changes to queries in a user's current session. While such approaches have been effective in improving retrieval, they either are limited to suggesting queries in a query log or fail to make appropriate leaps that are necessary for query recommendation. More crucially, these approaches only recommend queries that are a coarse approximation of the information a user needs to complete their goal. They do not directly attempt to model the need and generate recommendations from it. This work will propose a framework for generating novel yet focused queries for query recommendation.
【Keywords】: diversity; query recommendation; search session analysis; user simulations
【Paper Link】 【Pages】:831
【Authors】: Manoj Reddy Dareddy
【Abstract】: Recommender systems are omnipresent on the web. They aim to address the problem of information overload. Recommender systems personalizes the content to each individual user based on their preferences. Currently, these systems are used to recommend movies, music, news, products to buy etc. They essentially assist us in making decisions. This paper presents the author's plan for his PhD research in this domain and research questions to be discussed at the WSDM 2017 Doctoral Consortium.
【Keywords】: neural networks; online evaluation; recommender systems
【Paper Link】 【Pages】:833
【Authors】: Venkatesh Vinayakarao
【Abstract】: Advances in text retrieval do not apply directly to source code retrieval because of the difference in characteristics of source code when compared to text. Recently, researchers have proposed specialized indexing and querying techniques that are based on structural representations and semantics of source code. Current tools and techniques heavily depend on user defined terms in source code to leverage text retrieval techniques. Further, they have limitations in handling partial programs, platform independence and index-time processing. We focus on building reusable models of source code for the purposes of indexing and querying.
【Keywords】: code search; natural language processing
【Paper Link】 【Pages】:835
【Authors】: Zijun Xue
【Abstract】: Latent Dirichlet Allocation (LDA) is an extremely popular probabilistic topic model used for a diverse class of appications. While highly effective, one important limitation of LDA is the high memory footprint of its inferencing algorithm, making it difficult to scale to a large dataset. In my thesis, I propose sdLDA, a highly-scalable disk-based LDA that (1) leverages the plentiful space available in disk to reduce its main-memory footprint and (2) preserves the sparsity of the extracted topics during sampling to improve both memory efficiency and sampling complexity of the inferencing algorithm. Our extensive experiments show that sdLDA scales to datasets that are two orders of magnitude larger than what existing main-memory-based algorithms can handle. Furthermore, sdLDA exhibits similar (or even better) performance compared to main-memory-based LDAs in terms of running time.
【Keywords】: scalability; text analysis; topic model
【Paper Link】 【Pages】:837
【Authors】: Kiran Garimella
【Abstract】: In this thesis, we develop methods to (i) detect and quantify the existence of filter bubbles in social media, (ii) monitor their evolution over time, and finally, (iii) devise methods to overcome the effects caused by filter bubbles. We are the first to propose an end-to-end system that solves the problem of filter bubbles completely algorithmically. We build on top of existing studies and ideas from social science with principles from graph theory to design algorithms which are language independent, domain agnostic and scalable to large number of users.
【Keywords】: controversy; echo chambers; filter bubble; polarization; social media
【Paper Link】 【Pages】:839
【Authors】: Eliezer de Souza da Silva
【Abstract】: This project is focused on the design of probabilistic models for recommender systems and collaborative ltering by extending and creating new models to include rich contextual and content information (content, user social network, location, time, user intent, etc), and developing scalable approximate inference algorithms for these models. The working hypothesis is that big data analytics combined with probabilistic modelling, through automatically mining of various data sources and combining di erent latent factors explaining the user interaction with the items, can be used to better infer the user behaviour and generate improved recommendations. Fundamentally we are interested in the following questions: 1) Does additional contextual information improve the quality of recommender systems? 2) What factors (features, model, methods) are relevant in the design of personalized systems? 3) What is the relation between the social network structure, the user model and the information need of the user? How does the social context interferes with user preferences? How the evolution of the social network structure can explain changes in the user preference model? 4) Does the choice of approximate inference method have a signi cant impact on the quality of the system (quality- efficiency trade-offs)? To address some of this questions we started by proposing a model (Figure 1) based on Poisson factorization models [2], combining a social factorization model [1] and a topic based factorization [3]. The main idea is to combine content latent factor (topic, tags, etc) and trust between users (trust weight in a social graph) in a way that both sources of information have additive e ects in the observed ratings. In the case of Poisson models, this additive constraint will induce non-negative latent factors to be more sparse and avoid overfitting (in comparison the Gausian based models [2]. The main objective at this point is to compare models that incorporated both source of information (content and social networks). The next steps will include empirical validation. Concluding, we are interested in the interplay between large scale data mining and probabilistic modeling in the design of recommender systems. One initial approach we are pursuing is to model content and social network feature in a Poisson latent variable model. Our main objective in the future is the development of methods with competitive computational complexity to perform inference using het- erogeneous data in dynamical probabilistic models, as well as exploring the scalability limits of the models we propose.
【Keywords】: hybrid collaborative filtering; latent variable models; poisson matrix factorization; social recommendation
【Paper Link】 【Pages】:841
【Authors】: Sendong Zhao
【Abstract】: In the medical context, causal knowledge usually refers to causal relations between diseases and symptoms, living habits and diseases, symptoms which get better and therapy, drugs and side-effects, etc [3]. All these causal relations are usually in medical literature, forum and clinical cases and compose the core part of medical diagnosis. Therefore, mining these causal knowledge to predict disease and recommend therapy is of great value for assisting patients and professionals. The task of mining these causal knowledge for diagnosis assistance can be decomposed into four constitutes: (1) mining medical causality from text; (2) medical treatment effectiveness measurement; (3) disease prediction and (4) explicable medical treatment recommendation. However, these tasks have never been systemically studied before. For my PhD thesis, I plan to formally define the problem of mining medical domain causality for diagnosis assistance and propose methods to solve this problem. 1. Ming these textual causalities can be very useful for discovering new knowledge and making decisions. Many studies have been done for causal extraction from the text [1, 4, 5]. However, all these studies are based on pattern or causal triggers, which greatly limit their power to extract causality and rarely consider the frequency of co-occurrence and contextual semantic features. Besides, none of them take the transitivity rules of causality leading to reject those causalities which can be easily get by simple inference. Therefore, we formally define the task of mining causality via frequency of event co-occurrence, semantic distance between event pairs and transitivity rules of causality, and present a factor graph to combine these three resources for causality mining. 2. Treatment effectiveness analysis is usually taken as a subset of causal analysis on observational data. For such real observational data, PSM and RCM are two dominant methods. On one hand, it is usually difficult for PSM to find the matched cases due to the sparsity of symptom. On the other hand, we should check every possible (symptom, treatment) pair by exploiting RCM, leading to make the characteristic of exploding up, especially when we want to check the causal relation between a combination of symptoms and a combination of drugs. Besides, the larger number of symptom or treatment in the combination the less number of patient case retrieved, which lead to the lack of statistical significance. Specifically, patients tend to take tens of herbs as the treatment each time in Traditional Chinese Medicine (TCM). Therefore, how to evaluate the effectiveness of herbs separately and jointly is really a big challenge. This is also a very fundamental research topic supporting many downstream applications. 3. Both hospitals and on-line forums have accumulated sheer amount of records, such as clinical text data and online diagnosis Q&A pairs. The availability of such data in large volume enables automatic disease prediction. There are some papers on disease prediction with electronic health record (EHR) [2], but the research on disease prediction with raw symptoms is still necessary and challenging. Therefore, we propose a general new idea of using the rich contextual information of diseases and symptoms to bridge the gap of disease candidates and symptoms, and detach it from the specific way of implementing the idea using network embedding. 4. Recommendation in medical domain is usually a decision-making issue, which requires the ability of explaining "why". The ability of explaining "why" are basically from two paths. Consider the recommendation suggest you eat more vegetables. You probably do not believe it if there is nothing attached. But if the recommendation gives the literally reasons why eating more vegetables is good you might like to take this suggestion. Consider another scenario, if the recommendation gives you the data of the contrast which show that people who eat more vegetables are healthier than those eat less, it is certain that you also want to take this recommendation. Based on these two intuitions, we present a recommendation model based on proofs which are either literally reasons or difference from contrast. This work was supported by the 973 program (No. 2014CB340503) and the NSFC (No. 61133012 and No. 61472107).
【Keywords】:
【Paper Link】 【Pages】:843
【Authors】: Maria Maistro
【Abstract】: To address the challenge of adapting Information Retrieval (IR) to the constantly evolving user tasks and needs and to adjust it to user interactions and preferences we develop a new model of user behavior based on Markov chains. We aim at integrating the proposed model into several aspects of IR, i.e. evaluation measures, systems and collections. Firstly, we studied IR evaluation measures and we propose a theoretical framework to describe their properties. Then, we presented a new family of evaluation measures, called Markov Precision (MP), based on the proposed model and able to explicitly link lab-style and on-line evaluation metrics. Future work will include the presented model into Learning to Rank (LtR) algorithms and will define a collection for evaluation and comparison of Personalized Information Retrieval (PIR) systems.
【Keywords】: evaluation; markov precision; user model
【Paper Link】 【Pages】:845
【Authors】: Dimitar Dimitrov
【Abstract】: Navigation in an information space is a natural way to explore and discover its content. Information systems on the Web like digital encyclopedias (e.g., Wikipedia) are interested in providing good navigational support to their users. To that end, navigation models can be useful for estimating the general navigability of an information space and for understanding how users interact with it. Such models can also be applied to identify problems faced by the users during navigation and to improve user interfaces. Studying navigation on the Web is a challenging task that has a long tradition in our scientific community. Based on large studies, researchers have made significant steps towards understanding navigational user behavior on the Web identifying general usage patterns, regularities, and strategies users apply during navigation. The seminal information foraging theory has been developed suggesting that people follow links by constantly estimating their quality in terms of information value and cost associated with obtaining that value by interacting with the environment. Furthermore, models describing the network structure of the Web like the bow tie model, and the small world models have been introduced. These models contributed valuable insights towards characterizing the underlying network topology on which the users operate and the extent to which it allows efficient navigation. In the context of information networks, researchers have successfully modeled user navigation resorting to Markov chains and to decentralized search. With respect to the users' navigational behavior and their click activities to traverse a link, researchers have found a valuable source of information in the log files of Web servers. Click data has also been collected by letting humans play navigational games on Wikipedia. With this data, researchers tested different navigational hypotheses; for example, (i) if humans tend to navigate between semantically similar articles, (ii) if they experience a trade-off between following links leading towards semantically similar articles and following links leading towards possibly well-connected articles. For navigation with a particular target in mind, users are found to be greedy with respect to the next click if they are confident to be on the right path, whereas they tend to explore the information network at random if they feel insecure or lost and have no intuition about the next click. Although these research lines have advanced our understanding of navigational user behavior in information networks, for the goal of the proposed thesis-modeling navigation-related work does not address and cover the following questions: (i) What is the relationship between the user's awareness regarding the structure and the topology of the information network and the efficiency of navigation, i.e., modeled as decentralized search and (ii) How do users interact with the content to explore and discover it, i.e., are there some specific links that are especially appealing and what are their characteristics? My research focuses on modeling navigation in an information space represented as an information network. Regarding the first question, I introduce and apply partially informed decentralized search to model the extent to which a user is exposed to the network structure of the information space and can make informed decisions about her next step towards exploring the content [1]. I test different hypotheses regarding the type and the amount of network structural information used to model navigation. My results show that only a small amount of knowledge about the network structure is sufficient for efficient navigation. For the second question, I study large-scale click data from the English version of Wikipedia. I observe a focus of the users' attention towards specific links. With this part of the proposal, I want to shed light on a different aspect of navigation and concentrate on the question why some links are more successful than others. In particular, I study the relationship between the link properties and the link popularity as measured by transitional click data. To that end, I formulate navigational hypotheses based on different link features, i.e., network features, semantic features and visual features [2, 3]. The plausibility of these hypotheses is then tested using a Markov chain-based Bayesian hypothesis testing framework. Results suggest that Wikipedia users tend to select links located at the top of the page. Furthermore, users are tempted to select links leading towards the periphery of the Wikipedia network. To conclude, I believe that the won insights may have impact on system design decisions, i.e, existing guidelines for Wikipedia contributors can be adapted to better reflect the usage of the system.
【Keywords】: human click behavior; navigation; wikipedia